URL PDF HTML ☆

赞 0 踩 0

2606.02642 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.MM cs.SD 版本更新

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

SVHalluc: 音频-视觉大语言模型中的语音-视觉幻觉基准测试

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

发表机构 * KAIST（韩国国立信息通信研究院）

AI总结针对音频-视觉大语言模型中的语音-视觉幻觉问题，提出SVHalluc基准，从语义和时间两个维度评估模型将语音内容与视觉信号对齐的能力，发现现有模型存在跨模态理解局限。

Comments Accepted at CVPR 2026

详情

AI中文摘要

尽管音频-视觉大语言模型（LLMs）取得了成功，但它们可能产生看似合理但缺乏依据的输出，即幻觉。现有基准侧重于环境声音（例如狗叫）来指示事件发生。相比之下，人类语音承载着根本不同的、丰富的语义和时间结构，但当前模型能否准确地将语音内容与相应的视觉信号对齐仍未得到探索。在这项工作中，我们表明语音内容可以引发音频-视觉LLMs中的幻觉。为了系统研究这一点，我们引入了SVHalluc，这是第一个用于评估音频-视觉LLMs中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉：语义和时间。实验结果表明，最先进的开源音频-视觉LLMs难以将语音内容与相应的视觉信号对齐，在多个任务上的准确率接近随机。相比之下，Gemini 2.5 Pro显著优于开源模型。我们的分析表明，它们的失败源于跨模态理解能力有限，尽管在单模态感知方面表现强劲。我们的工作揭示了当前音频-视觉LLMs的一个新的根本性局限，并强调了基于语音的视频理解的需求。项目页面：此https URL。

英文摘要

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

URL PDF HTML ☆

赞 0 踩 0

2606.02639 2026-06-03 eess.IV cs.AI cs.CV 版本更新

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

通过AReT：解剖正则化TensoRF从数字重建放射图像进行稀疏视图肺结节体积测量

Spoorthi M, Suja Palaniswamy

发表机构 * Amrita University（阿姆里塔大学）

AI总结本文发现并解决了TensoRF在X射线衰减场中的默认密度偏移问题，提出解剖正则化张量辐射场框架AReT，仅用三个正交X射线投影即可实现肺结节的稳定体积重建，在LIDC-IDRI数据集上达到高精度。

详情

AI中文摘要

我们识别并解决了TensoRF应用于X射线衰减场时一个先前未报告的失败模式：默认密度偏移-10（最初为RGB场景重建引入）抑制了密度梯度，并阻止了稀疏视图医学重建，无论学习率或正则化策略如何。将密度偏移设置为零可恢复梯度流，并仅从三个正交X射线投影实现肺结节的稳定体积重建。在此基础上，我们提出AReT，一个解剖正则化的张量辐射场框架，用于使用LIDC-IDRI数据集（19名患者，放射科医生注释的结节）的冠状、矢状和轴向投影进行肺结节重建。与需要密集多视图采集的现有NeRF方法不同，AReT专为稀疏视图胸部成像设计，并整合了结合L1稀疏性和总变分平滑性的胸部解剖感知正则化。对11种重建策略的系统比较表明，解剖感知正则化始终优于生成先验引导的方法。与放射科医生共识分割相比，AReT在临床可操作的结节（>=10 mm，n=14）上实现了Pearson r=0.983（p<0.0001），中位绝对体积误差为11.4%，接近零的系统偏差为-77.3 mm^3，并且比球形体积近似提高了8.4倍。

英文摘要

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

URL PDF HTML ☆

赞 0 踩 0

2606.02634 2026-06-03 eess.IV cs.AI 版本更新

Echo-POSED: Geometric Self-Distillation for Echocardiography Guidance

Echo-POSED：用于超声心动图引导的几何自蒸馏

Elias Stenhede, Edvart Grüner Bjerke, Joanna Sulkowska, Eivind Bjørkan Orstad, Ole Jakob Elle, Ulysse Côté-Allard, Arian Ranjbar

AI总结提出一种自监督框架Echo-POSED，通过从3D超声心动图体积中切取2D视图训练，实现实时经胸超声心动图引导，无需专家标注视图或跟踪探头轨迹，在SO(3)×SO(3)上保持探头运动等变性，在患者内和患者间引导模拟中达到平均角度误差8.2度。

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD 版本更新

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器：自然信号共享小波分词方案的初步结果

Shenghao Ding

发表机构 * Yet Another AI

AI总结本文研究音频、图像和视频能否共享统一的小波分词方案，通过基于Haar DWT/IDWT的连续令牌模型，在多个数据集上验证了统一分词模式的可行性，并分析了潜在容量和元数据的影响。

Comments 12 pages, 3 figures

详情

AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式，而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型，该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上，密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明，视觉增益不能仅由潜在容量解释，同时也表明加性元数据嵌入并非普遍改进来源。最后，固定速率能量选择提供了一个强大的非参数基线：在压缩保留比率下，energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB，图像提高了16.90 dB，视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口，但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

URL PDF HTML ☆

赞 0 踩 0

2606.02615 2026-06-03 eess.AS cs.AI cs.SD 版本更新

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

FSA-GRPO：训练听觉大语言模型使用少样本示例

Haolong Zheng, Siyin Wang, Xulin Fan, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）； Tsinghua University（清华大学）

AI总结提出基于强化学习的后训练方法FSA-GRPO，通过专门设计的奖励机制鼓励模型利用少样本示例，增强其少样本适应能力，在儿童语音识别、语音翻译和音频理解等任务上取得提升。

详情

AI中文摘要

少样本提示为将听觉大语言模型适应低资源任务（如儿童语音识别）提供了一种有效方式。然而，大多数听觉大语言模型并未被明确训练以在这种示例条件格式下进行推理，限制了它们从少样本提示中获益的程度。为解决这一局限，我们引入了少样本感知GRPO（FSA-GRPO），一种基于强化学习的后训练方法，使用专门设计的奖励来鼓励模型利用少样本示例，从而增强其少样本适应能力。值得注意的是，仅使用高资源成人ASR数据进行训练即可提升模型的通用少样本适应能力，不仅在儿童语音识别中带来收益，在语音翻译和音频理解中也是如此。我们进一步研究了数据选择和辅助奖励加权，以确定有效的训练方案。实验表明，当域内数据不可用或无法用于训练时，FSA-GRPO比直接对相关域外数据进行微调更有效。

英文摘要

Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recognition. However, most auditory large language models are not explicitly trained to perform inference in this demonstration-conditioned format, limiting the extent to which they can benefit from few-shot prompting. To address this limitation, we introduce Few-Shot Aware GRPO (FSA-GRPO), an RL-based post-training recipe that uses a specially designed reward to encourage the model to leverage few-shot demonstrations, thereby strengthening its few-shot adaptation ability. Notably, training with only high-resource adult ASR data improves the model's general few-shot adaptation ability, yielding gains not only in children's speech recognition but also in speech translation and audio understanding. We further study data selection and auxiliary reward weighting to identify an effective training recipe. Our experiments show that when in-domain data are unavailable or cannot be used for training, FSA-GRPO is more effective than direct tuning on related out-of-domain data.

URL PDF HTML ☆

赞 0 踩 0

2606.02645 2026-06-03 stat.ML cs.AI cs.LG 版本更新

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

目标更新可能稳定线性Q学习：周期性和软动态

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST（韩国成均馆大学电气工程学院）

AI总结本文通过精确的切换线性系统动力学和联合谱半径分析，证明了在特定谱和步长条件下，周期性硬目标更新和软目标更新可以保证线性Q学习收敛到精确的投影Q-Bellman解。

详情

AI中文摘要

Q学习中的周期性目标更新和actor-critic方法中的软目标更新是经验上公认的稳定机制，但其精确的理论解释仍不完整。本文针对线性函数逼近的Q学习（线性Q学习），利用Bellman最大值引起的精确切换线性系统（SLS）动力学以及由此产生的切换矩阵族的联合谱半径（JSR），对这些机制进行了严格而精确的分析。尽管线性Q学习通常可能无法收敛，但我们证明，在明确的谱和步长条件下，周期性硬目标更新和软目标更新可以保证收敛到精确的投影Q-Bellman解。主要分析针对确定性线性Q学习进行，其中目标更新机制最为透明。一旦为均值递归建立了相应的JSR证书，随机强化学习设置可以通过将确定性模式替换为采样随机模式并添加相应的随机噪声分析来处理。

英文摘要

Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechanisms, but their precise theoretical explanation is still incomplete. This paper gives a rigorous and exact analysis of these mechanisms for Q-learning with linear function approximation (linear Q-learning) using the exact switched linear system (SLS) dynamics induced by the Bellman maximum and the joint spectral radius (JSR) of the resulting switching matrix families. Although linear Q-learning can fail to converge in general, we prove that, under explicit spectral and step-size conditions, periodic hard target updates and soft target updates can guarantee convergence to the exact projected Q-Bellman solution. The main analysis is carried out for deterministic linear Q-learning, where the target-update mechanism is most transparent. Once the corresponding JSR certificate is established for the mean recursion, the stochastic reinforcement-learning setting can be treated by replacing deterministic modes with sampled stochastic modes and adding the corresponding stochastic-noise analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.02632 2026-06-03 stat.ML cs.AI cs.CY cs.LG econ.EM stat.AP 版本更新

Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery

立场：优先识别结构，而非复杂模型，以促进科学发现

Tyler H. McCormick

发表机构 * GitHub

AI总结本文论证现代机器学习在高维代理机制下存在通用欠定性，提出“机制性机器学习”的具体标准，以确保以LLM为中心的工作流真正支持科学而非模拟科学。

Comments Will appear as a position paper in ICML

2606.02592 2026-06-03 stat.AP cs.AI 版本更新

Tracking Urban Atmospheric Pollutants using Sentinel-5P Satellite Data

利用Sentinel-5P卫星数据追踪城市大气污染物

Alice Gomez-Cantos, Henry O. Velesaca

发表机构 * Facultad de Ciencias Naturales y Matemáticas, Escuela Superior Politécnica del Litoral, ESPOL, Campus Gustavo Galindo, Km. 30.5 Vía Perimetral, Guayaquil, 090902, Ecuador（生态与数学学院，海岸理工大学，ESPOL，加斯托·加林多校区，公里30.5环形路，瓜亚基尔，090902，厄瓜多尔）； Software Engineering Department, Research Center for Information and Communication Technologies (CITIC-UGR), University of Granada, 18071, Granada, Spain（软件工程系，信息与通信技术研究中心（CITIC-UGR），格拉纳达大学，18071，格拉纳达，西班牙）

AI总结提出基于Sentinel-5P/TROPOMI卫星对流层柱观测的框架，通过中位数和高百分位数等分布指标及K-means聚类，在厄瓜多尔瓜亚斯省尺度上表征城市NO2污染背景与极端值，为数据稀缺地区提供可解释、可扩展的空气质量评估工具。

详情

AI中文摘要

城市二氧化氮（$NO_2$）是燃烧相关空气污染的关键指标，在城市中表现出强烈的时空变异性。本研究提出一个基于卫星的框架，利用Sentinel-5P/TROPOMI的对流层柱观测数据，追踪厄瓜多尔瓜亚斯省的城市$NO_2$污染。该方法不估计地表浓度，而是强调稳健的分布指标，包括中位数和上尾百分位数（$P_{90}$、$P_{95}$和$P_{99}$），以表征县尺度上的背景条件和局部污染极端值。多年卫星观测数据按年汇总，并使用无监督K-means聚类分析，以识别无预定义阈值的特征污染模式。结果表明，高度城市化的县持续表现出较高的极端$NO_2$值和更大的变异性，而城市化程度较低的地区则呈现较低且更均匀的模式。所提出的方法为数据稀缺地区仅使用卫星观测提供了一种可解释且可扩展的城市空气质量评估工具。该实现已在GitHub上公开，网址为https://this URL。

英文摘要

Urban nitrogen dioxide ($NO_2$) is a key indicator of combustion-related air pollution and exhibits strong spatial and temporal variability in cities. This study presents a satellite-based framework for tracking urban $NO_2$ pollution using tropospheric column observations from Sentinel-5P/TROPOMI over Guayas Province, Ecuador. Rather than estimating surface concentrations, the methodology emphasizes robust distributional metrics, including the median and upper-tail percentiles ($P_{90}$, $P_{95}$, and $P_{99}$), to characterize background conditions and localized pollution extremes at the canton scale. Multi-year satellite observations are aggregated annually and analyzed using unsupervised K-means clustering to identify characteristic pollution regimes without predefined thresholds. Results show that highly urbanized cantons consistently exhibit elevated extreme $NO_2$ values and greater variability, while less urbanized areas display lower and more homogeneous patterns. The proposed approach provides an interpretable and scalable tool for urban air-quality assessment in data-scarce regions using satellite observations alone. The implementation is publicly available on GitHub https://hvelesaca.github.io/sentinel-5P-clustering/.

URL PDF HTML ☆

赞 0 踩 0

2606.03763 2026-06-03 econ.GN cs.AI q-fin.EC 版本更新

TadA-Bench：面向智能蛋白质工程的未来轮次发现的百万变异基准

Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang

发表机构 * Tsinghua University（清华大学）

AI总结 TadA-Bench 是一个基于31轮TadA定向进化的百万变异湿实验回放基准，通过定义固定数据回放任务来评估模型在未见过的未来轮次中排序变异的能力，并引入Seq2Graph统一标签，揭示进化覆盖度比局部数据密度更重要。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). Data: https://huggingface.co/datasets/JinGao/TadABench-1M . Code: https://github.com/shiyegao/TadABench-1M

详情

AI中文摘要

人工智能用于科学发现正进入智能体时代，蛋白质工程系统应优先考虑未来的湿实验，而不仅仅是拟合静态测量。我们引入了TadA-Bench，这是一个来自31轮TadA定向进化的百万变异湿实验回放基准，用于面向智能蛋白质工程的未来轮次发现。TadA-Bench保留了实验的时间顺序，并定义了一个固定数据回放任务：给定早期的实验轮次，模型对仅出现在后期轮次中的变异进行排序。它提供了对齐的DNA、RNA和蛋白质视图，并使用Seq2Graph（一种基于图的标签统一流程）来将嘈杂的富集测量结果协调为一致的跨轮次活性标签。随机分割控制显示强插值能力，但未来轮次排序和有限预算候选选择则弱得多。控制分析表明，进化覆盖度比局部数据密度更具信息性，将TadA-Bench定位为面向智能蛋白质工程的未来轮次发现的可重复湿实验回放基底；数据和代码已在Hugging Face和GitHub上发布。

英文摘要

AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph-based label-unification pipeline, to reconcile noisy enrichment measurements into consistent cross-round activity labels. Random-split controls show strong interpolation, but future-round ranking and finite-budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA-Bench as a reproducible wet-lab replay substrate for future-round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.03985 2026-06-03 cs.RO cs.AI cs.CV 版本更新

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Humanoid-GPT：扩展数据与结构以实现零样本运动跟踪

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University（清华大学）； Galbot Inc.（Galbot公司）； Shanghai Jiao Tong University（上海交通大学）； Peking University（北京大学）； Shanghai Qi Zhi Institute（上海启智研究院）

AI总结提出Humanoid-GPT，一种基于GPT风格的因果Transformer，在十亿级运动语料上预训练，实现全身控制，通过扩展数据和模型容量达到对未见运动和任务的零样本泛化。

Comments Accepted at CVPR 2026

2606.03979 2026-06-03 cs.LG cs.AI 版本更新

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

语言模型需要睡眠：学习自我修改和巩固记忆

Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

发表机构 * Google（谷歌）； Cornell University（康奈尔大学）

AI总结受人类学习过程启发，提出“睡眠”范式，通过记忆巩固（知识播种）和梦境（自我改进）两阶段，使模型持续学习、将短期记忆转化为长期知识并自我提升。

Comments A version of this work has been publicly available from September 2025 on OpenReview

详情

AI中文摘要

过去几十年见证了机器学习算法设计的重大进步，从早期针对特定任务的浅层模型研究到更通用的深度大语言模型（LLMs）。尽管在需要即时预测或上下文学习的任务中显示出有希望的结果，现有模型缺乏持续学习并有效将其时间上下文知识转移到长期参数的能力。受人类学习过程的启发，我们引入了一种“睡眠”范式，允许模型持续学习，通过重放将其短期脆弱记忆蒸馏为稳定的长期知识，并通过“梦境”过程递归地自我改进。更详细地说，睡眠包括两个阶段：（1）记忆巩固：一个向上的蒸馏过程，称为知识播种，其中较小自我的记忆被蒸馏到更大的网络中，以在保留知识的同时提供更多容量。作为概念验证，我们提出了一种新的广义蒸馏过程用于知识播种（即在线策略蒸馏与基于强化学习的模仿学习的结合）；（2）梦境：一个自我改进阶段，其中模型使用强化学习生成合成数据的课程，以排练新知识并在没有人类监督的情况下完善现有能力。我们在长视野、持续学习、知识整合和少样本泛化任务上的实验支持了睡眠阶段的重要性。

英文摘要

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

URL PDF HTML ☆

赞 0 踩 0

2606.03976 2026-06-03 cs.CV cs.AI cs.LG q-bio.NC 版本更新

Formalizing the Binding Problem

形式化绑定问题

Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

AI总结本文用信息论方法形式化绑定问题，提出一种探测方法测量模型表示中的绑定信息，并在视觉Transformer上实验，证明绑定是强视觉识别和推理的关键要素。

Comments Accepted to ICML 2026

详情

AI中文摘要

世界表征，可以说，包含关于特征的信息（例如，某物是蓝色的，某物是圆形的），但也包含关于哪些特征属于同一对象的信息（例如，圆形是蓝色的），我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题：它需要知道哪些特征属于一起。然而，尽管有研究表明视觉Transformer（ViT）知道哪些补丁属于一起，但目前尚不清楚当前的深度学习模型是否学会展示绑定信息，即针对特征的信息。我们可能认为绑定信息并不多，毕竟将特征错误归因于错误对象是基于ViT架构的常见失败，尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题，并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验，测量来自架构不同组件（如图像摘要标记[CLS]或空间标记）的绑定信息。我们使用具有不同绑定挑战的数据集，例如特征共享、遮挡和自然特征，同时比较多个预训练ViT的性能。总体而言，我们的研究证明了绑定是强视觉识别和推理的关键要素。

英文摘要

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.03969 2026-06-03 cs.CL cs.AI 版本更新

Quantifying Faithful Confidence Expression in Large Reasoning Models

量化大型推理模型中的忠实置信表达

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University（耶鲁大学）

AI总结针对大型推理模型（LRM）在长链思维输出中难以忠实表达内在置信度的问题，提出基于令牌概率、隐藏状态和响应一致性的框架，系统量化其语言决断性与内部不确定性之间的对齐程度。

Comments Code: https://github.com/yale-nlp/faithful_lrm

详情

AI中文摘要

可靠的不确定性沟通对于LLMs的可信度至关重要，然而忠实校准（FC）——模型内在置信度与（语言上）表达的置信度之间的对齐——是一个持续存在的失败模式。这一挑战对大型推理模型（LRM）尤为关键，因为其扩展的推理轨迹常被用户解读为深思熟虑、能力和信心的证据。尽管FC重要且LRM广泛使用，但LRM能否忠实表达其置信度仍知之甚少。此外，衡量FC的主流范式难以泛化到LRM生成的长链思维输出，这些输出往往缺乏清晰的步骤边界、步骤结构不一致，并在整个轨迹中编码复杂的条件依赖——使得内在置信度的估计复杂化。为应对这一挑战，我们引入了一个新颖的框架来系统量化LRM的FC。我们的框架基于令牌概率、隐藏状态和采样响应一致性，分析语言决断性与三种内部不确定性来源的关系。我们还设计了一种前缀条件采样方法，以控制轨迹中的条件和结构变化。将我们的框架应用于一系列多样化的领先模型、数据集和提示，我们发现忠实置信表达是LRM的一个重大挑战。推理行为不会自动转化为改进的FC，针对非推理模型的提示干预在推理设置中并不能提高忠实性。不同的置信估计器还对同一轨迹产生不同评估，揭示了先前评估方法的脆弱性。综合来看，我们的工作将FC确立为LRM的一个独特的可靠性和对齐目标，尤其是在这些系统越来越多地部署在高风险场景中的背景下。

英文摘要

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.03968 2026-06-03 cs.CL cs.AI 版本更新

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC：为超越可验证奖励的强化学习协同设计查询与评分标准

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

发表机构 * Amazon（亚马逊）； Georgia Institute of Technology（佐治亚理工学院）

AI总结针对基于评分标准的强化学习中查询分布固定导致的评分标准质量瓶颈，提出QUBRIC框架，通过协同设计查询与评分标准，利用教师关键点、对比生成和可学习性过滤，在ArenaHard上取得+5.5点提升，并泛化到法律、道德和叙事推理任务。

详情

AI中文摘要

基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的一条有前景的途径，但现有方法在优化评分标准时，将查询分布视为固定不变。我们识别出一个结构性瓶颈：评分标准的质量受限于查询结构。开放式查询会导致模糊的评分标准；而简单地将查询收窄则会引入任何模型都无法验证的虚构参考，导致所有回答失败，训练无法获得奖励信号。我们提出QUBRIC，一个协同设计查询和评分标准的框架。教师导出的关键点将开放式查询改写为基于场景、可评估的问题。然后，对比评分标准生成将教师策略的差距转化为查询级别的标准，可学习性过滤仅保留信息量丰富的查询-评分标准对用于GRPO训练。QUBRIC在ArenaHard上相比SFT基线取得了+5.5分的提升。仅使用指令遵循数据训练，它进一步迁移到三个涵盖法律、道德和叙事推理的保留基准（平均提升+6.3分），改进集中在推理相关维度。这些结果证明，协同设计查询和评分标准可以使基于评分标准的强化学习成为严格可验证任务之外RLVR的实用补充。

英文摘要

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.03967 2026-06-03 cs.CL cs.AI 版本更新

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

AlignAtt4LLM：面向仅解码器LLM的快速AlignAtt方法在IWSLT 2026同声传译任务中的应用

Quentin Fuxa, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL（查理大学，人文学院，ÚFAL）； University of Edinburgh（爱丁堡大学）

AI总结提出AlignAtt4LLM系统，通过显式源文本跨度、离线选择翻译对齐头、选择性qk快速重放和运行时查询/键捕获，首次将AlignAtt策略应用于仅解码器LLM，在英德、英意同声传译中优于基线。

Comments Accepted to IWSLT 2026

详情

AI中文摘要

我们描述了AlignAtt4LLM，一个用于英语到德语、意大利语和中文的IWSLT 2026同声传译系统。该系统是一个同步级联：Qwen3-ASR结合强制对齐生成增量更新的源文本转录，Gemma-4 E4B-it在MT侧的AlignAtt策略下翻译该前缀。据我们所知，这是AlignAtt首次应用于仅解码器LLM，而早期AlignAtt系统使用的编码器-解码器交叉注意力在此类模型中不存在。我们通过提出（1）提示中的显式源文本跨度，（2）离线选择翻译特定的对齐头，（3）草稿到源注意力块的选择性qk快速重放，以及（4）保持模型输出比特一致的运行时查询/键捕获，恢复了一个可用的策略。在IWSLT 2026开发集上，AlignAtt4LLM在约2秒的低延迟和低于4秒CU-LongYAAL的高延迟场景下，均优于欧洲目标语言（英语到德语和英语到意大利语）的提供基线。英语到中文的结果较为复杂，但该方法不依赖于Gemma-4：由于AlignAtt4LLM仅需要确定的提示布局、校准的对齐头和查询/键捕获，相同的策略可以重新应用于针对非欧洲目标语言的更强翻译专用仅解码器MT骨干网络。

英文摘要

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

URL PDF HTML ☆

赞 0 踩 0

2606.03965 2026-06-03 cs.CL cs.AI 版本更新

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering：实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结提出Agentic Chain-of-Thought Steering (ACTS)方法，通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语，实现预算感知的策略控制，从而在保持推理质量的同时显著节省token，并支持准确率-效率的可控权衡。

详情

AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性，但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度，但隐式地决定了模型的思考方式。在本文中，我们提出了Agentic Chain-of-Thought Steering (ACTS)，它将推理引导形式化为一个马尔可夫决策过程，其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步，控制器观察推理轨迹和剩余思考预算，然后发出一个包含推理策略和引导短语的引导动作，以启动推理器的下一步。这使得在保持推理器生成连续性的同时，能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体，并进行多预算增强，然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明，ACTS在显著节省token的同时达到了与全思考相当的性能，并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

URL PDF HTML ☆

赞 0 踩 0

2606.03962 2026-06-03 cs.LG cs.AI 版本更新

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

利用奖励不确定性在强化学习中诱导多样化行为

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Aygün, David Smalling, Shibl Mourad, Doina Precup, André Barreto, Mark Rowland

发表机构 * New York University（纽约大学）； Google DeepMind（谷歌深Mind）

AI总结针对传统强化学习缺乏多样性的问题，提出将奖励函数替换为奖励分布，通过非线性集合目标自然产生可控的多样化行为，并推导出梯度估计器，实验证明其鲁棒性和理论优势。

Comments Core contributors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, André Barreto, Mark Rowland

详情

AI中文摘要

经典强化学习通常寻求最大化标量奖励期望和的确定性策略。然而，现代应用如语言模型微调或科学发现需要多样性。现有的补救措施如熵正则化或多样性奖励通常需要脆弱的权衡，以性能换取随机性，或依赖可能使策略排名错位的启发式指标。我们认为，多样性更自然地理解为对奖励不确定性的理性响应。当奖励函数不完全已知时——例如模糊偏好或不完美的奖励模型——承诺单一行动可能是次优的。基于此，我们提出对强化学习目标进行根本性重新表述，将标量奖励替换为奖励函数上的分布，并对行动集合应用非线性目标。结果是一个框架，其中校准的行为多样性自然出现，通过奖励函数分布保持可控，且无需牺牲期望奖励即可获得。聚焦于上下文赌博机设置，我们为该目标推导出原则性的梯度估计器，并证明我们的公式自然泛化了原始策略梯度以及最近发展的行动集方法。我们的实证结果表明，该框架为传统问题表述无法诱导所需行为广度的复杂强化学习任务提供了鲁棒且理论基础的替代方案。

英文摘要

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

URL PDF HTML ☆

赞 0 踩 0

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS 版本更新

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics（电信与人工智能系，布达佩斯技术与经济大学）； SpeechTex Ltd.（SpeechTex公司）； ELTE Research Centre for Linguistics（ELTE语言研究所）

AI总结针对低资源语言和特定领域，提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线，实验表明合成对话能有效提升ASR性能，在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情

AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线，该流水线生成带有参与者元数据的场景级对话，将说话人属性映射到TTS语音配置文件，并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下，评估了五种LLM家族，分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估，该方法本身适用于任何语言，只要各组件有相应资源。结果表明，合成对话持续改善语音识别性能，但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据，在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明，通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

URL PDF HTML ☆

赞 0 踩 0

2606.03939 2026-06-03 cs.LG cs.AI cs.PF 版本更新

FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

FlashbackCL：缓解联邦学习中的时间遗忘

Mubarak A. Ojewale, Adriana E. Chis, Jorge M. Cortes-Mendoza, Bernardo Pulido-Gaytan, Horacio Gonzalez-Velez

发表机构 * Cloud Competency Centre, National College of Ireland, Dublin, Ireland（云竞争力中心，爱尔兰国家学院，都柏林，爱尔兰）

AI总结针对联邦学习中客户端数据分布随时间漂移导致的时间遗忘问题，提出FlashbackCL方法，通过时间衰减标签计数、类别平衡水库采样重放和服务器端主动核心集筛选，在CIFAR-10上相对Flashback提升6.9%-10.0%，时间遗忘减少68%。

详情

AI中文摘要

基础模型和边缘模型的联邦学习（FL）越来越多地部署在客户端数据分布随时间漂移的场景中，然而现有的遗忘缓解方法假设每个客户端的分布是平稳的。Flashback是近期最强的针对跨客户端（空间）遗忘的FL方法，它使用单调累积的每类标签计数作为知识代理；该代理在时间分布漂移下会失准，并将全局模型锚定在过时的类别平衡上。我们通过一个与协议级波动隔离的每阶段指标形式化定义了FL中的时间遗忘，并提出了Flashback Continual Learning（FlashbackCL），它是Flashback的即插即用扩展，包含：(i) 时间衰减的标签计数；(ii) 具有类别平衡水库采样（CBRS）的设备感知重放缓冲区；(iii) 在公共蒸馏集上的服务器端主动核心集筛选。结果表明，在具有50个客户端和三种受控时间漂移模式的CIFAR-10上，FlashbackCL相对于Flashback实现了6.9%至10.0%的相对改进，同时将时间遗忘减少了高达68%。一项5变体消融实验表明CBRS重放是关键组件。FlashbackCL在平稳CIFAR-100上也比Flashback提高了3.5个百分点，表明类别平衡重放同样正则化了空间异质性和时间漂移。

英文摘要

Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

URL PDF HTML ☆

赞 0 踩 0

2606.03927 2026-06-03 cs.LG cs.AI 版本更新

FFR: Forward-Forward Learning for Regression

FFR：前向-前向学习用于回归

Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

发表机构 * University of Bristol（布里斯托大学）； University College London（伦敦大学学院）； University of Cambridge（剑桥大学）

AI总结提出FFR框架，通过序数竞争 goodness 函数、分层阶梯架构和层次化预测将前向-前向算法扩展到回归任务，在多个数据集上恢复BP 98.6%的精度并显著降低内存和时间开销。

详情

AI中文摘要

前向-前向（FF）算法通过纯局部、逐层优化训练神经网络，提供了反向传播（BP）的计算高效且生物合理的替代方案。然而，FF本质上是为通过对比正负样本对进行分类而设计的，将其扩展到回归面临根本性挑战：连续目标空间缺乏用于对比学习的自然“对立面”，且标准 goodness 函数不携带关于目标幅度或顺序的信息。我们提出FFR（前向-前向回归），据我们所知，这是第一个将FF扩展到现实世界回归并展示在多样化真实数据集上具有竞争力的性能的框架。FFR引入了三项关键创新：（1）序数竞争 goodness 函数，通过距离感知序数监督下分区神经元组之间的竞争学习取代对比对；（2）分层阶梯架构，其中浅层学习粗序数判别，深层细化到细粒度回归，并通过多尺度特征聚合实现层间协作；（3）带不确定性估计的层次化预测，其中多尺度预测器联合提供鲁棒预测和预测置信度作为免费午餐。大量实验结果表明，FFR在五个真实世界回归基准上平均恢复了BP 98.6%的精度，同时将峰值训练内存降低到深度8时BP的27%和深度32时BP的8%，每次迭代时间约为BP的72%，并且显著优于所有无BP的竞争对手。

英文摘要

The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural "opposites" for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP's accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's, and substantially outperforms all BP-free competitors.

URL PDF HTML ☆

赞 0 踩 0

2606.03918 2026-06-03 cs.AI 版本更新

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

Hedge-Bench：在金融推理相关的困难、现实任务上对智能体进行基准测试

Eric Cho, Shawn Huang, Alice Lu, Andy Lyu

发表机构 * Trata ； Brigham Young University

AI总结提出Hedge-Bench基准，包含102个基于对冲基金分析师实际工作推理轨迹的任务，用于评估AI智能体在开放金融推理问题上的表现，前沿模型得分低于16%。

Comments Dataset and evaluation harness available at github.com/Trata-Inc/trata-hedge-bench

2606.03910 2026-06-03 cs.PF cs.AI cs.DC cs.NI 版本更新

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

NetKV: 面向分解式LLM推理的网络感知解码实例选择

Mubarak Adetunji Ojewale

发表机构 * Cloud Competency Centre, National College of Ireland（国家爱尔兰学院云能力中心）

AI总结针对分解式LLM推理中KV缓存传输导致的首令牌时间增加问题，提出网络成本感知调度器NetKV，通过贪心算法选择解码实例，在64-GPU胖树模拟器上平均降低TTFT达21.2%。

详情

AI中文摘要

分解式LLM推理迫使KV缓存在解码开始前穿越数据中心网络，因此传输时间直接计入首令牌时间（TTFT）预算。当前调度器仅根据计算负载和前缀缓存局部性进行路由，忽略了预填充和解码实例之间的拓扑距离和动态拥塞。我们通过一个轻量级的算子到调度器接口（网络成本预言机）来弥补这一差距，并证明忽略网络项会导致仅缓存感知的调度在上下文长度增长时任意次优。NetKV是一个每请求O(|D|)的贪心算法，它消耗该预言机，其层级排名对过时遥测数据具有可证明的鲁棒性。在由Mooncake轨迹驱动的64-GPU四层胖树模拟器上，NetKV相比轮询调度平均降低TTFT达21.2%，相比调优的缓存+负载感知调度器降低17.6%，将SLO达标率提升最多20.1个百分点，并在所有测试条件下将令牌间时间开销保持在0.5毫秒以下，无需对传输、推理引擎或硬件进行任何更改。

英文摘要

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.03907 2026-06-03 cs.SE cs.AI cs.HC 版本更新

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

配置智能体AI编码工具对构建vs购买决策的影响：一项研究协议

Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude

发表机构 * Singapore Management University, Singapore（新加坡管理大学）； University of Bamberg, Germany（巴姆堡大学）； King’s College London, United Kingdom（伦敦国王学院）； Heidelberg University, Germany（海德堡大学）

AI总结本研究通过受控实验协议，探讨配置机制如何影响Claude Code和OpenAI Codex等智能体AI编码工具在构建vs购买决策中的行为，并发布可复用的基准数据集和分析流程。

Comments 14 pages, 1 table. Accepted at the 20th International Symposium on Empirical Software Engineering and Measurement (ESEM 2026), Registered Reports track

详情

AI中文摘要

智能体AI编码工具以越来越高的自主性编写代码，并在此过程中决定何时导入库以及何时从头实现功能。这些决策——是从头构建功能还是购买外部库（以下称为构建vs购买）——对软件安全性、许可合规性、性能和长期可维护性有直接影响。然而，尚无受控实验研究探讨智能体AI编码工具中构建vs购买决策的支配因素。配置机制，即开发人员根据项目或工作流程定制智能体AI编码工具行为的手段，是实践者影响这些决策的主要方式之一。但尚不清楚哪些配置机制最有效地影响构建vs购买决策。我们提出了一项预注册协议，研究配置机制如何改变两种流行的智能体AI编码工具（Claude Code和OpenAI Codex）中的构建vs购买行为。我们将执行来自阶段性项目基准的受控编程任务，每个任务围绕可识别的构建vs购买点构建，并操纵提供给每个工具的配置，范围从无配置、包含软偏好和明确禁止的上下文文件，到技能（可自主发现的指令）、支持MCP的库发现工具和权限控制，测量工具选择的库、是否披露新引入的库以及这些披露是否完整准确。九个预注册假设构成了该协议。生成的基准数据集和分析流程将作为可复用工件发布，用于评估智能体AI编码工具中的构建vs购买行为。

英文摘要

Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

URL PDF HTML ☆

赞 0 踩 0

2606.03906 2026-06-03 cs.AI 版本更新

scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

scTranslation：单细胞多组学模态翻译的综合基准

Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li

发表机构 * Westlake University（西湖大学）； Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Xidian University（西安电子科技大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结针对单细胞多组学模态翻译任务，提出了包含多样化数据集、先进模型和全面评估指标的综合基准scTranslation，并系统研究了特征选择、特征质量和少样本设置等影响因素。

详情

AI中文摘要

在单细胞中同时测量多种组学模态使研究人员能够更全面地理解细胞状态和调控机制。然而，由于高实验成本、显著噪声和不完全的模态覆盖，近年来出现了多种用于模态翻译的计算方法。尽管翻译模型有所发展，但在数据集、评估指标和影响因素方面仍缺乏系统的基准评估。为此，我们提出了scTranslation，一个用于单细胞多组学模态翻译任务的综合基准。它包括多样化的翻译数据集，整合了最先进的模型，并提供了全面的评估指标。此外，我们评估了模型在不同场景下的性能，如特征选择、特征质量和少样本设置。这些因素显著影响模型性能，但此前很少被系统研究。利用该基准，我们对当前方法进行了大规模研究，报告了许多有洞察力的发现，为未来发展开辟了新的可能性。该基准已开源以促进未来研究。代码匿名发布于该https URL。

英文摘要

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.

URL PDF HTML ☆

赞 0 踩 0

2606.03895 2026-06-03 cs.OS cs.AI cs.CR 版本更新

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Agent libOS: 一种受库操作系统启发的运行时，用于长时间运行、能力受控的LLM智能体

Yingqi Zhang

AI总结提出Agent libOS运行时，将LLM智能体建模为具有进程标识、生命周期、能力控制和审计记录的AgentProcess，通过类似libc的工具包装和运行时原语边界实现安全调度与资源控制。

Comments 14 pages, 1 figure, 2 tables

详情

AI中文摘要

一种基于LLM和知识图谱的无训练混合智能体框架用于多文档摘要

Cuong Vuong Tuan, Trang Mai Xuan, Tien-Cuong Nguyen, Vu-Duc Ngo, Thien Van Luong

发表机构 * Faculty of Artificial Intelligence and Data Science, Phenikaa University（人工智能与数据科学学院，泛尼克大学）； VNPT AI, VNPT Group（VNPT AI，VNPT集团）； MobiFone Research and Development Center, MobiFone Corporation（MobiFone研发与开发中心，MobiFone公司）； Business AI Lab, Faculty of Data Science and Artificial Intelligence, National Economics University, College of Technology（商业人工智能实验室，数据科学与人工智能学院，国家经济大学，技术学院）

AI总结提出一种无需训练、结合大语言模型和知识图谱的混合智能体框架，通过分解摘要任务为专用智能体（抽取、知识感知抽象、迭代精炼）并利用多视角一致性机制，在英文和越南语数据集上取得领先性能。

Comments Accepted by Neural Computing and Applications

详情

AI中文摘要

多文档摘要（MDS）在从文本数据集合中提取关键信息方面发挥着关键作用。现有方法通常难以捕捉复杂的文档间关系，严重依赖大量标注数据进行监督训练，或在跨领域和跨语言时泛化能力有限。为解决这些限制，我们提出一种无训练的混合智能体框架用于MDS，该框架利用大语言模型（LLM）和知识图谱的互补优势。我们的方法将摘要分解为专门的智能体任务：抽取式选择、知识感知抽象和迭代精炼，每个任务无需特定微调。我们通过由LLM引导的多视角一致性机制统一其输出。在四个英文和越南语数据集上的实验表明，该方法达到了最先进或具有竞争力的性能，验证了我们模块化设计的有效性和适应性。

英文摘要

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

URL PDF HTML ☆

赞 0 踩 0

2606.03866 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Taiji: 面向工业LLM增强推荐的帕累托最优策略优化与语义ID权衡

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu, Peng Jiang, Kun Gai

发表机构 * Kuaishou Technology（快手科技）； Unaffiliated（无隶属）

AI总结提出Taiji框架，通过逆向工程推理和开放拒绝采样生成高质量CoT数据，并采用帕累托最优策略优化（POPO）自适应调整跨域奖励权重，实现LLM语义知识与推荐ID特征的帕累托最优权衡，在快手广告平台部署后服务超4亿日活用户。

Comments 8 pages, 2 figures

详情

AI中文摘要

通过大型语言模型（LLM）扩展推荐系统已成为工业界的显著趋势。然而，通过后训练（如SFT和RL）将LLM的语义空间与推荐系统的ID空间对齐仍然具有挑战性。现有的LLM4Rec范式受到两个主要问题的瓶颈：（1）在SFT期间，难以衡量和改进开放域推荐中的思维链（CoT）质量；（2）在RL对齐过程中，忽略了LLM语义奖励与推荐偏好奖励之间的权衡。受这些挑战启发，我们提出了Taiji，一种专为工业推荐系统设计的新型LLM-as-Enhancer框架。为了克服SFT瓶颈，我们利用逆向工程推理和开放拒绝采样生成高质量、领域特定的CoT数据。为了解决RL对齐问题，我们提出了帕累托最优策略优化（POPO），它自适应调整跨域奖励权重。理论上，它在LLM的语义世界知识与代表在线用户偏好的协同ID特征之间实现了最优权衡。大量的离线评估和在线A/B测试验证了Taiji的有效性。自2026年5月在快手广告平台部署以来，Taiji目前每天服务超过4亿用户，产生了显著的商业收入，并展示了其在网络规模环境中的强大可扩展性。

英文摘要

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

URL PDF HTML ☆

赞 0 踩 0

2606.03858 2026-06-03 cs.AI 版本更新

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

PyraMathBench: 评估与提升大型语言模型的数学能力

Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He

发表机构 * East China Normal University（东华师范大学）； Hasso Plattner Institute, University of Potsdam（波茨坦大学哈索普兰特纳研究所）

AI总结提出PyraMathBench分层基准测试，通过整合数值处理与数学推理评估LLM，并引入SOLVE模块和IRPO优化方法提升数值-数学协同能力。

详情

AI中文摘要

尽管数值推理作为大型语言模型（LLM）在各类应用中数学能力的基石具有关键作用，但很少有基准测试通过整合数值处理与数学推理来评估LLM，这阻碍了数学任务中失败的可解释性。我们引入了PyraMathBench，一个全面的分层基准测试，包含来自7,404道数学文字题的32,505个问题，涵盖4个关键认知方面、14个子类别和2种模态。实验表明，LLM的性能因数值计算不足和对抽象数值问题的处理薄弱而严重受损。为解决这一问题，我们提出了智能优化与学习型多功能模块（SOLVE）和交互式相对策略优化（IRPO），通过高效的工具调用（模糊匹配和低质量调用拒绝）增强LLM的数值-数学协同能力。对比实验显示，Qwen-2.5在SOLVE和IRPO训练下获得了5.0分的提升。

英文摘要

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

URL PDF HTML ☆

赞 0 踩 0

2606.03852 2026-06-03 cs.SE cs.AI 版本更新

EvoDS: 具有技能学习和上下文管理的自进化自主数据科学智能体

Zherui Yang, Fan Liu, Yansong Ning, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出EvoDS，通过自主技能获取和自适应上下文压缩策略，结合强化学习训练，使数据科学智能体能够自进化并显著提升多阶段迭代任务的性能。

Comments Accepted by KDD2026

详情

DOI: 10.1145/3770855.3818002

AI中文摘要

近年来，大语言模型（LLM）智能体的进展为自动化数据科学带来了有希望的突破。然而，现有方法仍然受到静态动作集和缺乏原则性长程上下文管理的根本限制，阻碍了它们在多阶段、迭代数据科学流程中积累跨任务可重用经验并可靠运行的能力。为了解决这些挑战，我们引入了EvoDS，一个自进化的自主数据科学智能体，通过智能体强化学习学会扩展其技能并自适应地管理长期上下文。具体来说，EvoDS引入了两个关键策略：（1）自主技能获取（ASA）机制，使智能体能够合成、验证和重用可执行技能；（2）自适应上下文压缩（ACC）策略，将上下文管理视为一个学习控制问题而非被动截断。这些策略在一个两阶段多智能体训练方案中协调，使EvoDS能够随时间自主改进。理论上，我们证明了EvoDS的分层设计减少了工具选择错误，其优化目标与信息瓶颈原理一致，确保了高效的上下文使用。实验上，EvoDS在四个不同基准测试中平均优于最先进的开源数据科学智能体28.9%，同时消除了超出令牌限制的失败。我们的代码和数据可在该网址获取。

英文摘要

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

URL PDF HTML ☆

赞 0 踩 0

2606.03829 2026-06-03 cs.AI 版本更新

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

BigFinanceBench: 一个基于工作流的金融研究智能体基准

Alex Wang, Georg Meinhardt, Jacob Katz, Joseph H. Kim, Pratyush K. Chaudhary, Chase Blagden, Eric Xu

发表机构 * Rogo ； OpenAI

AI总结针对金融研究答案可审计推导过程未被充分评估的问题，提出包含928个专家编写任务、每个任务附带点权评分标准的BigFinanceBench基准，用于评估完整推导过程而非仅最终答案，实验表明最佳系统仅达58.8%评分，存在显著提升空间。

详情

AI中文摘要

金融研究答案只有在其他分析师能够审计其产生过程（包括选择哪个来源、哪个时期和会计定义、做出哪些假设以及如何进行计算）时才具有决策相关性。现有的金融基准主要评估孤立的子技能或最终答案，而忽略了可审计的推导过程本身。我们引入了BigFinanceBench，一个包含928个专家编写的开放式金融研究任务的基准，其中每个任务将一个真实参考答案与一个点权评分标准配对，该评分标准将推导过程分解为可独立检查的步骤。BigFinanceBench基于工作流，因为它评估完整的推导过程而不仅仅是最终输出。在36,241个评分点上，该基准支持部分信用评估和跨分析师工作流的故障定位。评估十个当前前沿和开放权重智能体，我们发现存在显著提升空间：最佳系统仅达到58.8%的评分，最终答案准确性是推导质量的有用但有损的代理指标，并且模型能力在金融工作流中不均匀变化。

英文摘要

Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.03823 2026-06-03 cs.AI cs.CY cs.NE 版本更新

Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization

基于遗传优化的稀疏道路观测城市交通仿真校准

Hunter Sawyer, Jesse Roberts, Simon Matei

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出一种基于遗传算法的框架，利用稀疏道路观测数据校准城市交通仿真，无需详细就业数据，生成与真实测量高度相关的交通流和就业分布。

详情

AI中文摘要

城市交通仿真是基础设施规划的关键工具，包括电动汽车充电站的布局。然而，许多城市的逼真交通仿真受到两个基本数据限制的阻碍：大多数城市只有一小部分道路段有详细的真实交通测量数据，而建模通勤交通所需的就业分布数据很少以仿真所需的分辨率提供。本文提出一个基于遗传算法的框架，直接解决这两个限制，从稀疏道路观测中校准城市交通仿真，无需详细的就业位置数据。使用北卡罗来纳州格林斯博罗的SUMO交通仿真平台，我们的方法优化了就业分布和门控交通参数，使仿真交通与已知交通流率的一小部分道路样本对齐。我们证明，该方法产生的仿真交通与真实测量高度相关，能泛化到训练中未包含的道路段，并且产生的就业分布与人口普查就业数据在定性上具有良好的一致性，尽管从未直接在该就业数据上训练。这项工作表明，可以从最少的真实观测实现逼真的城市交通仿真，提供一种可扩展且数据轻量的仿真校准方法，降低了在不同城市部署交通模型的障碍。

英文摘要

Urban traffic simulation is a critical tool for infrastructure planning, including the placement of electric vehicle charging stations. However, realistic traffic simulation across many cities is hindered by two fundamental data limitations: detailed real-world traffic measurements are available for only a small fraction of road segments in most cities, and employment distribution data critical for modeling commuter traffic is rarely available at the resolution needed for simulation. This paper presents a genetic algorithm-based framework that directly addresses both limitations, calibrating urban traffic simulations from sparse road observations without requiring detailed job location data. Using the SUMO traffic simulation platform for Greensboro, North Carolina, our approach optimizes job distributions and gate-traffic parameters to align simulated traffic with a small sample of roads with known traffic-flow rates. We demonstrate that this approach produces simulated traffic that correlates well with real-world measurements, generalizes to road segments withheld from training, and produces job distributions that show promising qualitative agreement with census employment data despite never directly training on that employment data. This work demonstrates that realistic urban traffic simulation can be achieved from minimal real-world observations, offering a scalable and data-light approach to simulation calibration that reduces the barrier to deploying traffic models across diverse cities.

URL PDF HTML ☆

赞 0 踩 0

2606.03814 2026-06-03 cs.AI 版本更新

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

利用BART基于评分标准评估CS1 C++编程作业

Kelsey Rainey, Jesse Roberts

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出基于评分标准的多任务微调BART模型，用于自动评分C++编程作业，通过联合预测数值分数和等级区间并匹配分布，使评分更接近教师行为。

详情

AI中文摘要

本文研究基于评分标准的多任务微调变换器模型，用于自动评分入门级C++编程作业，旨在产生比通用LLM更能反映教师评分行为的分数预测。使用多学期CS1数据，将学生提交的作业与数值分数、字母等级区间和作业评分标准配对，然后预处理为统一的序列用于变换器输入。采用带有LoRA适应的BART编码器-解码器，联合预测数值分数和等级区间，并增加分布匹配项以对齐预测分数和经验分数分布，这是以往工作中常被忽略的评估维度。实验比较了单任务和多任务训练、硬独热与模糊及基于边界的软标签、有评分标准与无评分标准条件，并增加了T5和成对预训练变体。结果表明，具有基于边界的软标签和评分标准上下文的多任务BART在平均绝对误差和分数分布对齐方面优于单任务、硬标签或仅代码基线。完全微调的T5进一步提高了分布保真度，而成对预训练以牺牲少数类敏感性为代价减少了数值误差。总体而言，研究结果表明，校准感知、评分标准引导的训练比优化准确性的替代方案能产生更像教师的评分行为。

英文摘要

This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.

URL PDF HTML ☆

赞 0 踩 0

2606.03812 2026-06-03 cs.AI 版本更新

Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

通过智能体对话危害识别分析增强操作安全性

Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield, Tirthankar Ghosal

发表机构 * Oak Ridge National Laboratory（橡树岭国家实验室）

AI总结提出HAZDIAL框架，利用结构化多智能体多轮对话（对抗性辩论与建设性讨论）改进基于NLP的危害识别质量，并通过算法优化智能体交互，实验证明优于单次基线方法。

详情

AI中文摘要

在工业过程控制、自主系统和安全关键系统等高风险领域，操作安全性要求可靠的危害识别。虽然大型语言模型在自动化安全分析任务中显示出潜力，但单次、整体推理是脆弱的：它缺乏安全工程师迭代应用的自校正、深思熟虑和上下文细化。在本文中，我们介绍了HAZDIAL，一个研究结构化智能体对话（多智能体、多轮交互）是否比单次基线提高基于NLP的危害识别质量的框架。我们系统地比较了两种对话模式：对抗性辩论和建设性讨论，并提出了基于算法的智能体交互优化。我们使用标准分类指标（准确率、精确率、召回率、F1）和新颖的对话指标，针对策划的金标准数据集评估所有配置。这项工作推进了对话系统、多智能体推理和AI安全的交叉领域，为对话驱动的危害分析提供了经验证据。

英文摘要

Operational safety in high-stakes domains such as industrial process control, autonomous, and safety-critical systems, demand reliable hazard identification. While large language models (LLMs) have shown promise in automating safety analysis tasks, single-turn, monolithic inference is brittle: it lacks the self-correction, deliberation, and contextual refinement that safety engineers apply iteratively. In this paper, we introduce HAZDIAL, a framework that investigates whether structured agentic dialogue-multi-agent, multi-turn interactions improves the quality of NLP- based hazard identification over single-pass baselines. We systematically compare two dialogue modalities: adversarial debate and constructive discussion, and propose an algorithm-based agentic interaction optimization. We evaluate all configurations against a curated golden dataset using standard classification metrics (accuracy, precision, recall, F1) and novel dialogue metrics. This work advances the intersection of dialogue systems, multi-agent reasoning, and AI safety, providing an empirical evidence for dialogue-driven hazard analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.03811 2026-06-03 cs.CR cs.AI cs.LG 版本更新

基于正交易轴磁隧道结的有符号脉冲神经元

Huannan Zheng, Jingli Liu, Kezhou Yang

AI总结提出一种基于正交易轴磁隧道结的紧凑型有符号脉冲神经元，通过自由层和钉扎层的正交易轴实现双极性脉冲生成，并映射磁矩动力学到有符号LIF膜电位演化，在CIFAR-10和CIFAR10-DVS上分别达到91.06%和77.40%的准确率。

2606.03777 2026-06-03 cs.AI cs.CR q-fin.RM 版本更新

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

从控制边界到保险索赔：通过CER框架重构AI中介损失

Alex Leung, Rex Zhang, Kentaroh Toyoda, SiewMei Loh

AI总结本文提出CER框架（控制边界、证据重构、保险响应），用于诊断和重构由生成式或代理式AI系统导致的损失，以支持保险索赔。

详情

AI中文摘要

通过受保组织的生成式或代理式AI系统产生的AI损失需要状态重构，而不仅仅是事件重构，因为相关状态会随着系统推理、检索、调用工具和行动而改变。相关的问题不仅是发生了什么损失，还包括系统被允许做什么、实际做了什么，以及重构的损失能否支持保险索赔。本文处理受保人的AI系统处于因果链中的损失，包括外部触发的故障，如提示注入、检索增强生成（RAG）投毒、恶意工具输出、凭证滥用和数据投毒。具体而言，本文介绍了CER，一种用于AI残余风险转移的用例级诊断。C（控制边界）询问系统是否具有可执行的操作范围。E（证据重构）询问是否可以从保留的工件中重构系统状态和因果链。R（保险响应）询问重构的损失是否被保险：保险覆盖是否在市场上可用并为受保人投保，以及支持保险索赔所需的证据。本文做出三项贡献：定义了AI特定的重构问题，通过CER操作化该问题，并指定了AI重构的索赔级证据。公开示例包括报道的PocketOS和Replit代理数据库删除事件，以及作为已裁决的输出/依赖案例的Moffatt诉加拿大航空案。关键词：AI系统；CER框架；残余风险转移；代理式AI；生成式AI；AI保险；证据重构。

英文摘要

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.03770 2026-06-03 cs.DC cs.AI 版本更新

E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

E2LLM：异构边缘/雾环境中高效LLM服务

Truong-Thanh Le, Amir Taherkordi, Hoang-Loc La, Frank Eliassen, Phuong Hoai Ha, Peiyuan Guan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出E2LLM框架，通过复制模型到多设备组并采用模型并行，结合遗传算法聚类和动态规划分区，在资源受限的异构边缘/雾环境中实现高效LLM部署，相比Splitwise基线在高需求下平均等待时间降低50%以上。

详情

AI中文摘要

大型语言模型（LLM）已成为现代应用不可或缺的一部分，但其部署仍具挑战性。除了执行模型本身，实际部署必须解决成本效率、低延迟和最优资源利用问题。传统方法通常假设整个模型可以托管在单个设备上，这在许多现实场景中不成立，尤其是在设备资源受限的边缘和雾环境中。本文介绍了E2LLM，一个旨在在此类资源有限环境中实现高效LLM部署的框架。E2LLM并非简单地将单个模型分区到所有可用设备，而是将完整模型复制到多个设备组（副本），并在每个副本内应用模型并行。每个副本根据其处理输入和输出令牌的效率被分配专门角色PREFILL或DECODER。这种分离利用了LLM推理这两个阶段之间的固有差异。为了有效组织设备，我们利用遗传算法形成最大化系统性能的集群。在每个集群内，我们应用动态规划确定最优分区策略，以最小化模型并行执行中的瓶颈。实验结果表明，我们的方法能够稳健地适应不同工作负载，包括输入和输出令牌长度显著变化的场景。与Splitwise基线相比，E2LLM在高需求条件下将平均等待时间降低了50%以上。

英文摘要

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

URL PDF HTML ☆

赞 0 踩 0

2606.03762 2026-06-03 cs.LG cs.AI 版本更新

通过推导图揭示Do-演算推理的结构

Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier

AI总结本文引入推导图来表示Do-演算规则的应用与组合，刻画了在Do-演算下等价的观测与干预概率的完整空间，并展示了通过最多四次规则应用即可实现等价变换，进而利用等价因果查询产生更有效的估计量。

Comments Accepted at ICML 2026

2606.03705 2026-06-03 cs.AI 版本更新

Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

图上的代码：通过大型语言模型在知识图谱上进行迭代式程序化推理

Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen, Kun Su, Fei Wang, Xiaolong Jin, Jin Zhang, Jiafeng Guo, Xueqi Cheng

发表机构 * Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所人工智能安全重点实验室）； Shandong University（山东大学）； Shandong University-Weihai Research Institute of Industrial Technology（山东大学威海工业技术研究院）

AI总结提出Code-on-Graph (CoG)框架，通过将知识图谱模式表示为Python类并生成可执行代码，解决现有LLM-KG集成中操作符不灵活和知识注入不可扩展的问题，在WebQSP、CWQ和GrailQA上提升高达10.5%。

详情

AI中文摘要

知识图谱（KGs）被广泛用于缓解大型语言模型（LLMs）的局限性，如知识过时和幻觉。现有的LLM-KG集成框架通常依赖预定义操作符从知识图谱中检索事实知识，并将其注入提示以生成答案。这种范式面临两个关键瓶颈：1）不灵活性：预定义操作符范围有限，因此缺乏足够的组合表达能力来完全捕捉知识图谱问题所需的复杂语义。2）不可扩展性：将事实知识直接注入提示限制了处理大规模事实知识的可扩展性。为了解决这两个瓶颈，我们提出了Code-on-Graph（CoG），一个用于LLM-KG集成的程序化推理框架。具体来说，给定每个推理步骤检索到的事实知识，CoG首先识别相应的知识图谱模式，并将这些模式表示为Python类，这些类作为检索事实的抽象接口。然后，它生成基于这些类的可执行代码，在执行过程中，检索到的事实被实例化为相应类的对象。这种设计实现了灵活的基于代码的推理，同时避免将大规模事实知识直接注入提示。在WebQSP、CWQ和GrailQA上的实验表明，CoG比之前的最先进模型性能提升高达10.5%。

英文摘要

Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.

URL PDF HTML ☆

赞 0 踩 0

2606.03704 2026-06-03 cs.AI cs.CE cs.CY 版本更新

Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

动态目标选择与防护机制及大语言模型监督在金融决策中的应用

Keigo Sakurai, Takahiro Ogawa, Miki Haseyama, Anjyu Anan, Kei Nakagawa

发表机构 * Hokkaido University（北海道大学）； Nomura Asset Management Co., Ltd.（日兴资产经营管理公司）； Kobe University（Kobe大学）； Osaka Metropolitan University（大阪市立大学）

AI总结提出DOSS方法，通过将目标选择建模为分类问题并利用滚动窗口进行顺序更新，结合置信度感知门控和LLM监督，实现金融决策中动态目标选择，降低误选和过度切换风险。

Comments Accpeted to The 2nd Workskop on Advances in Financial AI Workshop: Towards Agentic and Responsible Systems at ICLR 2026

详情

AI中文摘要

金融决策任务（如股票推荐和投资组合配置）通常估计未来收益和风险，然后为投资者选择交易或配置，所选优化目标往往决定实际表现。然而，由于市场条件随时间变化，固定目标在不同市场状态下可能次优，而依赖潜在状态估计的状态切换流程可能噪声大或延迟，频繁切换会增加交易成本和运营不稳定性。本文提出DOSS（带防护机制的动态目标选择），一种基于学习的选择器，直接从近期收益的可解释统计摘要中为每个时间点选择决策相关的目标函数，从少量候选（如追求收益、规避损失和风险调整）中选择，无需引入中间状态变量。DOSS将目标选择形式化为目标上的分类问题，并通过滚动窗口进行顺序更新以做出前瞻性选择，避免时间泄漏，同时为每个提议输出置信度分数。为缓解部署中的误选和过度切换，DOSS应用置信度感知门控，并带有故障安全机制，将低置信度提议覆盖为保守默认值，并实施与切换频率相关的显式控制。我们进一步通过将大语言模型（LLM）定位为监督组件而非新目标生成器来整合治理：LLM仅限于接受提议目标或将其覆盖为预定义安全默认值，并在需要时由确定性基于规则的约束触发覆盖。

英文摘要

Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then select trades or allocations for an investor, and the chosen optimization objective often determines realized performance. However, because market conditions evolve over time, a fixed objective can be suboptimal across regimes, while regime-switching pipelines that rely on latent regime estimates can be noisy or delayed and frequent switching can increase turnover and operational instability. In this paper, we propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based selector that directly chooses the decision-relevant objective function at each time point from interpretable statistical summaries of recent returns, selecting among a small set of candidates (e.g., return-seeking, loss-averse, and risk-adjusted) without introducing intermediate regime variables. DOSS formulates objective selection as a classification problem over objectives and performs sequential updates with a rolling window to make forward-looking selections without temporal leakage, while also outputting a confidence score for each proposal. To mitigate misselection and excessive switching in deployment, DOSS applies confidence-aware gating with a fail-safe that overrides low-confidence proposals to a conservative default and enforces explicit controls tied to switching frequency. We further integrate governance by positioning a Large Language Model (LLM) as an oversight component rather than a generator of new objectives: the LLM is restricted to accept a proposed objective or override it to a predefined safe default, with deterministic rule-based constraints triggering overrides when needed.

URL PDF HTML ☆

赞 0 踩 0

2606.03692 2026-06-03 cs.AI cs.CL 版本更新

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

SkillPyramid：一种用于自我进化智能体的层次化技能整合框架

Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China（认知与决策智能复杂系统重点实验室，自动化研究所，中国科学院，北京，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学人工智能学院，北京，中国）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室，上海，中国）； Beijing Academy of Artificial Intelligence, Beijing, China（北京人工智能研究院，北京，中国）

AI总结针对智能体缺乏系统性技能构建、积累和迁移的问题，提出SkillPyramid层次化技能整合框架，通过自进化机制在任务执行中组合、验证和吸收新技能，在三个基准上平均奖励提升38.0%，执行步骤减少27.7%。

详情

AI中文摘要

最近的AI智能体可以灵活调用技能来解决复杂任务，但其长期改进从根本上受到缺乏系统性技能构建、积累和迁移的限制。特别是，没有统一的技能整合框架，智能体倾向于在不同任务中冗余构建相似能力，无法有效将经验转化为可复用资产，并且难以将任务特定技能泛化到新场景。为了解决这一限制，我们提出了SkillPyramid，一个技能整合框架，它重用现有技能经验以实现更广泛的任务泛化。在层次化技能拓扑上运行，SkillPyramid进一步引入了一种自进化机制，使智能体能够在任务执行过程中组合、验证和吸收新技能。在ALFWorld、WebShop和ScienceWorld上使用四个骨干模型的实验表明，SkillPyramid将平均奖励提高了38.0%，并将执行步骤减少了27.7%。总体而言，我们的方法将技能集合从静态资源池转变为动态进化系统。

英文摘要

Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

URL PDF HTML ☆

赞 0 踩 0

2606.03689 2026-06-03 cs.LG cs.AI 版本更新

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

保持存活：基于表格基础模型的无审查生存分析

Mariana Vargas Vieyra

发表机构 * GitHub

AI总结提出一种无需训练的生存回归方法，利用表格基础模型预测事件时间并迭代填补右删失数据，构建加速失效时间模型，在标准基准上表现与需训练的模型相当。

详情

AI中文摘要

生存分析是一种统计框架，用于建模直到某个感兴趣事件发生的时间跨度。它广泛应用于包括医疗保健和客户流失预测在内的多个领域，其适用性的一个核心挑战在于事件时间被部分观测或存在右删失。近年来，表格基础模型因其能够在单次前向传播中执行预测任务而无需数据集特定的参数拟合，引起了广泛关注。尽管取得了成功，但由于右删失的存在，它们在时间-事件数据预测任务中的应用仍然困难。在这项工作中，我们提出了一种无需训练的生存回归方法，通过利用表格基础模型来预测事件时间并迭代地填补右删失数据。我们的方法使用表格基础模型构建加速失效时间模型，除了拟合单个标量参数外无需训练。随后，基于Buckley-James估计器，我们引入了一种非参数上下文内估计器来处理右删失数据。我们在标准生存分析基准上的实验表明，我们的方法与几种需要训练的参数和半参数生存回归模型（包括Cox回归和参数加速失效时间模型）相比具有竞争力。

英文摘要

Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several domains, including healthcare and churn prediction, a central challenge in its applicability stems from the time of the event being partially observed or \emph{right-censoring}. Tabular Foundation Models (TFM) have attracted significant interest in recent years due to their ability to perform prediction tasks in a single forward pass, requiring no dataset-specific parameter fitting. Despite their success, their application to prediction tasks on time-to-event data remains difficult due to right censoring. In this work, we present a training-free method to survival regression by leveraging TFMs to both predict the time of the event and iteratively impute right-censored data. Our method uses a TFM to construct an Accelerated Failure Time (AFT) model requiring no training beyond fitting a single scalar parameter. Subsequently, by building on the Buckley-James estimator, we introduce a non-parametric in-context estimator for right-censored data. Our experiments on standard survival analysis benchmarks show that our method is competitive with several parametric and semi-parametric survival regression models that require training, including Cox regression and parametric AFT models.

URL PDF HTML ☆

赞 0 踩 0

2606.03686 2026-06-03 cs.AI 版本更新

The DeepSpeak-Agentic Dataset

DeepSpeak-Agentic 数据集

Sarah Barrington, Maty Bohacek, Hany Farid

AI总结本文提出了一个包含37小时人机半结构化对话视频的数据集DeepSpeak-Agentic，用于评估AI代理的自动取证识别、研究人机交互特性，并作为大型语言模型和AI生成语音/面部技术的基准。

2606.03685 2026-06-03 cs.LG cs.AI 版本更新

A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

监督微调的大语言模型规划器中世界模型恢复的深入探究

Patrick Emami, Nan Qiang, Peter Graf

发表机构 * National Laboratory of the Rockies（落基山国家实验室）

AI总结通过可解释性实验，研究监督微调如何影响大语言模型在经典规划任务中恢复世界模型的能力，发现微调使模型线性编码动作有效性和状态谓词，且更广泛的状态空间覆盖有助于更准确的世界模型恢复。

Comments 17 pages. Under review at TMLR

详情

AI中文摘要

监督微调（SFT）改进了大语言模型（LLM）中的端到端经典规划，但这些模型是否也学会了表示和推理它们正在解决的规划问题？由于经典规划问题的相对复杂性以及端到端规划生成对LLM的挑战，探索这个问题一直很困难。在我们的工作中，我们设计并执行了一系列可解释性实验，通过检查微调LLM的内部表示和生成能力，全面探究世界模型恢复。我们发现：a) 对有效动作序列进行监督微调使LLM能够线性编码动作有效性和一些状态谓词。b) 难以使用输出概率对动作有效性进行分类的模型可能仍然学习到将有效动作与无效动作分开的内部表示。c) 微调期间更广泛的状态空间覆盖（例如来自随机游走数据）能更准确地恢复底层世界模型。总之，这项工作为将可解释性技术应用于规划LLM提供了一种方法，并产生了有助于揭示LLM中知识表示方式的见解。

英文摘要

Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.03678 2026-06-03 cs.AI 版本更新

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

EvoDrive: 通过自我改进的LLM智能体实现安全关键自动驾驶的帕累托进化

Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma

AI总结提出EvoDrive，首个基于LLM的自动化智能体进化框架，通过模拟器接地演员-评论家架构和帕累托存档，在安全关键场景生成中实现对抗性与真实性的多目标优化。

详情

AI中文摘要

生成安全关键场景对于验证和改进自动驾驶系统至关重要，但它本质上需要在最大化对抗性以暴露故障的同时保持真实性。现有方法通常通过手工设计的启发式方法来管理这种权衡，将生成限制在已知的先验知识中，忽视了未充分探索的模式。虽然最近开放式的智能体进化可以突破这一限制，但不受约束的通用智能体缺乏严格的模拟器接地，往往将多目标张力退化为单标量最大化。本文提出了EvoDrive，第一个基于LLM的自动化智能体进化框架，用于多目标场景生成。EvoDrive采用模拟器接地的演员-评论家架构，其中记忆驱动的演员迭代地提出对生成器的改进，评论家过滤掉不可信的候选者，而自我进化的世界评估器将有前途的候选者路由以优化模拟预算。EvoDrive进一步维护一个评估候选者的帕累托存档，以保留多样化的攻击-真实性权衡，并通过模拟反馈指导未来的进化。在MetaDrive和CARLA上的基准测试结果表明，EvoDrive不仅显著扩展了各种生成器的帕累托前沿，而且为策略训练生成了有价值的场景。

英文摘要

Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

URL PDF HTML ☆

赞 0 踩 0

2606.03664 2026-06-03 cs.NI cs.AI 版本更新

AUGUSTE: Online-Learning dApp for Predictive URLLC Scheduling

AUGUSTE: 用于预测性URLLC调度的在线学习dApp

Maxime Elkael, Michele Polese, Yunseong Lee, Koichiro Furueda, Tommaso Melodia

AI总结针对URLLC中调度请求导致的高延迟问题，提出基于在线机器学习的MAC调度框架AUGUSTE，通过预测数据包到达提前分配资源，在真实5G测试平台上实现延迟与资源开销的最佳权衡。

详情

AI中文摘要

超可靠低延迟通信（URLLC）是5G的主要驱动力之一，3GPP为工业自动化、车联网（V2X）、战术边缘网络和无人系统控制等应用设定了1-10毫秒的延迟目标。多年后，真实的5G时分双工（TDD）网络的中位上行链路（UL）往返时间仍在50-70毫秒范围内，这主要是因为用户设备（UE）在发送UL数据之前必须完成调度请求（SR）过程。现有的补救措施，主要是配置授权（CG）调度，仅能消除严格周期性流量的这一开销，并需要跨层同步，这限制了其采用。我们提出了AUGUSTE（通过自适应时间估计实现URLLC的预测性上行授权），这是一种基于学习的介质访问控制（MAC）调度框架，它将在线机器学习（ML）模型嵌入UL调度器中，以预测数据包到达并在发出SR之前主动分配资源。一个自适应状态机在收集无偏到达统计信息的学习阶段和利用学习到的预测仅在预期有流量时进行调度的自信阶段之间交替。我们在运行OpenAirInterface的真实5G测试平台上，针对三种URLLC流量模式（请求-响应、ML边缘推理和周期性自主报告）评估了AUGUSTE，结果表明它在延迟-开销权衡上达到了最佳可行点：它以约十分之一的资源开销（7-10%开销）实现了与始终在线调度相当的中位往返时间（RTT）（约10毫秒，比基于SR的20毫秒基线减半）。

英文摘要

Ultra Reliable and Low Latency Communications (URLLC) was one of the main motivations behind 5G, with 3GPP advertising 1-10 ms latency targets for applications such as industrial automation, Vehicle-To-Everything (V2X), tactical edge networking, and unmanned-system control. Years on, real 5G Time Division Duplexing (TDD) networks still show median Uplink (UL) round-trip times in the 50-70 ms range, largely because of the Scheduling Request (SR) procedure that a User Equipment (UE) must complete before transmitting UL data. Existing remedies, primarily Configured Grant (CG) scheduling, only eliminate this overhead for strictly periodic traffic and require cross-layer synchronization, which has limited their adoption. We propose AUGUSTE (Anticipatory Uplink Grants for URLLC via Self-Adapting Temporal Estimation), a learning-based Medium Access Control (MAC) scheduling framework that embeds online Machine Learning (ML) models in the UL scheduler to predict packet arrivals and proactively allocate resources before an SR is issued. An adaptive state machine alternates between a learning phase that collects unbiased arrival statistics and a confident phase that exploits the learned predictions to schedule only when traffic is expected. We evaluate AUGUSTE on a real 5G testbed running OpenAirInterface across three URLLC traffic patterns (request-response, ML edge inference, and periodic autonomous reporting), and show that it operates at the best achievable point on the latency-overhead trade-off: it matches always-on scheduling's median Round Trip Time (RTT) (around 10 ms, halving the 20 ms SR-based baseline) at roughly one-tenth its resource cost (7-10 percent overhead).

URL PDF HTML ☆

赞 0 踩 0

2606.03657 2026-06-03 cs.AI 版本更新

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

诊断大语言模型工具使用中的知识缺口：面向新API获取的智能体基准

Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen

发表机构 * NYU Shanghai（纽约大学上海分校）

AI总结提出 NovelAPIBench 基准，通过动态发现新API、分解知识包并生成可执行任务，诊断模型在API使用中的六类错误，发现检索与参数调优互补。

Comments 37 pages, 12 figures

详情

AI中文摘要

用于代码生成的大语言模型通常需要使用预训练数据中不存在的API。这不仅仅是回忆函数名：模型必须协调签名、模块路径、输入输出契约、语义和可执行使用模式。现有的新API基准通常是静态的，依赖于粗略的通过/失败指标，或使用可能无法反映真实库演变的合成API。我们引入了NovelAPIBench，一个全自动动态基准，对于任何基础模型和目标库，发现新API，提取分解的知识包，生成可执行编码任务，并将失败样本分配到六个诊断类别。在大约1.9K个任务、四个基础模型和五个领域上，我们比较了通过检索注入的知识与通过参数自适应内化的知识。我们发现知识组件不可互换：使用示例是最强的独立信号，而最佳的双组件设置将签名与机制或示例配对，具体取决于领域和骨干。添加更多上下文，尤其是源代码，可能通过增加导入路径错误而有害。一旦外部知识被移除，参数自适应也不能取代检索；相反，微调主要教会模型如何使用提供的包，并且这种能力可以迁移到保留的库。这些结果表明检索和调优扮演互补角色：检索提供易变的API内容，而调优改进程序性整合。

英文摘要

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.

URL PDF HTML ☆

赞 0 踩 0

2606.03655 2026-06-03 cs.AI cs.LO 版本更新

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

命题可废止立场逻辑中的非单调蕴涵

Nicholas Leisegang, Thomas Meyer, Ivan Varzniczak

发表机构 * University of Cape Town and CAIR, South Africa（开普敦大学和CAIR，南非）； Université Sorbonne Paris Nord, Inserm, Sorbonne Université, Limics, 93017 Bobigny, France（巴黎-索邦大学，Inserm，索邦大学，Limics，法国93017博比尼）； ISTI-CNR, Pisa, Italy（意大利比萨ISTI-CNR）

AI总结本文通过引入情境立场条件句，将KLM风格的非单调理性蕴涵关系提升到命题可废止立场逻辑（PDSL）的一个片段中，并证明了该片段可表达为一组情境条件句，进而将基于排序的蕴涵关系（如理性和词典序闭包）从命题情况忠实翻译到PDSL，同时保持复杂度界限。

详情

AI中文摘要

近期在可废止推理领域的研究中，Kraus等人提出的优先语义和蕴涵概念已被应用于模态逻辑。然而，该领域的工作主要集中在可满足性检查以及单调蕴涵关系上，后者在推理上可能较弱。引入这一概念的一个特定模态逻辑是命题立场逻辑，其中的模态可以表达不同视角的观点。这导致了命题可废止立场逻辑（PDSL）的形式化。在本文中，我们提出了一种方法，将（非单调）理性蕴涵关系类从传统的KLM风格推理提升到PDSL的一个片段中。为此，我们通过情境立场条件句扩展了PDSL的表达力，使得我们能够在给定立场的上下文中讨论可废止条件句。这使我们能够用情境条件句重新刻画PDSL的语法，并表明PDSL的一个大片段可以表达为一组情境条件句。然后，我们专注于刻画该片段中的非单调蕴涵，定义了一种方法，将任何基于排序的蕴涵关系从命题情况移植到PDSL情况。这首先在一般情形下描述，然后在理性和词典序闭包的具体情形下考虑，为每个推理提供了到PDSL的忠实翻译。我们还表明，该PDSL片段中的蕴涵检查可以主要使用命题情况下的算法进行，同时保持复杂度界限。

英文摘要

Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal logics. However, work in this field has focussed primarily on satisfiability checking, and monotonic notions of entailment, which may be inferentially weak. One particular modal logic where this has been introduced is propositional standpoint logics, where modalities can express the views of different viewpoints. This has resulted in the formalisation of propositional defeasible standpoint logic (PDSL). In this paper, we propose a means of lifting the class of (non-monotonic) rational entailment relations from traditional KLM-style reasoning to a fragment of PDSL. In order to do so, we extend the expressivity of PDSL via situated standpoint conditionals, allowing us to talk about a defeasible conditional holding in the context of a given standpoint. This allows us to re-characterise the syntax of PDSL in terms of situated conditionals, and shows that a large fragment of PDSL is expressible as a set of situated conditionals. We then focus on characterising non-monotonic entailment in this fragment, defining a method to transport any ranking-based entailment relation from the propositional case into the PDSL case. This is first described in the general case and then considered in the specific cases of rational and lexicographic closures, providing a faithful translation of each inference into PDSL. We also show that entailment-checking in this fragment of PDSL can be done largely using algorithms from the propositional case, while preserving complexity bounds.

URL PDF HTML ☆

赞 0 踩 0

2606.03648 2026-06-03 cs.CL cs.AI 版本更新

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

微调大语言模型的安全性测量应基于能力

Krishnapriya Vishnubhotla, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko

发表机构 * National Research Council, Canada（加拿大国家研究理事会）

AI总结通过将微调锚定于特定能力目标，多维度评估微调对模型能力和安全性的影响，发现微调模型对安全提示可能产生不连贯输出、自动安全判断不可靠，且结论因安全基准和评估者而异。

Comments 8 pages plus appendices

详情

AI中文摘要

通过微调将基础大语言模型适应用户的任务或偏好风格可能会损害模型的安全性。先前的研究在有限且看似随机的实验设置中考察了微调对模型安全性的影响。我们认为，将微调锚定于特定的能力目标对于避免任意的经验选择至关重要，这使我们能够得出关于安全性影响的有意义结论，并在一致的基础上比较缓解方法。我们通过关注能力和安全性，对微调对模型行为的影响进行了多维度评估。我们的结果揭示了重要问题：(1) 微调模型可能对安全提示产生不连贯的生成内容，(2) 对于这种不连贯输出，自动安全判断不可靠，(3) 关于微调影响的结论可能因安全基准以及安全评估者的选择而改变。

英文摘要

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

URL PDF HTML ☆

赞 0 踩 0

2606.03647 2026-06-03 cs.CR cs.AI cs.LG 版本更新

TSQAgent: 通过专用智能体推理评估时间序列数据质量

Shunyu Wu, Dan Li, Haozheng Ye, Weibin Feng, Jian Lou, Bo Zhang, Wenjie Feng, Chenjuan Guo, See-Kiong Ng

发表机构 * Sun Yat-sen University（中山大学）； China University of Mining Technology（中国矿业大学）； University of Science and Technology of China（中国科学技术大学）； East China Normal University（华东师范大学）； National University of Singapore（新加坡国立大学）

AI总结提出TSQAgent框架，通过三个协作智能体（感知器、检查员、裁决者）识别相关质量维度并进行定量比较，显著提升LLM在时间序列数据质量评估中的表现。

详情

AI中文摘要

评估时间序列（TS）数据的质量是基础但极具挑战性的任务，因为质量维度具有多面性。最近，大语言模型（LLM）通过成对比较和逐维度评估，成为TS质量评估的一种有前景的范式。然而，现有方法依赖手动预定义的质量维度和纯文本推理，尚不清楚LLM能否识别真正相关的质量维度或进行基于证据的定量质量比较。为探究此问题，我们构建了TSQBench，一个专用基准，用于评估LLM在两种渐进能力上的表现：（i）理解和识别相关质量维度，（ii）在特定维度下进行质量比较。分析表明，当前LLM在维度识别和基于证据的质量比较方面均存在困难。为解决这些局限，我们提出TSQAgent，一种新颖的用于TS质量评级的智能体推理框架，包含三个协作角色：感知器（负责聚焦维度选择）、检查员（负责逐维度定量分析）和裁决者（负责聚合并优化最终判断）。特别地，我们引入一种智能体推理策略，赋予模型识别和优先考虑最相关质量维度的能力，并进一步提出一个配备外部分析工具的智能体工作流，以实现对选定维度的精确定量比较。在提出的基准和11个真实世界数据集上的实验表明，我们的框架不仅显著提升了LLM在质量理解和定量比较方面的能力，而且有效地将这些改进转化为更好的质量感知数据选择，从而提升下游性能和数据效率。

英文摘要

Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensions. Recently, large language models (LLMs) have emerged as a promising paradigm for TS quality assessment via pairwise comparison and per-dimension evaluation. However, existing approaches rely on manually predefined quality dimensions and purely text-based reasoning, leaving it unknown whether LLMs can identify truly relevant quality dimensions or perform grounded and quantitative quality comparisons. To investigate this, we construct TSQBench, a dedicated benchmark for evaluating LLMs on two progressive capabilities: (i) understanding and identifying relevant quality dimensions, and (ii) performing quality comparison under specific dimensions. Our analysis reveals that current LLMs consistently struggle with both dimension identification and evidence-grounded quality comparison. To address these limitations, we propose TSQAgent, a novel agentic reasoning framework for TS quality rating consisting of three collaborative roles: Perceiver for focused dimension selection, Inspector for dimension-wise quantitative analysis, and Adjudicator that aggregates and refines the final judgment. In particular, we introduce an agentic reasoning strategy that instills the ability to identify and prioritize the most relevant quality dimensions, and further propose an agent workflow equipped with external analytical tools to enable precise quantitative comparisons over selected dimensions. Experiments on both the proposed benchmark and eleven real-world datasets demonstrate that our framework not only substantially improves LLMs' capabilities in quality understanding and quantitative comparison but also effectively translates these improvements into better quality-aware data selection, leading to enhanced downstream performance and data efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.03628 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

通过幻觉拒绝采样构建可靠的长文本生成

Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； DeepMind（深度思维）

AI总结提出分段幻觉拒绝采样框架SHARS，利用任意幻觉检测器在生成过程中拒绝并重采样幻觉片段，以缓解长文本生成中的幻觉累积问题，提升事实一致性。

Comments accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）在开放式文本生成方面取得了显著进展，但仍容易产生不正确或无依据的幻觉内容，这损害了其可靠性。在长文本生成中，由于幻觉雪崩现象（早期错误传播并累积到后续输出），这一问题更加严重。为了解决这一挑战，我们提出了一种新颖的推理时幻觉缓解框架，称为分段幻觉拒绝采样（SHARS），该框架使用任意幻觉检测器在生成过程中识别并拒绝幻觉片段，并重新采样直到生成忠实的内容。通过仅保留可信信息并在此基础上构建后续生成，该框架减轻了幻觉累积并增强了事实一致性。为了实例化该框架，我们采用语义不确定性作为检测器，并引入了若干关键修改以解决其局限性并更好地适应长文本。我们的方法使模型能够自我纠正幻觉，无需外部资源（如网络搜索或知识库），同时保持与这些资源的兼容性以便未来扩展。在标准化幻觉基准上的实证评估表明，我们的方法显著减少了长文本生成中的幻觉，同时保持甚至提高了生成的信息量。代码可在以下网址获取：this https URL。

英文摘要

Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.03626 2026-06-03 cs.CV cs.AI cs.CY 版本更新

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

TurtleAI：海龟图形学中视觉编程的多模态模型基准测试

Chao Wen, Jacqueline Staub, Adish Singla

发表机构 * MPI-SWS（马克斯·普朗克研究所-斯图加特）

AI总结提出TurtleAI基准，包含823个基于海龟图形学真实任务的视觉编程任务，评估20多个多模态模型发现成功率低于30%，并通过少量种子样本生成合成数据微调Qwen2-VL-72B提升约20%性能。

Comments ACL Findings 2026 paper

详情

AI中文摘要

视觉语言模型（VLM）已被探索用于视觉编程，即生成代码以解决视觉任务。然而，大多数先前工作侧重于提高生产力的视觉编程；目前尚不清楚当前VLM在教育导向的视觉编程上表现如何，以及哪些因素限制了它们的性能。为填补这一空白，我们引入了TurtleAI，这是一个包含823个任务的基准，这些任务基于海龟图形学领域的真实视觉编程任务精心策划。解决这些任务需要模型感知几何图案、推理空间关系，并合成能忠实再现几何图案的Python代码。我们评估了20多个VLM，包括GPT-5、GPT-4o和Qwen2-VL-72B，发现它们表现显著困难，大多数成功率低于30%。为解决这些限制，我们提出了一种仅需少量种子样本的数据生成技术。在生成的合成数据上微调Qwen2-VL-72B，在真实任务上取得了约20%的提升。我们的失败分析揭示，GPT-4o在空间推理和精确视觉复制方面存在困难，而微调主要改善了视觉推理与代码实现之间的对齐。

英文摘要

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.

URL PDF HTML ☆

赞 0 踩 0

2606.03624 2026-06-03 cs.AI cs.CL 版本更新

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

桥接辅助约束以解决大型推理模型中的指令遵循问题

Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； University of International Relations（国际关系大学）； Tencent Jarvis Lab（腾讯Jarvis实验室）； Westlake University（西湖大学）； King’s College London（伦敦国王学院）

AI总结针对大型推理模型难以可靠遵循多重约束的问题，提出约束关系图补全框架，通过显式建模约束关系并发现桥接约束，将约束违反率降低39%。

Comments a pre-MIT Press publication version

详情

AI中文摘要

大型推理模型（LRMs）在许多任务中展现出令人印象深刻的能力，但在可靠地遵循多个指令方面存在困难，要么无法满足单个约束，要么难以同时平衡相互竞争的约束。我们将这一挑战形式化为约束遵循问题（CAP）。本文引入了一个新颖的框架，通过将指令表示为约束的结构化知识图来解决CAP。我们的方法，约束关系图补全（CRGC），显式建模约束之间的关系，识别遵循挑战，并发现“桥接约束”，帮助模型更好地聚焦和协调需求。桥接约束作为辅助指令，使主要约束更加突出和兼容。与通过通用训练方法增强指令遵循的现有方法不同，CRGC通过利用模型自身的知识来创建更好的生成路径，从而专门提高约束满足度。在三个流行的指令遵循数据集上的实验表明，与标准提示相比，我们的方法将约束违反减少了39%，同时保持了大型推理模型的推理能力。

英文摘要

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2606.03620 2026-06-03 cs.LG cs.AI 版本更新

Physics-Guided Policy Optimization with Self-Distillation

基于物理引导的自蒸馏策略优化

Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei

发表机构 * Amazon（亚马逊）

AI总结针对自蒸馏策略优化中固定步长导致训练不稳定的问题，提出受粘性流体动力学启发的物理引导策略优化（PGPO），通过互信息估计动态调整步长，在Science-QA数据集上提升性能并保持训练稳定性。

详情

AI中文摘要

自蒸馏策略优化（SDPO）已成为大语言模型后训练的一种流行范式，其中模型根据特权信息从自身预测中学习。然而，SDPO对每次更新步长的信任程度敏感：来自自我教师的修正可能在某些批次上信息丰富，而在其他批次上具有误导性，若以固定步长统一应用，会破坏训练稳定性。受粘性流体动力学启发，并在随机微分方程层面形式化类比，我们提出物理引导策略优化（PGPO），该方法引入一个基于学生预测与反馈条件教师之间互信息估计的信息调制步长乘子。我们证明这种调制保留了普通SGD的一阶弱近似保证，且每次迭代的额外开销可忽略。我们在Science-QA数据集上评估PGPO，它在4个领域中的3个上优于SDPO，提升高达+4.5个点，同时在SDPO训练后期崩溃的设置中保持稳定。

英文摘要

Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.

URL PDF HTML ☆

赞 0 踩 0

2606.03618 2026-06-03 cs.AI 版本更新

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

跨语言令牌套利：通过本地LLM预处理优化代码智能体上下文窗口

Mehmet Utku Colak

发表机构 * GitHub

AI总结提出一种预处理的边缘端提示重写中间件，利用本地Llama 3.2模型进行跨语言翻译和结构重写，在保持或提升任务准确率的同时减少34-47%的提示令牌和最高18.8%的总令牌消耗。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

AI辅助编码智能体受到输入令牌成本的瓶颈限制。原始人类输入的两个病理现象导致了大部分开销：非英语文本的令牌化低效和对话提示中的结构熵。现有方法通过压缩已经臃肿的上下文或在失败发生后进行干预来被动应对。我们引入了一种预处理的边缘端提示重写中间件，在开发者和云智能体之间运行。本地Llama 3.2（3B）模型执行跨语言翻译成英语、结构重写为紧凑的任务导向格式，以及正则表达式验证的重写-回退保护，确保优化后的提示永远不会大于原始提示。我们在OMH-Polyglot（一个涵盖土耳其语、阿拉伯语、中文和代码混合规范的多语言编码基准）上进行评估。在三个商业LLM后端上，该中间件将提示令牌减少了34-47%，总令牌减少了最多18.8%，同时保持或提高了任务准确率。消融研究表明，收益主要来自重写阶段，而非简单的函数名提取。与LLMLingua-2在匹配压缩率下相比，我们的方法在所有评估后端上始终获得更优的OckScore性能。这些结果表明，主动提示优化可以在不牺牲编码质量的情况下大幅降低推理成本。

英文摘要

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

URL PDF HTML ☆

赞 0 踩 0

2606.03608 2026-06-03 cs.LG cs.AI 版本更新

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

利用验证-生成差距：基于置信度条件的测试时强化学习

Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng

发表机构 * Sun Yat-Sen University（中山大学）； University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）

AI总结提出TTRL-CoCoV框架，通过置信度自适应机制解决无标签设置下Pass@k优化中的伪标签错误和多样性崩溃问题，显著提升Pass@1和Pass@k性能。

详情

AI中文摘要

测试时强化学习已成为一种有前景的范式，用于在完全无标签的方式下增强大型语言模型的复杂推理能力。尽管现有研究关注Pass@1性能，但在无标签设置下优化Pass@k（衡量生成覆盖率以支持持续探索）仍未被充分探索且至关重要。在无标签设置下优化Pass@k极具挑战性，因为直接应用对RLVR有效的Pass@k优势设计会导致性能不佳。通过深入的实证分析，我们发现阻碍性能的根本原因：低置信度样本的伪标签估计很可能不正确，而高置信度样本的候选答案则遭受严重的多样性崩溃。为克服这些障碍，我们提出TTRL-CoCoV（基于置信度条件的测试时强化学习），一种新颖的置信度自适应框架，可扩展Pass@k覆盖率并提升Pass@1性能。基于我们的关键洞察——验证能力通常领先于生成能力，TTRL-CoCoV采用置信度条件机制：对于高置信度样本，它引导验证器并应用探索增强奖励以防止多样性崩溃；对于低置信度样本，它将伪标签选择委托给验证器以过滤错误伪标签；对于中等置信度样本，则完全绕过验证。大量实验表明，TTRL-CoCoV在6个广泛认可的基准上优于最佳竞争方法，在Pass@1上平均绝对提升+9.8%，在Pass@16上平均绝对提升+18.7%，甚至在与全监督强化学习方法相比时，在多个推理基准上实现了高达+5.0%的Pass@1绝对提升。我们的代码仓库：此 https URL。

英文摘要

Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.

URL PDF HTML ☆

赞 0 踩 0

2606.03602 2026-06-03 cs.LG cs.AI cs.CL 版本更新

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

CauTion：知道何时信任LLM进行集成因果发现

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）； Nanjing University（南京大学）； Tongji University（同济大学）

AI总结提出CauTion框架，通过共识过滤和LLM可靠性估计，将LLM领域知识可靠地集成到多个统计因果发现算法中，解决纯统计方法的局限和LLM错误问题。

详情

AI中文摘要

从观测数据进行因果发现仍然具有挑战性，因为纯统计方法存在根本性限制，例如等价类内的统计可区分性和对有限样本量的敏感性。虽然大型语言模型（LLM）提供了有希望的领域知识来源来补充统计推断，但现有的LLM增强方法容易受到LLM错误的影响，并且产生高昂的令牌成本。此外，依赖单一数据驱动算法可能使结果对算法特定偏差敏感。为了解决这些限制，我们提出了CauTion，一个通过共识过滤和LLM可靠性估计将LLM领域知识可靠地集成到统计因果发现算法集成中的框架。CauTion分三个阶段进行。首先，算法集成利用共识投票解决算法一致的最多96%的边，在过滤后的共识边上实现接近完美的准确性。其次，一个信任校准仲裁机制通过无注释的信任校准过程估计LLM和算法的相对可靠性，然后用于控制信任加权投票过程，将LLM仲裁限制在算法证据不可靠的边上。第三，应用循环修复步骤确保最终因果图是有效的无环图。在六个数据集上的实验表明，CauTion在性能上始终优于数据驱动和LLM增强的基线，在更大的图上获得更大的收益，并且对LLM错误具有强大的鲁棒性。代码可在以下网址获取：https://this URL。

英文摘要

Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

URL PDF HTML ☆

赞 0 踩 0

2606.03601 2026-06-03 cs.SE cs.AI 版本更新

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

DDOR: 用于可解释过度拒绝测试与修复的Delta调试方法

Qinyan Zhou, Peixin Zhang, Jun Sun, Haonan Zhang, Dongxia Wang

发表机构 * Southeast University（东南大学）； Singapore Management University（新加坡管理学院）； Zhejiang University（浙江大学）； Huzhou Institute of Industrial Control Technology（湖州工业控制技术研究所）

AI总结提出DDOR框架，通过delta调试定位最小拒绝触发片段（mRTF），实现黑盒环境下大语言模型过度拒绝行为的自动化测试与修复。

详情

AI中文摘要

虽然安全对齐和护栏有助于大语言模型（LLM）避免有害输出，但它们也可能导致过度拒绝，即对仅看似有风险的无害查询进行无根据的拒绝。我们提出了DDOR（用于过度拒绝的Delta调试），这是一个完全自动化和可解释的框架，用于在黑盒设置中进行过度拒绝测试和修复，其中仅可访问模型输入和输出，内部安全机制保持不透明。DDOR应用delta调试来定位最小拒绝触发片段（mRTF），这些片段提供了短语级别的、可解释的证据，说明拒绝发生的原因。基于这些mRTF，DDOR生成多样化、上下文丰富的提示，并执行多预言验证以过滤本质上不安全或模糊的案例，从而产生可扩展且模型特定的过度拒绝测试套件（每个模型约1K个案例）。除了评估之外，我们进一步利用定位的mRTF进行有针对性的提示修复，显著减少过度拒绝，同时保留原始意图并在真正有害的输入上保持安全性。总体而言，DDOR提供了一种实用的端到端解决方案，用于评估和缓解过度拒绝，在不牺牲安全性的情况下提高LLM的可用性。

英文摘要

While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.

URL PDF HTML ☆

赞 0 踩 0

2606.03569 2026-06-03 cs.CV cs.AI 版本更新

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

当注意力崩溃时：从结构到语义的阶段性视觉令牌剪枝

Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang

发表机构 * Shandong University（山东大学）； National University of Singapore (Suzhou) Research Institute（新加坡国立大学（苏州）研究院）

AI总结针对视觉语言模型推理中视觉令牌剪枝因依赖单一注意力分数导致特征多样性下降的问题，提出两阶段剪枝框架STS，先通过排斥采样最大化结构多样性，再通过指令感知交叉注意力过滤语义无关令牌，从而提升保留令牌的结构多样性与细粒度任务对齐。

详情

AI中文摘要

视觉语言模型（VLMs）展现了卓越的能力，但在推理过程中承受着巨大的计算开销。虽然视觉令牌剪枝提供了一种有前景的解决方案，但现有方法主要依赖于初始注意力分数。这种单一度量范式存在一个关键缺陷：高注意力分数会固有地坍缩到语义相似区域，从而严重降低特征多样性并丢弃重要的上下文细节。为解决这一问题，我们引入了结构到语义（STS），一种新颖的两阶段视觉令牌剪枝框架，明确解耦了剪枝过程。第一阶段采用基于排斥的采样机制，以最大化空间和结构多样性。第二阶段利用指令感知的交叉注意力，精确过滤掉与提示无关的令牌。这种两阶段协同构成了STS的核心，首先确保几何覆盖，然后根据语义相关性细化保留的令牌。大量评估表明，STS减轻了由基于注意力的选择引起的冗余，提高了保留视觉令牌的结构多样性和细粒度任务对齐。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.03568 2026-06-03 cs.CV cs.AI cs.LG cs.RO 版本更新

Learned Non-Maximum Suppression for 3D Object Detection

用于3D目标检测的学习型非极大值抑制

Timo Osterburg, Stefan Schütte, Torsten Bertram

发表机构 * Institute of Control Theory and Systems Engineering, TU Dortmund University（控制理论与系统工程研究所，多特蒙德技术大学）

AI总结提出两种基于学习的过滤模块（D2D-Rescore和GossipNet3D）替代启发式NMS，通过检测间关系提升3D检测性能，尤其改善小物体和稀有类别的检测精度。

Comments 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026

详情

AI中文摘要

后处理是基于激光雷达的3D目标检测中的关键阶段，必须过滤密集且重叠的提议以实现紧凑可靠的感知。本文引入了两个学习型过滤模块，通过利用检测之间的关系来替代启发式非极大值抑制（NMS）。D2D-Rescore采用基于Transformer的检测到检测（D2D）注意力，而GossipNet3D通过鸟瞰图中的局部消息传递将2D GossipNet概念适应到3D。一种与nuScenes评估协议对齐的度量感知匹配策略确保了训练和验证行为的一致性，从而提高了整体检测性能。与CircleNMS相比，两种方法都提高了平均精度（mAP）、nuScenes检测分数（NDS）和真阳性质量，特别是对于小物体和稀有类别，同时增加了最小的计算开销。这些结果表明，学习型的检测级过滤可以在不修改基础网络的情况下增强3D检测器的可靠性，为启发式抑制提供了一种原则性的替代方案。代码可在以下网址获取：https://this URL。

英文摘要

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

URL PDF HTML ☆

赞 0 踩 0

2606.03566 2026-06-03 cs.CV cs.AI 版本更新

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结提出一种基于SwinUNETR和局部块采样的方法，实现多发性硬化侧脑室脉络丛的自动分割，在降低99%计算量的同时取得优于现有模型的Dice系数。

详情

AI中文摘要

背景：侧脑室脉络丛（LVCP）正逐渐被认为是与多发性硬化（MS）身体残疾和神经炎症相关的关键影像生物标志物。然而，LVCP的手动分割非常繁琐，限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程，利用靶向的脑室内和脑室周围小块采样，从独立和多模态MRI输入中自动分割MS中的LVCP。方法：我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描（数据集1：n=177；数据集2：n=177；扩展测试集：n=388）。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构，并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数（DSC），辅以计算需求（GFLOPs）和95百分位豪斯多夫距离（HD95）。结果：在扩展测试集上，SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868（95% CI: 0.863-0.872），显著优于UXNET（DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001）。当仅限于独立FLAIR输入时，基于Transformer的方法保持了0.863的高DSC，而UXNET的空间定位显著恶化（HD95: 1.86 vs. 3.00 mm）。重要的是，所提出的框架将计算负载降低了99%（91.8 vs. 22,080 GFLOPs）。通过将局部块采样与SwinUNETR架构相结合，该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

URL PDF HTML ☆

赞 0 踩 0

2606.03557 2026-06-03 cs.AI cs.HC 版本更新

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

从提示到服务：基于SLM的AI驱动虚拟世界代理编排网关

Louis Nisiotis, Aimilios Hadjiliasi

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出一种基于小语言模型的代理编排网关，通过意图驱动的服务路由解耦虚拟世界客户端与异构AI后端，并在虚拟博物馆测试床中验证了其可行性和效率。

详情

AI中文摘要

随着生成式AI能力的扩展，AI驱动的虚拟世界面临日益增长的架构挑战。用户通过世界内界面以多模态方式进行交互，但其请求需要根本不同的AI后端模型和计算资源。将这些能力直接嵌入虚拟世界系统会降低可扩展性、增加维护复杂性，并限制协调分布在边缘和云基础设施上的服务的能力。本文提出一种基于SLM的代理编排网关，这是一种轻量级运行时协调机制，通过意图驱动的服务路由将虚拟世界客户端与异构AI后端解耦。边缘部署的SLM对每个用户提示的语义意图进行分类，可配置的服务注册表验证并解析路由决策，然后透明地调用所选后端，从而无需修改客户端应用即可在虚拟世界中引入新的AI能力。该网关在InterwovenXR虚拟博物馆测试床中实现并评估。评估表明，紧凑型SLM可以在边缘硬件上作为可靠的意图路由器，并且任务特定的微调可以将参数低于十亿的模型转化为实用的低延迟路由器。一种分层配置将微调后的十亿以下参数模型作为路由器，与用于对话响应生成的较大SLM配对，证明可以在中端边缘硬件上部署，并且比将两个职责委托给单个模型更高效。研究结果表明，SLM可以支持虚拟世界中实用的AI服务编排，并且该工作贡献了一种可评估的架构，用于可扩展、可扩展且支持边缘的AI交互，使虚拟代理成为分布式生成式AI服务的访问点。

英文摘要

As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.

URL PDF HTML ☆

赞 0 踩 0

2606.03544 2026-06-03 cs.AI cs.CL 版本更新

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

SAGE: 智能体生态中社会化演化的定量评估

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

发表机构 * Tsinghua University, China（清华大学, 中国）； Meituan, China（美团, 中国）

AI总结提出SAGE框架，通过对比社会演化（SocialEvo）与自我演化（SelfEvo）两种计算条件，在三个领域评估共享经验对智能体性能的影响，发现群体历史并非普遍放大器，但能帮助陷入停滞的智能体取得突破，且社会收益依赖于抽象能力而非暴露量。

Comments 13 pages, 5 figures

详情

AI中文摘要

自我改进的语言智能体通常被孤立评估：一个智能体尝试任务、接收反馈并迭代优化自身行为。然而，智能体越来越多地与同伴一起运作，其策略和结果公开可见。这引发了一个研究不足的问题：共享经验何时能产生自我改进无法单独实现的改进？我们引入了SAGE（社会智能体群体演化），一个评估框架，比较两种计算匹配的条件：SocialEvo，其中来自五个不同模型家族的智能体共同演化，可访问所有同伴的历史；以及SelfEvo，其中每个智能体获得相同数量的任务尝试，但只能看到自己的过去，这是自我改进智能体研究中的常规做法。我们在三个领域实例化SAGE：开放式机器学习研究、长期经济规划和战略多人游戏，并在多个演化轮次中进行评估。我们发现群体历史并非普遍放大器：最强的智能体并未超过其自我演化上限。然而，在自我改进下停滞的智能体，当同伴经验可用时，可以取得重大突破。在竞争环境中，反事实控制显示智能体普遍改进，而非发展针对对手的策略。在不同形式的共享历史中，过滤后的同伴轨迹和反思性摘要通常优于原始日志，表明社会收益依赖于抽象而非暴露量。这些发现表明，同伴历史收益是智能体特定的、领域依赖的，并取决于从公共轨迹中抽象可转移知识的能力。

英文摘要

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

URL PDF HTML ☆

赞 0 踩 0

2606.03532 2026-06-03 cs.LG cs.AI 版本更新

When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

教师何时应该移动？自在线策略蒸馏中的时间耦合与稳定性

Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang

发表机构 * Peking University（北京大学）； University of Chinese Academy of Sciences（中国科学院大学）； Tsinghua University（清华大学）

AI总结研究自在线策略蒸馏中教师更新调度对稳定性的影响，提出基于隔离期和门控机制的CGTR方法，实现零崩溃和最佳性能。

详情

AI中文摘要

自在线策略蒸馏针对从自身参数历史派生的教师训练学生策略，但教师的更新调度——控制教师与学生之间的\emph{时间耦合}——尚未作为稳定性变量被系统研究。通过对Qwen3-8B进行受控调度扫描，我们确定\emph{隔离期}（定义为更新之间教师完全冻结）是实现稳定学习的关键结构属性，而非教师年龄。为了刻画这些底层训练动态，我们引入了一个诊断框架，包括时间KL结构、刷新冲击和长度尾部风险。该框架进一步揭示了\emph{状态遗忘崩溃}：最优的短视固定调度在长视训练下灾难性失败，因为时钟驱动的刷新可以在单个不可逆步骤中将短暂漂移的学生复制到教师中。这种失败模式在短视评估下不可见，并且在机制上不同于EMA的慢性污染。为了解决这个问题，我们提出了\emph{巩固门控教师刷新}（CGTR），它在保持隔离期的同时，基于奖励改进和长度尾部安全的联合证据对每次刷新进行门控，确保每次教师移动响应于真正的学生巩固而非时钟信号。使用单一共享参数集且无需每数据集重新调整，CGTR在所有四个任务（化学、生物学、物理学、工具使用）上实现了 extbf{零崩溃}和最佳最终分数，并自动调节其刷新频率以适应每个任务的学习动态。

英文摘要

Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.03523 2026-06-03 cs.CR cs.AI cs.LG 版本更新

High-Precision APT Malware Attribution with Out-of-Scope Resilience

高精度APT恶意软件归因与越界鲁棒性

Peter Williams, Adam Sobey, Erisa Karafili

发表机构 * Department of Computer Science, University of Oxford（1 奥克斯福德大学计算机科学系）

AI总结提出基于排名二元分类器与显式弃权的APT恶意软件归因方法，在越界样本占比87%时仍保持92%精度和95%选择性准确率。

详情

AI中文摘要

早期归因高级持续性威胁（APT）活动可帮助防御者优先调查、选择对策并减少入侵影响。恶意软件提供了有用的归因证据，但自动化APT恶意软件归因在实践中仍然困难。现有方法通常作为封闭集分类器在有限数量的已知APT组织上进行训练和评估。然而，在操作环境中，分类器很可能遇到训练中未出现的组织样本。封闭集分类器被迫将这些样本分配给已知组织，产生无根据且可能误导的归因。我们提出一种基于排名二元分类器与显式弃权的高精度APT恶意软件归因方法。我们的方法不是训练单个多类分类器，而是为每个APT组织训练和调整两个二元分类器，根据验证性能对分类器进行排名，并顺序应用它们。仅当分类器提供足够证据时才对样本进行归因；否则，弃权。我们在APT恶意软件数据集和旨在压力测试越界行为的更大组合数据集上评估该方法。在APT恶意软件数据集上，该方法实现了比先前公布结果更高的精度。在最具挑战性的设置中，87%的测试样本来自训练中排除的60个APT组织，该方法对94%的越界样本弃权，同时在其分类的样本上保持92%的精度和95%的选择性准确率。

英文摘要

Early attribution of Advanced Persistent Threat (APT) activity can help defenders prioritise investigation, select countermeasures, and reduce the impact of an intrusion. Malware provides useful attribution evidence, but automated APT malware attribution remains difficult in practice. Existing approaches are typically trained and evaluated as closed-set classifiers over a limited number of known APT groups. In operational environments, however, classifiers are likely to encounter samples from groups not represented during training. Closed-set classifiers are then forced to assign such samples to known groups, producing unsupported and potentially misleading attributions. We present a high-precision APT malware attribution method based on ranked binary classifiers with explicit abstention. Rather than training a single multi-class classifier, our approach trains and tunes two binary classifiers per APT group, ranks the classifiers by validation performance, and applies them sequentially. A sample is attributed only when a classifier provides sufficient evidence; otherwise, it abstains. We evaluate the method on the APT Malware dataset and on a larger combined dataset designed to stress-test out-of-scope behaviour. On the APT Malware dataset, the method achieves higher precision than previously published results on the same dataset. In the most challenging setting, where 87% of test samples came from 60 APT groups excluded from training, the method abstained on 94% of out-of-scope samples while maintaining 92% precision and 95% selective accuracy on the samples it classified.

URL PDF HTML ☆

赞 0 踩 0

2606.03521 2026-06-03 cs.LG cs.AI 版本更新

Post-Hoc Robustness for Model-Based Reinforcement Learning

基于模型的强化学习的后验鲁棒性

Siemen Herremans, Ali Anwar, Siegfried Mercelis

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种在推理时利用学习模型和名义策略进行鲁棒策略改进的后验鲁棒化方法，通过对抗性展开的模型预测控制提升鲁棒性，无需额外训练神经网络。

详情

AI中文摘要

为了提高强化学习（RL）在现实世界中的适用性，对抗鲁棒RL领域研究如何在对抗环境扰动下训练智能体。在该设置中，主角智能体在对手的环境扰动下优化策略，形成零和马尔可夫博弈。当对抗鲁棒RL与基于模型的RL结合时，对手可以针对学习到的转移模型而非训练环境。扩展这一思想，本文引入了深度RL智能体在推理时的后验鲁棒化。通过将学习模型与训练的名义策略结合使用，我们的方法执行鲁棒策略改进步骤。目标是提高鲁棒性而无需对神经网络进行额外训练。具体来说，我们利用对抗性展开下的模型预测控制，这些展开通过有界不确定性集内的投影梯度下降进行近似。此外，这些离线展开在执行时考虑并缓解了分布外问题。通过在扰动的Gymnasium MuJoCo环境中评估算法，同时考虑后验推理设置的计算限制，验证了所提方法在鲁棒性上的显著提升。

英文摘要

To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an adversary, resulting in a zero-sum Markov game. When adversarially robust RL is combined with model-based RL, the adversary can target a learned transition model instead of the training environment. Extending this idea, this work introduces post-hoc robustification of deep RL agents at inference time. By using the learned model in combination with a trained nominal policy, our approach performs a robust policy improvement step. The goal is to improve robustness without any additional training of neural networks. Specifically, we utilize model-predictive control under adversarial rollouts, which are approximated via projected gradient descent within a bounded uncertainty set. Furthermore, these offline rollouts are performed while considering and mitigating out-of-distribution issues. The proposed methodology is validated by demonstrating significant improvements in robustness when the algorithm is evaluated in perturbed Gymnasium MuJoCo environments, while considering the computational limitations of the post-hoc inference setting.

URL PDF HTML ☆

赞 0 踩 0

2606.03518 2026-06-03 cs.AI cs.CR 版本更新

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

覆盖治理：面向代理型人工智能的委托与范围的组合授权框架

Amjad Ibrahim, Yong Li

发表机构 * Huawei Heisenberg Research Center（华为海森堡研究所以）

AI总结针对代理型AI中传统授权框架无法处理递归委托、动态范围等问题，提出一种组合治理框架，通过定义委托类型、权限责任和资源范围衰减，并引入组合算子在不重写现有策略的情况下叠加代理语义，实现可问责的授权。

Comments 12 pages

详情

AI中文摘要

随着AI系统从被动模型演变为能够发起行动、协作和委托任务的自主主动代理，软件系统的传统边界变得模糊。围绕固定主体、显式请求和静态范围构建的传统授权和委托框架不足以治理代理系统。代理型AI需要更丰富的授权语义：代理必须继承和委托权限，在时间限制的权限下行动，并通过共享协议进行协调。现有的身份和访问管理（IAM）系统未能完全捕捉这种代理概念，缺乏递归委托、上下文边界和动态范围作为可执行治理原语的机制。与OAuth 2.0等访问委托标准不同，我们将委托视为合同条款，而不仅仅是基于静态令牌的同意凭证。本文提出一个组合治理框架，引入了代理型AI不可或缺的原语。我们定义了委托类型及其权限和问责含义，并引入了资源范围衰减的概念以限制代理访问范围。这些概念被表达为通用的关系定义，可以组合到现有的授权域（例如金融系统）中。为了操作化这种组合，我们定义了一个组合算子，将新的代理语义（例如递归委托链）叠加到现有关系策略上，而无需重写它们。我们通过形式化证明和实证评估来证实该框架，表明它为代理型AI系统中的可问责授权提供了形式化且实用的基础。

英文摘要

As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.03512 2026-06-03 cs.RO cs.AI 版本更新

SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts

SPADE: 草图引导的路径规划增强扩散专家

Charbel Abi Hana, Tatiana Ghantous, Mikael Khalil, Anthony Rizk

发表机构 * IDEALworks GmbH ； IMT Atlantique ； IDEALworks GmbH & Saint Joseph University of Beirut（IDEALworks GmbH及贝鲁特圣约瑟夫大学）

AI总结提出一种结合扩散增强的框架，通过改进的标注工具和训练策略，在保持实时性的同时提升路径规划的泛化能力和鲁棒性，显著降低姿态误差和FID。

详情

DOI: 10.65109/RIHP6974

AI中文摘要

路径规划对于自主移动机器人（AMR）至关重要。将人类偏好纳入规划的常规方法通常依赖于复杂的奖励工程或硬件密集型解决方案。最近的最先进框架利用模仿学习从专家演示中训练特定行为的路径规划模型。然而，这些方法面临两个关键限制：对未见环境的泛化能力有限，以及演示收集中的鲁棒性较低。为了解决这些挑战，本文介绍了一个增强框架，专注于两个主要贡献：一个基于ROS 2重构的标注工具，以及一种新颖的训练策略，将基于扩散的数据增强集成到基线行为克隆模型中。提供了专家演示数据集，并通过消融研究评估所提出解决方案的鲁棒性。增强方法优于最先进的方法，绝对姿态误差（APE）降低39.1%，Fréchet初始距离（FID）降低33.5%，同时可训练参数减少93.8%。此外，它达到了扩散级别的泛化能力，同时保留了最先进模型的实时、边缘特性。

英文摘要

Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typically rely on either complex reward engineering or hardware-intensive solutions. Recent state-of-the-art frameworks leverage imitation learning to train behavior-specific path planning models from expert demonstrations. However, these approaches face two key limitations: limited generalization to unseen environments and low robustness in demonstration collection. To address these challenges, this work introduces an enhanced framework that focuses on two main contributions: an overhauled annotation tool built on ROS 2, and a novel training strategy that integrates diffusion-based augmentation into baseline behavioral cloning models. A dataset of expert demonstrations is provided and evaluated through ablation studies to assess the robustness of the proposed solution. The enhanced approach outperforms state-of-the-art methods with 39.1% lower Absolute Pose Error (APE) and 33.5% lower Fr'echet Inception Distance (FID) while having 93.8% less trainable parameters. Moreover it attains diffusion-level generalization while preserving the real-time, on-edge properties of state-of-the-art models.

URL PDF HTML ☆

赞 0 踩 0

2606.03503 2026-06-03 cs.AI 版本更新

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold: 通过内省偏好学习折叠推理链

Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen

AI总结提出ThoughtFold框架，通过细粒度偏好学习惩罚冗余探索并鼓励直接连接关键推理段，将推理链折叠为更简洁路径，在保持精度的同时大幅降低token使用量。

详情

AI中文摘要

大型推理模型（LRMs）由于在思维链（CoTs）上使用可验证奖励的强化学习（RLVR）取得了显著进展。然而，由于长CoT自然包含试错，且主流RLVR方法选择结果正确的CoT轨迹进行记忆，长CoT中的冗余探索不可避免地得到强化，导致LRMs的过度思考问题。先前解决此问题的尝试主要给较短轨迹更多优势，但其学习信号仍基于结果，无法减少长CoT中冗余探索的记忆。因此，我们提出ThoughtFold，一个利用细粒度偏好学习来缓解冗余探索以实现高效推理的框架。ThoughtFold采用内省策略识别每个正确轨迹中的冗余，从而产生一系列候选子轨迹。利用这一谱系，我们引入一个掩码偏好优化目标，明确惩罚冗余探索并鼓励模型直接桥接关键推理段，有效地将其推理链折叠为更简洁的路径。大量实验表明，ThoughtFold显著提高了效率。它将DeepSeek-R1-Distill-Qwen-7B的token使用量减少约56%，同时保持最先进的准确性。

英文摘要

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.03486 2026-06-03 cs.CR cs.AI 版本更新

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

NeuroArmor：基于安全变体引导的表示一致性实现越狱防御中的选择性重新锚定

Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出NeuroArmor白盒运行时防御方法，通过为每个提示构建安全变体作为局部安全参考，在隐藏状态空间进行一致性检查并路由异常，有效降低恶意攻击成功率同时保持低误报率。

Comments 16 pages, 4 figures, 17 tables. Submitted to ACL ARR

详情

AI中文摘要

大型语言模型仍然容易受到越狱攻击，这些攻击将有害意图隐藏在看似普通的请求背后，例如角色扮演、翻译、编码、对抗性后缀和多轮铺垫。现有的防御方法仍然难以在不过度拦截良性但敏感的请求的情况下处理这些攻击，部分原因是它们通常对每个提示应用相同的操作，因此无法平衡安全性和有用性。我们提出NeuroArmor，一种白盒运行时防御方法，它使用提示特定的安全变体作为局部安全参考，用于决定何时需要干预，并在触发时作为干预的安全目标。对于每个提示，NeuroArmor构建K个安全变体，在隐藏状态空间中将提示状态与此局部安全参考进行比较，并将异常路由到恶意提示的拒绝分支或边界良性提示的有用恢复分支。在Llama-3-8B-Instruct上，NeuroArmor将恶意攻击成功率（ASR）从41.56%降低到1.57%，同时将共享良性池上的良性误报率（FPR）从30.26%降低到22.05%；匹配的基线在此权衡上仍然明显较弱。外部评估者和手动行为评估进一步表明，剩余未拦截的输出产生操作危害的可能性大大降低。总体而言，NeuroArmor通过结合提示特定的一致性检查、路由和选择性干预，为越狱防御提供了更有效的运行时策略。

英文摘要

Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.03483 2026-06-03 cs.LG cs.AI 版本更新

Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation

分析超连接中的流坍缩：从诊断到缓解

Ekaterina Alimaskina, Gleb Molodtsov, Aleksandr Beznosikov

发表机构 * MIRAI ； BRAIn Lab ； Yandex Research ； Innopolis University

AI总结本文通过细粒度诊断发现超连接中的多流残差连接存在流坍缩现象，即信号集中于主导流，并通过打破初始化对称性缓解该问题以提升性能。

详情

AI中文摘要

超连接（HC）用多个流替换单个Transformer残差流，引入了流索引上的置换对称性。我们研究这种对称性在实践中如何被打破：流是平衡地专门化还是表现出主导流使用。通过对基于HC的语言模型进行细粒度诊断，我们追踪多流表示的实际使用方式。我们发现，在早期种子阶段之后，残差混合通常保持接近恒等映射，限制了HC在流之间交换信息的核心机制。此外，信号和可解释特征都集中在一个主导流中，名义上的多流残差连接可能未充分利用其容量，行为更接近单流残差路径。最后，我们表明在流初始化时打破对称性可以减少主导行为并提高各种 extit{m}HC变体的性能。我们的代码已公开。

英文摘要

Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream indices. We study how this symmetry is resolved in practice: whether streams specialize in a balanced way or exhibit dominant-stream usage. Using fine-grained diagnostics for HC-based language models, we trace how multi-stream representations are actually used. We find that after an early seeding stage, residual mixing often remains close to identity, limiting a core HC mechanism for exchanging information between streams. Moreover, both signal and interpretable features concentrate in a dominant stream, and the nominally multi-stream residual connection can underutilize its capacity, behaving closer to a single-stream residual pathway. Finally, we show that breaking symmetry at stream initialization reduces dominant behavior and improves performance across \textit{m}HC variants. Our code is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.03471 2026-06-03 cs.AI cs.MA q-bio.NC 版本更新

A formal definition and meta-model for a machine theory of mind

机器心智理论的正式定义与元模型

Fabio Cuzzolin

AI总结本文基于认知心理学、神经科学和人工智能证据，首次提出机器心智理论的严格形式化定义，并构建整体元模型，以审视现有研究并推动未来突破。

Comments 48 pages, 2 figures

2606.03467 2026-06-03 cs.AI 版本更新

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

StepFinder：多智能体系统中故障归因的时间语义框架

Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

发表机构 * Peking University（北京大学）

AI总结提出StepFinder框架，通过将执行日志编码为时间语义序列并利用时序建模与注意力模块，高效准确地定位多智能体系统中的故障根因步骤。

Comments 12 pages, 5 figures. Accepted by KDD 2026

详情

AI中文摘要

基于LLM的多智能体系统在复杂多步骤任务中展现出显著的协作能力。然而，这些系统对单步执行错误高度敏感，错误会通过智能体交互传播并导致级联故障。为理解故障原因并提高系统可靠性，故障归因被引入作为一项任务，旨在自动识别导致故障的根因步骤。现有故障归因方法主要依赖LLM对原始执行轨迹进行推理，这不仅导致高推理成本和延迟，还受到冗余和噪声执行日志的干扰，使LLM难以准确识别真正的根因步骤。为此，我们提出StepFinder，一个轻量级故障归因框架。我们仅在特征构建阶段使用LLM将执行日志编码为时间语义序列。随后，应用参数高效的时序建模与注意力模块组合来捕捉轨迹的序列演化与跨步骤依赖。最后，通过多尺度差异和位置偏差细化步骤级错误分数，实现精确的根因识别。在Who&When基准上的实验结果表明，StepFinder在步骤级故障归因上优于基于LLM的方法，同时实现了显著更高的推理效率，与最快的基于LLM的方法相比，推理时间减少79%，且无文本生成开销。我们的代码可从此https URL获取。

英文摘要

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

URL PDF HTML ☆

赞 0 踩 0

2606.03465 2026-06-03 cs.LG cs.AI 版本更新

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

重新思考张量分解在训练后大语言模型压缩中的作用

Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov

发表机构 * University of Florida（佛罗里达大学）； National Research University Higher School of Economics（俄罗斯国家研究大学——莫斯科经济学院）

AI总结本文系统评估了张量分解在稠密和MoE架构上的训练后压缩效果，通过实证与理论分析揭示了其与LLM异构表示之间的根本性不匹配，从而界定了其实际限制和在规模化部署中的可行角色。

2606.03463 2026-06-03 cs.AI cs.CL 版本更新

DMF: A Deterministic Memory Framework for Conversational AI Agents

DMF：对话式AI代理的确定性记忆框架

Matteo Stabile, Enrico Zimuel

发表机构 * Roma Tre University（罗马三大学）

AI总结提出一种CPU优先的确定性记忆框架DMF，通过经典NLP分析、向量几何和数学评分替代生成式记忆压缩，实现零令牌成本且与Mem0相当的准确性。

Comments 21 pages, 3 figures

详情

AI中文摘要

对话式AI代理需要在大时间跨度的交互中既具可扩展性又语义连贯的记忆系统。现有方法主要依赖基于大语言模型（LLM）的写入时摘要，这引入了非确定性、令牌成本上升以及剪枝决策不透明等问题。我们提出确定性记忆框架（DMF），一种CPU优先的方法，用完全确定性的流水线替代生成式记忆压缩，该流水线基于经典NLP分析、向量几何和数学评分。DMF为每次对话交互分配一个生存分数$\Omega$，该分数由确定性内容信号、对话线索和结构化来源通过逻辑投影组合计算得出。一个交互计数衰减定律，记为$\Omega_{\mathrm{eff}}(\Delta n)$，控制着相关性随新轮次到达的演变，其中$\Delta n$是较新交互的数量而非实际时间，从而保持完全确定性。我们给出了DMF的数学公式、结构化召回流水线、剪枝决策过程和评估协议。实验在基于LoCoMo和LongMemEval数据集构建的专用基准上进行。我们将DMF与AI代理的流行记忆层Mem0进行比较。DMF在准备记忆上下文时使用零令牌，在整个对话中使用的令牌数少5到242倍，同时达到相当的准确性。这些结果表明，可以从记忆管理循环中消除LLM调用，将令牌成本降至几乎为零，并为对话式AI代理实现确定性记忆系统。

英文摘要

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $Ω$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $Ω_{\mathrm{eff}}(Δn)$, governs how relevance evolves as new turns arrive, where $Δn$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

URL PDF HTML ☆

赞 0 踩 0

2606.03461 2026-06-03 cs.AI 版本更新

PRISM: 通过自组织专家专业化协同视觉基础模型

Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出PRISM框架，采用双流混合专家（MoE）架构，通过两阶段范式（先解构专家知识使其专业化，再动态重组为任务特定路径）解决视觉基础模型集成中的负迁移问题，在PASCAL-Context和NYUD-v2上达到新最优。

Comments Accepted to ICML 2026

2606.03435 2026-06-03 cs.AI 版本更新

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

CP-Agent: 化学扰动下细胞形态学轮廓的上下文感知多模态推理

Yuxin Zhang, Yiyao Li, Ping Shu Ho, Simon See, Zhenqin Wu, Kevin Tsia

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong（香港大学电子与计算机工程系）； School of Computing and Data Science, The University of Hong Kong（香港大学计算与数据科学学院）； School of Biomedical Engineering, The University of Hong Kong（香港大学生物医学工程学院）； Nvidia AI Technology Center（NVIDIA人工智能技术中心）； Advanced Biomedical Instrumentation Centre（先进生物医学仪器中心）

AI总结提出CP-Agent，一种基于上下文感知对齐模块CP-CLIP的多模态大语言模型，用于生成药物扰动下细胞形态变化的可解释机制性解释，实现高精度处理与机制区分（最大F1分数0.896），并整合工具使用与推理生成结构化报告以加速药物发现。

Comments ICLR 2026

详情

AI中文摘要

Cell Painting结合多重荧光染色、高内涵成像和定量分析，生成高维表型读数，以支持多种下游任务，如作用机制（MoA）推断、毒性预测和药物-疾病图谱构建。然而，现有工作流程缓慢、昂贵且难以解释。药物筛选建模方法主要侧重于分子表示学习，而忽略了实际实验上下文（例如细胞系、给药方案等），限制了泛化性和MoA分辨率。我们引入了CP-Agent，一种智能多模态大语言模型（MLLM），能够为药物扰动下的细胞形态变化生成与机制相关、人类可解释的理由。其核心是CP-Agent利用上下文感知对齐模块CP-CLIP，该模块联合嵌入高内涵图像和实验元数据，以实现稳健的处理和MoA区分（达到最大F1分数0.896）。通过将CP-CLIP输出与智能工具使用和推理相结合，CP-Agent将理由编译成结构化报告，以指导实验设计和假设优化。这些能力凸显了CP-Agent通过实现更可解释、可扩展和上下文感知的表型筛选来加速药物发现的潜力——简化药物发现中假设生成的迭代循环。

英文摘要

Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.03432 2026-06-03 cs.CR cs.AI cs.LG 版本更新

A Hybrid Approach For Malware Classification Using Secondary Features Fusion

一种使用二次特征融合的恶意软件分类混合方法

Raja Khurram Shahzad, Muhammad Mustaqeem, Haroon Elahi

AI总结提出一种通过融合API调用和n-gram特征，并采用投票集成算法进行恶意软件检测与家族分类的方法，在Microsoft数据集上达到99.72%准确率和0.989 AUC。

详情

AI中文摘要

恶意软件（无论是变种还是新型）的数量正在迅速增加，使得恶意软件检测和缓解成为一个复杂的问题。改善恶意软件缓解的一种方法是自动检测和恶意软件家族分类。然而，传统的恶意软件检测方法无法将检测到的恶意软件分类到各自的家族中，阻碍了有效的恶意软件缓解。因此，本文提出了一种自动化恶意软件检测并将检测到的恶意软件分类到相应恶意软件家族的方法。所提出的方法在提取相关恶意软件特征（如API调用、固定和可变长度n-gram）后，使用自定义特征选择方法进行特征融合。此外，对于预测模型，提出了一种基于投票的算法融合方法。为了对所提出的方法进行实验评估，对Microsoft提供的数据集应用了二分类和多分类方法。最后，将实验结果与现有技术进行了比较。实验结果表明了所提出方法的有效性和效率，AUC为0.989，准确率为99.72%，对数损失为0.01。

英文摘要

The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One approach to improving malware mitigation is automatic detection and malware family classification. However, traditional malware detection methods cannot classify detected malware into their respective families, hindering effective malware mitigation. Consequently, this paper proposes a method to automate malware detection and classification of the detected malware into respective malware families. The proposed method uses feature fusion after extracting relevant malware features such as API calls and fixed and variable length n-grams with a customized feature selection method. Moreover, for the predictive model, a voting based approach is proposed for algorithm fusion. For the experimental evaluation of the proposed method, both binary and multi-class classification approaches are applied to the data set provided by Microsoft. Finally, the experimental results are compared with the state of the art. The experimental results indicate the effectiveness and efficiency of the proposed approach with an AUC of 0.989, accuracy of 99.72%, and a log loss of 0.01.

URL PDF HTML ☆

赞 0 踩 0

2606.03430 2026-06-03 cs.CR cs.AI 版本更新

FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems

FlowGuard: 基于流匹配的能源系统入侵检测系统中无数据模型窃取攻击的身份无关检测

Maxime Schwarzer, Laurin Holz, Tobias Huerten, Johannes Loevenich, Thies Moehlenhof, Roberto Rigolin F. Lopes, Veit Hagenmeyer

发表机构 * CortAIx Labs, Thales Deutschland（CortAIx实验室，Thales德国）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出FlowGuard，一种基于流匹配的身份无关防御方法，通过检测查询是否属于分布外（OOD）来防御针对能源系统入侵检测系统的无数据模型窃取攻击，在单客户端和分布式Sybil场景下均保持稳定检测率。

详情

AI中文摘要

部署在能源基础设施中的人工智能入侵检测系统（IDS）容易受到模型窃取攻击，攻击者可以离线创建规避流量。当前针对模型提取的防御要么依赖于身份绑定的查询监控（对分布式攻击者Sybil无效），要么通过软标签扰动进行预测中毒（不适用于硬标签IDS部署）。因此，我们提出FlowGuard，一种基于流匹配的身份无关防御，在IDS处理之前将传入查询分类为分布外（OOD）。该方法利用了以下事实：为无数据模型窃取攻击合成的查询占据比真实网络流量更低维的流形，导致在使用基于合法数据训练的连续归一化流时，对数似然显著降低。我们在单客户端和分布式（100客户端Sybil）设置下，使用MAZE和DisGUIDE攻击评估了我们的方法，并与PRADA和FDINet进行了比较。当分布发生变化时，PRADA的检测率降至0%，而我们的防御在不依赖身份信息的情况下，在两种设置下均保持稳定的检测率。我们讨论了该方法的范围和局限性，并概述了在数据依赖攻击中的潜在应用。

英文摘要

Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS) deployed in energy infrastructure are vulnerable to model theft attacks, which allow adversaries to create evasive traffic offline. Current defences against model extraction rely either on identity-bound query monitoring, which is ineffective against distributed attackers (Sybil), or on prediction poisoning through soft-label perturbation, which is inapplicable to hard-label IDS deployments. Therefore, we propose FlowGuard, an identity-independent defence based on flow matching that classifies incoming queries as out-of-distribution (OOD) prior to IDS processing. This approach exploits the fact that queries generated synthetically for data-free model stealing attacks occupy a lower-dimensional manifold than real network traffic. This results in measurably lower log-likelihoods when using a Continuous Normalizing Flow that has been trained on legitimate data. We evaluate our method against PRADA and FDINet using MAZE and DisGUIDE attacks in single-client and distributed (100-client Sybil) settings. While PRADA's detection rate dropped to 0% when the distribution changed, our defence maintained a stable detection rate across both settings without relying on identity information. We discuss the scope and limitations of the approach, and outline potential applications to data-dependent attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.03428 2026-06-03 cs.NE cs.AI cs.LG 版本更新

PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers

PrimeSVT: 一种具有优先压缩策略的自动化内存感知剪枝框架用于脉冲视觉Transformer

Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi（eBRAIN实验室，工程系，纽约大学（NYU）阿布扎克分校）； New York University (NYU) Abu Dhabi, United Arab Emirates (UAE)（纽约大学（NYU）阿布扎克分校，阿拉伯联合酋长国（UAE））

AI总结提出PrimeSVT框架，通过自动化结构化剪枝和优先压缩策略，在满足精度和内存约束下压缩脉冲视觉Transformer，实现内存节省26.68%且精度损失小于3%。

Comments 8 pages, 8 figures, 3 tables

详情

AI中文摘要

脉冲视觉Transformer（SViT）的大尺寸仍然阻碍其嵌入式实现，因此需要模型压缩。现有工作通过非结构化剪枝压缩SViT模型，这需要专门的硬件加速器来利用其特定的稀疏模式以最大化效率提升。此外，它们的手动方法需要大量设计时间来为每个网络找到合适的剪枝设置，因此这种方法不可扩展。为了解决这一限制，我们提出了PrimeSVT，一种新颖的框架，对预训练的SViT模型执行自动化的内存感知结构化剪枝，从而在推理期间最大化其效率提升，适用于广泛使用的计算架构。为此，PrimeSVT首先根据层的大小（即参数数量）对SViT层进行排序，根据它们在不同剪枝率下的鲁棒性识别目标剪枝层，然后利用这个顺序从最大层到最小层逐层顺序压缩模型（即所谓的优先压缩策略），同时考虑用户定义的约束（即可接受的精度和内存节省）。在每一层中，PrimeSVT基于L2范数值采用通道级滤波器剪枝，以结构性地移除不重要的权重。实验结果表明，PrimeSVT通过自动化单次剪枝节省了26.68%的内存，同时将精度保持在原始未剪枝SViT模型（73.3%）的3%以内（未微调时为70.3%，微调后为72.9%），从而满足了精度和内存约束。这些表明我们的PrimeSVT框架实现了SViT及其嵌入式实现的设计自动化。

英文摘要

The large sizes of Spiking Vision Transformers (SViTs) still hinder their embedded implementation, highlighting the need for model compression. State-of-the-art works compress SViT models through unstructured pruning, which needs specialized hardware accelerators for their specific sparsity patterns to maximize efficiency gains. Moreover, their manual approach requires a huge design time to find an appropriate pruning setting for each network, thus making this approach not scalable. To address this limitation, we propose PrimeSVT, a novel framework that performs automated memory-aware structured pruning on pre-trained SViT models, thereby maximizing their efficiency gains during inference amenable to widely-used computing architectures. To achieve this, PrimeSVT first sorts the SViT layers based on their sizes (i.e., number of parameters), identifies the targeted pruning layers based on their robustness under different pruning rates, then leverages this order for compressing the model layer-by-layer sequentially from the largest one to the smallest one (i.e., so-called prioritized compression policy), while considering the user-defined constraints (i.e., acceptable accuracy and memory saving). In each layer, PrimeSVT employs channel-wise filter pruning based on their L2-norm values to structurally remove the non-significant weights. Experimental results show that PrimeSVT saves 26.68% memory through automated single-shot pruning, while preserving accuracy within 3% (70.3% without fine-tuning and 72.9% with fine-tuning) from the original unpruned SViT model (73.3%), thus meeting the accuracy and memory constraints. These show that our PrimeSVT framework enables design automation for SViTs and their embedded implementation.

URL PDF HTML ☆

赞 0 踩 0

2606.03398 2026-06-03 cs.CL cs.AI 版本更新

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Transformer建模计数器语言中栈表示的因果证据

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani（比拉理工学院和科学学院，皮兰）

AI总结通过线性探针和消融实验，证明Transformer在计数器语言任务中学习的栈表示对其性能具有因果必要性。

Comments 8 pages, 8 figures

2606.03391 2026-06-03 cs.LG cs.AI cs.CL 版本更新

When Model Merging Breaks Routing: Training-Free Calibration for MoE

当模型合并破坏路由：MoE的无训练校准

Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对MoE架构中模型合并导致的路由崩溃问题，提出基于二阶曲率的无训练校准方法HARC，通过闭式解和共轭梯度法高效重对齐路由器，显著提升数学推理和代码生成性能。

详情

AI中文摘要

模型合并已成为一种无需重新训练即可整合多个LLM能力的成本效益方法。然而，现有的合并技术主要基于线性参数算术或优化，在应用于混合专家（MoE）架构时面临困难。我们识别出MoE合并中的一个关键失效模式，称为路由崩溃，其中合并后的路由器无法将令牌分派给合适的专家。路由崩溃源于非线性softmax和离散Top-k路由机制对合并引起的参数扰动的敏感性，这种敏感性进一步被MoE预训练期间施加的负载平衡约束放大。由于微调后的专家表现出不同的专长，即使是适度的错误路由也可能导致严重的性能下降。为解决此问题，我们提出Hessian感知路由器校准（HARC），一种无训练框架，利用二阶曲率信息重新对齐合并后的路由器。该方法采用闭式解，可通过无矩阵共轭梯度法高效求解。在数学推理和代码生成任务上的实验表明，HARC有效缓解了多种MoE合并基线中的路由崩溃，并带来了显著的性能提升。我们的代码可在该https URL获取。

英文摘要

Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.

URL PDF HTML ☆

赞 0 踩 0

2606.03385 2026-06-03 cs.RO cs.AI 版本更新

Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation

先抓取后规划与失败归因：一种用于精确且可泛化机器人操作的闭环两阶段框架

Jiahao Xu, Peiyuan Wang, Hanzhuo Zhang, Zihao Yu, Tianyu Fu, Hao Chen, Xuanhao Xiang, Jianbo Yu, Chenchen Fu, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院）

AI总结提出GTP-FA框架，通过任务导向的两阶段抓取-规划流程和失败归因模型，在抓取和规划模块中分别注入任务先验和风险惩罚以及针对高风险初始状态进行数据收集和微调，显著提升机器人操作任务的成功率。

Comments 32 pages, project page: https://sites.google.com/view/gtp-fa/

详情

AI中文摘要

在机器人操作中，抓取与运动规划之间的紧密耦合常常掩盖失败的真实原因，导致低效的试错过程。为了实现高效的长时域操作，我们提出了GTP-FA（先抓取后规划与失败归因），一种面向任务的两阶段抓取-规划框架，该框架生成抓取候选并根据所选抓取执行下游运动规划。给定失败的操作轨迹，我们学习一个失败归因模型，该模型可泛化到未见过的抓取，并生成失败模式的稳定分布以进行诊断引导的优化。基于这些归因结果，我们以诊断驱动的方式优化两个模块：在抓取侧，我们将任务级先验和风险惩罚注入抓取候选评分和优化中，以抑制不稳定或与任务不兼容的抓取；在规划侧，我们通过数据收集和微调针对高风险初始状态，以解决真正的规划瓶颈。我们在仿真和真实机器人实验中评估了所提出的框架，并表明GTP-FA在基于RL、IL、扩散策略和VLA的设置中提升了相应的基础学习器，实现了显著更高的总体任务成功率。

英文摘要

In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.

URL PDF HTML ☆

赞 0 踩 0

2606.03381 2026-06-03 cs.CR cs.AI 版本更新

AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses

AI模型提取攻击：绕过防御中的单客户端假设

Maxime Schwarzer, Johannes F. Loevenich, Gustavo Sánchez, Laurin Holz, Thies Möhlenhof, Tobias Hürten, Roberto Rigolin F. Lopes, Veit Hagenmeyer

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Zurich（苏黎世大学）； University of Tübingen（图宾根大学）

AI总结本文通过提出CerberusAI框架，系统性地证明模型提取攻击中的单客户端假设（SCA）在高级持续性威胁（APT）等协同攻击者面前无效，并展示基本轮询查询分布策略即可绕过PRADA等防御机制，呼吁转向无状态、独立于身份的防御架构。

详情

AI中文摘要

确保部署在军事指挥控制（C2）系统和关键基础设施中的人工智能（AI）模型的保护对于维持信息优势至关重要。模型提取攻击（MEA）构成了重大威胁，因为它们使对手能够复制专有模型、泄露受保护信息并准备离线对抗性攻击。然而，当前的防御策略主要依赖于单客户端假设（SCA），即隐含地假设攻击源自孤立身份。本工作系统地证明了在协同威胁行为者（如高级持续性威胁APT）存在的情况下，SCA从根本上无效。我们引入了一个模块化、开源框架CerberusAI，用于可复现的模型窃取研究，并利用它模拟分布式攻击场景。我们的实证评估表明，成熟的防御机制（如防止深度神经网络模型窃取攻击PRADA）可以通过基本的轮询查询分布策略被绕过，导致检测性能显著下降。此外，我们证明即使是全局聚合方法也可以通过自适应流量混合使其在操作上变得无用。这些结果强调了在模型提取攻击领域需要向有状态、独立于身份的防御架构进行范式转变。本文最初发表于由信息系统技术（IST）科学与技术委员会IST-224-RSY组织的国际军事通信与信息系统会议（ICMCIS），该会议于2026年5月12-13日在英国巴斯举行，并获得了最佳论文奖。

FLIPS：通过伪随机序列为LLMs进行实例指纹识别

Gurvan Richardeau, Gohar Dashyan, Erwan Le Merrer, Gilles Tredan

发表机构 * Inria（法国国家信息与自动化研究所）

AI总结提出FLIPS方法，利用生成的二进制随机序列中的偏差，在237个模型实例上实现96%（闭集）和90%（开集）的识别准确率，解决了现有指纹识别技术无法区分同一LLM不同配置的问题，为AI监管提供了实例级指纹识别新范式。

Comments 20 pages, 20 figures, 3 tables. 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

文献揭示，大型语言模型（LLM）的行为不仅受其原始权重影响，还受其实例级参数（如指令提示、采样配置或量化）影响。在一种配置下生成安全输出的模型，在另一种配置下可能产生有毒内容。然而，当前的LLM识别技术（如指纹识别）侧重于知识产权保护，其设计倾向于对这些实例级参数的变化具有鲁棒性。这对AI监管构成了关键挑战，因为合规评估针对的是实际部署的行为，而非模型来源。在本文中，我们引入了实例级指纹识别，这是一种面向监管的范式，用于区分同一LLM的不同配置。我们的方法FLIPS利用生成的二进制随机序列中的偏差，在237个模型实例上达到96%（闭集）和90%（开集，其中一些目标未知）的识别准确率，而改编的LLMmap基线仅为35%。这表明实例级指纹识别对于监管既必要又实际可行。代码见https://this URL。

英文摘要

Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance-level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance-level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance-level fingerprinting, a regulator-oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed-set) and 90% (open-set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance-level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting.

URL PDF HTML ☆

赞 0 踩 0

2606.03329 2026-06-03 cs.AI 版本更新

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

InfoMem: 基于答案条件信息增益训练长上下文记忆智能体

Tiancheng Han, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao

发表机构 * Tongji University（同济大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出InfoMem奖励机制，通过评估最终记忆对真实答案的每token对数似然增益，训练分块记忆智能体以提升长上下文任务性能。

Comments 17 pages, 7 figrues,

详情

AI中文摘要

长上下文任务要求LLM从大上下文中识别并保留与答案相关的信息。分块记忆智能体通过顺序读取文档块、更新紧凑记忆并从累积记忆中生成最终答案来解决这一问题。然而，现有的基于RL的分块智能体要么依赖稀疏的最终答案奖励，要么使用词汇中间奖励来指导记忆和检索动作。这些信号监督任务成功或局部重叠，但不直接评估最终记忆是否支持真实答案。我们提出InfoMem，一种用于训练分块记忆智能体的奖励机制，该机制使用答案条件信息评估最终记忆的效用。InfoMem衡量最终记忆增加模型对真实答案的每token对数似然的程度。为了稳定RL优化，InfoMem仅对成功轨迹应用此信号，并在奖励组合前对其进行归一化。在相同的GRPO框架和训练预算下，InfoMem在长上下文记忆智能体性能上优于可比的记忆智能体RL基线。分析表明，有效的最终记忆奖励应作用于成功轨迹，在奖励组合前归一化，并基于答案而非查询进行条件化。我们的代码可从此https URL获取。

英文摘要

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model's per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.

URL PDF HTML ☆

赞 0 踩 0

2606.03326 2026-06-03 cs.AI 版本更新

The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

违规情境模式：一种用于合规违规的知识图谱模式

Nima Kamali Lassem, Fuqi Song, Seyid Amjad Ali

发表机构 * DiliTrust ； Department of Information Systems and Technologies, Bilkent University（信息系统与技术系，比尔肯特大学）

AI总结提出违规情境模式（VSP），将合规检测中的违规实例化为持久化图节点，支持生命周期状态和审计历史，并通过法律实体合同图实例化四种道义规则验证其有效性。

详情

AI中文摘要

合规管道将违规检测为瞬态查询结果，而不将违规本身作为具有审查状态、受影响实体或审计历史的持久化图对象保留。违规情境模式（VSP）填补了这一空白。基于Gangemi和Mika的情境模式，VSP将每个检测到的违规具体化为一个图节点，包含规则标识符、时间有效性区间、生命周期状态以及与所涉及实体的证据链接。生命周期转换存储为不可变的、符合PROV-O的事件，因此审计历史成为图遍历。我们在法律实体和合同生命周期属性图中实例化VSP，并通过FCL->Cypher->MERGE管道操作四条道义规则（V1未授权签名、V2过期授权、V3缺失保密条款、V4缺失违约通知条款）。我们针对BODACC公司高管出版物检查V1和V2，在73个GDPRhub执法决定上评估V4，并对V3和V4运行SHACL跨形式主义检查。核心发现是规则体独立性：将V4从条款存在性检查扩展到截止日期检查，F1从0.312提升至0.602，而模式的标识、生命周期和证据语义保持不变。这分离了模式贡献与检测器贡献，因此检测逻辑可以演进而不使累积的审计历史失效。

英文摘要

Compliance pipelines detect violations as transient query results and do not keep the violation itself as a persistent graph object with review state, affected entities, or audit history. The Violation Situation Pattern (VSP) closes this gap. Building on the Situation pattern of Gangemi and Mika, VSP reifies each detected violation as a graph node with a rule identifier, a temporal validity interval, a lifecycle state, and evidence links to the entities involved. Lifecycle transitions are stored as immutable, PROV-O-aligned events, so audit history is a graph traversal. We instantiate VSP in a legal entity and contract lifecycle property graph and operationalize four deontic rules (V1 unauthorized signature, V2 expired mandate, V3 missing confidentiality clause, V4 missing breach-notification clause) through an FCL->Cypher->MERGE pipeline. We check V1 and V2 against BODACC corporate-officer publications, evaluate V4 on 73 GDPRhub enforcement decisions, and run a SHACL cross-formalism check on V3 and V4. The central finding is rule-body independence: extending V4 from clause-presence to deadline checking raises F1 from 0.312 to 0.602, while the pattern's identity, lifecycle, and evidence semantics stay the same. This separates a pattern contribution from a detector contribution, so detection logic can evolve without invalidating accumulated audit history.

URL PDF HTML ☆

赞 0 踩 0

2606.03322 2026-06-03 cs.LG cs.AI 版本更新

Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification

多模态图神经网络与Transformer引导的自适应扩散用于临床前阿尔茨海默病分类

Jaeyoon Sim, Minjae Lee, Guorong Wu, Won Hwa Kim

发表机构 * Pohang University of Science and Technology（浦项科学技术大学）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出一种结合扩散核与多头注意力的图神经网络框架，通过Transformer引导自适应扩散过程，有效融合多模态特征，提升临床前阿尔茨海默病分类性能并识别关键脑区。

Comments 10 pages, Accepted to MICCAI 2024

详情

AI中文摘要

大脑的图形表示通过感兴趣区域（ROI）之间的关系为诊断和预测神经退行性疾病提供了关键见解。尽管近年来出现了各种图神经网络（GNN）来有效捕获关系信息，但在解释大脑网络方面仍存在固有局限性。具体而言，卷积方法无法有效聚合远邻域信息，而基于注意力的方法在捕获节点中心信息方面存在缺陷，特别是在保留关键节点的关键特征方面。这些不足揭示了从不同模态的不同特征中识别疾病特异性变化的挑战。为此，我们提出一个集成框架，通过下游Transformer引导每个节点的扩散过程，其中图的短程和长程属性分别通过扩散核和多头注意力进行聚合。我们通过使用多种模态改进临床前阿尔茨海默病（AD）分类的性能，证明了我们模型的优越性。此外，我们的模型能够熟练识别与AD临床前阶段密切相关的关键ROI，为疾病的早期诊断和预防提供了重要潜力。

英文摘要

The graphical representation of the brain offers critical insights into diagnosing and prognosing neurodegenerative disease via relationships between regions of interest (ROIs). Despite recent emergence of various Graph Neural Networks (GNNs) to effectively capture the relational information, there remain inherent limitations in interpreting the brain networks. Specifically, convolutional approaches ineffectively aggregate information from distant neighborhoods, while attention-based methods exhibit deficiencies in capturing node-centric information, particularly in retaining critical characteristics from pivotal nodes. These shortcomings reveal challenges for identifying disease-specific variation from diverse features from different modalities. In this regard, we propose an integrated framework guiding diffusion process at each node by a downstream transformer where both short- and long-range properties of graphs are aggregated via diffusion-kernel and multi-head attention respectively. We demonstrate the superiority of our model by improving performance of pre-clinical Alzheimer's disease (AD) classification with various modalities. Also, our model adeptly identifies key ROIs that are closely associated with the preclinical stages of AD, marking a significant potential for early diagnosis and prevision of the disease.

URL PDF HTML ☆

赞 0 踩 0

2606.03312 2026-06-03 cs.RO cs.AI 版本更新

RobotValues: Evaluating Household Robots When Human Values Conflict

RobotValues: 当人类价值观冲突时评估家用机器人

Jongwook Han, Hyeongjin Kim, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）

AI总结提出RobotValues基准，通过10K个价值冲突场景评估家用机器人规划器，发现视觉语言模型存在默认价值偏好且难以覆盖，表明评估需考虑价值冲突下的行动选择。

详情

AI中文摘要

虽然家用机器人通常基于任务完成度进行评估，但日常家庭环境涉及价值冲突情境，其中机器人应选择优先考虑其他价值观（如人类自主性、效率或社会适宜性）而非任务成功的行动。然而，目前尚无评估机器人在此类场景中价值偏好的基准。我们引入RobotValues，一个在10K个价值冲突场景中评估家用机器人规划器的基准。每个实例包含一个逼真的家庭图像和多个优先考虑不同人类价值观的合理机器人动作。我们通过LLM辅助场景生成、利益相关者基于价值观提取、图像生成和自动质量控制构建RobotValues。使用RobotValues评估机器人领域使用的视觉语言模型，发现模型表现出默认价值偏好，包括安全性和适应性，而低估了隐私优先的行动。当模型被指示优先考虑与其自身偏好冲突的特定价值观时，它们通常无法覆盖默认行动，80%的时间选择了错误行动。这些发现表明，家用机器人评估不仅应衡量任务完成度或安全性合规性，还应衡量当人类价值观冲突时机器人是否能在合理行动中做出选择。

英文摘要

While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in which robots are expected to choose actions that prioritize other values than task success, such as human autonomy, efficiency, or social appropriateness. Yet, there are no benchmarks for evaluating robots' value preferences in such scenarios. We introduce RobotValues, a benchmark to evaluate household robot planners in 10K value-conflict scenarios. Each instance consists of a realistic household image with multiple plausible robot actions that prioritize different human values. We construct RobotValues through LLM-assisted scenario generation, stakeholder-grounded value extraction, image generation and automatic quality control. Using RobotValues we evaluate VLMs used in robotics and find that models exhibit default value preferences, including safety and accommodation, while underselecting privacy-prioritizing actions. When the models are instructed to prioritize specific values that conflict with their own preferences, they often fail to override their default actions, choosing incorrect actions for 80% of the time. These findings suggest that household robot evaluation should measure not only task completion or safety compliance, but also whether robots can choose among plausible actions when human values conflict.

URL PDF HTML ☆

赞 0 踩 0

2606.03310 2026-06-03 cs.LG cs.AI 版本更新

Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis

学习多尺度超图用于高阶脑连接分析

Jaeyoon Sim, Soojin Hwang, Seunghun Baek, Guorong Wu, Won Hwa Kim

发表机构 * KAIST（韩国科学技术院）

AI总结提出自适应多尺度超边学习框架MuHL，通过构建层次节点特征并动态学习高阶交互，在多个脑网络基准上提升神经退行性疾病分类性能并识别关键脑区。

Comments 24 pages, Accepted to ICML 2026

详情

AI中文摘要

理解脑区之间的复杂交互对于早期神经退行性疾病（如阿尔茨海默病和帕金森病）的分类至关重要。虽然基于图的模型广泛用于分析脑网络，但大多数现有方法主要关注直接连接节点之间的成对交互，限制了其捕捉跨多个区域的高阶依赖关系的能力。尽管已有基于超图的方法来建模高阶关系，但许多方法依赖于预定义的超边或将学习限制在超边权重上，降低了灵活性并限制了其捕捉多分辨率结构模式的能力。为此，我们引入了一个自适应多尺度超边学习框架，即MuHL，该框架构建层次节点特征，并通过在多分辨率图信号上连续构建超边来动态学习高阶交互。在多个脑网络基准上的大量实验表明，MuHL在不同阶段持续提高了疾病分类性能，并从学习到的超边中识别出与疾病进展相关的关键感兴趣区域及其群体交互，突显了其作为神经退行性疾病脑网络分析强大工具的潜力。

英文摘要

Understanding complex interactions between brain regions is critical for early neurodegenerative disease classification such as Alzheimer's Disease (AD) and Parkinson's Disease (PD). While graph-based models are widely used to analyze brain networks, most existing approaches primarily focus on pairwise interactions between directly connected nodes, limiting their ability to capture higher-order dependencies across multiple regions. Although hypergraph-based methods have been proposed to model higher-order relations, many rely on predefined hyperedges or restrict learning to hyperedge weights, reducing flexibility and limiting their capacity to capture multi-resolution structural patterns. In this regard, we introduce an adaptive multi-scale hyperedge learning framework, i.e., MuHL, which constructs hierarchical node features and dynamically learns high-order interactions through continuous hyperedge construction over multi-resolution graph signals. Extensive experiments on multiple brain network benchmarks demonstrate that MuHL consistently improves disease classification performance across different stages, and further identifies key regions of interest (ROIs) and their group-wise interactions from the learned hyperedges that are associated with disease progression, highlighting its potential as a powerful tool for brain network analysis in neurodegenerative disorders.

URL PDF HTML ☆

赞 0 踩 0

2606.03305 2026-06-03 cs.AI 版本更新

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

基准审计中的可靠性差距：分布偏移和规模作为污染检测的失败模式

Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert

发表机构 * NASK National Research Institute（国家研究 institute）； Warsaw University of Technology（华沙技术大学）； Gdańsk University of Technology（格但喀大学）

AI总结研究基准污染检测方法在分布偏移和规模约束下的可靠性，发现三种主流方法在335次评估中仅199次正确，揭示了受控验证与实际审计之间的系统性可靠性差距。

详情

AI中文摘要

基准污染，即评估示例出现在模型的训练数据中，威胁着LLM评估的有效性。存在用于检测训练数据成员身份的统计工具，但几乎仅在受控学术制度中得到验证：大规模、同质的预训练语料库和透明、单阶段的训练流程。这些方法在实际审计场景中是否仍然可靠尚不清楚。我们识别了两种研究不足的失败模式：分布偏移，当可疑集和验证集违反IID假设时出现；以及规模约束，因为基准比预训练语料库小几个数量级。我们系统评估了三种主流范式：LLM数据集推断、事后数据集推断和CoDeC，涉及来自多个家族（包括Pythia、OLMo~2以及专门的文化和医学LLM）和规模（高达27B）的27个模型。然后我们将分析进一步扩展到前沿行业模型。在335次评估中，只有199次产生正确结果。LLM数据集推断在分布偏移下产生假阳性，事后数据集推断在基准规模下效力不足，而CoDeC仅提供粗略的来源信号，不足以验证单个基准分割。我们的结果揭示了受控验证与实际基准审计之间的系统性可靠性差距，并表明统计检测尚不能取代透明的数据来源。我们开源了我们的基准以供进一步研究。

英文摘要

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.

URL PDF HTML ☆

赞 0 踩 0

2606.03290 2026-06-03 cs.LG cs.AI 版本更新

Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective

消息调优优于图提示调优：棱镜空间视角

Yancheng Chen, Dun Ma, Shuai Zhang, Yang Liu, Xixun Lin, Xiangyu Zhao, Wenguo Yang, Wei Chen, Chuan Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出棱镜空间理论（PS-Theory）量化图提示调优的适应能力上限，并引入消息调优（MTG）方法，通过注入可学习消息原型超越该上限，实验验证其优越性。

Comments Accepted by ICML 2026

详情

AI中文摘要

基于预训练与自适应范式的图基础模型（GFMs）已成为图学习的研究热点。对于基于GNN的GFMs，图提示调优已成为下游任务的主流自适应方法。尽管近期方法解释了图提示调优为何有效，但如何严格衡量其适应能力仍是一个开放问题。解决该问题对于理解图提示调优的能力极限以及开发更强大的自适应方法至关重要。本文提出棱镜空间理论（PS-Theory），一种新颖的数学框架，用于量化自适应方法的能力，同时重点建立图提示调优适应能力上限。基于所提出的PS-Theory，我们进一步引入GFMs的消息调优（MTG），一种轻量级方法，在GNN骨干网络的每一层注入少量可学习消息原型，以自适应地引导消息融合，无需更新预训练权重。通过我们的PS-Theory，我们证明MTG的适应能力可以超过图提示调优的理论上限。大量实验表明，MTG在多个基准数据集上 consistently 优于图提示基线，为我们的理论发现提供了强有力的实证支持。

英文摘要

Graph Foundation Models (GFMs), built upon the Pre-training and Adaptation paradigm, have emerged as a research hotspot in graph learning. For GNN-based GFMs, graph prompt tuning has become the prevailing adaptation method for downstream tasks. Although recent methods explain why graph prompt tuning works, how to rigorously measure its adaptation capacity remains an open problem. Addressing this problem is critical for understanding the capability limits of graph prompt tuning and for developing more powerful adaptation methods. In this paper, we propose Prismatic Space Theory (PS-Theory), a novel mathematical framework to quantify the capacity of adaptation methods, while focusing on establishing the upper bound for the adaptation capacity of graph prompt tuning. Building upon the proposed PS-Theory, we further introduce Message Tuning for GFMs (MTG), a lightweight approach that injects a small set of learnable message prototypes into each layer of the GNN backbone to adaptively guide message fusion without updating pre-trained weights. Through our PS-Theory, we prove that the adaptation capacity of MTG can exceed the theoretical upper bound of graph prompt tuning. Extensive experiments demonstrate that MTG consistently outperforms graph prompt baselines across diverse benchmark datasets, providing strong empirical support for our theoretical findings.

URL PDF HTML ☆

赞 0 踩 0

2606.03288 2026-06-03 cs.CY cs.AI 版本更新

AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study

AI生成的新手程序员追踪：多机构研究中的学习效果与学习者差异

Yuri Noviello, Naaz Sibia, Anastasiia Birillo, Thomas Overklift Vaupel Klein, Michael Liut, Gosia Migut

发表机构 * Delft University of Technology（代尔夫特理工大学）； University of Toronto（多伦多大学）； JetBrains Research（JetBrains研究）

AI总结本研究提出AI生成的类比动画追踪（GATs），通过多机构实验比较其与文本解释对新手程序员学习程序执行的影响，发现GATs在即时学习上有选择性优势，但效果依赖情境且短暂，且受学习者参与度调节。

详情

DOI: 10.1145/3803400.3809346

AI中文摘要

入门编程（CS1）课程常常难以支持学生对程序执行的理解。虽然可视化可以使执行过程明确，但其有效性取决于设计和情境，而AI生成可视化的实证证据仍然有限。我们提出了生成动画追踪（GATs），即基于AI生成的、类比驱动的、配有旁白的动画，协调源代码、执行状态和概念类比。我们在两个机构的CS1课程中（Python，N=961；Java，N=151）进行了一项研究，比较GATs与文本解释。我们测量了即时学习表现和体验、课程结束时的参与度和考试成绩。结果表明，GATs可以在即时学习方面产生选择性优势，但优势取决于情境且是短期的。我们观察到GATs对表现的影响受到学习者参与度概况的调节。这一发现强调了个性化方法的重要性。

英文摘要

Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can make execution processes explicit, their effectiveness depends on design and context, and empirical evidence for AI-generated visualizations remains limited. We propose Generated Animated Traces (GATs), AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. We conduct a study at two institutions in CS1 courses (Python, N=961; Java N=151) comparing GATs to textual explanations. We measure immediate learning performance and experience, end-of-course engagement and exam performance. Results show that GATs can yield selective benefits for immediate learning, but benefits are context-dependent and short-term. We observe that GATs' influence on performance is moderated by learner engagement profiles. This finding underscores the importance of personalized approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.03273 2026-06-03 cs.CV cs.AI cs.CL 版本更新

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

VistaHop: 视觉深度搜索的多跳视觉推理基准

Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin

发表机构 * East China Normal University（东华大学）； Meituan（美团）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出VistaHop基准，通过多跳问答任务评估多模态大推理模型在视觉深度搜索中的迭代图像检查、视觉锚点定位和跨证据链推理能力，实验表明现有模型表现有限。

详情

AI中文摘要

视觉深度搜索要求多模态大推理模型（MLRM）智能体通过反复检查图像区域、将中间推理锚定在视觉证据上，并跨长推理链连接细粒度线索来回答复杂的视觉查询。然而，现有基准主要关注单步视觉理解或静态图像问答，对迭代图像检查、视觉锚点定位和多跳证据整合的评估有限。在这项工作中，我们引入了VistaHop，一个用于评估视觉深度搜索中以视觉为中心的搜索和多跳视觉推理的基准。VistaHop包含300张高分辨率图像、25个视觉搜索场景和350个多跳QA任务，这些任务要求模型跟随从视觉锚点出发的证据链，或融合跨多个基于图像的推理路径的信息。我们进一步开发了VistaArena，一个统一的评估环境，支持带有文本搜索、图像搜索、图像裁剪和基于证据的答案验证的工具增强推理。在七个代表性MLRM上的实验表明，当前模型远未解决VistaHop：最佳模型SenseNova-MARS-32B仅达到24.31%的Pass@1。这些结果揭示了在视觉定位、证据重访、长链推理和多锚点信息融合方面的持续局限性，凸显了对更强基准和训练方法的需求，以推动视觉深度搜索的发展。

英文摘要

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

URL PDF HTML ☆

赞 0 踩 0

2606.03270 2026-06-03 cs.LG cs.AI 版本更新

Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles

常见子结构可迁移吗？基于神经向量丛的黎曼图基础模型

Li Sun, Zhenhao Huang, Yiding Wang, Qin Chen, Pietro Lio, Philip S. Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对图结构迁移性理论缺失的问题，提出基于黎曼几何的神经向量丛框架GAUGE，通过内在几何学习实现可迁移子结构表征，在零样本链接预测和图同构任务中验证了优越性。

Comments Accepted by ICML 2026

详情

AI中文摘要

基础模型通过预训练-适应范式引发了革命，最近的研究将这一成功扩展到图。与其他模态不同，图包含丰富的结构模式，但其结构迁移性仍知之甚少。先前的研究考虑离散领域中的常见子结构，我们被一个基本问题所驱动：常见子结构可迁移吗？其背后的理论很大程度上未被探索。在这项工作中，我们转向通过功能行为的视角学习可迁移结构。理论上，我们将可迁移子结构与表示空间的内在几何联系起来。然而，表征这种内在几何很少被触及。基于黎曼几何，我们开发了一个称为神经向量丛的图内在几何学习框架，该框架能够用局部坐标解析内在几何。在此基础上，我们设计了GAUGE，一个可预训练的神经架构，它构建向量丛，展平几何兼容的局部坐标，以及一个新的狄利克雷损失，该损失也衡量迁移努力。我们通过实验验证了其在具有挑战性的任务（包括零样本链接预测和图同构）中的优越表现力。

英文摘要

Foundation models have sparked a revolution via a pretraining-adaptation paradigm, with recent efforts extending this success to graphs. Unlike other modalities, graphs contain rich structural patterns, yet their structural transferability remains poorly understood. Prior studies consider common substructures in the discrete realm, and we are motivated by a fundamental question: Are common substructures transferable? The underlying theory is largely underexplored. In this work, we shift toward learning transferable structures through the lens of functional behavior. Theoretically, we connect transferable substructures to intrinsic geometry of the representation space. However, characterizing such intrinsic geometry has rarely been touched. Grounded in Riemannian geometry, we develop a graph intrinsic geometry learning framework called Neural Vector Bundle, which enables parsing intrinsic geometry with local coordinates. Building on this, we design GAUGE, a pretrainable neural architecture that constructs the vector bundle, flattening geometrically compatible local coordinates, and a new Dirichlet loss, which also measures the transfer effort. We empirically validate its superior expressiveness in challenging tasks including zero-shot link prediction and graph isomorphism.

URL PDF HTML ☆

赞 0 踩 0

2606.03269 2026-06-03 cs.AI 版本更新

现实世界数据集是否包含自然实验？基于因果特征选择的实证研究

Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke

AI总结本文利用因果发现和特征选择检测现实世界数据集中的自然实验，并通过干预性处理提升模型性能。

详情

AI中文摘要

在自然界中，影响某些个体或群体但不影响其他个体或群体的事件构成隐式干预，被称为自然实验。例如，COVID-19大流行是冠状病毒对感染COVID的亚群的一次干预。我们问：现有的现实世界数据集中是否存在自然实验？如果存在，我们应该如何处理它们？为了检测数据中的自然实验，我们使用因果发现恢复潜在因果图，并基于因果链接进行特征选择。如果通过将数据视为干预性而非观测性来提升下游性能，我们认为这表明数据集包含自然实验。我们首先通过使用合成图模拟包含和不包含自然实验的数据集来验证这一假设。然后，我们在大量现实世界数据集上进行系统的实证评估。我们的结果表明，现实世界数据集确实包含自然实验，我们可以利用这些自然实验通过因果推断来提升模型性能。我们的工作代表了该领域的初步探索，在有限范围内进行了初步研究。

英文摘要

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

URL PDF HTML ☆

赞 0 踩 0

2606.03238 2026-06-03 cs.LG cs.AI 版本更新

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

当RLHF失败时：奖励黑客、崩溃和评估者博弈的机制分类

Zelalem Abahana

发表机构 * First Citizens Bank（第一公民银行）； Alma Mater Europaea University（欧洲大学）

AI总结本文通过PPO、DPO等方法的对比实验，提出了一种基于奖励和评估者分数方向的机制分类法，将RLHF失败模式分类为可定位、可预测的训练动态。

Comments 20 pages, 8 figures; includes code, artifacts, and live demo

详情

AI中文摘要

从人类反馈中强化学习（RLHF）通过用学习到的可扩展代理替代未明确指定的人类目标，实现了大规模后训练。这种替代同时创建了一个结构化的失败面：优化可以提高学习到的奖励而外部质量下降，降低代理和评估者分数，揭示代理欠对齐，或产生评估者特定的分歧。我们展示了一个紧凑RLHF流程的实证失败模式研究，该流程包括近端策略优化（PPO）、直接偏好优化（DPO）、不确定性惩罚PPO（UP-PPO）、奖励模型不确定性、近似策略漂移、多样性和重复诊断，以及两个外部LLM评估者。我们不将奖励黑客视为单一终端事件，而是使用学习到的奖励、评估者分数和平均评估者分数的方向对检查点之间的匹配转换进行分类。在61个检查点行和1920个行级转换中，激进的PPO具有最高的局部奖励黑客率（14.45%；bootstrap 95% CI: 10.16-18.75），而UP-PPO在相同激进机制下产生较低率（11.33-10.94%）。转换前的逻辑模型以ROC-AUC 0.821预测未来行级奖励黑客，行级分析发现12个设置中有3个存在检查点平均值遗漏的局部奖励黑客。核心结论是方法论上的：RLHF失败不仅是最终模型病理，而且是可分类、可定位和部分可预测的训练动态。

英文摘要

Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical failure-mode study of a compact RLHF pipeline with proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, diversity and repetition diagnostics, and two external LLM judges. Rather than treating reward hacking as a single terminal event, we classify matched transitions between checkpoints using the directions of the learned reward, judge scores, and average judge score. Across 61 checkpoint rows and 1920 row-level transitions, aggressive PPO has the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16-18.75), while UP-PPO yields lower rates in the same aggressive regime (11.33-10.94%). A pre-transition logistic model predicts future row-level reward hacking with ROC-AUC 0.821, and row-level analysis finds localized reward hacking that checkpoint averages miss in 3 of 12 settings. The central conclusion is methodological: RLHF failures are not only final-model pathologies, but training dynamics that can be classified, localized, and partially anticipated.

URL PDF HTML ☆

赞 0 踩 0

2606.03237 2026-06-03 cs.AI cs.CL cs.CY cs.LG cs.MA 版本更新

Solipsistic Superintelligence is Unlikely to be Cooperative

唯我论超级智能不太可能合作

Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo

发表机构 * DeepMind（深度Mind）； University of Cambridge（剑桥大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文指出，基于唯我论方法设计的超级智能（极端能力的任务求解器）因忽视部署引发的内生非平稳性而难以合作，呼吁将相互依存作为核心设计原则的非唯我论研究范式。

Comments 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026

详情

AI中文摘要

AI的核心挑战正从能力转向共存。AI研究的主导范式侧重于开发将世界视为外生且平稳反馈源的强大智能体。我们认为，源于这种唯我论AI设计方法的超级智能（极端能力的任务求解器）不太可能合作。部署AI系统会引发内生非平稳性，导致训练-测试-部署差距，即历史分布与部署环境相偏离。我们称此为单边优化的自我削弱属性。缩小这一差距需要参与合作的AI：即多个行为体导航其相互依存的均衡选择过程。我们呼吁一种非唯我论的研究范式，将这种相互依存作为核心设计原则，而非将合作视为待解决的任务。这需要构建涉及自适应对手方的动态评估测试平台，将制度视为设计原语，并保留人类能动性作为我们构建系统的结构性特征。

英文摘要

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

URL PDF HTML ☆

赞 0 踩 0

2606.03236 2026-06-03 cs.AI 版本更新

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

先感知后推理：一种用于高效可靠主动移动代理的预推理感知框架

Zhijie Ding, Weinan Hong, Zicheng Zhu, Lei Li, Dezhi Kong, Hao Wang, Peng Zhou, Xuchu Jiang, Jiaming Xu

发表机构 * HyperAI Team, Xiaomi Corporation（HyperAI团队，小米公司）； Zhongnan University of Economics and Law（中南财经政法大学）； Jilin University（吉林大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出预推理感知框架（PRPF），通过轻量级多模态主动感知器（MPP）进行干预门控和上下文压缩，仅在需要时激活主动代理推理器（PAR），以解决主动移动代理中干预时机与方式决策的目标错位和冗余推理问题。

详情

AI中文摘要

多模态大语言模型（MLLMs）显著推动了移动代理的发展，但主动移动辅助仍然具有挑战性，因为代理必须在决定如何协助之前确定何时干预。现有系统通常在一个统一的基于MLLM的流水线中实现这两个决策，导致保守的干预过滤与全面的辅助生成之间的目标错位，以及在代理应保持沉默时的冗余推理。为了解决这些限制，我们提出了预推理感知框架（PRPF），这是一个基于先感知后推理的两阶段框架。PRPF引入了一个轻量级的多模态主动感知器（MPP）用于干预门控和上下文压缩，并仅在需要干预时激活主动代理推理器（PAR）。在ProactiveMobile基准上的实验表明，与ProactiveMobile基线相比，PRPF显著降低了误触发率（FTR），同时提高了成功率（SR）和推理效率。

英文摘要

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.03232 2026-06-03 cs.LG cs.AI 版本更新

GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond

GFFMERGE: 图神经力场的高效合并及其扩展

Parth Verma, Parv P. Singh, Vipul Garg, Ishita Thakre, N. M. Anoop Krishnan, Sayan Ranu

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Cambridge（剑桥大学）

AI总结提出GFFMERGE框架，通过凸嵌入对齐问题解析解实现图神经网络的闭式模型合并，在力场回归任务中恢复接近联合训练的性能，并实现5-27倍加速。

详情

AI中文摘要

图神经网络（GNN）通过降低计算成本实现接近量子精度的原子模拟，彻底改变了神经力场，但将这些模型适应新化学系统需要对基础模型进行昂贵的重新训练。受视觉和语言处理中模型合并的启发，我们提出了GFFMERGE，这是第一个用于GNN闭式模型合并的原则性框架。我们利用消息传递层的线性结构，将合并问题形式化为具有解析解的凸嵌入对齐问题。通过对GNN模型合并的首次系统基准测试，我们发现为视觉和语言设计的现有方法在力场回归任务上灾难性地失败，而GFFMERGE恢复了接近黄金标准联合训练的性能。在分子（MD17、MD22）、固态（LiPS20）和大规模图基准测试中，GFFMERGE及其通用GNN对应物GNNMERGE实现了5-27倍的加速，同时支持专业模型的模块化组合。值得注意的是，我们的闭式解在微调前就优于所有基线方法，并为更快、数据高效的收敛提供了优越的初始化。

英文摘要

Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of message-passing layers and formulate merging as a convex embedding-alignment problem with an analytical solution. Through the first systematic benchmarking of model merging for GNNs, we show that existing methods designed for vision and language catastrophically fail on force field regression, while GFFMERGE recovers performance approaching gold standard joint training. Across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph benchmarks, GFFMERGE and GNNMERGE (its generic GNN counterpart) achieve 5-27$\times$ speedups while enabling modular composition of specialized models. Remarkably, our closed-form solution alone outperforms all baseline methods before fine-tuning and provides superior initialization for faster, data-efficient convergence.

URL PDF HTML ☆

赞 0 踩 0

2606.03223 2026-06-03 cs.RO cs.AI 版本更新

BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions

BotDirector：跨对称现实的多模态交互机器人讲故事

Zhe Sun, Meng Wang, Lei Wang, Yuxi Wang, Wanxin Li, Yujia Peng, Zhenliang Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China（国家一般人工智能重点实验室，BIGAI，北京，中国）； Peking University, Beijing, China（北京大学，北京，中国）

AI总结提出一个结合具身交互和自然语言交互的机器人讲故事系统，利用LLM代理将儿童创建的叙事转化为自导航群体机器人的运动序列，支持灵活场景和日常物品。

2606.03220 2026-06-03 cs.CL cs.AI 版本更新

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE: 面向MLLM生成Web工件的需求诱导状态评估

Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang

发表机构 * Tsinghua University（清华大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； East China Normal University（华东师范大学）； Tongji University（同济大学）； Institute of Artificial Intelligence, Beihang University（北京航空航天大学人工智能研究院）

AI总结提出WebRISE框架，通过交互契约图（ICG）将任务需求转化为可观察状态、用户意图转换和DOM/视觉断言，以评估MLLM生成的Web工件的功能正确性，实验表明ICG评分检测状态错误率是检查点评估的2-16倍。

详情

AI中文摘要

现有的MLLM生成Web工件基准通过局部证据评估交互，忽略了决定页面是否正常工作的需求诱导状态和转换。我们提出WebRISE，它将任务需求编译成交互契约图（ICG），包含可观察状态、用户意图转换以及DOM/视觉断言，以实现与实现无关的浏览器执行。WebRISE涵盖五种输入模态（文本、Markdown、草图、图像、视频）下的442个任务，包含5,495个转换和5,271个需求检查，将用户声明的功能与隐式的产品级约束分开。在14个MLLM中，即使最强的模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率，且视觉质量不能代表行为（Qwen3.6-35B-A3B在Markdown上：V=80.8但T=15.5）。视频提供了最强的交互信号（隐式覆盖率比文本高10.6个百分点），而隐式约束仍然存在；缺陷注入表明，基于ICG的评分检测状态错误的速率是检查点评估的2-16倍。

英文摘要

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.03214 2026-06-03 cs.AI cs.CV cs.CY cs.LG 版本更新

Effect of Demographic Bias on Skin Lesion Classification

人口统计偏差对皮肤病变分类的影响

Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim

发表机构 * Fontys University of Applied Science, Venlo, The Netherlands（Fontys应用科学大学，荷兰Venlo）； Fontys University of Applied Science, Eindhoven, The Netherlands（Fontys应用科学大学，荷兰Eindhoven）； Eindhoven University of Technology, Eindhoven, The Netherlands（埃因霍温技术大学，荷兰Eindhoven）； IT University of Copenhagen, Denmark（哥本哈根IT大学，丹麦）

AI总结本研究使用基于ResNet的卷积模型评估皮肤病变分类性能，通过线性规划控制人口统计特征，研究患者性别和年龄偏差的影响，并比较三种学习策略，发现性别偏差主要源于数据不平衡，而年龄偏差始终偏向年轻群体。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures

详情

DOI: 10.59275/j.melba.2026-4156
Journal ref: https://melba-journal.org/2026:011

AI中文摘要

在这项研究中，我们评估了使用基于ResNet的卷积模型进行皮肤病变分类的性能，重点关注训练数据中人口统计偏差的影响，特别是患者性别和年龄的变化。我们使用线性规划生成具有受控人口统计特征的数据集，从而系统性地研究偏差效应。评估了三种学习策略：单任务模型、强化多任务模型和对抗学习方案。我们的性别分析表明，性别特定的训练数据集优化了模型性能。值得注意的是，在训练数据中包含男性患者提高了男性亚组的性能，即使在女性占多数的情况下也是如此。强化学习和对抗学习方案缩小或消除了平衡和女性占多数数据集中的偏差差距。然而，这些策略在男性占多数的环境中效果较差，模型在男性上的表现仍然优于女性。在主要男性患者群体中，与基线模型相比，这两种学习方案显示出边际偏差减少。基于年龄的分析表明，三种模型方法的基线性能相当，性能随年龄类别下降。无论训练数据分布如何，年轻组始终达到最高性能。尽管平衡训练对最年轻年龄组产生最佳结果，但较老年组的性能下降。我们发现性别偏差主要源于数据不平衡，而年龄偏差无论分布如何始终偏向年轻群体。这些不同的机制需要有针对性的缓解策略。此外，在两个外部数据集上的跨数据集验证表明，域转移显著影响性能和人口统计偏差模式。

英文摘要

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

URL PDF HTML ☆

赞 0 踩 0

2606.03203 2026-06-03 cs.AI 版本更新

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench: 一个仅基于截图的临床计算机使用代理基准测试

Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang

发表机构 * Microsoft Research Asia（微软亚洲研究院）； Digital Medical Research Center, School of Basic Medical Sciences, Fudan University（复旦大学基础医学院数字医学研究中心）； Shanghai Key Laboratory of MICCAI（上海MICCAI重点实验室）

AI总结提出 MedCUA-Bench，一个覆盖10个医学领域18个临床场景的交互式基准，通过确定性检查器评估任务完成和五个临床安全维度，揭示当前代理在真实临床软件上的性能差距。

详情

AI中文摘要

计算机使用代理可以自动化重复的基于屏幕的临床工作，但它们在医疗图形用户界面中的可靠性仍未得到充分验证。现有的基准测试侧重于通用的网页或桌面任务，对医疗软件的覆盖不足，而医疗软件需要领域知识，其用户界面设计与主流应用显著不同，缺乏公开的测试环境，并且需要超出任务完成的安全验证。我们引入了 MedCUA-Bench，一个用于临床计算机使用代理的交互式基准测试。它涵盖了10个医学领域的18个临床场景，这些场景根据真实产品手册和开源医疗系统重建，以捕捉真实的临床界面，同时避免许可和隐私限制。每个任务都配有配对的意图级和步骤级目标，以将临床推理与用户界面执行分离，并通过确定性检查器在任务完成和五个临床安全维度上进行评估。在23个代理中，最好的闭源模型达到了54.2%的严格成功率，而所有模型在真实的 OpenEMR 上均低于9%。开源代理的平均成功率仅为2.5%，最好的达到了16.2%。MedCUA-Bench 揭示了当前代理与可靠临床软件使用之间的差距，为未来的研究提供了一个可复现的测试平台。

英文摘要

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.03198 2026-06-03 cs.CL cs.AI 版本更新

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

AI评分歧视取决于复杂临床决策中的评分协议

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

发表机构 * Asclep Korea Inc.（Asclep韩国公司）； Center for Data Science, New York University（纽约大学数据科学中心）； Division of Endocrinology and Metabolism, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine（成均馆大学医学院内分泌与代谢科，三星医疗中心）； Biomedical Statistics Center, Samsung Medical Center（三星医疗中心生物医学统计中心）； Department of Digital Health, SAIHST（SAIHST数字健康科）； Department of Data Convergence & Future Medicine, Sungkyunkwan University（成均馆大学数据融合与未来医学科）

AI总结通过因子研究，发现基于评分标准的协议能放大AI评分者区分能力，而无评分标准协议则抑制这种区分，支持在临床AI评估中使用评分标准锚定。

Comments 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

详情

AI中文摘要

临床AI评估越来越多地委托给大型语言模型（LLMs）作为AI评分者进行评分，但其在不同评估条件下的评分行为尚未被定量表征。我们通过一项因子研究填补了这一空白，该研究关注成人2型糖尿病（T2D）药物治疗在12个月门诊随访中的AI评分者行为，这是一项涉及复杂决策的临床任务，通过七个评估问题操作化。四个开源LLMs同时作为临床决策支持系统（CDSS）模型和AI评分者。每个CDSS输出在两种评分协议下评分：基于评分标准的Gold Rubric（GR）协议（包含患者特定评分标准）和无评分标准的Non Gold Rubric（Non-GR）协议。线性混合效应模型将评分协议因子与五个设计因子（CDSS模型、CDSS提示配置（文档参考生成[DRG] vs. 基线）、评分者模型、提示字符和提示类型）交叉，并估计主效应及其协议交互。在所有问题中，AI评分者在Non-GR下始终给出非常窄范围内的更高分数（平均74-78分），而GR下的平均分数低7.69至49.64分，四分位距宽1.68至3.67倍。在每个问题内，GR将AI评分者对DRG和基线CDSS输出的区分能力放大了1.76至5.10倍，同时揭示了Non-GR抑制的评分者模型间的显著行为变异。这些发现支持评分标准锚定作为保留临床AI评估区分能力的评分协议；当问题需要患者特定或司法管辖区特定标准，而评分者模型无法仅从参数知识推断时，无评分标准评分无法替代。

英文摘要

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

URL PDF HTML ☆

赞 0 踩 0

2606.03165 2026-06-03 cs.CL cs.AI 版本更新

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

大型语言模型中词汇对齐和偏好阶段转变的完全自动识别

Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez

发表机构 * University of Washington（华盛顿大学）

AI总结本文提出两种无需人工干预的评估指标——词汇对齐分数和三角化偏好转变，用于自动识别大型语言模型中的词汇过度使用及其与人类偏好学习的关联。

Comments 16 pages, 2 figures, 10 tables

详情

DOI: 10.63317/4ut7ammh7z3h
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6116-6131

AI中文摘要

数字聊天助手（如ChatGPT）使用的语言可能与人类预期存在偏差（不对齐）。主要针对科学英语的研究已经描述了出现的偏差以及在一定程度上解释了原因，将其与人类偏好学习的训练阶段联系起来。然而，现有方法依赖于人工筛选。本文引入了两种无需筛选、假设较少的评估指标：词汇对齐分数（识别词汇过度使用）和三角化偏好转变（量化此类转变中有多少可归因于人类偏好学习）。使用PubMed摘要，生成了续写，并通过六个模型系列（Falcon、Gemma、Llama、Mistral、OLMo、Yi）的滑动窗口文档频率进行测量。该过程无需人工干预即可识别过度使用的词汇，如'suggest'、'additionally'和'strategy'，并估计它们与偏好学习的关联。我们的发现重复了先前的工作，并且在参数设置、随机种子以及进一步数据的评估中保持稳定。该方法易于扩展，能够系统研究科学英语之外以及跨语言的词汇（不对齐），因此，这些指标有潜力为未来模型改进对齐并理解其起源做出贡献。

英文摘要

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

URL PDF HTML ☆

赞 0 踩 0

2606.03159 2026-06-03 cs.CV cs.AI cs.RO 版本更新

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

NVIDIA OmniDreams：用于闭环自动驾驶仿真的实时生成式世界模型

NVIDIA, :, Aarti Basant, Amlan Kar, Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas, Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao, Kai He, Katarina Tothova, Kevin Xie, Michał Tyszkiewicz, Qi Wu, Riccardo de Lutio, Ruilong Li, Sanja Fidler, Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff, William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang

AI总结提出OmniDreams，一个基于Cosmos扩散模型训练的基础生成式世界模型，通过自回归生成动作条件视频，实现闭环仿真中复杂长尾场景的实时合成，并验证其在策略模型训练中的有效性。

详情

AI中文摘要

随着自动驾驶能力的提升，在长尾场景中安全评估驾驶策略仍是一个关键瓶颈。在闭环仿真中，驾驶策略模型与环境主动交互，其动作动态更新模拟器状态并直接影响下一组生成的传感器观测。尽管近期基于重建的神经模拟器提供了逼真效果，但它们从根本上受限于初始捕获数据，难以泛化到高度动态或新颖场景。为克服这些限制，我们引入了OmniDreams，一个从Cosmos扩散模型进行中期和后训练的基础生成式世界模型，能够自回归地实时生成动作条件视频。通过利用Cosmos丰富的视觉先验以及在21k小时驾驶场景上的中期和后训练，OmniDreams合成了传统模拟器难以捕获的复杂未观测现象，例如极端天气和不可预测的动态智能体行为。关键在于，它自回归地根据过去帧、当前模拟器状态和即时驾驶动作来调节其逼真的传感器生成。在结合Alpamayo 1策略模型和AlpaSim编排器的闭环系统中部署时，OmniDreams充当一个高度响应、反应灵敏的环境，为训练和评估下一代自动驾驶策略提供了可扩展且全面的解决方案。我们还展示了初步结果，表明从OmniDreams后训练的世界-动作模型（WAM）在Physical AI自动驾驶NuRec数据集上取得了强劲性能，超越了基于VLA的Alpamayo 1.5研究策略模型，同时仅使用其1/5的总参数量。这些结果凸显了像OmniDreams这样的实时世界模型也有潜力作为策略架构的骨干网络。

英文摘要

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.03157 2026-06-03 cs.AI 版本更新

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

ClinicalMC：面向大语言模型的多疗程临床决策基准

Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China（东华大学）； Renji Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China（复旦大学附属中山医院）

AI总结提出ClinicalMC基准，包含多阶段样本，通过多智能体评估框架在单轮静态和多轮动态设置下测试大语言模型的临床决策能力。

详情

AI中文摘要

大语言模型（LLMs）已在医疗领域广泛应用，但在复杂临床决策场景中仍面临重大挑战。现有基准主要评估LLMs在单疗程设置中的表现，缺乏对多疗程场景的系统评估——在后者中，患者的病情随时间演变。为弥补这一空白，我们提出ClinicalMC，一个面向多疗程临床决策的基准。它包含从入院到出院的四个阶段的1,275个中文样本和5,804个英文样本。这些阶段涵盖分诊、首诊检查/诊断/治疗、后续多疗程检查/评估/治疗以及最终诊断。在ClinicalMC中，英文数据集中的患者平均经历5.11个临床疗程，而中文数据集中的患者经历3.42个。为评估LLM性能，我们构建了一个多智能体评估框架，包括患者、考官和医生智能体。基于该基准和框架，我们设计了两种实验设置——单轮静态设置和多轮动态设置——并评估了三类LLM：1）闭源LLM如GPT5-mini；2）开源LLM如DeepSeek-V3.2；3）医学LLM如HuatuoGPT-o1。通过广泛评估，我们旨在更好地理解LLM在医学领域的性能，并支持其在医疗中的有效部署。

英文摘要

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

URL PDF HTML ☆

赞 0 踩 0

2606.03144 2026-06-03 cs.AI 版本更新

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

GTBench：一个基于课程体系的基准，用于评估大语言模型作为图论数学研究助手的能力

Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta

发表机构 * Louisiana State University（路易斯安那州立大学）； Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）； Texas A&M-Central Texas（德克萨斯大学阿姆斯特朗中央分校）

AI总结本文提出GTBench基准，通过三个难度递增的图论问题组（本科定义、算法推理、研究生证明）评估大语言模型的数学推理能力，发现GPT-5表现最佳，其他模型随难度下降显著，并揭示了人类与自动评估者之间的系统性分歧。

Comments 19 pages, 5 figures, 7 tables

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作技术学科的自学助手，但其作为数学推理助手的可靠性仍知之甚少。我们引入了GTBench，这是一个基于课程体系的基准，用于评估LLM作为图论数学研究助手的能力，包含63个问题，分为三个难度递增的组：本科定义和基本性质（第1组）、算法跟踪和结构推理（第2组）以及研究生级别的证明构建（第3组）。问题来源于经过验证的学术材料，包括Diestel的《图论》。我们评估了五个前沿模型——GPT-5、Claude Sonnet 4.6、Gemini 2.5 Flash-Lite、Llama 3.3 70B和Mistral Large 3——在零样本和思维链提示下，对第1组和第2组使用精确匹配和LLM作为评判者的评估，对第3组使用混合人类专家和LLM作为评判者的协议。我们的结果揭示了显著的性能层次：GPT-5在第1组接近上限（零样本95.8%），并在研究生证明上保持有意义的准确性（82%），而所有其他模型随着难度增加性能大幅下降，其中Llama在第3组零样本下的人类评估中达到0%。失败模式分析表明，正确的算法但错误的执行错误在第1组和第2组中占主导地位，而第3组还出现了不完整的推理失败，并揭示了人类评估者与自动评判者之间的系统性分歧，特别是在冗长或接近完整的证明上（人类对之间的kappa = 0.48-0.83）。GTBench为LLM中的图论推理提供了第一个基于课程体系的评估框架，对数学教育和科学研究中AI工具的治理具有直接影响。

英文摘要

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

URL PDF HTML ☆

赞 0 踩 0

2606.03137 2026-06-03 cs.AI 版本更新

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Think-Before-Speak: 从内部评估到多智能体社会模拟中的公开表达

Kaiqi Yang, Tai-Quan Peng, Sanguk Lee, Hui Liu

发表机构 * Michigan State University（密歇根州立大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结提出TBS框架，通过分离智能体的内部推理与公开话语生成，模拟从内部评估到公开表达的路径，并在气候政策讨论中验证其机制敏感性。

详情

AI中文摘要

基于LLM的多智能体模拟为研究社会互动、审议和集体意见动态提供了一种有前景的方法。然而，许多现有的对话模拟框架主要将互动表示为可观察的轮次交换或聚合输出，使得沉默、说话意图和公开表达背后的内部评估过程难以考察。我们引入了TBS（Think-Before-Speak），一种基于间隔的多智能体模拟框架，将智能体的私人推理与公开话语生成分离。在每个间隔，所有智能体基于共享的对话历史及其自身记忆更新结构化的内部状态。这些状态包括与失调相关的评估、感知的意见气候、感知的孤立风险、回应策略和说话意愿。然后，协调器解决竞争的说话意图，并将一个话语提交到公共对话中，允许内部评估和公共互动随时间共同演化。我们在模拟的关于气候相关政策问题的市政厅讨论中评估了TBS。结果表明，TBS产生连贯的内部状态轨迹，并且这些轨迹在轮次分配、沉默和记忆条件下系统地变化。与失调相关的评估增加了智能体的说话意愿，而沉默压力评估则降低了它。一旦形成说话意图，公开表达主要由轮次分配规则塑造。这些发现表明，TBS通过使从内部评估到公开表达的路径可观察和可分析，支持机制敏感的社会模拟。

英文摘要

LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time. We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents' willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.

URL PDF HTML ☆

赞 0 踩 0

2606.03135 2026-06-03 cs.AI 版本更新

Uncertainty-Aware Clarification in LLM Agents with Information Gain

基于信息增益的LLM智能体不确定性感知澄清

Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Ying Zhao, Zhijiang Guo, Wei Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对用户指令不明确导致LLM智能体工具操作错误的问题，提出一种以信息增益奖励为导向的澄清框架，通过贝叶斯信念更新量化澄清问题的效用，训练智能体生成高信息增益的澄清，在τ-Bench环境中将任务成功率提升3.7%，仅增加0.3个交互步骤。

详情

Journal ref: ICML 2026

AI中文摘要

大型语言模型（LLM）智能体通常在未明确说明的用户指令下运行，其中关于用户意图的潜在不确定性会导致错误的工具操作。为了解决这一挑战，我们提出了一种目标导向的澄清框架，将澄清行为与歧义消除对齐。我们方法的核心是信息增益奖励，这是一种通过测量由澄清交互引起的对真实目标贝叶斯信念更新来量化澄清问题效用的指标。我们使用该奖励训练澄清器（LLM），以优化高信息增益，确保澄清有效减少不确定性并提高智能体-工具-用户环境中的任务完成度。我们在一个增强澄清的τ-Bench环境中验证了我们的框架，并在五个异质骨干网络上进行了跨智能体评估。实验结果表明，与无澄清基线相比，我们的方法一致地将成功率提高了3.7%，同时平均仅增加了0.3个总交互步骤。

英文摘要

Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced $τ$-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7\% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.

URL PDF HTML ☆

赞 0 踩 0

2606.03128 2026-06-03 cs.CR cs.AI cs.CL cs.LG 版本更新

Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

解耦式智能合约审计：通过蒸馏与聚合的轻量级LLM框架

Bagus Rakadyanto Oktavianto Putra, Muhamad Risqi Utama Saputra, Widyawan, Guntur Dharma Putra

发表机构 * University of Indonesia（印度尼西亚大学）

AI总结提出一种基于轻量级开源LLM（0.6B-4B参数）的解耦式智能合约审计框架，通过rsLoRA、知识蒸馏和链式验证聚合策略，在漏洞检测中达到98.25%准确率，优于7B-34B参数模型。

Comments 12 pages, 4 figures, 5 tables. Accepted to IEEE ICWS 2026

详情

AI中文摘要

智能合约面临关键安全挑战，需要在去中心化网络服务中进行彻底审计。虽然大型语言模型（LLMs）在自动漏洞检测中展现出潜力，但现有方法缺乏严重性评估和可操作的修复建议，且计算开销过大。在本研究中，我们引入了一个高效的端到端智能合约安全审计框架，利用轻量级、高度优化的开源LLMs（0.6B-4B参数）。我们的框架将综合审计任务解耦为四个相互关联的组件：漏洞检测、解释、严重性分类和修复建议。为了在无需庞大参数量的情况下保持高准确性，我们实现了秩稳定低秩适配器（rsLoRA）、知识蒸馏以及自定义链式验证（CoVe）聚合策略，系统性地筛选并整合模型生成的多个草稿响应，形成高准确度的审计报告。实验结果表明，我们的轻量级流水线持续优于最先进的开源代码密集LLMs（7B至34B参数），在漏洞检测中达到98.25%的准确率，在生成解释任务中达到0.4375的对齐分数。此外，我们广泛的消融研究实证验证了我们的解耦审计过程相对于统一提示的优越性，并揭示了一种新颖的严重性中心性偏差，为未来LLM辅助审计研究建立了关键基准。

英文摘要

Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models (LLMs) have shown promise in automated vulnerability detection, existing approaches lack severity evaluations with actionable remediation and demand unnecessarily massive computational overhead. In this study, we introduce an efficient end-to-end smart contract security audit framework utilizing lightweight, highly optimized open-source LLMs (0.6B-4B parameters). Our framework decouples comprehensive audit tasks into four interconnected components: vulnerability detection, explanation, severity classification, and remediation recommendation. To maintain high accuracy without massive parameters, we implement Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a custom Chain-of-Verification (CoVe) aggregation strategy to systematically screen and consolidate multiple draft responses from the model into a highly accurate audit report. Experimental results demonstrate that our lightweight pipeline consistently outperforms state-of-the-art open-source coder dense LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and an alignment score of 0.4375 in generative explanation tasks. Furthermore, our extensive ablation studies empirically validate the superiority of our decoupled audit processes over unified prompting and uncover a novel severity centrality bias, establishing a critical benchmark for future research in LLM-assisted auditing.

URL PDF HTML ☆

赞 0 踩 0

2606.03119 2026-06-03 cs.CV cs.AI cs.LG 版本更新

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

GuidedBridge: 无需训练地利用先验引导改进桥接模型

Zehua Chen, Yucheng Yang, Binjie Yuan, Kaiwen Zheng, Jun S. Liu, Jun Zhu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出无需训练的先验引导方法（PG）和频率调制先验引导（FMPG），通过对比弱先验与已见先验增强桥接模型的先验利用，并设计级联框架CFG-FMPG用于图像修复，实验证明该方法能一致提升预训练桥接模型在多种图像翻译任务中的性能。

Comments ICML 2026

详情

AI中文摘要

引导方法，如无分类器引导（CFG）和自动引导（AG），推动了扩散模型中噪声到数据生成的发展。最近，桥接模型引入了一种数据到数据的生成过程，可以利用有指导性的干净先验。在这项工作中，受先前通过去噪结果质量差异作为引导的方法启发，我们提出了一种无需训练的桥接引导方法，称为先验引导（PG）。具体来说，我们引入一个弱先验，该先验在桥接预训练期间未见，阻碍先验利用从而降低去噪结果。然后，我们将其与已见先验对比，通过缩放因子突出并增强先验利用。此外，我们分析了桥接过程中先验利用的潜在机制，并设计了频率调制先验引导（FMPG），该引导将引导尺度调整到与桥接生成动力学一致的低频和高频带。为了解决图像修复中的先验利用问题，我们开发了一个级联框架CFG-FMPG，该框架首先通过CFG生成噪声隐藏表示，然后将其作为生成先验与FMPG一起利用，在不影响推理效率的情况下发挥它们的互补优势。实验表明，我们的PG方法在多种图像翻译任务中一致地改进了预训练桥接模型。

英文摘要

Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.03103 2026-06-03 cs.AI 版本更新

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft: 桌面代理在专业工作流与人在环协作中的基准测试

Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang

发表机构 * Zhejiang University（浙江大学）； Tsinghua University（清华大学）； Tencent（腾讯）； The University of Hong Kong（香港大学）

AI总结提出DeskCraft基准，针对专业创意软件中的长周期工作流和主动人机协作，通过多级难度分类和交互协议评估18种代理，发现GPT-5.4在标准任务上达31.6%，交互任务上达27.6%。

详情

AI中文摘要

专业创意和工程软件中的真实桌面工作流通常跨越长时间跨度，并且往往需要人在环协调，代理在任务进行中主动寻求必要信息，用户提供额外指令、澄清、反馈或修正。然而，现有的桌面GUI基准大多将这一场景简化为短小、简单的任务，所有用户指令预先提供。为解决此问题，我们引入DeskCraft，一个针对长周期创意和工程工作流以及主动人机协作的桌面GUI基准。DeskCraft将任务组织成多级难度分类，长周期任务需要超过50个执行步骤，涵盖设计、视频、音频和3D创作等专业创意软件。此外，DeskCraft将人机协作形式化为一个交互协议，涵盖回合中和回合后交换。回合中交互捕捉代理在不确定性下主动发起的澄清和用户在执行过程中发起的打断，而回合后交互则容纳用户在代理发出完成信号后的反馈，共同覆盖现实协作模式的全空间。我们在538个任务上评估了18个专有和开源代理，发现GPT-5.4在标准任务上达到31.6%，在交互任务上达到27.6%。进一步分析揭示了长周期工作流交付和主动澄清方面的持续失败。我们将在以下网址开源所有评估代码、任务和数据：https://this https URL。

英文摘要

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.

URL PDF HTML ☆

赞 0 踩 0

2606.03099 2026-06-03 cs.CL cs.AI 版本更新

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

PhotoCraft: 具有层次自进化记忆的深度图像搜索代理推理

Kailin Lyu, Zhiqiang Yuan, Jianwei He, Qiwei Yan, Xuanbo Su, Nanxing Hu, Yang Liu, Ce Hao, Shengqian Qin, Lianyu Hu, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc.（微信AI模式识别中心，腾讯公司）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Zhongguancun Academy（中关村学院）； Nanyang Technological University（南洋理工大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出PhotoCraft，一种无需训练的分层记忆系统，通过工作、情景和语义记忆增强多模态大语言模型，实现深度图像搜索中的多步推理和知识迁移，在DISBench上提升检索性能达18.5%。

详情

AI中文摘要

深度图像搜索需要对丰富的上下文线索（如时间、地点和事件关系）进行多步推理。然而，现有的大语言模型代理大多是无状态和反应式的，缺乏持久记忆来维持长期上下文或跨任务迁移经验，这常常导致执行漂移和经验隔离。为了解决这些限制，我们提出了PhotoCraft，一种无需训练的分层记忆系统，用于照片搜索代理。受人类认知启发，PhotoCraft为多模态大语言模型配备了工作记忆、情景记忆和语义记忆，这些记忆在推理过程中被动态调用，以在多步推理和答案生成中保持逻辑一致性和知识可迁移性。在DISBench上的大量实验表明，PhotoCraft在不同多模态大语言模型骨干上持续改善了上下文感知检索，取得了高达18.5%的性能提升，并有效缓解了无记忆深度图像搜索中的关键瓶颈，为可靠且可泛化的多模态搜索代理提供了一条实用路径。

英文摘要

Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi-step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context-aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5\% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.

URL PDF HTML ☆

赞 0 踩 0

2606.03097 2026-06-03 cs.AI 版本更新

DELTAMEM: 通过残差树为LLM智能体增量式经验记忆

Haoran Tan, Zeyu Zhang, Zhicheng Cao, Rui Li, Xu Chen

发表机构 * Beijing Key Laboratory of Research on Large Models and Intelligent Governance（北京大模型与智能治理重点实验室）； Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE（下一代智能搜索与推荐工程技术研究中心，教育部）； Gaoling School of Artificial, Renmin University of China（中国人民大学人工智能学院）； Duke University School of Medicine（杜克大学医学学院）

AI总结提出DeltaMem框架，通过构建两个独立的残差树（目标条件任务经验和场景级环境知识）组织经验记忆，利用增量节点减少冗余，并通过失败惩罚相似度扫描和自主合并机制实现高效检索与自组织，在多种交互环境中优于现有基线。

详情

AI中文摘要

基于大语言模型的智能体越来越依赖记忆从持续交互中学习经验。然而，将经验存储为独立、扁平的单位会导致大量冗余和检索冲突，因为相似的情节重复重叠内容，而细微的场景变化导致检索到的记忆提供矛盾的指导。为了解决这个问题，我们引入残差经验的概念，认为新获得的经验通常是现有知识的增量变化。我们提出DeltaMem，一个将经验记忆组织成两个独立残差树的框架：一个存储目标条件任务经验作为可复用技能，另一个存储场景级环境知识。每个树使用一个根节点表示通用的基础经验，以及增量delta节点表示后续的变化，使得相关经验可以共享共同基础而不重复。对于检索，采用失败惩罚相似度扫描找到最佳匹配，并通过从根到匹配链的组合重构完整经验。一个自主合并机制将高频路径蒸馏成新的根节点，使树能够从通用启发式自组织为专门变体。在多种交互环境中的实验表明，DeltaMem持续优于现有基线。为促进未来研究，我们在该网址发布代码。

英文摘要

Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal-conditioned task experience as reusable skills and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure-penalized similarity scan locates the best match, reconstructing the full experience via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling the trees to self-organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import-myself/DeltaMem.

URL PDF HTML ☆

赞 0 踩 0

2606.03080 2026-06-03 cs.CL cs.AI 版本更新

学习何时何地连接：图上动态消息传递的自适应虚拟节点

Jaejun Lee, Joyce Jiyoung Whang

发表机构 * School of Computing, KAIST（计算机学院，韩国科学技术院）； Department of AI Computing, KAIST（人工智能计算系，韩国科学技术院）

AI总结提出MAVN框架，通过端到端可微分的方式自适应地决定在消息传递神经网络的哪一层为哪些节点引入虚拟节点，并基于双向评分机制建立连接，理论证明其能模拟任意节点-虚拟节点连接模式，实验表明在多个数据集上显著提升骨干网络性能。

Comments 12 pages, 6 figures, 10 tables, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3818013

AI中文摘要

虽然虚拟节点（VN）常用于消息传递神经网络（MPNN）中以促进有效的消息传递，但现有的基于VN的方法存在局限性，例如限制所有节点连接到相同数量的VN、在应用MPNN之前固定连接，以及独立于连接到同一VN的其他节点而将节点连接到VN。我们提出了MAVN，一个端到端可微分的MPNN框架，允许节点和VN之间无约束的连接，并根据跨层演化的节点表示动态按需引入VN。具体来说，MAVN学习基于连接的相对重要性自适应地决定何时（在哪一层）以及何地（连接到哪些节点）引入和连接VN。从候选VN池中，MAVN在每一层选择必要的VN，每个选中的VN连接到非空节点子集，由双向评分机制引导，该机制同时捕捉节点对VN的偏好和VN对节点的偏好。我们理论上证明，对于任何节点-VN连接模式，都存在一组MAVN参数可以模拟该模式。在九个真实世界数据集上的实验表明，MAVN持续提升骨干MPNN的性能，相对于骨干网络实现高达46.5%的提升，并优于基线方法。

英文摘要

While Virtual Nodes (VNs) are often utilized in Message Passing Neural Networks (MPNNs) to facilitate effective message passing, existing VN-based methods have limitations, such as constraining all nodes to connect to the same number of VNs, fixing the connections before applying MPNNs, and connecting a node to a VN independently of the other nodes that connect to the same VN. We propose MAVN, an end-to-end differentiable MPNN framework that allows non-constrained connections between nodes and VNs and dynamically introduces VNs on demand in response to evolving node representations across layers. Specifically, MAVN learns to adaptively determine when (at which layer) and where (to which nodes) to introduce and connect VNs based on the relative importance of connections. From a pool of candidate VNs, MAVN selects the necessary VNs in each layer, where each selected VN is connected to a nonempty subset of nodes, guided by a dual-perspective scoring mechanism that jointly captures the nodes' preferences for VNs and the VNs' preferences for nodes. We theoretically prove that for any node-VN connectivity pattern, there exists a set of MAVN's parameters that can simulate the pattern. Experiments on nine real-world datasets demonstrate that MAVN consistently improves the performance of backbone MPNNs, achieving up to 46.5% improvement over the backbones and outperforms the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.03066 2026-06-03 cs.AI 版本更新

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

CORE: 面向冲突的通用多模态篡改检测推理

Jinjie Shen, Yaxiong Wang, Yujiao Wu, Lechao Cheng, Tianrui Hui, Nan Pu, Zhihui Li, Zhun Zhong

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出CORE框架，通过构建冲突归因语料库和面向冲突的推理，增强多模态大语言模型的冲突捕捉能力，实现鲁棒且泛化的多模态篡改检测。

Comments Accepted by ICML 2026

详情

AI中文摘要

生成式AI的快速崛起使得多模态假新闻日益逼真且泛滥，对公众信任和社会稳定构成严重威胁。现有检测方法严重依赖针对特定篡改的模型和大规模标注数据，导致对新兴篡改类型的泛化能力差。我们观察到，篡改误导信息的本质在于其内在冲突，即跨模态或与常识世界知识之间的语义或物理不一致。受此启发，我们提出面向冲突的推理（CORE）框架，这是一种有效的范式，通过学习赋予多模态大语言模型（MLLMs）显式的冲突捕捉能力。为此，CORE首先构建了冲突归因语料库（CAC），包含冲突因素和来源的细粒度标注，为后续的冲突感知训练提供必要的数据支持。通过基于CAC进行面向冲突的表示增强和推理，CORE实现了鲁棒且可泛化的冲突检测，能够有效且快速地适应未见过的篡改类型，仅需少量样本甚至零样本设置。大量实验表明，CORE超越了现有最先进模型。数据集和代码已公开于该链接。

英文摘要

The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation-specific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, \textbf{i.e.,} semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose \textbf{C}onflict-\textbf{O}riented \textbf{RE}asoning (\textbf{CORE}) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.

URL PDF HTML ☆

赞 0 踩 0

2606.03061 2026-06-03 cs.DC cs.AI cs.LG cs.NI cs.SY eess.SY 版本更新

Brief Announcement: Generative Markov Model for Distributed Computing Systems

简要公告：分布式计算系统的生成马尔可夫模型

Alfreds Lapkovskis, Ali Beikmohammadi, Sindri Magnússon, Praveen Kumar Donta

发表机构 * Department of Computer and Systems Sciences, Stockholm University, Sweden（斯德哥尔摩大学计算机与系统科学系）

AI总结针对分布式计算系统的异构性和复杂性，提出一种基于结构化状态分解的生成马尔可夫模型，实现可处理的模拟、推理和策略学习，并通过协作AI推理案例验证其有效性。

Comments Submitted to 40th International Symposium on Distributed Computing (DISC 2026)

详情

AI中文摘要

新兴的分布式计算范式，如计算连续体，本质上是异构、随机和复杂的。高效且有效地利用连续体中所有可用资源需要一个统一的系统形式化模型。为了解决这一差距，我们提出了一个通用框架，将分布式计算系统建模为生成马尔可夫模型，该模型在结构化系统状态上进行分解。在我们的模型中，状态分解为高维变量，每个变量进一步在其元素上分解，反映了分布式系统固有的稀疏依赖结构。这产生了一个可处理的模型，能够对原本难以处理的系统状态进行模拟、推理和策略学习，从而将分布式计算与马尔可夫链理论和强化学习（RL）联系起来。我们通过一个协作AI推理的案例研究来展示我们的框架，其中专用服务器将资源与服务用户自愿提供的资源相结合。我们的结果表明，集中式调度在规模上成为瓶颈，而将计算分布到用户设备上可减少延迟和服务器资源消耗。这些发现突显了自适应决策在分布式计算系统中的价值，并展示了该框架在建模、模拟和优化方面的实用性。

英文摘要

Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.03057 2026-06-03 cs.LG cs.AI 版本更新

Rethinking Molecular Text Representations for LLMs: An Empirical Study

重新思考用于大语言模型的分子文本表示：一项实证研究

Arun Raja, Garrett M. Morris, Kian Ming A. Chai

发表机构 * University of Oxford（牛津大学）； DSO National Laboratories（DSO国家实验室）

AI总结通过系统基准测试，评估了9种分子表示和8种化学任务下16个LLM的性能，发现表示选择强烈影响结果，结构化文本表示（CML、MolJSON）在结构任务中占优，IUPAC在语义任务中占优，而SMILES很少最优。

Comments 25 pages, 11 figures, 20 tables

详情

AI中文摘要

大语言模型（LLMs）越来越多地用于分子任务，但目前尚不清楚使用哪种分子表示。我们提出了一个系统基准测试，评估了LLM在九种表示和八种化学任务上的分子能力。我们基准测试了16个LLM，涵盖五个模型家族，包括推理和非推理变体、化学专用LLM以及封闭前沿模型。性能强烈依赖于表示，没有单一表示在所有任务中获胜，尽管CML是最好的，其次是MolJSON、InChI，然后是规范SMILES。显式结构化文本表示（CML和MolJSON）主导结构任务；IUPAC主导语义任务，在所有16个LLM的分子检索中获胜；而SMILES变体尽管在预训练中普遍存在，但很少是最优的。化学专用模型在使用SMILES时表现良好，但使用结构化文本表示时性能大幅下降，这表明仅基于SMILES的评估奖励了不具泛化能力的专业化。使用LLM作为评判者，我们发现IUPAC产生的正确分子生成比例最高。通过分词审计、线性探针和注意力的机制研究表明，表示在模型内部以不同方式编码；例如，结构化表示需要跨分子范围的更高注意力。我们的结果反对表示不变的评估，并激励基于LLM的化学任务感知表示路由。

英文摘要

Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations and eight chemical tasks. We benchmark 16 LLMs across five model families, including reasoning and non-reasoning variants, chemistry-specialized LLMs, and closed frontier models. Performance is strongly representation-dependent and no single representation wins across tasks, though CML is the best, followed by MolJSON, InChI, and then canonical SMILES. Explicit structured text representations (CML and MolJSON) dominate structural tasks; IUPAC dominates semantic tasks, winning molecule retrieval for all 16 LLMs; and SMILES variants are rarely optimal despite their prevalence in pretraining. Chemistry-specialized models perform well with SMILES at the cost of large degradations with structured text representations, suggesting SMILES-only evaluation rewards specialization that does not generalize. Using LLM-as-a-judge, we find that IUPAC produces the highest fraction of correct molecule generations. A mechanistic study via tokenization audits, linear probes and attention shows that representations are encoded differently inside the model; for example, structured representations require higher attention across the molecular span. Our results argue against representation-invariant evaluation and motivate task-aware representation routing for LLM-based chemistry.

URL PDF HTML ☆

赞 0 踩 0

2606.03056 2026-06-03 cs.AI 版本更新

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

SkillDAG：面向大规模LLM技能选择的自演化类型化技能图

Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, Ivor W. Tsang

发表机构 * Fudan University（复旦大学）； National University of Singapore（国立新加坡大学）； CFAR A*STAR

AI总结提出SkillDAG，通过类型化有向图建模技能间关系，并作为推理时可调用的结构化检索接口，支持在线演化，在ALFWorld和SkillsBench上显著超越基线。

Comments 19 pages, 5 figures

详情

AI中文摘要

随着LLM智能体采用大规模技能库，选择合适的子集成为一个结构性问题而非相似性匹配问题：技能之间存在依赖、冲突、特化或重复关系，这种结构对于全枚举和嵌入相似性都是不可见的。我们提出SkillDAG，将技能间关系建模为类型化有向图，并将其作为推理时、智能体可调用的结构化检索接口暴露给LLM智能体，在执行过程中查询和演化，而非固化在固定的检索流水线中：每次搜索返回向量匹配、类型化边邻居和冲突信号，并通过提议-提交协议让智能体注册基于执行的边，从而使图在多个回合中积累结构。在ALWWorld和SkillsBench上使用MiniMax-M2.7，SkillDAG达到67.1%的成功率和27.3%的奖励，比最强报告的Graph-of-Skills基线分别高出+12.8和+8.6个百分点；该优势可移植到gpt-5.2-codex，且在匹配查询下，内在SkillsBench Ret@K从65.5提升至78.2。这些增益可追溯到可隔离的机制：候选排序在池规模扩大10倍时保持鲁棒，而固定种子扩散流水线会退化；以及集合单调的在线编辑，在不驱逐先前命中项的情况下扩大地面真实召回率。

英文摘要

As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.

URL PDF HTML ☆

赞 0 踩 0

2606.03054 2026-06-03 cs.AI 版本更新

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

ToolGate: 面向工具增强视觉语言智能体的令牌高效预调用控制

Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； University College London（伦敦大学学院）； AI Lab, The Yangtze River Delta（长江三角洲人工智能实验室）

AI总结针对工具增强视觉语言智能体中工具调用成本高且不必要的问题，提出轻量级外部控制器ToolGate，通过轨迹文本和结构特征预测执行/跳过决策，在降低令牌成本的同时保持或提升准确率。

详情

AI中文摘要

工具增强的视觉语言智能体可以通过OCR、检测、分割等工具获取外部感知证据，但执行每个提议的工具调用成本高昂且有时不必要。我们研究了预调用控制问题：在ReAct风格的VLM智能体提出感知工具调用后，是否应执行该调用，还是在其输出进入上下文之前跳过？在五个基准测试中，我们发现基线智能体表现出较差的局部选择性：有益和有害调用的发生率相近（11.8% vs. 9.9%），而大多数调用不会改变即时强制答案预测。我们引入了ToolGate，一个轻量级外部控制器，它根据轨迹文本和简单的结构特征预测执行/跳过决策。在两个Qwen3-VL骨干网络上，ToolGate将令牌成本降低到无限制ReAct基线的64-69%，同时保持跨域设置的平均准确率。在Qwen3-VL-30B上进行匹配域轨迹训练后，它进一步将平均准确率提高了1.65个百分点。这些结果表明，工具增强的VLM智能体不仅受益于更好的感知工具，还受益于对工具输出何时值得付费的显式控制。

英文摘要

Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.

URL PDF HTML ☆

赞 0 踩 0

2606.03040 2026-06-03 cs.AI cs.LG 版本更新

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

RelGT-AC：用于关系数据库中自动完成任务的关系图Transformer

Phillip Jiang

发表机构 * Appsofa LLC（Appsofa公司）

AI总结提出RelGT-AC模型，通过列掩码策略、统一任务头和TF-IDF文本编码器，在关系数据库的自动完成任务上优于GraphSAGE基线。

Comments 12 pages, 6 figures. Code and model checkpoints available at https://github.com/jiangdmv/graph-transformer

详情

AI中文摘要

关系数据库支撑着现代企业、科学和医疗系统，但由于其多表、异构和时间结构，对此类数据进行预测性机器学习仍然具有挑战性。关系深度学习（RDL）通过将数据库表示为异构图并直接应用图神经网络（GNN）来解决这一问题。RelBench v2最近引入了自动完成任务——一种实际动机的任务类型，其目标是从关系上下文中预测现有列值，类似于智能表单填充助手。我们提出了RelGT-AC（用于自动完成的关系图Transformer），通过三个有针对性的贡献扩展了RelGT架构：（1）一种列掩码策略，通过在子图编码期间屏蔽目标列来防止平凡解；（2）一个统一的任务头，支持在单个模型内进行二分类、多分类和回归自动完成任务；（3）一个TF-IDF文本编码器，自动检测和编码自由文本列，恢复分类编码器丢弃的强词汇信号。在跨越3个RelBench v2数据集（rel-trial、rel-f1、rel-stack）的7个任务中，RelGT-AC在所有3个回归自动完成任务上优于GraphSAGE基线，并通过TF-IDF编码器在文本密集的资格任务上实现了高达+10 AUROC点的提升。

英文摘要

Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi-table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks -- a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form-filling assistant. We propose RelGT-AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF-IDF text encoder that automatically detects and encodes free-text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel-trial, rel-f1, rel-stack), RelGT-AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text-heavy eligibility tasks via the TF-IDF encoder.

URL PDF HTML ☆

赞 0 踩 0

2606.03036 2026-06-03 cs.AI 版本更新

面向稀疏脉冲语言模型在商用CPU上的脉冲感知C++ INT8推理

Ting Liu

发表机构 * SymbolicLight Research（SymbolicLight研究院）

AI总结本文提出一种脉冲感知的C++推理运行时，利用稀疏二进制脉冲状态作为执行原语，结合混合布局、AVX2/FMA内核和INT8量化，在商用CPU上实现脉冲语言模型的高效解码，吞吐量优于同等规模稠密模型但质量略逊。

Comments 11 pages, 7 tables

详情

AI中文摘要

脉冲语言模型展现出激活稀疏性，而稠密Transformer运行时无法直接利用。本文从系统角度研究这一特性。基于SymbolicLight V1脉冲门控语言模型家族，我们实现了一个C++ CPU推理运行时，将稀疏二进制脉冲状态视为执行原语，而非仅应用事后权重压缩。该运行时结合了清单驱动的权重加载器、混合行/列内存布局、AVX2/FMA内核、每通道对称INT8量化以及脉冲条件稀疏路径的整数域累加。在AMD Ryzen 7 5800X上，早期标量FP32基线解码速度为9.5 tokens/s。混合布局AVX2 FP32将其提升至14.7 tokens/s，而AVX2 INT8在相同step-30k导出模型上达到19.9 tokens/s，同时将权重占用从3.49 GB降至1.06 GB。对于可用的186k步874M参数INT8导出模型，C++运行时在单线程CPU基准测试中解码速度为22.63 tokens/s，相比之下，TinyLlama-1.1B Q8_0为16.31 tokens/s，Falcon3-1B Q8_0为11.26 tokens/s，Qwen2.5-1.5B Q8_0为9.70 tokens/s。线程扩展在四个CPU线程时达到47.90 tokens/s，512 token预填充从单线程的29.86 tokens/s提升至八线程的94.68 tokens/s。吞吐量提升伴随着质量代价：SNN报告WikiText-2困惑度为24.80，差于同一基准中的稠密基线。我们将结果定位为稀疏语言运行时的推理系统研究，长期动机在于可能受益于传感器和执行器附近本地低核推理的具身和边缘智能体。脉冲感知执行可以改善稀疏脉冲语言模型的CPU吞吐量和内存行为，而模型质量、受控稠密训练基线、具身任务评估和测量CPU能耗仍是开放问题。

英文摘要

Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decodes at 9.5 tokens/s. Mixed-layout AVX2 FP32 raises this to 14.7 tokens/s, and AVX2 INT8 reaches 19.9 tokens/s on the same step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step 874M-parameter INT8 export, the C++ runtime decodes at 22.63 tokens/s in a single-thread CPU benchmark, compared with 16.31 tokens/s for TinyLlama-1.1B Q8_0, 11.26 tokens/s for Falcon3-1B Q8_0, and 9.70 tokens/s for Qwen2.5-1.5B Q8_0 under llama.cpp. Thread scaling reaches 47.90 tokens/s at four CPU threads, and 512-token prefill improves from 29.86 to 94.68 tokens/s from one to eight threads. The throughput result comes with a quality cost: the SNN reports WikiText-2 perplexity 24.80, worse than the dense baselines in the same benchmark. We frame the result as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. Spike-aware execution can improve CPU throughput and memory behavior for sparse spiking language models, while model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems.

URL PDF HTML ☆

赞 0 踩 0

2606.03022 2026-06-03 cs.CL cs.AI 版本更新

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

幻觉作为正交噪声：通过动态上下文正交化实现推理时流形对齐

Mingkuan Zhao, Wentao Hu, Tianchen Huang, Yuheng Min, Suquan Chen, Yide Gao, Yanbo Zhai, Shuangyong Song, Xuelong Li

发表机构 * Xi’an Jiaotong University（西安交通大学）； Xingchen AGI Lab（星辰AGI实验室）； China Telecom AI Technology (Beijing) Co., Ltd.（中国电信人工智能技术（北京）有限公司）； Institute of Artificial Intelligence, China Telecom（中国电信人工智能研究院）； University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出一种基于线性表示假设的几何框架，将大语言模型幻觉解释为残差流语义流形的正交噪声，并引入推理时干预方法动态上下文正交化（DCO），通过层间Z分数抑制机制选择性地衰减异常正交分量，在保持知识记忆的同时提升上下文忠实度。

详情

AI中文摘要

大语言模型（LLMs）中的幻觉——即生成与上下文事实或逻辑约束不一致的内容——仍然是可靠部署面临的持续挑战。在这项工作中，我们通过基于线性表示假设的几何框架来解决这个问题。我们提出，幻觉表现为相对于残差流语义流形的正交噪声。具体来说，我们假设虽然注意力头理想地传播与上下文子空间一致的信息，但当特定头引入与该子空间正交的分量时，就会产生幻觉，破坏潜在表示的一致性。基于这一表述，我们引入了动态上下文正交化（DCO），一种推理时干预方法。DCO利用输入残差流作为动态上下文锚点，对注意力头输出进行正交分解。为了区分上下文对齐的语义更新和发散噪声，DCO采用层间Z分数抑制机制，根据统计分布选择性地衰减异常正交分量。在XSum、NQ-Swap和IFEval等基准上对Llama-3-8B和70B的评估表明，与最先进的干预基线相比，DCO实现了更优的上下文忠实度。此外，DCO在TriviaQA和TruthfulQA等知识密集型任务上保持高性能，有效缓解了现有方法中常见的幻觉抑制与参数知识保留之间的权衡。我们的发现验证了幻觉的几何解释，并将DCO确立为一种计算高效的流形对齐方法。代码可在https://this https URL获取。

英文摘要

Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

URL PDF HTML ☆

赞 0 踩 0

2606.03019 2026-06-03 cs.CY cs.AI 版本更新

Reproducibility is the New Copyleft: Defining AGI-oriented Reproducible Builds

可重现性是新的Copyleft：定义面向AGI的可重现构建

Masayuki Hatta

发表机构 * Surugadai University（上贺茂大学）

AI总结本文提出面向通用人工智能（AGI）的可重现构建作为Copyleft的功能等价物，通过定义七项要求来确保模型从声明输入到输出的比特精确可重现性，并论证协议而非平台是更优的治理框架。

Comments Accepted at AGI-26. To appear in the proceedings (Springer LNCS)

详情

AI中文摘要

Copyleft，如GNU通用公共许可证中所实施的，是一种利用版权保证用户自由的法律技巧，通过将源代码的可用性与每次分发行为绑定。其规范力量依赖于一个隐含的技术前提：源代码和目标代码之间存在定义明确、可人工审计且可重现的关系。大型语言模型以及未来的通用人工智能（AGI）系统系统地违反了这一前提。重建模型所需的工件——代码、数据、权重、超参数、工具链和硬件配置——各自受到独立的法律、技术和经济约束，当前没有任何开源框架能完全解决这些问题。足够强大的AI系统还可以将许可下的源代码重写为功能等效的衍生作品，从而剥离原始义务，这是一种Copyleft无法有效防御的洗白形式。本文认为，对于AGI，Copyleft的功能等价物必须基于可重现构建，而非代码的共享相同条款：可重现构建是一种保证从声明输入到输出比特精确可重构性的实践。我们回顾了Copyleft的逻辑，批判性地审视了Maffulli的“第二次解放”论点（即AI实现了Stallman的梦想），并表明除非AGI系统本身是可重现的，否则该论点不成立。借鉴开源AI定义（OSAID）、模型开放框架（MOF）、OpenMDW和确定性推理研究，我们定义了面向AGI的可重现构建的七项要求。我们进一步论证，模型上下文协议（MCP）和类似的AI到AI耦合机制构成了一个新的动态链接层，Copyleft式许可对此并不适用，而Masnick的“协议而非平台”框架提供了更有前景的治理模板。

英文摘要

Copyleft, as implemented in licenses such as the GNU General Public License, was a legal hack that used copyright to guarantee user freedom by tying the availability of source code to every act of distribution. Its normative force rested on an implicit technical premise: that source code and object code stand in a well-defined, humanly auditable, and reproducible relationship. Large language models and, prospectively, Artificial General Intelligence (AGI) systems systematically violate this premise. The artifacts jointly required to reconstruct a model -- code, data, weights, hyperparameters, toolchain, and hardware configuration -- are each subject to independent legal, technical, and economic constraints that no current open-source framework fully resolves. Sufficiently capable AI systems can also rewrite licensed source into functionally equivalent derivatives stripped of their original obligations, a form of laundering against which copyleft has no effective defense. This paper argues that a functional analogue of copyleft for AGI must be grounded not in share-alike clauses over code, but in reproducible builds: a practice guaranteeing bit-exact reconstructability from declared inputs. We review the logic of copyleft, critically examine Maffulli's Second Liberation thesis according to which AI fulfills Stallman's dream, and show that the argument collapses unless AGI systems are themselves reproducible. Drawing on the Open Source AI Definition (OSAID), the Model Openness Framework (MOF), OpenMDW, and deterministic-inference research, we define seven requirements for AGI-oriented reproducible builds. We further argue that the Model Context Protocol (MCP) and analogous AI-to-AI coupling mechanisms constitute a new dynamic linking layer for which copyleft-style licensing is ill-suited, and that Masnick's "protocols, not platforms" framework offers a more promising governance template.

URL PDF HTML ☆

赞 0 踩 0

2606.03017 2026-06-03 cs.LG cs.AI cs.RO 版本更新

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

ConTraIRL：用于可迁移逆强化学习的分解对比抽象

Yikang Gui, Bikramjit Banerjee, Prashant Doshi

发表机构 * School of Computing University of Georgia（乔治亚大学计算学院）； School of Computing Sciences & Computer Engineering The University of Southern Mississippi（密西西比大学计算科学与计算机工程学院）

AI总结提出ConTraIRL框架，通过双编码器对比学习解耦环境动态与任务目标的潜在表示，实现组合奖励迁移，在连续控制基准上显著提升少样本迁移的样本效率和奖励恢复。

详情

AI中文摘要

当策略必须泛化到未见过的环境动态与任务目标组合时，逆强化学习中的奖励迁移不可靠。我们提出用于可迁移逆强化学习的分解对比抽象（ConTraIRL），该框架通过学习这两个因素的解耦潜在表示来实现组合奖励迁移。ConTraIRL采用双编码器架构，将观测映射到分离的动态和目标的潜在空间，并通过双重对比目标进行训练。时间对齐鼓励动态编码器学习目标不变的结构，而目标编码器捕获动态不变的特征。这种分解支持在重组动态-目标设置下的奖励推断。在连续控制基准上的实验表明，对未见过的动态-目标配对进行有效的少样本迁移，与迁移逆强化学习基线相比，提高了样本效率和奖励恢复。

英文摘要

Reward transfer in Inverse Reinforcement Learning (IRL) is unreliable when policies must generalize to unseen combinations of environment dynamics and task goals. We propose Factorized Contrastive Abstractions for Transferable IRL (ConTraIRL), a framework that enables compositional reward transfer by learning decoupled latent representations of these two factors. ConTraIRL uses a dual-encoder architecture that maps observations into separate dynamics and goal latent spaces, trained with a dual contrastive objective. Temporal alignment encourages the dynamics encoder to learn goal-invariant structure, while the goal encoder captures dynamics-invariant features. This factorization supports reward inference under recombined dynamics-goal settings. Experiments on continuous control benchmarks demonstrate effective few-shot transfer to unseen dynamics-goal pairings, improving sample efficiency and reward recovery over transfer IRL baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.03005 2026-06-03 cs.CV cs.AI 版本更新

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

发表机构 * University of Waterloo（多伦多大学）； Vector Institute（向量研究所）； AIML, Adelaide University（AIML，阿德莱德大学）； Department of Engineering Science, University of Oxford（牛津大学工程科学系）； Oxford Centre for Economic and Social History, University of Oxford（牛津大学经济与社会史中心）； Department of Computer Science, University College London（伦敦大学学院计算机科学系）

AI总结提出TypewriterLM，一个仅在1913年前英文文本上训练的7.24B历史语言模型，通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件，解决数据质量、时间泄漏、训练和评估等挑战。

详情

AI中文摘要

我们介绍了TypewriterLM，一个仅在1913年前英文文本上训练的7.24B历史语言模型。开发历史语言模型需要解决数据质量和可用性、防止时间泄漏、设计时间一致的后训练流程以及构建可靠评估等挑战。为了解决这些问题，我们构建了TypewriterCorpus，一个54B词元的历史语料库，收集自多样化的档案和语言标注来源，并进行了广泛的数据清洗和泄漏缓解措施。此外，我们引入了词汇基础指令微调，一种后训练框架，限制响应直接基于历史源文档。使用该框架，我们构建了两个历史指令微调数据集：History-LIMA和History-SelfInstruct。为了评估能力和时间一致性，我们引入了History-Event，一个用于评估能力、时间基础和泄漏的基准套件。我们发布了TypewriterLM及所有相关资源，以支持未来对历史语言模型的研究。

英文摘要

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

URL PDF HTML ☆

赞 0 踩 0

2606.02979 2026-06-03 cs.CV cs.AI cs.RO 版本更新

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，丰田寺大学）； Department of Computer Science and Electronics, Gadjah Mada University（计算机科学与电子系，加查马达大学）

AI总结提出一种紧凑的深度多任务学习模型，通过自适应损失加权和中间传感器融合技术，在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影，实现高效自动驾驶感知。

Comments This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213

详情

DOI: 10.1109/TITS.2022.3149370

AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型，能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影，无需其他模型支持。我们还提供了一种自适应损失加权算法，以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术，该模型可以处理并组合来自RGB摄像头、动态视觉传感器（DVS）和安装在自车多个位置的激光雷达的多种输入模态。因此，可以更好地理解动态变化的环境。基于消融研究，使用我们提出的方法训练的模型变体取得了更好的性能。此外，还进行了比较研究，以阐明其与一些近期模型组合相比的性能和有效性。结果表明，即使参数少得多，我们的模型仍能保持更好的性能。因此，该模型可以更快地推理，并减少GPU内存使用。此外，结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究，我们在以下网址公开共享代码和其他文件：https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

URL PDF HTML ☆

赞 0 踩 0

2606.02974 2026-06-03 cs.AI cs.HC cs.LG 版本更新

WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

WISE-HAR：一种基于WiFi的人类活动识别的可泛化集成深度学习框架

Maheen Arshad, Qindeel E Zahra, Muhammad Khuram Shahzad

发表机构 * Department of Computing, School of Electrical Engineering and Computer Science（计算机系，电气工程与计算机科学学院）； National University of Sciences and Technology (NUST)（国家科学与技术大学（NUST））

AI总结本文提出WISE-HAR框架，通过集成五种CNN架构、数据增强和跨场景评估，在Wallhack1.8k数据集上实现94.87%的LOS测试准确率，并展现出强泛化能力。

Comments 8 pages, 5 figures

详情

AI中文摘要

利用WiFi信号进行人类活动识别（HAR）已成为智能家居、医疗监控、安全系统和环境辅助生活的一项变革性技术。与引发严重隐私问题且在弱光条件下失效的传统基于摄像头的系统，或需要用户配合的可穿戴传感器不同，基于WiFi的HAR是非侵入性的、保护隐私的、成本效益高的，并且能在任何光照条件下无缝工作。本文提出了一种综合方法，使用Wallhack1.8k WiFi频谱图数据集识别三种不同的人类活动：“无人”（空房间）、“行走”和“行走+挥手”。我们提出了三项关键改进以应对基于WiFi的HAR的主要挑战。首先，为了解决高性能方差问题，我们实现了集成学习，采用五种不同的CNN架构（Deep CNN、Wide CNN、MobileNetV2、ResNet50V2和EfficientNetB0）。其次，为了解决小数据集大小的限制，我们应用了激进的数据增强技术，包括时间扭曲、频率掩蔽和噪声添加。第三，为了评估真实世界的泛化能力，我们进行了跨场景评估（在视距上训练，在非视距上测试）和跨天线评估（在双锥天线上训练，在PIFA天线上测试）。我们的集成模型在使用双锥天线的LOS场景下达到了94.87%的测试准确率，比最佳单个模型高出0.66%。数据增强将随机森林的性能从60%提升到95%。跨场景评估显示准确率下降极小，仅为1.37%和2.07%，证明了强大的泛化能力。结果表明，所提出的方法鲁棒、可靠，适用于不同硬件配置的多样化环境中的实际部署。

英文摘要

Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: "No Presence" (empty room), "Walking", and "Walking + Arm-waving" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.02967 2026-06-03 cs.ET cs.AI cs.AR cs.SY eess.SY 版本更新

Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence

轨道上的玻璃盒：面向可信自主立方星智能的宪法AI验证框架

Karthik Barma, Anil Sanneboyina, V C Premchand Yadav

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出玻璃盒框架，通过运行时宪法AI验证层拦截自主航天器决策，利用六项物理约束和七项线性时序逻辑安全不变式确保安全，并证明其验证开销与模型规模无关。

Comments 12 pages, 2 figures, 2 tables, 32 references. Paper 1 of the Project October series on autonomous orbital intelligence

详情

AI中文摘要

航天工业正在悄然构建一个尚未被充分认识的事物：在地球上空550公里处运行数千个自主AI工作负载的轨道数据中心，且无人类参与。微软、AWS以及越来越多的轨道计算企业正在将云规模处理从地面转移到轨道。然而，它们都尚未回答治理问题——当轨道数据中心规模的自主AI系统在太空中做出错误决策时，如何在决策变得不可逆转之前阻止它们？我们引入玻璃盒：一个运行时宪法AI验证层，在单个命令到达任何航天器子系统之前，拦截来自机载AI策略的每个候选动作，并根据六项基于物理的宪法约束和七项线性时序逻辑（LTL）安全不变式对其进行评估。每个批准的动作都附带一个加权可解释性分数E(a_t)（范围[0,1]）和完整的宪法审计日志。我们在Project October中演示了玻璃盒：一个针对CubeSat级航天器的完全模拟的五层自主轨道智能架构。我们证明玻璃盒的验证开销为O(N_c)，其中N_c是宪法规则的数量，与模型大小或航天器状态维度无关。我们提供了宪法约束语法的完整形式规范、通过Z3和NuSMV模型检查验证的七项LTL安全不变式，以及一个详细的工作示例，展示玻璃盒在电池状态退化的日食入口处拦截不安全推理请求。随着轨道计算向数据中心基础设施规模发展，运行时宪法验证不再是研究上的新奇事物——它是每个自主轨道平台最终将需要的任务关键型安全基础设施。

英文摘要

The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomous AI workloads with no human in the loop, 550 km above the Earth. Microsoft, AWS, and a growing list of orbital computing ventures are moving cloud-scale processing off the ground and into orbit. What none of them have answered yet is the governance question -- when autonomous AI systems at orbital data center scale make wrong decisions in space, what stops those decisions before they become irreversible? We introduce Glass Box: a runtime constitutional AI verification layer that intercepts every candidate action from an onboard AI policy and evaluates it against six physics-grounded constitutional constraints and seven Linear Temporal Logic (LTL) safety invariants before a single command reaches any spacecraft subsystem. Every approved action carries a weighted explainability score E(a_t) in [0,1] and a complete constitutional audit log. We demonstrate Glass Box within Project October: a fully simulated five-layer autonomous orbital intelligence architecture for CubeSat-class spacecraft. We prove that Glass Box verification overhead is O(N_c) in the number of constitutional rules, independent of model size or spacecraft state dimension. We present a complete formal specification of the constitutional constraint grammar, seven LTL safety invariants verified by Z3 and NuSMV model checking, and a detailed worked example of Glass Box intercepting an unsafe inference request at eclipse-entry under degraded battery state. As orbital computing scales toward data center infrastructure, runtime constitutional verification is no longer a research novelty -- it is mission-critical safety infrastructure that every autonomous orbital platform will eventually require.

URL PDF HTML ☆

赞 0 踩 0

2606.02965 2026-06-03 cs.AI 版本更新

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

基准测试无法衡量的：论自主智能体弃权能力的评估

Victor Ojewale, Suresh Venkatasubramanian

发表机构 * Brown University（布朗大学）

AI总结本文指出自主智能体基准测试忽视弃权能力，提出合规偏差概念，并引入弃权场景分类和评估协议，实验表明安全-可用性权衡是可调的。

Comments ACM CAIS 2026: RLEval Workshop Oral Presentation(Best Paper Award)

详情

AI中文摘要

自主智能体的基准测试衡量智能体是否完成任务，然而这种框架系统地忽略了智能体是否应该继续执行任务。在人类反馈目标下训练的智能体形成了一种结构性倾向，即使缺乏安全行动所需的输入、证据或授权也会继续执行，我们将这种倾向称为合规偏差，因为奖励信号和基准测试评分机制都将继续执行视为正确的默认行为，无论安全行动的前提条件是否满足。我们做出三项贡献。首先，我们表明合规偏差源于人类反馈流程中的奖励黑客行为，并因主流智能体基准测试而根深蒂固，这些基准测试要么惩罚智能体的暂停，要么在架构上无法区分有原则的暂停和静默失败。然后，我们引入弃权合理场景的三缺口分类法，涵盖所需信息缺失的规范缺口、无法确认世界状态的验证缺口以及未获得明确授权的权威缺口，这些共同为构建弃权感知的智能体基准测试提供了原则性基础。最后，我们提出弃权评估协议（安全率、可用率和知情拒绝率），并报告了144个企业智能体场景和五个模型系列的初步结果，其中运行时强制弃权机制在授权场景下实现了高达89.2%的危险动作阻断和87.5%的可用性，表明安全-可用性权衡是可调的而非固有的，并且其形状在不同模型系列间差异显著。我们将此视为初步工作，并提供分类法和复合指标作为进一步讨论的起点。

英文摘要

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.

URL PDF HTML ☆

赞 0 踩 0

2606.02962 2026-06-03 cs.CV cs.AI cs.HC eess.IV 版本更新

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

发表机构 * Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center , ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain（图像处理小组（GTI）、信息处理与电信中心、电信工程学院、马德里理工大学、西班牙）

AI总结针对自我中心视频中的自然语言查询定位任务，提出手部轨迹编码器与自适应门控交叉注意力融合方法，利用手部运动信息提升查询定位性能。

Comments Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

2606.02958 2026-06-03 cs.CR cs.AI 版本更新

Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries

Echelon: 跨隐私边界的可审计聚合专用语言模型适配

Hina Dixit, Punit Kumar, Irene Tenison, Nevasini Sasikumar

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出Echelon架构，通过强制设备级模型状态不可导出为系统不变量，仅允许聚合后的跨边界数据传输，并结合缓冲半异步安全聚合、陈旧感知加权等机制，在1B参数LoRA适配中实现低通信开销下的稳定训练。

详情

AI中文摘要

跨组织语言模型适配日益面临严格的治理约束：在许多部署中，设备级模型状态（参数、激活值、优化器状态及每设备更新）无法导出到管理边界之外。现有的分布式和联邦学习栈通常假设跨站点模型交换，然后改造隐私机制，这使合规性复杂化并导致审计脆弱。我们提出Echelon，一种边界优先的训练架构，将设备级模型状态不可导出作为系统不变量强制执行。设备在每个边界内本地训练；唯一的跨边界负载是安全聚合的边界级增量加上O(1)协调元数据，并通过具体的审计接口暴露。将交换限制为聚合改变了优化问题：系统必须在广域网延迟、异构参与、节点波动和非独立同分布数据下保持稳定，尽管全局层面从未看到每设备更新。Echelon结合了缓冲半异步安全聚合、陈旧感知加权、参与窗口、近端局部目标以及漂移感知外同步控制器。在M=2个边界上的1B参数LoRA适配中，预算匹配的竞赛（三个种子，24.88M tokens）达到验证损失3.887 +/-0.010，并在固定token、固定字节、固定挂钟时间和固定同步次数预算下，在调优的低通信基线中表现最佳或并列最佳。在OpenWebText压力测试中，Echelon在评估的广域网和非独立同分布处理下维持2,139-2,176 tokens/s的吞吐量；Echelon-DA在广域网延迟下相对于隐私对等的DiLoCo+SA基线改善了达到目标的时间，并且在200ms模拟延迟或严重非独立同分布分区下质量最多下降2.2%。

英文摘要

Cross-organization language-model adaptation increasingly faces hard governance constraints: in many deployments, device-level model state-parameters, activations, optimizer state, and per-device updates-cannot be exported outside an administrative boundary. Existing distributed and federated stacks typically assume cross-site model exchange and then retrofit privacy mechanisms, which complicates compliance and makes auditing brittle. We present Echelon, a boundary-first training architecture that enforces device-level model-state non-export as a systems invariant. Devices train locally inside each boundary; the only cross-boundary payloads are securely aggregated boundary-level deltas plus O(1) coordination metadata, exposed through a concrete audit surface. Restricting exchange to aggregates changes the optimization problem: the system must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data even though the global plane never sees per-device updates. Echelon combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer synchronization controller. In 1B-parameter LoRA adaptation across M= 2 boundaries, a budget-matched contest over three seeds (24.88M tokens) reaches validation loss 3.887 +/-0.010 and is best or tied-best among tuned low-communication baselines under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets. In OpenWebText stress tests, Echelon sustains 2,139-2,176 tokens/s across evaluated WAN and non-IID treatments, Echelon-DA improves time-to-target under WAN latency relative to a privacy-parityDiLoCo+SA baseline, and quality degrades by at most 2.2% under 200ms emulated latency or severe non-IID partitioning.

URL PDF HTML ☆

赞 0 踩 0

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC 版本更新

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE：边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结提出SCOPE模块化代理，用于自然语言控制的PTZ相机，在边缘部署实现实时感知、规划与控制，并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情

DOI: 10.1145/3757279.3785641
Journal ref: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026

AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估：自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具，并使用部署关键指标（包括延迟、准确性和错误模式）进行评估。我们提出了SCOPE（用于感知和评估的仿真与相机操作），这是一个模块化代理，用于自然语言、开放词汇的云台变焦（PTZ）相机控制和视觉场景理解，专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行，也可在物理PTZ相机上运行，所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试，涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别，在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合，以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合，将Qwen3小语言模型（SLM）与Moondream和Qwen视觉语言模型（VLM）配对。更强的SLM显著减少了幻觉并改善了工具路由，从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM，感知就成为主要的性能瓶颈。在规划和感知方面，混合专家模型在延迟和内存占用与更小网络相当的情况下，始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升，为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

URL PDF HTML ☆

赞 0 踩 0

2606.02908 2026-06-03 cs.CL cs.AI 版本更新

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

WRIT: 面向多轮用户代理的写密集型轨迹合成

Hengrui Gu, Xiaotian Han, Kaixiong Zhou

发表机构 * North Carolina State University（北卡罗来纳州立大学）； Case Western Reserve University（凯斯西储大学）

AI总结针对多轮用户代理在信息收集和决策中面临的证据负担挑战，提出WRIT方法，通过合成写密集型和读密集型轨迹，训练代理在信息负载下做出基于证据的决策，仅用2K轨迹即可提升性能并减少推理时token使用。

详情

AI中文摘要

多轮用户代理必须从不完整的请求中推断用户意图，通过对话和工具收集缺失信息，并执行有效操作。训练轨迹将此过程记录为用户消息、代理响应、工具调用等的交错序列。合成足够复杂的轨迹已成为训练代理的核心途径：现有流程通常通过将多个用户请求组合成更长的任务来增加难度，产生训练顺序执行的写密集型轨迹。我们认为，当代理必须在收集和比较大量读工具证据后才能确定其参数时，单个写决策本身可能很困难，这是仅靠写密集型数据无法解决的挑战。基于这一见解，我们提出WRIT（写-读密集型轨迹合成），这是一个沿两个复杂度轴合成多轮代理训练轨迹的流程：任务中写决策的数量和每个决策的证据负担。WRIT首先生成写密集型和读密集型任务。然后，它多样化用户行为指令以反映真实的对话变化，最后在可执行环境中模拟代理-用户交互以生成完整的训练轨迹。由此产生的数据不仅训练代理执行更长的任务，而且在高信息负载下做出稳健的、基于证据的决策。仅用2K合成轨迹，在WRIT上训练的4B模型在$\tau^2$-bench上优于GPT-5.1 no-think，并大幅减少推理时token使用，表明紧凑的SFT数据可以将部分昂贵的测试时推理转化为高效的代理行为。

英文摘要

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $τ^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.02884 2026-06-03 cs.LG cs.AI 版本更新

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

我们真的在倾斜吗？流模型和扩散模型中奖励引导的机制

Sanjit Dandapanthula, Nicholas M. Boffi

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过高斯混合模型和二次奖励的闭式分析，揭示了奖励引导扩散中奖励黑客现象源于Doob h函数的有限粒子插件估计，并提出了无额外计算的闭式奖励阻尼调度来纠正模式内偏差。

详情

AI中文摘要

奖励引导算法在推理时将学习到的生成过程导向奖励倾斜的测度。虽然经验上强大，但这些方法容易产生奖励黑客行为：引导模型以牺牲对学习分布的保真度为代价过度优化奖励。先前的工作将其归因于神经奖励函数的复杂性或扩散训练中的隐式偏差，但其根本起源仍知之甚少。我们表明，奖励黑客行为源于大多数实际奖励引导扩散实现中的一个近似——Doob h函数的有限粒子插件估计——即使在最简单的高斯和高斯混合目标以及二次奖励的非平凡设置中也是如此。在闭式中，我们分离了插件估计器的两种不同失效模式：它导致每个模式内的奖励黑客行为，并且无法选择高奖励模式。我们提出了一种闭式奖励阻尼调度，无需额外计算即可纠正模式内偏差，并阐明了最佳-n采样在补偿模式选择失败中的作用。在高斯混合目标、2D棋盘和FLUX.1文本到图像生成上的实验证实了我们的理论见解适用于实际设置。

英文摘要

Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.02883 2026-06-03 cs.HC cs.AI cs.CY cs.IR 版本更新

LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems

LLM辅助重排序以在推荐系统中实现细微目标

Amir Ghasemian, Homa Hosseinmardi, Upasana Dutta, Duncan J. Watts

发表机构 * Department of Communication, University of California, Los Angeles, CA 90095（通信系，加州大学洛杉矶分校，CA 90095）； Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104（计算机与信息科学系，宾夕法尼亚大学，Philadelphia, PA 19104）； Amenenberg School of Communication, University of Pennsylvania, Philadelphia, PA 19104（安纳伯格通信学院，宾夕法尼亚大学，Philadelphia, PA 19104）； Operations, Information, and Decisions Department, University of Pennsylvania, Philadelphia, PA 19104（运营、信息与决策系，宾夕法尼亚大学，Philadelphia, PA 19104）

AI总结本研究通过零样本指令提示对YouTube侧边栏候选进行重排序，发现无约束的LLM辅助重排序会放大极端和阴谋论内容，而轻量级提示正则化可在轻微损失相关性的情况下减少极端内容并增加意识形态多样性。

Comments 30 pages total; 11 pages, 5 figures, 2 tables (main text); 19 pages, 11 figures, 9 tables (appendix)

详情

AI中文摘要

推荐系统已从内容组织工具发展为塑造日常行为的复杂系统。通过控制我们所看到的内容，它们塑造了我们的感知，引发了对过滤气泡、激进化、两极分化和社会不平等的担忧。大型语言模型（LLM）实现了更强大的个性化，加剧了这些动态。然而，大多数推荐系统针对参与度或有限的准确性指标进行调优，很少关注更广泛的社会影响，例如个性化如何重塑社会重要领域中的曝光度。我们研究了LLM辅助重排序在提高个性化的同时，是否无意中放大了对意识形态极端或阴谋论政治内容的曝光，这是一种在新闻推荐中理论上存在但尚未得到实证表征的风险。使用真实的新闻消费历史，我们通过零样本、基于指令的提示对YouTube侧边栏候选进行重排序。我们比较了基线提示与一个约束变体，该变体保持主题相关性并扩大意识形态曝光，同时减少阴谋论或极端内容。在没有约束的情况下，重排序加强了个性化，但增加了对历史中包含此类内容的用户的阴谋论和极端主义材料的曝光。轻量级提示级正则化减少了对极端内容的推广并增加了意识形态多样性，同时相关性损失较小。合成实验表明，LLM通过语言中的统计规律而非对意识形态的语义理解进行重排序，这解释了为什么朴素提示会放大这些模式，而正则化可以重塑它们。总之，我们的结果突显了LLM在高风险推荐中实现上下文细微差别的能力，以及评估LLM辅助个性化超越准确性并将提示设计视为有价值负载而非中性默认的必要性。

英文摘要

Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.

URL PDF HTML ☆

赞 0 踩 0

2606.02875 2026-06-03 cs.AI 版本更新

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

交接债务：当编码代理接管被中断任务时的重新发现成本

Dipesh KC, Anjila Budathoki

发表机构 * Independent Researcher（独立研究者）； Georgia State University（佐治亚州立大学）

AI总结本文通过引入“交接债务”概念，研究编码代理在任务中断后从部分状态恢复时的重新发现成本，并提出一种接管协议来量化不同交接视图对后继代理效率的影响。

详情

AI中文摘要

编码代理基准测试评估单个不间断代理能否解决仓库问题。实际软件工作更为复杂：任务会被中断、重新分配、审查，并从另一个代理或工程师留下的部分状态恢复。我们通过“交接债务”研究这一缺失维度：即前任工作不透明或不完整时施加的重新发现成本。我们的接管协议在确定性交接点中断编码代理，冻结仓库，并在四种交接视图下评估后继代理：仅仓库状态、原始轨迹、摘要笔记和结构化笔记。在75个源任务中，该协议为每个后继模型生成181个交接点任务和724次接管运行。在三个后继模型中，相对于仅仓库接管，带有上下文的交接将中位代理事件减少20-59%，累积提示令牌减少42-63%。解决率的影响较小且依赖于模型，但效率提升是一致的。这些发现表明，编码代理评估不仅应报告任务是否解决，还应报告另一个代理恢复该工作的成本。

英文摘要

Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\% and cumulative prompt tokens by 42--63\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

URL PDF HTML ☆

赞 0 踩 0

2606.02871 2026-06-03 cs.CL cs.AI 版本更新

面向边缘嵌入式AI智能体系统的模块化架构

Marcus Rüb, Michael Gerhards

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结提出一种模块化参考架构，通过分层设计解耦设备端和云端智能体，并引入治理层，解决嵌入式微控制器上部署自主AI的严格资源约束问题。

详情

AI中文摘要

大型语言模型的兴起使得具备复杂推理和工具使用能力的智能体AI成为可能；然而，由于嵌入式微控制器严格的内存和能量限制，在普适计算环境中部署这种自主性仍然具有挑战性。现有框架通常假设服务器级资源或持续连接，导致深度嵌入式系统存在空白。本文提出了一种嵌入式智能体系统的模块化参考架构，弥合了确定性实时控制与智能体智能之间的鸿沟。我们引入了一种分层设计，将设备端智能体（执行高度压缩的神经网络和基于规则的逻辑，用于低延迟、隐私关键任务）与云端增强智能体（利用小型语言模型进行更高级别的推理和规划）解耦。一个关键贡献是集成了跨领域的治理层，确保分布式自主设备集群的可观测性、策略执行和安全性。本文不呈现纯经验基准，而是分析资源受限环境中关于延迟、能量和可靠执行的架构设计原则与权衡。

英文摘要

The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2606.02860 2026-06-03 cs.LG cs.AI 版本更新

Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

遗忘并非擦除：通过传输键恢复潜在知识

Archie Chaudhury

发表机构 * Axionic Labs（Axionic实验室）

AI总结通过缝合评估协议和紧凑的任务特定传输键，发现灾难性遗忘主要由内部阶段接口漂移而非任务相关计算的永久擦除引起，并能在顺序训练后恢复大部分早期任务性能。

Comments Technical report showcasing results from transport keys

详情

AI中文摘要

灾难性遗忘通常被视为表征问题：在顺序训练后，模型似乎失去了支持早期任务性能的特征。我们挑战了这一观点的更强形式。在受控的持续学习设置中，我们发现相当一部分明显的遗忘可归因于内部阶段之间的接口漂移，而非任务相关计算的永久擦除。我们通过一种缝合评估协议研究这一现象，该协议将更新后网络的早期计算与其前身的后期计算相结合，并可选地通过紧凑的任务特定传输键进行中介。我们在系统层面将传输键描述为紧凑的接口对齐算子，从少量配对的锚点激活中估计，并通过模型缝合进行评估。在split CIFAR-100上使用ResNet风格网络时，传输键在顺序训练任务B后恢复了大部分原始任务A的性能。在紧凑视觉变换器上，我们观察到类似的恢复模式。这些结果表明，持续学习可能需要更好的机制来索引和重新访问潜在计算，而不仅仅是防止权重变化的方法。

英文摘要

Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks. We challenge the stronger form of this view. Across controlled continual-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post-update network with late computation from its predecessor, optionally mediated by a compact, task-specific transport key. We describe transport keys at a systems level as compact interface-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching. On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B. On a compact vision transformer, we observe a similar recovery pattern. These results suggest that continual learning may require better mechanisms for indexing and re-accessing latent computations, not only methods that prevent weight change.

URL PDF HTML ☆

赞 0 踩 0

2606.02859 2026-06-03 cs.CL cs.AI cs.MA 版本更新

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

思维经济：具有经济交互的涌现多智能体智能

Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

发表机构 * Harvard University（哈佛大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结受哈耶克经济理论启发，通过拍卖和财富积累的简单经济信号实现去中心化信用分配，使弱智能体群体涌现出多步推理策略，在五个智能体任务中超越强单体基线。

详情

AI中文摘要

一群智能体如何在没有集中控制的情况下自我协调和自适应，形成更强的集体智能？受弗里德里希·哈耶克关于市场中去中心化协调的经济理论启发，我们通过一个智能体经济体来研究这个问题，其中智能体通过拍卖竞争行动权、交换支付，并从环境奖励中积累财富。这些简单的经济信号引出去中心化的信用分配，在没有全局编排或显式通信协议的情况下驱动规划。群体通过经济选择进化：有效的智能体积累财富并通过利用变异，而无效的智能体破产并通过探索被替换。我们表明，从弱智能体初始化，经济体产生涌现的多步推理策略，并在五个智能体任务中超越更强的单体基线，包括数学推理、金融研究、科学研究、加速器设计和分布式系统优化。我们进一步提供了关于经济动态如何塑造智能体行为的理论见解，将局部激励与长期全局表现联系起来。我们的结果指向了多智能体智能的一条新路径：与其设计协调，不如设计去中心化的激励结构，在这种结构下协调会自动涌现。

英文摘要

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

URL PDF HTML ☆

赞 0 踩 0

2606.02857 2026-06-03 cs.LG cs.AI 版本更新

GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

GRZO：用于大语言模型微调的组相对零阶优化

Liyan Tan, Yequan Zhao, Yifan Yang, Ruijie Zhang, Xinling Yu, Zheng Zhang

发表机构 * University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结提出GRZO优化器，通过组相对归一化聚合每个样本的损失，在不增加前向成本的情况下将有效梯度方向数从1提升至批量大小，降低方差并改善收敛，在多个模型和任务上优于MeZO。

Comments Preprint. Under review

详情

AI中文摘要

零阶优化是微调大语言模型时一种内存高效的反向传播替代方案，但其部署受限于梯度估计的高方差。我们提出GRZO，一种组相对零阶优化器，它为每个小批量样本抽取一个伪独立扰动，并通过组相对归一化聚合每个样本的损失，在不增加额外前向成本的情况下将有效梯度方向数从1提升至批量大小，同时保持推理级内存。我们证明GRZO在方向上是无偏的，方差随批量大小成比例缩小，从而得到比MeZO更紧的非凸收敛界。在RoBERTa-large、Llama3-8B和OPT-13B上，跨多个任务，GRZO在Llama3-8B上的平均准确率比MeZO提高$+3.0$，峰值GPU内存降低$23\%$；作为MeZO核心的即插即用替代，它平均将稀疏、低秩和量化ZO变体提升$+6.0$。

英文摘要

Zeroth-order (ZO) optimization is a memory-efficient alternative to backpropagation for fine-tuning large language models, but its deployment is limited by the high variance of gradient estimation. We propose GRZO, a Group-Relative Zeroth-Order optimizer that draws one pseudo-independent perturbation per mini-batch example and aggregates the per-example losses through group-relative normalization, raising the effective gradient-direction count from one to the batch size at no additional forward cost while preserving inference-level memory. We prove that GRZO is directionally unbiased with variance shrinking proportionally to the batch size, yielding a tighter nonconvex convergence bound than MeZO. Across RoBERTa-large, Llama3-8B, and OPT-13B over multiple tasks, GRZO improves average accuracy on Llama3-8B by $+3.0$ over MeZO at $23\%$ lower peak GPU memory; as a drop-in replacement for the MeZO core, it lifts sparse, low-rank, and quantized ZO variants by $+6.0$ on average.

URL PDF HTML ☆

赞 0 踩 0

2606.02837 2026-06-03 cs.CL cs.AI 版本更新

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

修复FOLIO和MALLS：经过验证的标注和基于LLM的框架以聚焦人工重新标注

Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

发表机构 * University of Udine（乌迪大学）

AI总结通过人工检查发现FOLIO和MALLS数据集中存在大量形式化错误，提出基于LLM的框架引导人工审核，显著减少所需审核量并提高数据集准确性。

详情

AI中文摘要

从自然语言到一阶逻辑（NL-to-FOL）的准确翻译是神经符号AI系统和自然语言推理（NLI）的基础，因此NL-to-FOL基准的质量至关重要——然而这些数据集从未经过严格审计。我们的第一个贡献是对 extsf{FOLIO}的验证集和 extsf{MALLS}测试实例子集进行系统性人工检查，发现分别约有39%和36%的条目包含错误的FOL形式化（即真实标签），此外还有一定比例的歧义NL句子（分别为16.4%和48%）以及 extsf{FOLIO}中错误的NLI标签（8.4%）。我们的第二个贡献是开发并发布了这些数据集的修正真实标签，并展示了标注错误如何扭曲参考基准任务上的模型评估：使用修正后的真实标签测试三个最先进的LLM（Gemma~4 31B-it、Qwen3-30B-A3B和GPT-4o-mini），准确率提升了9到22个百分点。受这些发现启发，我们提出了一个基于LLM的框架，以支持人工审查NL-to-FOL数据集。通过将审查者引导至最易出错的实例，我们实验证明，在审查少于24%的实例后即可达到90%的数据集准确率，而无引导的审查则需要审查超过70%的实例。我们发布了所有经过人工验证的标注以及框架代码。

英文摘要

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.02835 2026-06-03 cs.AI 版本更新

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

超越答案的思考：评估大型推理模型中的有害过度思考

Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

发表机构 * University of Trento（特伦托大学）； Toyota Motor Europe（丰田欧洲公司）； Fondazione Bruno Kessler（布鲁诺·凯塞林基金会）

AI总结本文提出前缀级轨迹评估协议，通过定义推理充分性来区分冗余但无害的冗长过度思考和导致正确轨迹偏离的有害过度思考，发现当前模型不仅受限于推理能力，还受限于无法在适当时机停止。

详情

AI中文摘要

大型推理模型（LRMs）通过增加测试时计算生成显式的中间推理轨迹来提升性能，但更长的推理是否始终有益这一假设尚未得到充分检验。虽然近期证据表明额外推理可能导致模型过度思考，但我们提出疑问：“一旦模型得出正确答案，进一步的推理是优化解决方案，还是偏离它？”为了研究正确性之后的动态，我们引入了一种基于推理充分性的前缀级轨迹评估协议，定义了模型首次生成正确答案所需的最小推理预算。这使我们能够将冗长过度思考（额外推理冗余但无害）与有害过度思考（持续推理破坏已正确的轨迹）区分开来。从多模态基准开始，我们发现许多被认为是推理密集型的问题实际上只需要很少的推理。此外，在第一个正确前缀处停止比标准推理提高了高达21%的准确率，表明当前模型不仅受限于推理能力，还受限于无法在适当时机停止。此外，虽然常见的效率策略（如早停）能大幅减少冗长过度思考（高达50%），但它们未能缓解有害过度思考。失败分析表明，正确性偏差主要由逻辑漂移和视觉重新解释驱动。最后，我们展示了我们的发现可推广到纯语言推理基准，突显了有害过度思考作为一个更广泛的可靠性风险。代码可在该 https URL 获取。

英文摘要

Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.

URL PDF HTML ☆

赞 0 踩 0

2606.02834 2026-06-03 cs.CR cs.AI 版本更新

Large Byte Model: Teaching Language Models About Compiled Code

大型字节模型：教会语言模型关于编译代码的知识

Florian Störtz, Catalin-Andrei Stan, Alexandru Dinu, Sandra Servia-Rodríguez, Mihaela Gaman, Calin Miron, Edward Raff

发表机构 * CrowdStrike U.K.（CrowdStrike英国分公司）； CrowdStrike Romania（CrowdStrike罗马尼亚分公司）； CrowdStrike USA（CrowdStrike美国分公司）

AI总结本文提出首个字节原生大语言模型，通过定制字节分词器扩展词汇表，使其能直接处理可执行文件原始字节并回答恶意软件分析问题，在家族分类和架构分类上分别达到69%和98%的准确率。

2606.02832 2026-06-03 cs.AI 版本更新

An Exploration of Collision-based Enemy Morphology Generation

基于碰撞的敌人形态生成探索

Johor Jara Gonzalez, Matthew Guzdial

发表机构 * Alberta Machine Intelligence Institute (Amii)（阿尔伯塔人工智能研究所）； Department of Computing Science, University of Alberta（阿尔伯塔大学计算机科学系）

AI总结本文探索了三种基于玩家碰撞信息生成敌人形态的新方法，并证明其性能优于从机器人形态学工作改编的进化基线。

2606.02822 2026-06-03 cs.CR cs.AI 版本更新

Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

哪种防御措施应对哪种威胁？归因OWASP-LLM-Top-10覆盖及其在释义下的脆弱性

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结本文通过归因分析，测量了不同防御家族（拒绝过滤、预算控制等）对OWASP-LLM-Top-10威胁的覆盖情况，并揭示了拒绝防御在释义攻击下的脆弱性。

Comments 17 pages, 4 figures, 7 tables

详情

AI中文摘要

生产级LLM应用堆叠了多种防御家族——拒绝短语过滤器、令牌预算控制、模型白名单、速率限制、工具注册认证——然而现有的攻防模拟（BAS）基准报告单一的总体覆盖数字，隐藏了哪个家族应对哪种威胁。我们测量归因。我们将四个OWASP-LLM-Top-10感知的智能体添加到一个21智能体的基线扫描器中，并针对四个合成LLM端点的格点：$L_0$（无防御）、$L_1$（仅拒绝）、$L_2$（仅预算）和$L_3$（全栈）。$L_1$和$L_2$是兄弟单轴消融，互不为子集；$L_3$是它们的并集加上工具注册认证和凭证清洗。在$N=10$次重复中，每个OWASP的发现计数清晰：仅拒绝消除所有LLM01（越狱）和LLM07（系统提示泄露）发现；仅预算通过终止多步序列消除所有LLM02（敏感信息泄露）和LLM10（无限制消耗）发现；LLM06（过度代理）需要全栈。我们探测释义下的脆弱性：使用300个Gemini生成的释义（在60模板脆弱性语料库上$K=5$），$L_1$拒绝阻断率在LLM01上下降15个百分点，在LLM07上下降25个百分点。第五个目标$L_4$-real，将存根后端替换为Gemini-2.5-flash，使用相同的$L_3$正则表达式，并与$L_1$完全匹配，表明除了正则表达式外没有可测量的对齐贡献（不是关于对齐的一般性声明）。预算控制没有下降（在扣除速率限制下限后为0个百分点）。一个通过静态基准的拒绝白名单可以被LLM驱动的释义器击败而不改变攻击意图；预算控制抵抗相同的变异。

英文摘要

Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet existing breach-and-attack-simulation (BAS) benchmarks report a single aggregate coverage number, hiding which family closes which threat. We measure attribution. We add four OWASP-LLM-Top-10-aware agents to a 21-agent baseline scanner and target a lattice of four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refusal-only), $L_2$ (budget-only), and $L_3$ (full stack). $L_1$ and $L_2$ are sibling single-axis ablations, not subsets of each other; $L_3$ is their union plus tool-registry authentication and credential scrubbing. Across $N=10$ replications, the per-OWASP finding count is clean: refusal alone removes all LLM01 (jailbreak) and LLM07 (system-prompt leakage) findings; budget alone removes all LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption) findings by terminating multi-step sequences; LLM06 (excessive agency) requires the full stack. We probe brittleness under paraphrasing: with 300 Gemini-generated paraphrases ($K=5$ over a 60-template brittleness corpus), $L_1$ refusal block rate falls 15 pp on LLM01 and 25 pp on LLM07. A fifth target, $L_4$-real, swaps the stub backend for Gemini-2.5-flash behind the same $L_3$ regex and matches $L_1$ exactly, indicating no measurable alignment contribution beyond the regex (not a general claim about alignment). Budget controls show no drop (0 pp once the rate-limit floor is factored out). A refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent; a budget control resists the same mutation.

URL PDF HTML ☆

赞 0 踩 0

2606.02814 2026-06-03 cs.IR cs.AI cs.CL 版本更新

评估 Transformer 和 LSTM 框架在无资料流域预测中的表现

Taye Akinrele, James Halgren, Noorbakhsh Amiri Golilarz, Sudip Mittal, Shahram Rahimi

发表机构 * University of Arizona（亚利桑那大学）

AI总结本研究通过 NOAA 国家水模型回顾模拟，评估仅编码器 Transformer 与 LSTM 在有限水文信息下上游径流推断的优势，发现 LSTM 整体性能更强，且加入下游信息可显著提升预测技能。

Comments 5 pages

详情

AI中文摘要

流域网络呈现收敛拓扑结构，其中多个支流汇入下游河道，整合了多样化的上游水文过程。在无资料流域中，缺乏直接观测增加了不确定性，并限制了预测极端事件的能力。本研究利用 NOAA 国家水模型（NWM）的回顾模拟，评估仅编码器 Transformer 是否在有限水文信息下比 LSTM 更具优势，用于上游径流推断。在仅上游和组合配置中，LSTM 在两种配置下的整体表现均优于 Transformer 模型。加入下游信息进一步提升了所有模型的性能，使中位数 NNSE 提高了 60% 以上。我们并未将其视为排行榜式的比较，而是将实验解释为对水文序列推断的架构归纳偏置的测试。结果表明，循环记忆仍比仅编码器 Transformer 更适用于此上游重建任务，而下游水文背景提供了强大的辅助约束，显著提高了跨架构的预测技能。

英文摘要

Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream hydrological processes. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events. This study evaluates whether an encoder-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model (NWM). Across both upstream-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%. Rather than treating this as a leaderboard-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures

URL PDF HTML ☆

赞 0 踩 0

2606.02781 2026-06-03 cs.AR cs.AI cs.ET 版本更新

CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation

CRAM-ER：面向可扩展存内计算的容错自旋计算随机存取存储器

Sohan Salahuddin Mugdho, Md. Shahedul Hasan, Brahmdutta Dixit, Yang Lv, Jian-Ping Wang, Cheng Wang

发表机构 * Electrical and Computer Engineering Iowa State University of Science and Technology（电气与计算机工程学院爱荷华州立大学科学与技术学院）； Electrical and Computer Engineering University of Minnesota Twin Cities（电气与计算机工程学院明尼苏达大学双城分校）

AI总结针对基于MRAM的计算随机存取存储器（CRAM）在加速深度神经网络时面临的概率性开关错误和低吞吐量问题，提出一种混合自旋-CRAM与CMOS加法器树的容错架构（CRAM-ER），通过硬件-软件协同设计实现高能效、高可靠性的矩阵向量乘法。

详情

AI中文摘要

深度神经网络（DNN）在多个领域取得了最先进的性能。然而，传统的冯·诺依曼计算范式面临严重的内存瓶颈。新兴的近内存和存内计算方法缓解了这一问题，但引入了显著的外围开销。基于MRAM的计算随机存取存储器（CRAM）能够实现无外围开销的原位逻辑，提供了一种密集、节能的解决方案。然而，概率性的MRAM开关会导致门级错误，限制了CRAM在加速DNN时的可扩展性和可靠性。此外，大量的顺序MRAM写入严重制约了CRAM的吞吐量。为了解决这些挑战，我们提出了一种容错CRAM（CRAM-ER）架构，用于可扩展的存内矩阵向量乘法（MVM）。我们的错误感知硬件-软件协同设计框架利用混合自旋-CRAM + CMOS加法器树架构来减轻器件级错误的影响，展示了具有高面积和能效的MVM功能。我们进一步开发了错误感知模型微调和细粒度纠错技术，以增强错误容限。在DNN基准测试上对CMOS+自旋混合架构的评估显示，在将CRAM延迟降低多达两个数量级的同时，实现了近乎无损的精度，在能效和能量延迟积方面均优于CPU/GPU+高带宽DRAM。

英文摘要

Deep neural networks (DNNs) have achieved state-of-the-art performance across diverse domains. However, typical Von Neumann compute paradigms face severe memory bottlenecks. Emerging near-memory and compute-in-memory approaches alleviate this but incur significant peripheral overhead. Computational Random Access Memory (CRAM) based on MRAM enables in-situ logic without peripheral overhead, offering a dense, energy-efficient solution. However, probabilistic MRAM switching induces gate-level errors that limit the scalability and reliability of CRAM for accelerating DNN. Moreover, the large number of sequential MRAM writes severely constrains CRAM throughput. To address these challenges, we propose an error-resilient CRAM (CRAM-ER) architecture for scalable in-memory matrix-vector multiplications (MVMs). Our error-aware hardware-software co-design framework leverages a hybrid spintronic-CRAM + CMOS adder-tree architecture to mitigate the impact of device-level errors, demonstrating MVM functionality with high area and energy efficiency. We further develop an error-aware model fine-tuning and fine-grained error correction for enhanced error resilience. Evaluations of the CMOS+spintronic hybrid architecture on DNN benchmarks show near-lossless accuracy while reducing CRAM latency by up to 2 orders of magnitude, outperforming CPU/GPU+high-bandwidth DRAM in both energy efficiency and energy-delay product.

URL PDF HTML ☆

赞 0 踩 0

2606.02775 2026-06-03 cs.AI cs.AR cs.DC cs.PF cs.RO 版本更新

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

AURA: 恒定VRAM下机器人策略的动作门控记忆

Josef Chen

发表机构 * KAIKAKU（卡基库）

AI总结提出AURA-Mem，一种恒定大小、基于动作误差信号门控写入的循环记忆，替代KV缓存，在边缘机器人任务中实现与基线相当的准确率，同时减少5-9倍写入次数。

详情

AI中文摘要

KV缓存是数据中心合适的记忆，但却是机器人错误的记忆。数据中心推理批量处理许多短请求并重置它们，在众多请求中分摊注意力缓存。具身智能体则在带宽受限的边缘硬件上运行一个长且不重置的回合，其中高带宽内存和闪存稀缺，闪存写入寿命有限，内存写入而非计算可能成为约束瓶颈。AURA-Mem（动作效用循环自适应记忆）针对这一场景。它用一个固定大小的循环记忆和一个学习得到的门控包装冻结的视觉-语言-动作骨干网络，该门控仅在当前观测会改变下一个动作时写入：一种知道何时保持沉默的记忆。与基于重建的记忆不同，该门控直接针对闭环动作误差信号进行训练。其推理状态固定为4,224字节，无论时间步长如何，而KV缓存则在100,000步时增长到6,061倍。在受控的合成基准测试中，AURA-Mem在准确率上与最佳的O(1)基线相当，同时使用5.19-6.13倍更少的写入，在更简单的配置上最多减少9.19倍写入。预算匹配的随机和周期性调度无法恢复这一增益，从而将收益归因于动作惊喜信号。在LIBERO-Long上训练的闭环OpenVLA-OFT 7B面板（每个机械臂n=60个回合）上，门控不会损害成功率：AURA-Mem与无门控基础策略（0.233）相当，并略超过始终写入的KV臂（0.217），同时使用7.0倍更少的写入和恒定内存。我们还实例化了一个近似信息状态价值损失界限作为方法论演示；在此规模下，该界限是空洞的而非保证。

英文摘要

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.

URL PDF HTML ☆

赞 0 踩 0

2606.02765 2026-06-03 cs.LG cs.AI 版本更新

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

表示能力：Transformer语言模型中特征表示的几何限制

Alexander Guha

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结基于线性表示和叠加假设，通过嵌入矩阵的余弦相似度分布估计模型可支持的近正交方向数量，推导出容量公式，并发现容量对偏差ε指数敏感。

Comments 22 pages, 10 figures. Submitted to NeurIPS 2026. This is a condensed version of thesis: https://hdl.handle.net/2286/R.2.N.204857

详情

AI中文摘要

模型维度（$d_{model}$）是Transformer语言模型中的一个基本超参数，但其在设定特征表示的几何限制方面的作用仍未得到充分探索。基于线性表示和叠加假设——这些假设提出模型将特征编码为潜在空间中的近正交方向——我们开发了一个框架来估计模型可以支持多少个这样的方向。我们首先将嵌入矩阵确立为跨潜在空间近正交约束的可测量代理：成对余弦相似度分布中有意义的token关系与偶然相似性之间的边界给出了模型对完美正交性的可接受偏差ε的具体估计。将此度量应用于数十个开源模型揭示了两个类别：具有高ε且其嵌入缺乏近正交结构的模型，以及具有低ε且保持近正交结构的模型。然后我们表明，标准的Johnson-Lindenstrauss引理大大低估了训练表示的填充效率，并推导出一个调整后的容量公式，其中近正交方向的数量取决于向量与维度的比率（$k/d$）而非原始计数——这一单一修改在没有额外参数的情况下将预测误差降低了两个数量级。结合这些结果，我们将表示能力定义为模型潜在空间中可用于特征和嵌入的可区分方向上界。容量对ε指数敏感，并且较大的模型倾向于更严格的正交约束而非最大化原始容量——这一模式与几种解释（稳定性-容量权衡、可用概念的上限或模型规模的混杂因素）兼容，我们将这些留给未来工作。

英文摘要

Model dimension ($d_{model}$) is a fundamental hyperparameter in transformer language models, yet its role in setting the geometric limits of feature representation remains under-explored. Grounded in the Linear Representation and Superposition Hypotheses - which propose that models encode features as near-orthogonal directions in latent space - we develop a framework for estimating how many such directions a model can support. We first establish the embedding matrix as a measurable proxy for near-orthogonality constraints across the latent space: the boundary between meaningful token relationships and incidental similarity in the pairwise cosine similarity distribution gives a concrete estimate of the model's accepted deviation $\varepsilon$ from perfect orthogonality. Applying this metric across dozens of open-source models reveals two classes: models with high $\varepsilon$ whose embeddings lack near-orthogonal structure, and models with low $\varepsilon$ that maintain it. We then show that the standard Johnson-Lindenstrauss lemma greatly underestimates the packing efficiency of trained representations, and derive an adjusted capacity formula in which the number of near-orthogonal directions depends on the ratio of vectors to dimensions ($k/d$) rather than the raw count - a single modification that cuts prediction error by two orders of magnitude with no extra parameters. Combining these results, we define representational capacity as an upper bound on the number of distinguishable directions available for features and embeddings in a model's latent space. Capacity is exponentially sensitive to $\varepsilon$, and larger models favor tighter orthogonality constraints over maximizing raw capacity - a pattern compatible with several explanations (a stability-capacity trade-off, a ceiling on usable concepts, or confounds with model scale) that we leave to future work.

URL PDF HTML ☆

赞 0 踩 0

2606.02755 2026-06-03 cs.SE cs.AI 版本更新

EntangleCodec：通过语义-声学纠缠的统一离散音频分词器

Hui Li, Yangfan Gao, Junlin Shang, Changhao Jiang, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University（复旦大学）

AI总结提出EntangleCodec，一种通过将音频与丰富标题对齐学习语义-声学联合表示的统一离散音频分词器，在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景，并通过流匹配扩散解码器实现高质量重建，在音频理解和生成任务上均取得领先性能。

Comments 17 pages, 10 figures

详情

AI中文摘要

音频分词器作为连续音频与音频语言模型（ALM）之间的离散接口，但现有分词器往往难以同时支持理解和生成。面向重建的编解码器保持声学保真度但缺乏丰富语义，而语义感知分词器通常依赖独立的语义和声学流，引入冗余或错位。我们提出 extbf{EntangleCodec}，一种统一的离散音频分词器，在量化之前学习与标题对齐的语义-声学表示。通过将音频与丰富标题而非ASR转录对齐，EntangleCodec在紧凑令牌流中捕获语言内容、说话人身份、情感、韵律和声学场景。流匹配扩散解码器进一步实现了语音、音乐和通用音频的高质量重建。EntangleCodec在重建质量上与专用编解码器竞争，在音频理解上优于所有基于编解码器的基线，在MMAR上提升高达 extbf{+7.4\%}，并在统一框架中支持TTS和TTA生成。此外，基于EntangleCodec的音频语言模型展现出强大的扩展行为：即使参数为 extit{0.6B}，该模型在三个基准测试中超越了参数超过 extit{13B}的专用连续表示LLM，参数减少了 extbf{22$ imes$}；扩展到 extit{8B}进一步在MMAR上建立了新的最先进结果，突显了在音频语言建模中表示质量与模型规模同等重要。代码和模型权重可从此https URL获取。

英文摘要

Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.

URL PDF HTML ☆

赞 0 踩 0

2606.02737 2026-06-03 cs.IR cs.AI cs.CL 版本更新

Attention Calibration for Position-Fair Dense Information Retrieval

面向位置公平的密集信息检索的注意力校准

Andrianos Michail, Elias Schuhmacher, Juri Opitz, Simon Clematide, Rico Sennrich

发表机构 * Department of Computational Linguistics University of Zurich（计算语言学系苏黎世大学）

AI总结针对密集检索模型的位置偏差问题，提出在推理时通过注意力校准（引入强度系数λ插值原始与完全校准分布）来提升位置公平性，无需重新训练且不牺牲整体检索效果，在多个数据集和模型上验证了部分校准优于完全校准，并提供了默认配置。

详情

AI中文摘要

密集检索模型存在位置偏差：当相关信息出现在段落较后位置时，检索效果会下降（Zeng et al., 2025）。我们探究是否可以在推理时减少这种偏差，无需重新训练且不牺牲整体检索效果。为此，我们将推理时的注意力校准（Schuhmacher et al., 2026）适配到下游检索，并引入强度系数λ，在原始注意力分布和完全校准的注意力分布之间进行插值。在SQuAD-PosQ和FineWeb-PosQ上的三个嵌入模型上，我们考察了篮子大小、校准层集和强度如何影响位置公平性与检索效果之间的权衡，发现部分校准通常优于完全校准。单个配置（B=128, λ=0.5, 50%层深度）在FineWeb-PosQ上提升了所有三个模型跨位置组的nDCG@10的调和平均值，无需逐模型调参，并且适用于<s>-池化和最后token池化两种架构。该默认配置无需修改即可迁移到PosIR（涵盖10种语言和31个领域），在所有16种长度四分位×模型×检索设置组合中降低了位置敏感指数，同时保持或提升了整体nDCG@10。我们在以下网址发布扩展后的代码库：this https URL

英文摘要

Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both <s>-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair-sentence-transformers

URL PDF HTML ☆

赞 0 踩 0

2606.02724 2026-06-03 cs.CV cs.AI 版本更新

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

AVTrack: 以人为中心的复杂场景中的视听跟踪

Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有视听跟踪数据集局限于简单场景的问题，提出AVTrack数据集，通过包含相机运动、视觉遮挡和位置变化等复杂动态条件，评估并提升鲁棒的人为中心视听场景理解。

Comments 19 pages, 10 figures, ICML 2026

详情

AI中文摘要

视听说话人跟踪旨在通过利用听觉和视觉线索来定位和跟踪活跃的说话人，实现细粒度、以人为中心的场景理解。这一能力对于智能视频编辑、监控和人机交互等实际应用至关重要。然而，现有数据集大多局限于具有粗略标注的简单或同质视听场景。这种过度简化的设置使评估偏向于静态视听共现，而非严格评估复杂动态场景中的鲁棒时空建模和跨模态推理。为了解决这些限制，我们引入了AVTrack，一个以人为中心的视听实例分割（AVIS）数据集，专为动态真实世界场景设计。AVTrack具有多样且具有挑战性的条件，包括相机运动、视觉遮挡和位置变化。在AVTrack上对代表性AVIS方法的评估揭示了显著的性能下降，使AVTrack成为复杂环境中鲁棒的以人为中心的视听场景理解的挑战性基准。我们进一步提供了一个简单而有效的基线，以促进未来的研究。项目网站：此https URL

英文摘要

Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/

URL PDF HTML ☆

赞 0 踩 0

2606.02673 2026-06-03 cs.AI cs.LG 版本更新

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

大语言模型中用于结构推理的可视化图脚手架

Runlin Lei, Xiaokui Xiao, Zhewei Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出将图结构作为大语言模型的内部推理辅助而非仅外部知识源，通过多跳问答实验发现视觉图引导相比文本化图在无直接答案提示时仍保持有效性，支持图作为组织推理的可视化脚手架。

详情

AI中文摘要

图已被用于增强大语言模型的结构化推理，主要是在测试时作为外部知识源提供给模型。在本文中，我们采取不同的视角：图对LLMs的价值不仅在于提供信息，还在于组织推理。受人类使用图结构思维导图组织分支和汇聚思维的启发，我们探究图是否可以作为推理辅助的内部形式。我们在多跳问答任务上研究这一问题，其中教师提供的推理轨迹被重写为图思维导图并用于指导学生模型。我们的实验揭示了明显的模态差距。当图结构被扁平化为文本时，一旦直接答案提示被移除，其益处变得有限。在这种抽象引导设置下，推理效率和答案质量都大幅下降。相比之下，视觉图引导在没有直接答案线索时仍然有效，并且其优势在监督微调和基于KL的蒸馏后仍然保持。上述发现支持了以下主张：图不仅应作为LLMs的外部知识结构来研究，还应作为组织推理的可视化脚手架。

英文摘要

Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided to models at test time. In this paper, we take a different view: the value of graphs for LLMs lie not only in supplying information, but also in organizing reasoning. Inspired by how humans use graph-structured mind maps to organize branching and converging thoughts, we ask whether graphs can serve as an internal form of reasoning assistance. We study this question on multi-hop question answering tasks, where teacher-provided reasoning traces are rewritten as graph mind maps and used to guide a student model. Our experiments reveal a clear modality gap. When graph structures are flattened into text, their benefits become limited once direct answer hints are removed. Under this abstract guidance setting, both reasoning efficiency and answer quality degrade substantially. In contrast, visual graph guidance remains effective without direct answer clues, and its advantage persists after supervised fine-tuning and KL-based distillation. The above findings support the claim that graphs should be studied not only as external knowledge structures for LLMs, but also as visual scaffolds for organizing reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02671 2026-06-03 cs.LG cs.AI 版本更新

AI代理中网络安全拒绝的新框架

Eliot Krzysztof Jones, Mateusz Dziemian, Matt Fredrikson, J Zico Kolter

发表机构 * Gray Swan ； Gray Swan AI ； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出首个针对AI代理在进攻性安全场景中建立拒绝边界的框架，包括拒绝原则、任务分类和评估方法，并发现8个前沿模型中6个拒绝率接近零。

详情

AI中文摘要

代理脚手架显著提升了LLM在复杂、长期任务上的表现，在网络安全等领域带来了广泛益处和放大风险。现有的AI代理网络安全基准主要关注能力测量——代理能多有效地完成进攻性安全任务——但忽略了一个关键问题：代理何时以及如何拒绝有害请求？我们提出了首个在进攻性安全场景中建立拒绝边界的框架。我们的框架定义了（1）任务应被拒绝的原则性标准，（2）应被拒绝的任务类别，以及（3）在良性和对抗条件下测量代理鲁棒性的评估方法。我们应用该框架评估当前基于LLM的代理在一系列基于Web的进攻性安全场景中是否遵守适当的拒绝边界，发现测试的8个前沿模型中有6个拒绝率接近零，只有2个模型（GPT-5.2和GPT-5.1 Codex）表现出任何有意义的拒绝行为。

英文摘要

Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified risks in domains like cybersecurity. Existing benchmarks for AI agents in cybersecurity focus mainly on measuring proficiency--how effectively agents can complete offensive security tasks--but neglect a critical question: when and how should agents refuse harmful requests? We present the first framework for establishing refusal boundaries in offensive security contexts. Our framework defines (1) principled criteria for when tasks should be refused, (2) categories of tasks that warrant refusal, and (3) evaluation methodology for measuring agent robustness under both benign and adversarial conditions. We apply this framework to assess how current LLM-powered agents adhere to appropriate refusal boundaries across a range of web-based offensive security scenarios, finding that 6 of 8 frontier models tested show near-zero refusal rates, with only 2 models (GPT-5.2 and GPT-5.1 Codex) demonstrating any meaningful refusal behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.02643 2026-06-03 cs.CR cs.AI cs.DB 版本更新

Inference Cost Attacks for Retrieval-Augmented Large Language Models

检索增强型大语言模型的推理成本攻击

Chengliang Liu, Liangbo Ning, Yujuan Ding, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出RA-ICA攻击范式，通过向外部知识库注入恶意文档，利用CREEP框架和MA-GRPO算法，使RAG增强的LLM系统推理时token消耗增加高达13.12倍且成功率超过90%。

Comments Accepted at The ACM Web Conference 2026 (WWW '26)

详情

DOI: 10.1145/3774904.3792683
Journal ref: Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates

AI中文摘要

检索增强生成（RAG）增强的LLM系统虽然强大，但由于包含额外的多阶段流水线（动态检索和综合外部知识源的信息），引入了大量的推理成本。这种高运营成本暴露了一个关键漏洞，即推理成本攻击（ICA）。然而，现有的ICA通常依赖于直接提示操纵的不切实际的假设。我们认为，对RAG增强的LLM系统更可行且更强大的威胁来自污染外部知识库（例如，来自互联网的网络知识）。在这项工作中，我们引入了检索增强推理成本攻击（RA-ICA），这是一种新颖的攻击范式，通过向外部知识语料库注入恶意文档来针对RAG增强的LLM系统的计算成本。为了实现这种攻击，我们提出了通过外部投毒耗尽计算资源（CREEP），这是一种新颖的框架，利用LLM代理自动制作恶意文档，这些文档在语义上相关以便检索，并且能够有效诱导推理阶段token消耗的异常增加。为了提高攻击的有效性，我们引入了记忆增强组相对策略优化（MA-GRPO），这是一种新颖的强化学习算法，通过从历史最佳对抗文档的动态记忆中学习来微调代理。在三个真实世界数据集上的大量实验表明，RA-ICA在不降低生成答案完整性的情况下，将token消耗增加了高达13.12倍，成功率超过90%。

英文摘要

Retrieval-Augmented Generation (RAG)-enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an extra multi-stage pipeline that dynamically retrieves and synthesizes information from external knowledge sources. This high operational cost exposes a critical vulnerability to Inference Cost Attacks (ICAs). However, existing ICAs often rely on the impractical assumption of direct prompt manipulation. We argue that a more feasible and potent threat to RAG-enhanced LLM systems arises from poisoning external knowledge bases (e.g., web knowledge from the Internet). In this work, we introduce the Retrieval-Augmented Inference Cost Attack (RA-ICA), a novel attacking paradigm that targets the computational cost of RAG-enhanced LLM systems by injecting malicious documents into external knowledge corpus. To operationalize this attack, we propose Computational Resource Exhaustion via External Poisoning (CREEP), a novel framework that leverages LLM agents to automatically craft malicious documents that are both semantically relevant for retrieval and potent for inducing an abnormal increase in token consumption during the inference phase. To enhance the attack's effectiveness, we introduce Memory-Augmented Group Relative Policy Optimization (MA-GRPO), a novel reinforcement learning algorithm that fine-tunes the agents by learning from a dynamic memory of historical best adversarial documents. Extensive experiments across three real-world datasets demonstrate that RA-ICA increases token consumption by up to 13.12 times with an over 90% success rate, without degrading the integrity of the generated answer.

URL PDF HTML ☆

赞 0 踩 0

2606.02641 2026-06-03 cs.RO cs.AI 版本更新

CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving

CARVE: 通过包络实现交互驾驶中被否决机动的认证可负担修复

Yifan Wang

发表机构 * Yifan Wang（王一帆）

AI总结针对交互驾驶中规则感知堆栈易忽略的硬规则裕度负值问题，提出CARVE认证层，通过有限格点上的自我与代理战术算子，实现被否决机动的可负担修复认证，并证明其合理性。

Comments 8 pages, 3 figures

详情

AI中文摘要

交互驾驶暴露了规则感知自动驾驶堆栈中容易忽略的失效模式：即使非优先代理的小幅合法让步可恢复可行性，自我候选的硬规则裕度仍可能为负。现有的规则手册、防护和可达性过滤器在否决不安全动作方面表现强劲，而基于预测的规划器则对可能的响应进行建模。两者均未返回运行时证明对象，该对象说明哪个有界多代理编辑修复了机动、谁拥有编辑、请求是否在路权上可负担，以及如果请求未被遵守，自我后备是什么。我们将这一缺失对象形式化为*交互修复认证*，并引入*CARVE*，一个在自我拥有和代理拥有的战术算子有限格点上的无预测认证层。代理拥有的请求仅在$B_j(s) = eta(\pi_j)\alpha_j^{\max}(s)$内可接受，这是一个将运动学可达性与规范优先级分离的合作包络。生成的证书记录了绑定规则、修复类别、修复集、责任加权成本分配和后备。在589个基于Lanelet2几何的INTERACTION重放片段上，CARVE-Greedy接受了98.64%的初始否决机动，恢复了370/378个人类解决错误否决，同时保持了589/589的路权尊重、零优先级代理假阳性以及400/400的负压力否决。我们证明了证书的合理性、结构性的路权尊重、精确的有限格点最小性、后备应急性和责任一致性条件。CARVE不预测也不需要其他驾驶员的合规性；它认证在声明假设下提议的交互是否有界、可归因且规范上可接受。

英文摘要

Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative for an ego candidate even though a small lawful accommodation by a non-priority agent would restore feasibility. Existing rulebooks, shields, and reachability filters are strong at vetoing unsafe actions, while prediction-based planners model likely responses. Neither returns a runtime proof object that states which bounded multi-agent edit repairs the maneuver, who owns the edit, whether the request is right-of-way affordable, and what ego fallback remains if the request is not observed. We formulate this missing object as *interactive repair certification* and introduce *CARVE*, a prediction-free certificate layer over a finite lattice of ego-owned and agent-owned tactical operators. Agent-owned requests are admissible only inside $B_j(s) = β(π_j)α_j^{\max}(s)$, a cooperation envelope that separates kinematic reachability from normative priority. The resulting certificate records the binding rule, repair category, repair set, responsibility-weighted cost split, and fallback. On 589 Lanelet2-geometry-grounded INTERACTION replay episodes, CARVE-Greedy accepts 98.64% of initially vetoed maneuvers and recovers 370/378 human-resolved false vetoes, while preserving 589/589 right-of-way respect, zero priority-agent false positives, and 400/400 negative-stress vetoes. We prove certificate soundness, structural right-of-way respect, exact finite-lattice minimality, fallback contingency, and blame-consistency conditions. CARVE does not predict or require another driver's compliance; it certifies whether a proposed interaction is bounded, attributable, and normatively admissible under declared assumptions.

URL PDF HTML ☆

赞 0 踩 0

2606.02640 2026-06-03 cs.CR cs.AI 版本更新

D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

D-Judge: 使用语义保持输出重写破坏多轮越狱攻击

Huanli Gong, Zhipeng Wei, Yu Fu, Haz Sameen Shahgir, Ananya Gupta, Yue Dong, N. Benjamin Erichson

AI总结提出D-Judge防御方法，通过语义保持的输出重写干扰攻击者的评判模型反馈循环，从而降低多轮越狱攻击的成功率。

Comments Proceedings of the 43rd International Conference on Machine Learning

详情

AI中文摘要

多轮越狱攻击对大型语言模型（LLM）的安全性构成日益严重的威胁，因为它们利用辅助评判模型的反馈来迭代优化提示，以实现有害目标。现有的防御措施主要在单个轮次或最终响应中检测或阻止不安全内容，但保留了评判驱动的优化循环，使攻击者能够从中间交互中提取信息性反馈。我们引入了D-Judge，一种语义保持的输出重写防御方法，它直接干预该循环，在攻击者的评判模型评估之前重写受害者LLM的响应。通过在不改变原始响应含义的情况下使评判的反馈信号失准，D-Judge破坏了攻击者的提示优化过程，导致后续查询针对扭曲的攻击进展信号进行优化。为了提高D-Judge生成此类重写的能力，我们构建了一个语义等价的响应对数据集，这些响应对会诱导不同的评判分配的有害性分数，并使用该数据集进行监督微调，随后进行直接偏好优化。在HarmBench上的实验表明，D-Judge在保持良性基准性能的同时，降低了最先进的多轮越狱攻击的成功率。

英文摘要

Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iteratively refine prompts toward harmful goals. Existing defenses largely detect or block unsafe content at individual turns or at the final response, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduce D-Judge, a semantics-preserving output rewriting defense that intervenes directly in this loop by rewriting the victim LLM's responses before they are evaluated by the attacker's judge. By misaligning the judge's feedback signal without changing the meaning of the original response, D-Judge derails the attacker's prompt-refinement process, causing subsequent queries to be optimized against a distorted signal of attack progress. To improve D-Judge's ability to produce such rewrites, we construct a dataset of semantically equivalent response pairs that induce different judge-assigned harmfulness scores, and use it for supervised fine-tuning followed by direct preference optimization. Experiments on HarmBench show that D-Judge reduces the success rate of state-of-the-art multi-turn jailbreaks while preserving performance on benign benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.02638 2026-06-03 cs.SD cs.AI eess.AS 版本更新

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune：歌曲生成的结构化与细粒度控制

Yuejiao Wang, Zihao Ji, Pengfei Cai, Xu Li, Haorui Zheng, Zewen Song, Zhongliang Liu, Chen Zhang, Pengfei Wan

发表机构 * Kling Team, Kuaishou Technology（快手科技 Kling 团队）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）

AI总结提出基于扩散Transformer的SegTune框架，通过用户或LLM指定局部音乐描述实现结构化细粒度控制，并引入LLM时长预测器实现精确歌词-音乐对齐，在音乐性和可控性上超越现有基线。

Comments This paper has been accepted to ACL 2026 as an oral presentation and has been nominated for the Best Paper Award. This work is a revised and extended version of an earlier technical report (arXiv:2510.18416). arXiv admin note: text overlap with arXiv:2510.18416

详情

AI中文摘要

近期神经歌曲生成的进展使得从歌词和全局文本提示中实现高质量合成成为可能。然而，大多数系统无法建模歌曲随时间变化的属性，严重限制了音乐结构和动态的细粒度控制。为解决这一问题，我们提出SegTune，一个基于扩散Transformer的框架，通过允许用户或大型语言模型（LLM）指定与歌曲片段对齐的局部音乐描述，实现结构化和细粒度的可控性。这些片段提示被时间广播到对应的时间窗口，而全局提示则确保风格连贯性。为支持精确的歌词-音乐对齐，我们引入了一个基于LLM的时长预测器，以LyRiCs格式自回归生成句子级时间戳。我们进一步构建了一个大规模数据管道，用于收集高质量歌曲及其对齐的歌词和提示，并提出了新的指标来评估片段对齐和声乐一致性。实验表明，SegTune在音乐性和可控性方面均优于现有基线。访问我们的项目页面（此 https URL ）获取代码和更多生成的歌曲。

英文摘要

Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.

URL PDF HTML ☆

赞 0 踩 0

2606.02630 2026-06-03 cs.CR cs.AI 版本更新

MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

MultiTurnPSB：评估多轮越狱攻击与基于分类器的防御在医疗AI安全中的应用

Anushka Sheoran, Yiduo Hao

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出多轮对抗基准MultiTurnPSB，通过四轮对话评估医疗聊天机器人的安全漏洞，发现多轮攻击下不安全响应率从35%升至近80%，并验证了轻量级输入分类器可降低52个百分点的不安全响应但存在高误报率。

详情

AI中文摘要

面向患者的医疗聊天机器人通常在单轮提示上进行评估，但真实用户在被拒绝后会继续追问、增加紧迫感并援引权威。我们引入了MultiTurnPSB，这是PatientSafetyBench的一个四轮对抗扩展，并在固定模板、模板自适应和实时对抗攻击下评估了GPT-4.1-mini。在实时攻击下，不安全响应从第1轮的35%上升到第4轮的近80%。在相同的攻击者下，GPT-4.1-mini和Claude Sonnet 4.5在基线时统计上无差异，但到第4轮时差距扩大到19倍，这种差异在单轮评估中不可见。我们描述了四种退化轨迹特征，并识别出一个导致大多数灾难性失败的双元素攻击公式。一个轻量级的输入侧分类器将第4轮不安全响应降低了52个百分点，尽管准确性严重下降，但对良性查询的45%误报率是主要的部署限制。还出现了一个方法论发现：Claude Sonnet在超过一半的后期对话中拒绝生成对抗性消息，尽管有明确的红队框架，这表明安全训练可能泛化到攻击者角色。

英文摘要

Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate GPT-4.1-mini under fixed template, template-adaptive, and live adversarial attacks. Unsafe responses rise from 35% to nearly 80% by Turn 4 under live attack. Under the same adversary, GPT-4.1-mini and Claude Sonnet 4.5 are statistically indistinguishable at baseline but diverge to a 19x gap by Turn 4, a difference invisible to single-turn evaluation. We characterize four degradation trajectory signatures and identify a two-element attack formula responsible for most catastrophic failures. A lightweight input-side classifier reduces Turn 4 unsafe responses by 52 percentage points despite severe accuracy degradation, but the 45% false alarm rate on benign queries is the primary deployment constraint. A methodological finding also emerges: Claude Sonnet refused to generate adversarial messages in over half of late-turn conversations despite explicit red team framing, suggesting safety training may generalize to the attacker role.

URL PDF HTML ☆

赞 0 踩 0

2606.02623 2026-06-03 cs.NE cs.AI cs.LG 版本更新

Oscillatory State-Space Models as Inductive Biases for Physics-Informed Neural PDE Solvers

振荡状态空间模型作为物理信息神经PDE求解器的归纳偏置

Abhishek Chandra, Taniya Kapoor

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Wageningen University & Research（瓦赫宁根大学与研究中心）

AI总结提出一种结合振荡状态空间动力学和PDE感知空间谱的PINN方法，以改进时变PDE求解的精度和内存效率。

详情

AI中文摘要

求解时变偏微分方程（PDE）是计算科学与工程中的一个重要问题。物理信息神经网络（PINN）从控制方程中学习PDE解。然而，准确捕捉时间演化仍然具有挑战性。最近的基于序列模型的方法使用通用序列模型参数化时间演化，这些模型捕捉时间依赖性，但没有显式编码PDE解的结构化动力学。此外，它们的内存需求可能随序列长度和分辨率而不利地扩展，限制了在大规模或高维设置中的适用性。本文介绍了一种PINN方法，该方法结合了振荡状态空间动力学来表示PDE解的模态结构。所提出的方法利用基于线性振荡器的时间演化，以及空间上的PDE感知谱基。这种设计实现了闭式空间微分和边界条件的一致强制执行。该方法在前向、逆和高维PDE问题上进行了评估，包括高达100个空间维度的情况。结果表明，与最近基于序列模型的PINN方法相比，该方法提高了精度并减少了内存使用。总体而言，本文强调了将结构化动力学先验纳入神经PDE求解器的时间演化中的好处，并建议设计更符合物理和计算高效的PINN架构。

英文摘要

Solving time-dependent partial differential equations (PDEs) is an important problem in computational science and engineering. Physics-informed neural networks (PINNs) learn PDE solutions from governing equations. However, accurately capturing temporal evolution remains challenging. Recent sequence-model-based approaches parameterize time evolution using general-purpose sequence models, which capture temporal dependencies but do not explicitly encode the structured dynamics of PDE solutions. In addition, their memory requirements can scale unfavorably with sequence length and resolution, limiting applicability in large-scale or high-dimensional settings. This work introduces a PINN approach that incorporates oscillatory state-space dynamics to represent the modal structure of PDE solutions. The proposed method leverages a linear-oscillator-based temporal evolution, together with a PDE-aware spectral basis in space. This design enables closed-form spatial differentiation and consistent enforcement of boundary conditions. The method is evaluated on forward, inverse, and high-dimensional PDE problems, including cases up to 100 spatial dimensions. The results show improved accuracy and reduced memory usage compared to recent sequence-model-based PINN approaches. Overall, this work highlights the benefits of incorporating structured dynamical priors into the temporal evolution of neural PDE solvers and suggests designing more physics-aligned and computationally efficient PINN architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.02618 2026-06-03 cs.CE cs.AI cs.MA physics.chem-ph 版本更新

Closed-Loop Molecular Design with Calibrated Deference

闭环分子设计中的校准式退让

Newman Cheng, Gordon Broadbent, Jason Dong, Syed Mohammed Ali Hussaini, Farman Ullah, Morris Sharp, Gabrielle Barnes, Nanlin Guo, Deyu Zou, Karin Strauss, William Chappell, David G. Kwabi, Bichlien H. Nguyen, Jake A. Smith

发表机构 * Microsoft Discovery & Quantum（微软发现与量子）； Microsoft Research（微软研究院）； Department of Chemical and Environmental Engineering, Yale University（耶鲁大学化学与环境工程系）； CanAm Bioresearch Inc.（CanAm 生物研究公司）

AI总结提出CLIO智能体，通过持续更新的信念状态图和递归计划-行动循环实现校准式退让，在闭环人机协作中成功设计出性能优于文献基准的AORFB负极电解液。

详情

AI中文摘要

我们提出了通过原位优化实现认知循环（CLIO），这是一种将持续更新的信念状态图与递归计划-行动循环相结合的智能体。结果产生了一个推理智能体，能够贡献某种定性的不同之处，我们称之为“校准式退让”：即识别自身工具或假设何时失败、相应调整策略、并生成指导实验修订的机制性假设的能力。我们在一个闭环人机协作活动中测试了CLIO，以设计一种水性有机氧化还原液流电池（AORFB）负极电解液，CLIO在与合成、表征并参与设计选择的化学家密切合作中主导了提议和解释。在三轮共17个候选分子中，CLIO收敛于一个最佳的膦酸酯候选物；表征证实其氧化还原电位比文献基准提高了130 mV。随后表征揭示了出乎意料的差电化学可逆性——这是所有性质预测器都未能标记的回归。CLIO生成了相互竞争的机制性假设，优先安排了诊断性实验，将失败归因于膦酸酯-钾离子配对，并建议用磺酸酯替代。所得化合物显示出显著改善的电化学可逆性，并保持了90 mV的氧化还原电位提升，从而闭环了设计-制造-测试-再设计循环。

英文摘要

We present Cognitive Loop via In-Situ Optimization (CLIO), an agent that couples a continuously-updated belief-state graph with a recursive plan-then-act loop. The result is a reasoning agent that can contribute something qualitatively different, which we term \emph{calibrated deference}: the capacity to recognize when its own tools or assumptions are failing, to adapt its strategy in response, and to generate mechanistic hypotheses that guide experimental revision. We tested CLIO in a closed-loop human-AI campaign to design an aqueous organic redox flow battery (AORFB) negolyte, with CLIO leading proposal and interpretation in close partnership with chemists who synthesized, characterized, and weighed in on design choices. Across 17 candidates over three rounds, CLIO converged on a top phosphonate candidate; characterization confirmed a 130~mV improvement in redox potential over the literature baseline. Characterization then revealed unexpectedly poor electrochemical reversibility -- a regression no property predictor had flagged. CLIO generated competing mechanistic hypotheses, prioritized discriminating diagnostics, traced the failure to phosphonate-potassium ion pairing, and prescribed a sulfonate replacement. The resulting compound showed substantially improved electrochemical reversibility and maintained a 90~mV improvement in redox potential, closing the design-make-test-redesign loop.

URL PDF HTML ☆

赞 0 踩 0

2606.02614 2026-06-03 cs.CE cs.AI 版本更新

Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin

边际博弈：巴西赤道边缘地区公共政策分析的多智能体系统

Antonio de Sousa Leitão Filho, Fabrício Saul Lima, Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa, Luís Jorge Mesquita de Jesus, Dennys Correia da Silva, Allan Kardec Duailibe Barros Filho

发表机构 * Aia Context ； Universidade Federal do Maranhão — UFMA（佛罗里达州立大学马纳汉分校）； Universidade Estadual de Campinas — UNICAMP（坎皮纳斯州立大学）

AI总结针对巴西赤道边缘地区石油勘探对马拉尼昂州福利影响的问题，提出基于多智能体强化学习（MARL）的仿真系统Margin Play，通过CTDE范式和BRO-MARL训练六个智能体，发现福利增益取决于制度安排，MA-Prospero配置可显著提升福利并降低环境负债。

详情

AI中文摘要

巴西赤道边缘（BEM）是巴西下一个海上石油前沿，预计于2026年在亚马逊福斯盆地开始运营。其资产在财政和领土上主要与马拉尼昂州相关联——该州在联邦中人类发展指数最低（0.676，IBGE 2022）。这引出了核心政策问题：在什么条件下，BEM的勘探能为马拉尼昂州产生净正外部性？问题本质上是多智能体的：联邦政府寻求收入和能源安全；州政府在宪法规定的特许权使用费专用下寻求区域福利；运营商在风险下最大化利润；ANP和IBAMA持有冲突的职责；亚马逊社区优先考虑领土和环境因素而非货币收入。我们提出Margin Play，一个多智能体强化学习（MARL）系统，在巴西经验校准和经典经济学文献下模拟这些张力。它实现了CTDE范式下的六个智能体，使用BRO-MARL进行训练。来自六个场景中60,000个回合的结果表明，答案取决于制度安排：在参考基线之下，福利增益微乎其微（Waval约1.68），而MA-Prospero配置产生Delta W = +17.5%和Delta Rcom = +21.3%，同时环境负债较低（Eamb = 0.048 vs. 0.076）。根本问题并非生产与福利之间的权衡，而是与勘探相关的公共政策制度的选择。

英文摘要

The Brazilian Equatorial Margin (BEM) is Brazil's next offshore oil frontier, with operations expected to begin in 2026 in the Foz do Amazonas basin. Its assets are fiscally and territorially linked primarily to Maranhao -- the state with the lowest HDI in the Federation (0.676, IBGE 2022). This raises the central policy question: under what conditions does BEM exploration generate net positive externalities for Maranhao? The problem is intrinsically multi-agent: the Federal Government seeks revenue and energy security; the state seeks regional welfare under constitutional royalty earmarking; the operator maximizes profit under risk; ANP and IBAMA hold conflicting mandates; and Amazonian communities prioritize territorial and environmental vectors over monetary income. We present Margin Play, a Multi-Agent Reinforcement Learning (MARL) system simulating these tensions under Brazilian empirical calibration and classical economic literature. It implements six agents under the CTDE paradigm, trained with BRO-MARL. Results from 60,000 episodes across six scenarios indicate the answer is conditional on the institutional regime: under the reference baseline, the welfare gain is marginal (Waval approx. 1.68), whereas the MA-Prospero configuration yields Delta W = +17.5% and Delta Rcom = +21.3%, with a lower environmental liability (Eamb = 0.048 vs. 0.076). The fundamental problem is not a trade-off between production and welfare, but the choice of public policy regime linked to exploration.

URL PDF HTML ☆

赞 0 踩 0

2606.02610 2026-06-03 cs.CE cs.AI cs.LG physics.ao-ph 版本更新

Samudra 2: Scaling Ocean Emulators across Resolutions

Samudra 2: 跨分辨率扩展海洋仿真器

Yuan Yuan, Jesse Rusak, Alexander Merose, Adam Subel, Pavel Perezhogin, Alistair Adcroft, Carlos Fernandez-Granda, Laure Zanna

发表机构 * Courant Institute School of Mathematics, Computing, and Data Science, New York University（Courant学院数学、计算与数据科学系，纽约大学）； Open Athena AI Foundation, Inc.（开放Athena人工智能基金会）； Program in Atmospheric and Oceanic Sciences, Princeton University（大气与海洋科学项目，普林斯顿大学）

AI总结针对现有海洋神经仿真器在长期自回归滚动中出现的方差崩溃和印记伪影问题，提出Samudra 2，通过改进U-Net骨干网络和动态损失函数，在1°分辨率下将上层海洋全球平均温度R²从0.56提升至0.87，并将深层海洋温度误差降低约七倍，且可扩展至1/2°和1/4°分辨率。

详情

AI中文摘要

海洋环流模式（OGCM）对气候科学至关重要，但计算成本高，限制了集合规模和强迫情景。神经仿真器有望实现数量级的加速，然而现有的海洋仿真器未能将精细空间分辨率与多年自回归滚动相结合。Samudra是第一个产生多十年全球滚动的自回归神经海洋仿真器，但仅限于$1^\\\circ$分辨率，并表现出两种长期故障模式：\\emph{方差崩溃}，即时间变异性的丧失，以及\\emph{印记伪影}，即速度模式泄漏到深海场中。我们提出Samudra 2，它引入了更宽的U-Net骨干网络，采用修改后的ConvNeXt风格块和减小的块内扩展因子，以及一个动态损失函数，根据预测误差重新加权输出通道，从而增强缓慢演变的深海场的梯度。在$1^\\\circ$分辨率下，Samudra 2将上层海洋全球平均温度$R^2$从0.56提高到0.87，并将深海温度误差降低约七倍。相同的架构可扩展到$1/2^\\\circ$和$1/4^\\\circ$分辨率，在大约8年的自回归滚动中恢复中尺度涡旋和尖锐的西边界流。在单个GPU上运行，Samudra 2能够为海平面预测、海洋热吸收和气候变率研究提供更大的集合。我们在此https URL提供代码、文档和基准资源。

英文摘要

Ocean general circulation models (OGCMs) are essential to climate science but computationally expensive, limiting ensemble size and forcing scenarios. Neural emulators promise orders-of-magnitude speedups, yet existing ocean emulators have not combined fine spatial resolution with multi-year autoregressive rollouts. Samudra, the first autoregressive neural ocean emulator to produce multi-decade global rollouts, is limited to $1^\circ$ resolution and exhibits two long-horizon failure modes: \emph{variance collapse}, the loss of temporal variability, and \emph{imprinting artifacts}, in which velocity patterns leak into deep-ocean fields. We present Samudra 2, which introduces a wider U-Net backbone with modified ConvNeXt-style blocks and a reduced block-internal expansion factor, together with a dynamic loss that reweights output channels according to their prediction errors, strengthening gradients for slow-evolving deep-ocean fields. At $1^\circ$, Samudra 2 increases upper-ocean global-mean temperature $R^2$ from 0.56 to 0.87 and reduces deep-ocean temperature error by roughly sevenfold. The same architecture scales to $1/2^\circ$ and $1/4^\circ$ over approximately 8-year autoregressive rollouts, recovering mesoscale eddies and sharp western boundary currents. Running on a single GPU, Samudra 2 enables larger ensembles for sea-level projections, ocean heat uptake, and climate variability studies. We provide code, documentation, and benchmark resources at https://openathena.ai/Ocean_Emulator/.

URL PDF HTML ☆

赞 0 踩 0

2606.02607 2026-06-03 cs.LG cs.AI cs.CR 版本更新

Geometry-Aware Tabular Diffusion

几何感知表格扩散

David Turtora Zagardo

发表机构 * arXiv

AI总结提出几何感知表格扩散（GATD），通过向扩散去噪器注入列值差异的成对角度和长度作为输入和辅助目标，以显式建模列间关系，在10个数据集上以更少参数取得SOTA性能。

Comments Accepted to the ICML 2026 main track. 24 pages, 10 figures, 22 tables

详情

AI中文摘要

表格合成对于隐私保护的共享和增强至关重要，然而扩散模型依赖隐式机制来捕捉列间关系。我们引入了几何感知表格扩散（GATD），它通过从列值差异计算出的成对角度和长度来增强表格扩散去噪器，并将其用作输入和辅助目标。我们的MLP实例化在平均使用3.5倍更少参数（对于分类任务最多25倍）的情况下实现了最先进的基准性能：在十个数据集上，它在8/10的形状、7/10的趋势和9/10的下游效用（F1/RMSE）上获胜，将形状和趋势误差分别降低了27%和20%。默认损失权重可迁移到GNN和Transformer去噪器，在27/30个架构-数据集单元上改善了形状，在25/30上改善了趋势。一项匹配的消融实验表明，监督（而非额外输入或容量）驱动了性能提升。这表明显式关系监督是表格扩散的一种可移植归纳偏置。

英文摘要

Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences and used as inputs and auxiliary targets. Our MLP instantiation achieves state-of-the-art benchmark performance while using 3.5x fewer parameters on average (up to 25x for classification tasks): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility (F1/RMSE), reducing Shape and Trend error by 27% and 20%. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture-dataset cells. A matched ablation shows supervision (not extra inputs or capacity) drives the gain. This shows explicit relational supervision is a portable inductive bias for tabular diffusion.

URL PDF HTML ☆

赞 0 踩 0

2606.02606 2026-06-03 cs.LG cs.AI 版本更新

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

ReLoRA: 面向演化LLM服务快速部署的知识复用适配

Yang Xu, Zihuai Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Xitong Fu

发表机构 * School of Computer Science and Technology, University of Science and Technology of China（计算机科学与技术学院，中国科学技术大学）； Suzhou Institute for Advanced Research, University of Science and Technology of China（苏州先进研究院，中国科学技术大学）

AI总结针对基础模型频繁更新导致已有LoRA适配器失效的问题，提出ReLoRA框架，通过贝叶斯优化初始化与调度正则化微调，实现知识复用与快速重新适配，降低计算开销并提升性能。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为持续演化的服务，其中频繁的基础模型更新可能使先前部署的任务特定低秩适配（LoRA）适配器失效。对于管理众多下游模型服务的提供商来说，为每个更新的基础模型从头重新训练每个LoRA适配器在计算上代价高昂，并延迟服务部署。同时，更简单的替代方案，即简单地将原始LoRA适配器应用于更新的基础模型，由于适配器-骨干网络不兼容，常常导致服务质量下降。为了解决这个问题，我们提出了ReLoRA，一种知识复用的重新适配框架，能够高效地为演化的LLM服务恢复可用的LoRA适配器，同时保持或提升任务性能。具体来说，ReLoRA包含两个关键的优化步骤：1）自适应LoRA初始化利用贝叶斯优化，通过融合先前部署的任务适配器和基础模型演化的信息，构建一个兼容性感知的起点；2）带调度正则化的微调首先通过强正则化快速将适配器引导至高质量区域，随后通过放松正则化进行任务特定精炼。这种设计使得在减少重新适配开销的同时，能够快速恢复服务质量。大量实验表明，与基线相比，ReLoRA将就绪时间减少高达8.9倍，准确率提升高达4.6%。

英文摘要

Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service providers managing numerous downstream model services, retraining each LoRA adapter from scratch for every updated base model is computationally prohibitive and delays service rollout. Meanwhile, the simpler alternative, i.e., naively applying the original LoRA adapter to the updated base model, often leads to degraded service quality due to adapter-backbone incompatibility. To address this problem, we propose ReLoRA, a knowledge-reusing re-adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services while preserving or improving task performance. Specifically, ReLoRA comprises two key optimization steps: 1) Adaptive LoRA initialization leverages Bayesian optimization to construct a compatibility-aware starting point by fusing information from both the previously deployed task adapter and the base model's evolution; 2) Fine-tuning with scheduled regularization first rapidly steers the adapter to a high-quality region via strong regularization, followed by relaxed regularization for task-specific refinement. This design enables rapid service-quality recovery with reduced re-adaptation overhead. Extensive experiments demonstrate that ReLoRA reduces time-to-readiness by up to 8.9$\times$ and improves accuracy by up to 4.6\% compared to baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02605 2026-06-03 cs.LG cs.AI eess.IV 版本更新

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

用于严重狭窄分类的心电图与血管造影表示的跨模态对比学习

Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer, Marcus Brugger, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich and TUM University Hospital（人工智能在医疗与医学中的研究所，慕尼黑技术大学及慕尼黑大学医院）； Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）； Munich Center for Machine Learning (MCML), Munich, Germany（慕尼黑机器学习中心（MCML））； Department of Internal Medicine, TUM University Hospital（慕尼黑大学医院内科学系）

AI总结提出StenCE预训练框架，通过跨模态对比学习从心电图特征中实现冠状动脉狭窄风险分层，在严重狭窄分类中首次达到高性能。

详情

AI中文摘要

冠状动脉狭窄是一种常见的心血管疾病，未经治疗的严重病例具有显著的心肌梗死风险。尽管冠状动脉（X射线）血管造影仍是狭窄诊断的金标准，但其具有侵入性、耗时且资源密集，因此仅对基于症状和既往临床测试具有高疾病概率的患者进行。然而，一部分患者，尤其是无症状患者，可能仍未被诊断。从心电图（ECG）中检测狭窄的迹象，由于心电图快速、廉价、无创，因此即使在无症状患者中也常规采集，将支持早期诊断。然而，由于在心电图中尚未识别出可靠的狭窄特异性信号，目前无法用于狭窄风险分层。为解决这一问题，我们引入了StenCE，一个预训练框架，允许基于直接从心电图导出的特征对患者进行分层。在不同狭窄严重程度阈值和额外心电图疾病分类任务上的评估表明，不同心电图编码器均取得了一致的性能提升，优于先前的工作。所获得的模型成功检测到心电图中用于狭窄诊断的信号，并且是首个在严重狭窄分类中实现高性能的模型。源代码可在以下网址获取：此 https URL。

英文摘要

Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time- and resource-intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non-invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis-specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at https://github.com/NikolaCenic/ecg-stenosis-cls.

URL PDF HTML ☆

赞 0 踩 0

2606.02604 2026-06-03 cs.LG cs.AI 版本更新

Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation

来自碎片化ESG数据的可审计气候风险智能：面向范围1-3验证的确定性编排与不平衡感知学习

Karan Sehgal, Khawar Naveed Bhatti

发表机构 * Kent Business School, University of Kent（肯特大学 Kent 商学院）

AI总结针对ESG数据碎片化及传统验证缺乏可审计性的问题，提出一种融合确定性编排、时序异常检测、不平衡感知集成学习和可解释治理的框架，并构建合成基准实现可复现验证。

Comments 22 pages, 7 figures. Preprint

详情

AI中文摘要

ESG和气候风险数据在异构的范围1、范围2和范围3报告环境中仍然碎片化，而传统的验证流程缺乏来源感知的可审计性、隐藏漂移检测和面向可复现性的治理。本文提出一个确定性气候风险智能框架，整合单一真相来源编排、时序异常检测、不平衡感知集成学习和面向可解释性的治理，用于可审计的ESG验证。为支持开放复现，我们构建并发布了一个合成ESG验证基准，该基准根据GHG协议、PCAF和ISSB标准的公开报告特征进行校准。该方法包括时序漂移分析、基于SMOTE的罕见事件优化、集成学习、来源感知编排以及基于TreeSHAP的可解释性，用于治理检查和审计重建。我们使用分类指标（召回率、F1、ROC AUC）、校准指标（ECE、Brier分数）以及面向治理的审计轨迹完整性度量（衡量可重建确定性来源到升级来源链的标记异常比例）将框架与统计分类器、异常检测方法、时序预测基线和基于阈值的系统进行评估。结果以分层五折交叉验证的均值和标准差报告，并进行配对显著性检验。该框架将ESG报告重新定义为确定性气候风险治理基础设施，支持可复现性、可解释性和运营可审计性。

英文摘要

ESG and climate risk data remain fragmented across heterogeneous Scope 1, Scope 2, and Scope 3 reporting environments, while conventional validation pipelines lack provenance aware auditability, hidden drift detection, and reproducibility oriented governance. This paper proposes a deterministic climate risk intelligence framework integrating single source of truth orchestration, temporal anomaly detection, imbalance aware ensemble learning, and explainability oriented governance for auditable ESG validation. To support open reproducibility, we construct and release a synthetic ESG validation benchmark calibrated against publicly reported characteristics of the GHG Protocol, PCAF, and ISSB standards. The methodology incorporates temporal drift analysis, SMOTE based rare event optimization, ensemble learning, provenance aware orchestration, and TreeSHAP based interpretability for governance inspection and audit reconstruction. We evaluate the framework against statistical classifiers, anomaly detection methods, temporal forecasting baselines, and a threshold based system using classification metrics (recall, F1, ROC AUC), calibration metrics (ECE, Brier score), and a governance oriented audit trace completeness metric measuring the fraction of flagged anomalies for which a deterministic source to escalation provenance chain can be reconstructed. Results are reported as mean and standard deviation across stratified five fold cross validation with paired significance testing. The framework reframes ESG reporting toward deterministic climate risk governance infrastructure supporting reproducibility, explainability, and operational auditability.

URL PDF HTML ☆

赞 0 踩 0

2606.02588 2026-06-03 cs.LO cs.AI cs.PL 版本更新

Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Lean-GAP：形式化研究生代数问题数据集

Seewoo Lee, Byung-Hak Hwang, Hyojae Lim, Jihoon Hyun, Ilkyoo Choi, Yeachan Park, Jineon Baek, Hyukpyo Hong, Keewoo Lee, Jaeseong Heo, Hyungryul Baik, Chul-hee Lee, Kyu-Hwan Lee

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Hanyang University（翰阳大学）； Hufs University（Hufs大学）； Sungkyunkwan University（成均馆大学）； University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； Sejong University（世宗大学）； University of Connecticut（康涅狄格大学）

AI总结本文提出Lean-GAP数据集，包含430个来自Dummit和Foote《抽象代数》的形式化研究生代数问题，并开发了从PDF预处理到自动形式化再到验证的可扩展流水线。

2606.02584 2026-06-03 cs.CL cs.AI cs.IR 版本更新

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX：习语理解、检索和解释的多语言基准

Ayman Ali Sharara

AI总结提出IdiomX，一个大规模多语言习语基准，通过可复现的多阶段流水线构建，涵盖190K+上下文示例和12K+习语，定义四项任务（检测、上下文-习语检索、阿拉伯语-英语习语检索、习语解释），实验表明上下文Transformer模型提升检测，混合检索重排序增强单语和跨语言检索，习语解释可建模为语义检索任务。

Comments 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

详情

AI中文摘要

习语表达仍然是自然语言处理中的持续挑战，因为它们的含义通常是非组合性的、依赖于上下文的，并且难以跨语言对齐。现有的习语资源在规模、上下文多样性或多语言覆盖方面往往有限，限制了它们对现代语言模型的实用性。我们介绍了IdiomX，一个用于习语理解、检索和解释的大规模多语言基准，通过可复现的多阶段流水线构建，结合词汇资源提取、大规模归一化、受控的大语言模型丰富和结构化验证。生成的数据集包含超过190K个上下文示例，涵盖12K+习语，具有对齐的英语、阿拉伯语和法语语义表示、习语和字面用法标签以及丰富的语言元数据。基于这一资源，我们定义了一个统一的四任务基准，涵盖习语检测、上下文到习语检索、阿拉伯语到英语习语检索和习语解释，将评估从比喻识别扩展到语义基础和可解释的含义检索。实验表明，上下文Transformer模型显著提高了习语检测，而混合检索和重排序架构则显著增强了单语和跨语言习语检索。结果进一步表明，习语解释可以有效地建模为语义检索任务，将可解释性作为基准的补充维度。总体而言，IdiomX提供了一个可扩展的基准，用于研究从检测到检索和语义解释的习语语言进展，并提供了一个模块化框架，可扩展到其他语言和比喻推理任务。

英文摘要

Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

URL PDF HTML ☆

赞 0 踩 0

2606.02581 2026-06-03 cs.IR cs.AI 版本更新

AgentRedBench: 针对SaaS集成的LLM代理的动态红队测试与集成感知防御

Hiskias Dingeto, William Leeney

发表机构 * StackOne Technologies（StackOne技术公司）

AI总结针对LLM代理在工具使用中面临的间接提示注入威胁，提出动态红队基准AGENTREDBENCH（覆盖24个企业集成、5种攻击类型）和基于集成多样语料训练的防御模型AGENTREDGUARD，将攻击成功率从69.9%降至2.4%，误报率仅0.37%。

详情

AI中文摘要

工具使用代理中的间接提示注入是一个具体的生产威胁：LLM代理读取来自集成（通过工具调用访问的第三方服务，如Gmail、Salesforce或Jira）的响应内容，用户既未编写也无法控制这些内容。现有基准低估了该威胁：大多数仅覆盖少量集成，且每次运行重复相同的攻击载荷，而开源防护模型是在聊天风格数据而非工具响应内容上训练的。我们引入了AGENTREDBENCH，这是一个动态的LLM驱动的红队测试基准，包含215个微妙的未明确授权场景（在用户请求授权边界上的攻击），涵盖9个功能家族、24个企业集成和5种攻击类型。在八模型面板（Anthropic、OpenAI、Google）上，无防护的攻击成功率（ASR）范围从32%（Claude Sonnet 4.6）到81%（Gemini 3 Flash）。为了保持场景集不在训练语料中，并随时间保持标题ASR的意义，我们开源了代码库、集成模式和AGENTREDGUARD模型；规范场景通过维护者中介渠道进行评估，具有不可变版本控制。我们随基准发布了AGENTREDGUARD：一个在集成多样化的对抗性工具响应内容语料上训练的防护模型。AGENTREDGUARD将面板ASR从69.9%降至2.4%，误报率为0.37%，在两个指标上均优于所有具有非平凡检测能力的开源基线（Llama Guard、PromptGuard 2、ProtectAI）。跨集成和跨攻击类型的保留测试均证实了增益在训练子集之外具有迁移性。

英文摘要

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.

URL PDF HTML ☆

赞 0 踩 0

2606.02132 2026-06-03 cs.AI 版本更新

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

学习何时不行动：缓解智能体强化学习中的工具滥用

Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences（NLPR，自动化研究所，中国科学院）； ByteDance（字节跳动）； Zhejiang University（浙江大学）

AI总结提出EAPO框架，通过引入无工具轨迹、难度感知奖励塑造和置信度感知令牌重加权，在数学和知识密集型推理任务中减少工具滥用，同时提升准确率-效率权衡。

Comments Under review

详情

AI中文摘要

智能体强化学习可能引发工具滥用，即模型过度使用外部工具，即使对于内部推理可解的查询也是如此。现有方法通过统一的工具使用惩罚或硬限制来缓解此问题，这降低了工具使用频率，但可能抑制有用的工具辅助探索。我们提出EAPO，一种高效的智能体策略优化框架，学习选择性工具使用。EAPO在每个rollout组中引入无工具轨迹，应用难度感知奖励塑造以主要对较简单查询上的冗余工具调用进行惩罚，并使用置信度感知令牌重加权来改进策略学习。在九个数学和知识密集型推理基准上，EAPO在Qwen2.5-3B、Qwen2.5-7B和Llama3.1-8B上持续改善了准确率-效率权衡。与GRPO相比，EAPO的平均性能分别提高了10.45%、7.27%和9.69%，同时平均工具调用次数分别减少了18.33%、18.33%和24.59%。这些结果表明，智能体可以在不损害工具集成推理的情况下学习何时不使用工具。

英文摘要

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02060 2026-06-03 cs.AI 版本更新

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

深度研究代理在何处出错？代理轨迹中的跨度级错误定位

Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University（南京大学NJU-LINK团队）； JIUTIAN Research（JIUTIAN研究院）

AI总结针对深度研究代理在长轨迹中难以定位错误的问题，本文通过构建TELBench基准和提出DRIFT审计框架，实现了跨度级错误定位，将首次错误定位准确率提升高达30个百分点。

Comments 28 pages, 11 figures, 4 tables

详情

AI中文摘要

深度研究代理通过搜索、工具使用、证据检查和答案合成的长轨迹来完成任务。基于最终答案的评估可以显示代理是否成功，但无法显示轨迹的哪些部分导致答案不可靠。我们研究了深度研究代理的跨度级错误定位。我们从两个代理框架、三个骨干模型和三个基准中收集了2,790条真实轨迹，将原始日志转换为语义跨度，并通过LLM辅助的专家评审标注了有害错误跨度。基于这些标注，我们构建了TELBench，一个包含1,000个实例的基准，用于在正常探索、失败搜索、暂定假设和无害噪声中识别错误跨度。我们进一步提出了DRIFT，一个以声明为中心的审计框架，该框架跟踪代理声明，检查其在轨迹证据中的支持，并标记那些无支持或冲突声明影响答案路径的跨度。跨模型系列和审计框架的实验表明，DRIFT将跨度级错误定位和首次错误准确率提高了高达30个百分点。我们的工作提供了深度研究代理可靠性的过程级视角。

英文摘要

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.01904 2026-06-03 cs.CL cs.AI 版本更新

超越均值的结构因果效应的拓扑可忽略性

Usef Faghihi

AI总结本文提出基于拓扑几何的因果度量（如密度超水平Betti摘要、欧拉签名和持续同调摘要）来量化干预分布的结构差异，并引入拓扑可忽略性假设以在无需完整反事实分布的情况下识别结构因果效应。

Comments This is a new version of our paper titled: Beyond Means: Topological Causal Effects under Persistent-Homology Ignorability. So we will resubmit this as version 2 of arXiv:2603.14169

详情

AI中文摘要

许多干预措施改变的是结果分布的结构而非其均值：它们可以将总体分裂为不连通的区域、创建循环或空洞、生成分支，或重组结果云团而几乎不改变平均响应。在这种情况下，基于均值的因果估计量（如平均处理效应）可能遗漏重要的结构效应。我们引入了基于干预结果定律摘要的拓扑几何因果度量，包括密度超水平Betti摘要、欧拉签名和持续同调摘要。这些度量量化了处理组和未处理组结果定律之间超出平均值的结构差异。我们还研究了因果解释所需的假设。我们引入了拓扑可忽略性，这是条件可忽略性的拓扑类比，要求所选结构特征的不变性而非整个反事实分布。当所选摘要是单射时，该条件与弱可忽略性一致；对于非单射摘要，它可以在不识别完整干预定律的情况下识别感兴趣的结构特征。我们定义了一个协变量标准化的拓扑几何因果效应，并开发了实用的估计量。我们在两个隐藏混杂基准中验证了该框架：一个完全合成的精确基准和一个使用威斯康星乳腺癌协变量的真实协变量半合成基准。在这两个基准中，弱可忽略性失败，平衡观测协变量几乎消除了标准化均值差异，但坐标均值平均处理效应仍然有偏。相比之下，选定的有限密度超水平Betti和欧拉对比在神谕、观测和加权分析中保持稳定。

英文摘要

Many interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regimes, create loops or holes, generate branches, or reorganize an outcome cloud while leaving the average response nearly unchanged. In such settings, mean-based causal estimands such as the average treatment effect may miss important structural effects. We introduce topological-geometrical causal metrics based on summaries of interventional outcome laws, including density-superlevel Betti summaries, Euler signatures, and persistent-homology summaries. These metrics quantify structural differences between treated and untreated outcome laws beyond averages. We also study the assumptions needed for causal interpretation. We introduce topological ignorability, a topological analogue of conditional ignorability that requires invariance of the chosen structural feature rather than the full counterfactual distribution. When the chosen summary is injective, this condition coincides with weak ignorability; for noninjective summaries, it can identify the structural feature of interest without identifying the full interventional law. We define a covariate-standardized topological-geometrical causal effect and develop practical estimators. We validate the framework in two hidden-confounding benchmarks: a fully synthetic exact benchmark and a real-covariate semi-synthetic benchmark using Wisconsin breast-cancer covariates. In both, weak ignorability fails and balancing observed covariates nearly eliminates standardized mean differences, yet the coordinate-mean average treatment effect remains biased. By contrast, selected finite density-superlevel Betti and Euler contrasts remain stable across oracle, observational, and weighted analyses.

URL PDF HTML ☆

赞 0 踩 0

2606.01162 2026-06-03 cs.AI 版本更新

Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts

基于混合专家模型的动态云工作流截止时间感知调度

Ya Shen, Gang Chen, Hui Ma, Mengjie Zhang

发表机构 * School of Engineering and Computer Science, Victoria University of Wellington（维多利亚大学工程与计算机科学学院）

AI总结提出一种基于混合专家模型的深度强化学习调度策略DEFT，通过图自适应门控机制动态路由决策，有效降低执行成本和截止时间违反率。

Comments This paper has been accepted by the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

AI中文摘要

云计算中的工作流调度需要将动态到达、图结构且具有不同截止时间的工作流智能地分配到不断变化的虚拟机资源上。然而，现有的深度强化学习调度器受限于僵化的单路径推理架构，难以处理多样化的调度场景。我们引入了 extbf{DEFT}（截止时间感知的混合专家模型），一种创新的深度强化学习策略架构，利用专门的混合专家模型，每个专家被训练用于管理不同级别的截止时间紧迫性。据我们所知，DEFT是首个引入并验证用于动态云工作流调度的混合专家模型架构。通过自适应地将决策路由到最合适的专家，DEFT能够满足单个专家无法实现的广泛截止时间要求。DEFT的核心是一种 extbf{图自适应}门控机制，该机制编码工作流截止时间和DAG、任务状态以及虚拟机条件，使用交叉注意力以细粒度、截止时间敏感的方式指导专家激活。在动态云工作流基准上的实验表明，DEFT显著降低了执行成本和截止时间违反率，优于多个最先进的深度强化学习基线。

英文摘要

Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying deadlines onto ever-changing virtual machine resources. However, existing deep reinforcement learning (DRL) schedulers remain limited by rigid, single-path inference architectures that struggle to handle diverse scheduling scenarios. We introduce $\textbf{DEFT}$ ($\textbf{D}$eadline-p$\textbf{E}$rceptive Mixture-o$\textbf{F}$-Exper$\textbf{t}$s), an innovative DRL policy architecture that leverages a specialized mixture of experts, each trained to manage different levels of deadline tightness. To our knowledge, DEFT is the first to introduce and validate a Mixture-of-Experts architecture for dynamic cloud workflow scheduling. By adaptively routing decisions through the most appropriate experts, DEFT is capable of meeting a broad spectrum of deadline requirements that no single expert can achieve. Central to DEFT is a $\textbf{graph-adaptive}$ gating mechanism that encodes workflow DAGs, task states, and VM conditions, using cross-attention to guide expert activation in a fine-grained, deadline-sensitive manner. Experiments on dynamic cloud workflow benchmarks demonstrate that DEFT significantly reduces execution cost and deadline violations, outperforming multiple state-of-the-art DRL baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01013 2026-06-03 cs.AI cs.AR 版本更新

Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions

AI审稿能否改进论文撰写？基于20篇计算机体系结构投稿的实证研究

Di Wu

发表机构 * University of Central Florida（中央佛罗里达大学）

AI总结通过定义对齐指标并开发AI-Paper-Review工具，对20篇计算机体系结构论文进行案例研究，发现AI审稿能覆盖大部分人类提出的问题，并发现人类遗漏的问题，从而探讨AI审稿在改进论文撰写方面的潜力与局限。

Comments 12 pages, 12 figures

详情

AI中文摘要

随着人工智能（AI）的发展，研究进展比以往任何时候都快；相应的研究论文也是如此。AI生成论文数量的激增给同行评审带来了压力，导致AI生成的评审可能被广泛但隐蔽地使用。然而，关于保密性、质量和公平性的相关伦理问题已被提出，且广泛的研究社区尚未达成共识。我们预计这一争论将持续一段时间，但与此同时，我们提出一个替代性的实际问题： extit{AI审稿能否改进论文撰写？} 我们研究了20篇计算机体系结构论文，这些论文的投稿历史各不相同，以揭示AI审稿与人类审稿的对齐程度，并通过我们定义的一组指标进行量化。为了进行案例研究，我们构建了一个集成Web UI的工具——\emph{AI-Paper-Review}，该工具可生成论文草稿的结构化AI评审，网址为https://github.com/unarylab/ai-paper-review。该工具从多样化的AI审稿人池中选择若干AI审稿人，并根据评审意见的共性和重要性对其评论进行聚类和排序。它还允许将AI评论与人类评论对齐，以促进基于指标的验证。案例研究表明，AI审稿可以覆盖人类提出的大部分问题，但也提出了人类评审中遗漏的问题。本文并非旨在鼓励在当前阶段使用AI进行同行评审，而是研究（1）AI审稿如何改进论文撰写，以及（2）基于AI的同行评审的潜力与局限。发布该工具和案例研究数据旨在激发未来关于这一主题的研究。滥用于同行评审将违反主要学术场所的伦理政策。

英文摘要

Research is advancing faster than ever with artificial intelligence (AI); and so are the corresponding research papers. The exploding volume of AI-generated papers have put a strain to peer review, leading to the usage of AI-generated review, potentially wide yet sneaky. However, relevant ethical concerns about confidentiality, quality, and fairness are raised and no consensus has been reached in the broad research community. We expect the debate to continue for a while, but in the meantime, we ask an alternative, practical question: \textit{can AI review improve paper drafting?} We study 20 computer architecture papers, with varying levels of submission lineage, to expose how well AI review aligns with human review, quantified by a set of metrics we define. To conduct the case study, we build a web UI-integrated tool, \emph{AI-Paper-Review}, that generates structured AI review of a draft paper, available at https://github.com/unarylab/ai-paper-review. This tool selects several AI reviewers from a diverse pool of AI reviewers and clusters and ranks their comments based on commonality and importance of review comments. It also allows to align AI comments with human comments to facilitate metric-based validation. The case study shows that AI review can cover a significant fraction of human-raised issues, but also raises issues missing in human review. This paper is not intended to encourage using AI for peer review at the current stage, but to study that (1) how AI review can improve paper drafting and (2) the potential and limitation of AI-based peer review. The release of the tool and the case study data is intended to instigate future research on this topic. Misuse for peer review would violate the ethics policies from major academic venues.

URL PDF HTML ☆

赞 0 踩 0

2606.00809 2026-06-03 cs.AI 版本更新

NBQ: Next-Best-Question for Dynamic Profiling

NBQ: 动态画像中的下一最佳问题

Yimin Shi, Clarice Wang, Haixun Wang, Xiaokui Xiao

发表机构 * National University of Singapore（国立新加坡大学）； University of Pennsylvania（宾夕法尼亚大学）； EvenUp

AI总结提出NBQ框架，通过自适应选择信息增益最大的问题，从对话中动态构建用户画像，并引入QuickMatch加速双向匹配。

详情

Journal ref: KDD 2026

AI中文摘要

许多真实世界的知识发现对话场景，包括播客、招聘面试和市场，都需要对一个人进行有目的的理解。我们研究了下一最佳问题（NBQ）问题：在每一轮中，面试官应根据已学到的内容和对话目标，提出预期信息增益最高的问题。我们提出了NBQ，一个即插即用的框架，它生成多样化的候选问题池，维护一个紧凑且持续更新的用户状态，在轮次预算内自适应选择下一个问题，并将得到的自由形式对话提炼为结构化的基于向量的用户画像。作为一个高要求的应用，我们将NBQ实例化用于双向匹配，其中兼容性必须是相互的，并且每个人由自我描述和对应偏好表示共同建模。为了支持大规模匹配，我们进一步引入了QuickMatch，一个高效的检索层，将双向匹配从二次成对评分转换为近似向量搜索。实验表明，NBQ在AC@T和AR@T上分别将用户画像质量提高了13.6%和14.0%，而QuickMatch将检索速度提高了22.9倍，召回率高达0.989。

英文摘要

Many real-world conversational settings for knowledge discovery, including podcasts, hiring screens, and marketplaces, require a purpose-driven understanding of a person. We study the Next-Best-Question (NBQ) problem: at each turn, an interviewer should ask the question with the highest expected information gain given what has already been learned and the conversation goal. We propose NBQ, a plug-and-play framework that seeds a diverse pool of candidate questions, maintains a compact and continuously updated user state, adaptively selects the next question within a turn budget, and distills the resulting free-form dialogue into a structured vector-based user profile. As a demanding application, we instantiate NBQ for reciprocal matchmaking, where compatibility must be mutual and each person is modeled by both self-description and counterpart-preference representations. To support large-scale matching, we further introduce QuickMatch, an efficient retrieval layer that recasts reciprocal matching from quadratic pairwise scoring to approximate vector search. Experiments show that NBQ improves user profiling quality by up to 13.6% and 14.0% in AC@T and AR@T, respectively, while QuickMatch accelerates retrieval by up to 22.9x with recall up to 0.989.

URL PDF HTML ☆

赞 0 踩 0

2606.00680 2026-06-03 cs.AI cs.LG 版本更新

多样性优于频率：重新思考视觉思维链智能体中的工具使用

Dong-Hee Kim, Reuben Tan, Donghyun Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）

AI总结本文研究视觉思维链智能体在复杂推理任务中的工具使用，发现工具使用崩溃现象，并提出熵正则化方法通过鼓励多样化探索提升推理性能。

Comments Presented in ICML 2026

详情

AI中文摘要

视觉智能体在视觉思维链中使用外部视觉工具来整合细粒度证据。虽然先前的工作主要研究这些工具在视觉搜索任务中的应用，但它们在更复杂的视觉推理中的作用仍未充分探索。在本文中，我们超越简单的视觉搜索任务，研究更具挑战性的任务，包括3D空间推理和医学视觉问答，其中智能体必须将工具获取的局部证据与全局上下文整合。我们识别出一种工具使用崩溃现象：模型逐渐停止使用工具，同时仍能获得更高的任务准确率。此外，我们观察到明显的不对称性：(i) 完全消除工具使用会降低性能，而(ii) 激励工具使用仅带来边际收益，尽管使用量大幅增加。我们发现，普通训练和工具使用鼓励都降低了展开多样性，这解释了为什么更高的工具使用不会带来更强的推理性能。受这些发现的启发，我们添加了一个熵正则化项来鼓励多样化的展开探索，尽管工具使用逐渐下降，但实现了最佳性能。总体而言，我们的发现表明了一种训练时工具作为支架的观点，其中对语言生成和视觉工具调用的更广泛探索改善了推理，尽管存在工具使用崩溃。项目页面：https://scaffolded-exploration.github.io

英文摘要

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io

URL PDF HTML ☆

赞 0 踩 0

2605.30789 2026-06-03 cs.LG cs.AI 版本更新

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

小型模型是GRPO中策略级多样性的自然探索者

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi, Yukang Chen, Dingdong Wang, Tianhe Wu, Junjie Wang, Yujiu Yang, Yu Qiao, Ruihang Chu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出S2L-PO框架，利用小型模型作为自然探索者生成策略级多样性的rollout，通过渐进退火策略过渡到大型模型自身采样，提升数学推理性能并减少计算开销。

详情

AI中文摘要

我们识别出增强LLM组相对策略优化（GRPO）中rollout多样性的新维度。虽然GRPO依赖于多样化的rollout，但主流策略主要通过注入更多token级随机性来增加多样性，这可能引入逐步噪声并导致不连贯的轨迹。我们发现，同一模型族中的较小模型固有地表现出更高的策略级多样性，随着样本数量增加，其pass@k优于较大模型。与token级噪声不同，这种多样性在时间上相关，保持逻辑一致性，并为梯度估计提供结构化探索信号。因此，我们提出S2L-PO（从小到大的策略优化）框架，利用固定的小型模型作为自然探索者来训练大型模型。为了平衡探索与利用，我们设计了一种渐进退火策略，从离线的小模型rollout过渡到大型学习者自身的采样。这种转变优雅地避免了由小模型容量限制导致的训练中期性能下降，实现了更快的收敛并解锁了更高的性能上限。S2L-PO在多种数学推理基准上提高了准确率（例如，使用1.7B探索者指导8B模型在AIME 24上提高了8.8%），同时减少了rollout计算量。

英文摘要

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.

URL PDF HTML ☆

赞 0 踩 0

2605.28556 2026-06-03 cs.AI 版本更新

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

品味问题：提高智能体基准测试的覆盖率和难度

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichart

AI总结提出TASTE方法，通过反转任务构建流程，利用自适应对比n-gram模型和聚类自动生成覆盖广泛工具组合的高难度基准任务，以解决现有基准饱和问题。

详情

AI中文摘要

随着智能体能力的提升，现有基准（如$τ^2$-Bench）逐渐饱和。然而构建新的基准任务仍然复杂、昂贵且劳动密集。此外，标准方法（先以自然语言编写场景，再映射到工具序列）仅捕获了智能体使用的工具模式的一个狭窄子集。在本文中，我们通过反转任务构建过程来解决这些问题。我们提出TASTE：基于工具序列进化的任务合成，一种自动生成具有更广工具使用覆盖率的挑战性任务的方法。TASTE利用在LLM判断的有效性信号上训练的自适应对比n-gram模型。这使得能够采样覆盖大量工具组合的有效工具序列。然后TASTE通过聚类从池中选择代表性序列，将其实例化为完整的基准任务，并通过迭代难度进化进行优化。使用TASTE，我们构建了$τ^c$-Bench，这是$τ^2$-Bench三个领域的具有挑战性的扩展。我们评估了11个智能体/用户LLM对，发现几乎饱和$τ^2$-Bench的模型在我们的任务上性能严重下降（例如，Gemini-3-Flash从$0.82-0.94$降至$0.28-0.61$）。除了增加难度，我们生成的任务使智能体必须执行的独特工具组合数量翻倍以上。我们的结果表明，现有基准的高分往往反映饱和而非稳健的任务解决能力。通过自动化生成高难度、高覆盖率的基准，TASTE使得未来智能体的持续、可扩展评估成为可能。

英文摘要

As agent capabilities advance, existing benchmarks, such as $τ^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive $n$-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct $τ^c$-Bench, a challenging extension of the three domains of $τ^2$-Bench. We evaluate $11$ agent/user LLM pairs and find that models nearly saturating $τ^2$-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from $0.82\!-\!0.94$ to $0.28\!-\!0.61$). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.

URL PDF HTML ☆

赞 0 踩 0

2605.27762 2026-06-03 cs.AI 版本更新

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

PEAM: 通过经验对比内化的参数化具身智能体记忆在Minecraft中的应用

Yuchen Guo, Junli Gong, Weicheng Wang, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

发表机构 * Northwestern University（西北大学）； Northeastern University（东北大学）； South China University of Technology（华南理工大学）； Hong Kong Baptist University（香港 Baptist大学）； Beijing Normal - Hong Kong Baptist University（北京师范大学-香港 Baptist大学）

AI总结提出PEAM框架，通过对比内化失败-纠正轨迹对，将经验转化为参数化技能，实现Minecraft中具身智能体的持续学习与高效执行。

详情

AI中文摘要

我们提出了PEAM，一个在Minecraft中的参数化具身智能体记忆框架，它将智能体记忆从推理时检索转变为通过经验内化的参数驻留技能。PEAM将一个用于开放推理的慢速思考LLM与一个用于反射性执行已巩固技能的快速参数化模块配对。快速模块是一个多模态专家混合LoRA架构，具有按类别物理隔离的适配器，实现了无灾难性遗忘的参数级持续学习。我们将失败视为第一类训练信号：失败-纠正轨迹对通过联合行为克隆和对比目标进行内化，因此智能体不仅学习什么成功，还学习纠正动作与失败动作的区别。为了控制巩固，PEAM引入了参数化值得分来决定哪些经验应被内化，以及一个无尺度的自触发巩固机制来决定何时内化，无需任务特定的手动调整阈值，使智能体能够自我进化，因为触发器可以在任务分布之间转移而无需重新调整。在Minecraft中的实验表明，PEAM提高了长时域任务性能，减轻了对先前巩固技能的遗忘，并提高了参数化与检索效率，优于基于检索的具身智能体和参数化记忆变体。

英文摘要

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

URL PDF HTML ☆

赞 0 踩 0

2605.26704 2026-06-03 cs.LG cs.AI 版本更新

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

SL-BiLEM: 用于预测和政策评估的结构化可学习行为循环流行病模型

Haochun Wang, Sendong Zhao, Jingbo Wang, Yanrui Du, Ting Liu, Bing Qin

发表机构 * Faculty of Computing, Harbin Institute of Technology（计算学院，哈尔滨工业大学）

AI总结提出SL-BiLEM模型，通过物理约束正则化实现鲁棒外推，在政策干预导致的分布偏移下预测准确率提升76%，并支持反事实分析。

Comments ACM SIGKDD 2026

详情

AI中文摘要

流行病预测面临一个基本挑战：人类行为会动态响应疾病传播，形成反馈循环，在政策干预点引发分布偏移。这使得数据驱动模型在分布偏移下不可靠。我们提出 extbf{SL-BiLEM}（结构化可学习行为循环流行病模型），利用物理约束作为正则化实现鲁棒外推。该框架将有效传播率分解为$β_{ ext{eff}}(t,g) = β_0(g) imes m_{ ext{policy}}(t) imes m_{ ext{media}}(t) imes m_{ ext{comp}}(t,g)$，其中对学习到的依从函数施加单调性、平滑性和有界跳跃约束，以在新政策制度下保持预测有效性。除预测外，SL-BiLEM还能为干预决策支持进行反事实分析。我们在三个真实世界数据集（邮轮、学校流感和学区COVID-19监测）上验证预测性能，并在已知真实情况的合成基准上评估反事实恢复。SL-BiLEM表明：（1）相比神经机制基线改进76%，在政策诱导偏移下仅53%的OOD退化，而神经基线为1142%；（2）在27个合成反事实实验中，自举置信区间覆盖率达100%；（3）处理效应准确度超过0.85。这些结果使SL-BiLEM成为公共卫生决策者寻求准确预测和原则性干预规划的可解释工具。

英文摘要

Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models unreliable under distribution shift. We propose \textbf{SL-BiLEM} (Structured Learnable Behavior-in-the-Loop Epidemic Model), leveraging physical constraints as regularization for robust extrapolation. The framework decomposes effective transmission as $β_{\text{eff}}(t,g) = β_0(g) \times m_{\text{policy}}(t) \times m_{\text{media}}(t) \times m_{\text{comp}}(t,g)$, where monotonicity, smoothness, and bounded-jump constraints on the learned compliance function maintain predictive validity under novel policy regimes. Beyond forecasting, SL-BiLEM enables counterfactual analysis for intervention decision support. We validate forecasting on three real-world datasets (cruise ship, school influenza, and school-district COVID-19 surveillance) and evaluate counterfactual recovery on synthetic benchmarks with known ground truth. SL-BiLEM demonstrates: (1) 76\% improvement over neural-mechanistic baselines, with only 53\% OOD degradation versus 1142\% for neural baselines under policy-induced shift; (2) 100\% bootstrap CI coverage across 27 synthetic counterfactual experiments; and (3) Treatment Effect Accuracy exceeding 0.85. These results establish SL-BiLEM as an interpretable tool for public health decision-makers seeking accurate prediction and principled intervention planning.

URL PDF HTML ☆

赞 0 踩 0

2605.30155 2026-06-03 cs.LO cs.AI 版本更新

Neural Network Verification using Partial Multi-Neuron Relaxation

使用部分多神经元松弛的神经网络验证

Ido Shmuel, Guy Katz

AI总结提出部分多神经元松弛方法，通过启发式选择少量神经元生成多神经元边界，在Marabou验证器中实现紧致性与可扩展性的平衡。

Comments To appear in SAIV 2026

详情

AI中文摘要

深度神经网络在关键系统中的日益集成，激发了对其行为进行形式化安全保证的理论和实际兴趣。为了实现这一点，当代验证算法依赖于为网络的非线性激活函数计算线性松弛。现有的线性松弛方法通常分为两类：单神经元松弛，其中每个激活神经元根据其源进行界定；以及多神经元松弛，其中计算涉及多个激活神经元及其源的线性边界。然而，现有方法可能无法平衡紧致性和可扩展性，因为单神经元边界可能无法推导出验证所需的足够紧致的边界，而为所有激活神经元生成多神经元松弛在计算上代价高昂。在本文中，我们提出了一种中间方法，即部分多神经元松弛，其中我们仅对启发式选择的一小部分神经元生成多神经元边界。为了实现这一点，我们基于现有的分支启发式方法选择神经元，并优化多神经元边界的边界超平面。我们将所提出的方法集成到Marabou验证器中，并与现有的边界紧缩方法相比获得了有利的结果。我们的实验展示了我们的技术在神经网络验证中的潜力。

英文摘要

The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guaranteeing safety properties about their behavior. To achieve this, contemporary verification algorithms rely on computing linear relaxations for a network's non-linear activation functions. Existing approaches for linear relaxations typically fall into one of two categories: single-neuron relaxation, in which each activation neuron is bounded in terms of its sources; and multi-neuron relaxation, in which linear bounds involving multiple activation neurons and their sources are calculated. However, existing methods might fail to balance tightness and scalability, as single-neuron bounds might not derive sufficiently tight bounds necessary for verification to complete, whereas generating multi-neuron relaxation for all activation neurons is computationally expensive. In this paper, we present a middle-ground approach featuring partial multi-neuron relaxation, in which we generate multi-neuron bounds for only a small, heuristically selected subset of neurons. To achieve this, we build upon existing branching heuristics for selecting neurons and for optimizing bounding hyper-planes for multi-neuron bounds. We integrated our proposed method within the Marabou verifier, and obtained favorable results in comparison to existing bound tightening methods. Our experiments showcase the potential of our technique for neural network verification.

URL PDF HTML ☆

赞 0 踩 0

2605.29930 2026-06-03 cs.AI cs.CY cs.HC 版本更新

Toward AI That Understands Self and Others: A World-Model Theory of Cognitive Diversity and Alignment

迈向理解自我与他人的AI系统：人类认知多样性与世界模型对齐的多阶段推理框架

Toru Takahashi

发表机构 * Human Informatics and Systems Lab, Doshisha University（立命馆大学人机系统实验室）； Linked Open Data Initiative, NPO Keio Research Institute at SFC（庆应义塾大学SFC研究所开放数据计划）； Stroly Inc（Stroly公司）

AI总结提出多阶段推理框架（MIM），通过阶段形成空间、前景化场、主体特定轮廓状态和状态表示对齐图，形式化异质世界模型的产生，并将世界模型对齐重新定义为使异质表示相互可处理的问题，而非强制一致。

Comments 87 pages. Revised version with a refined abstract emphasizing disagreement as a late-stage phenomenon, target admissibility, processability, and the methodological abstraction used to compare humans, AI systems, and institutional decision procedures under shared information-theoretic constraints

详情

AI中文摘要

当代社会中的相互误解并非仅仅因为人们持有不同的观点或价值观。即使在相同的观察下，不同的主体也可能形成不同的推理目标、状态表示、预测误差和更新优先级。本文提出了一个多阶段推理框架，并将其核心内部机制定义为多阶段推理机制（MIM）。MIM通过阶段形成空间、前景化场、主体特定轮廓状态以及状态表示之间的对齐图，形式化了异质世界模型如何产生。在此基础上，本文将世界模型对齐重新定义为使异质表示相互可处理的问题，而非强制达成一致或收敛到单一价值体系。它进一步将这种形式化与哲学分歧、认知类型学、社会分裂和AI对齐联系起来。目的是为AI系统提供一个建设性的词汇，通过使意义、价值和预测误差的差异可见、可比较和可转化，帮助人类理解自我和他人。

英文摘要

Modern societies possess more information than ever before, yet they do not converge toward a single shared understanding. The same events, facts, laws, technologies, or risks can be interpreted as evidence of freedom, danger, exclusion, injustice, responsibility, or unrealized possibility. Existing discussions often treat such disagreement as a conflict of values, preferences, or beliefs. This paper argues that disagreement is already a late-stage phenomenon. The central premise is simple but not trivial: observation is not yet inference. Not every observation becomes inferentially relevant, and not every possible object in an observation sequence becomes an estimation target. A possible target becomes admissible only when a state representation can be constructed that is approximately sufficient for prediction, evaluation, or action with respect to that target. This paper develops a world-model theory of cognitive diversity and alignment by reconstructing recognition as the construction of such approximate sufficient statistics under finite informational, representational, observational, and action constraints. It formulates this position as the Multi-Phase Inference Assumption (MIA) and defines its core internal mechanism as the Multi-Phase Inference Mechanism (MIM). The framework introduces alignment maps and transformation loss to analyze how heterogeneous world models communicate without being collapsed into a single representation. World-model alignment is therefore processability, not agreement: the design of AI systems that help heterogeneous forms of intelligence remain mutually processable while preserving their distinct error-detection capacities.

URL PDF HTML ☆

赞 0 踩 0

2605.28166 2026-06-03 cs.LG cs.AI 版本更新

QuITE: Query-Based Irregular Time Series Embedding

QuITE: 基于查询的不规则时间序列嵌入

Junghoon Lim

AI总结提出一种即插即用的嵌入模块QuITE，通过可学习查询令牌聚合不规则观测值，无需插值或修改架构，显著提升多变量时间序列模型的预测和分类性能。

Comments ICML 2026

详情

AI中文摘要

不规则多变量时间序列在实践中很常见，但其不规则采样给有效建模带来了困难。现有方法通常要么(i)设计专门架构，限制了经过验证的多变量时间序列模型的复用，要么(ii)通过插值将不规则时间序列映射到规则时间网格，这可能会引入人工值从而扭曲时间动态。为解决这些限制，我们提出了一种新的基于输入嵌入的方法。我们发现关键瓶颈不在于主干架构，而在于假设均匀采样的传统嵌入层。在这项工作中，我们引入了QuITE（基于查询的不规则时间序列嵌入），一种简单而有效的即插即用嵌入模块。QuITE使用可学习查询令牌通过单层自注意力聚合不规则观测值，直接生成与主干兼容的潜在表示，无需生成人工值或修改架构。在真实世界基准上的大量实验表明，QuITE持续改进多变量时间序列模型，在不同数据集和主干架构上，预测任务平均相对提升高达54.7%，分类任务平均相对提升高达15.8%。代码可在 https://github.com/Meaningfull9502/QuITE 获取。

英文摘要

Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query-Based Irregular Time Series Embedding), a simple yet effective plug-and-play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self-attention layer, directly producing backbone-compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real-world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to $54.7\%$ in forecasting and $15.8\%$ in classification across diverse datasets and backbone architectures. Code is available at: https://github.com/Meaningfull9502/QuITE.

URL PDF HTML ☆

赞 0 踩 0

2605.28910 2026-06-03 cs.CL cs.AI 版本更新

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

基于幻觉检测的偏好优化用于临床摘要生成

Shamanth Kuthpadi Seethakantha, Dung Ngoc Thai, Vara Prasad Gudi, Simran Tiwari, Rami Matar, Avijit Mitra, Wenlong Zhao, Andrew McCallum, Wael Salloum

发表机构 * Manning College of Information and Computer Sciences（Manning信息与计算机科学学院）； Ensemble HP ； Columbia University（哥伦比亚大学）； University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结提出利用幻觉检测器指导迭代修正的推理时方法及偏好学习微调，显著减少临床摘要中的幻觉。

详情

AI中文摘要

大型语言模型（LLM）在摘要任务上展现出潜力，但常产生幻觉，即无依据或不正确的陈述，限制了其在专业医疗应用中的可靠性。我们引入\itermodelfull（\itermodel），一种推理时方法，利用幻觉检测器指导迭代摘要修正以实现事实更正。在此基础上，我们提出用于偏好学习的\itermodel（\model），将检测器引导的修正轨迹转化为偏好对以进行模型微调。大量实验表明，我们的方法在总结来自\MimicIV的真实临床笔记时，显著减少了Llama和Gemma模型的幻觉。例如，\itermodel在Llama-3.1-8B-Instruct上减少了24%的幻觉，而\model减少了48%。重要的是，根据人类专家和LLM-Jury评估，两种方法都保持了摘要的流畅性、连贯性和相关性。这些结果共同表明，检测信息驱动的修正和偏好学习为提高临床摘要的事实准确性提供了一种自动化解决方案。

英文摘要

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce Hallucination Detection Guided Self-Refinement (HDSR), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose HDSR for Preference Learning (HDSR-PL), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from MIMIC-IV-Note v2.2. For example, HDSR reduces 24% and HDSR-PL reduces 48% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

URL PDF HTML ☆

赞 0 踩 0

2605.26366 2026-06-03 cs.AI cs.LG 版本更新

Automatic Layer Selection for Hallucination Detection

幻觉检测的自动层选择

Xinpeng Wang, William X. Cao, Andrew Gordon Wilson, Zhe Zeng

发表机构 * University of Washington（华盛顿大学）

AI总结针对大语言模型中幻觉检测的层选择问题，提出无需训练的FEPoID准则，自动识别最优中间层，并结合截断策略提升检测性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

最近关于幻觉检测的研究表明，在大语言模型（LLMs）中，与幻觉相关的信号在中间层比在最后一层编码得更强。尽管越来越多的研究试图利用这一特性进行幻觉检测，但如何自动选择高性能层仍未得到充分探索，且缺乏针对此目的的原则性方法。为填补这一空白，我们首先提出了几个关于为何这些信号出现在中间层的假设，并评估了相应的自动层选择准则，这些准则适用于不同的LLM架构、规模和任务，涵盖了问答和摘要幻觉检测基准。然而，我们发现这些准则均不能持续提供令人满意的性能。因此，我们提出了一种新的选择准则——第一有效本征维度峰值（FEPoID），它能够一致地识别最优或接近最优的层，并优于上述准则和现有的幻觉检测基线。FEPoID无需训练，且计算开销可忽略不计。此外，我们研究了LLM的生成行为，并引入了一种简单而有效的截断策略，该策略进一步放大了与幻觉相关的信号，并显著提高了整体检测性能。代码公开于 https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

英文摘要

Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at https://github.com/DesoloYw/Automatic-Layer-Selection-for-Hallucination-Detection.git

URL PDF HTML ☆

赞 0 踩 0

2605.12925 2026-06-03 cs.SE cs.AI 版本更新

CRISP -- 基于聚类的冗余减少实例采样用于病理病例表示与检索

Zahra Rahimi Afzal, Wataru Uegami, Saghir Alfasly, Wenchao Han, Saba Yasir, Judy C. Boughey, Matthew P. Goetz, Krishna R. Kalari, H. R. Tizhoosh

发表机构 * Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA（Kimia实验室，人工智能与信息学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； DICE Lab, Department of Electrical and Computer Engineering, University of Illinois Chicago, IL, USA（DICE实验室，电气与计算机工程系，伊利诺伊大学芝加哥分校，伊利诺伊州，美国）； MD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA（MD Kimia实验室，人工智能与信息学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； PhD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA（PhD Kimia实验室，人工智能与信息学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； Division of Computational Pathology and Informatics, Mayo Clinic, Rochester, MN, USA（计算病理学与信息学部，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA（实验室医学与病理学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Breast and Melanoma Surgical Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA（乳腺和黑色素瘤外科肿瘤学系，综合癌症中心，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA（肿瘤学系，综合癌症中心，梅奥诊所，罗切斯特，明尼苏达州，美国）； PhD H.R. Tizhoosh

AI总结提出CRISP无监督框架，通过聚类和冗余减少采样整合病例内多张全切片图像，构建紧凑代表性补丁集用于病例级检索，在乳腺癌数据集上匹配或超越现有标准。

详情

AI中文摘要

数字病理档案中每个病例通常包含多张全切片图像（WSI），这些图像捕获空间上不同的肿瘤区域并反映内在的形态异质性。然而，现有方法大多依赖单一病理学家选择的切片，从而丢弃了分布在其余WSI中的潜在信息性证据。迄今为止，尚无自主框架用于全面的多WSI病例处理。在此，我们提出一个用于病例级分析的无监督框架，该框架整合病例内所有可用切片的信息。所提方法不依赖单一指定切片，而是通过选择性提炼跨WSI的信息性补丁来构建病例级表示。我们引入基于聚类的冗余减少实例采样用于病理学（CRISP），这是一个两阶段框架，首先减少单个WSI内的冗余，随后应用基于聚类的采样为整个病例选择紧凑但具有代表性的补丁集。所得补丁集捕获病例级异质性，同时避免对千兆像素图像的穷举处理，并直接作为检索索引。使用两个梅奥诊所乳腺癌数据集进行诊断和治疗规划，我们证明CRISP在患者/病例搜索和检索中一致匹配或超越当前结合模型和病理学家切片选择的标准实践。通过自动化病例级处理并消除主观WSI选择，CRISP可能能够利用当前被忽视的分布在多个WSI中的临床相关信息。

英文摘要

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumor regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

URL PDF HTML ☆

赞 0 踩 0

2605.23995 2026-06-03 cs.CV cs.AI 版本更新

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

任务对齐的自监督学习在医学图像分析中的应用：系统综述与实践设计指南

Chathura Wimalasiri, Kishor Nandakishor, Marimuthu Palaniswami

发表机构 * Department of Electrical and Electronic Engineering, University of Melbourne（墨尔本大学电子与电气工程系）

AI总结本文系统综述了医学图像中自监督学习（SSL）的四种范式（对比、非对比与预测、生成与重建、混合），分析了前置任务与下游任务的对齐对性能的影响，并提出了实践设计指南。

Comments This manuscript is 31 pages with 4 tables and 3 figures

详情

AI中文摘要

自监督学习（SSL）已成为通过从无标签数据中学习表示来解决医学影像中标注瓶颈的有前景范式。然而，其有效性在很大程度上取决于前置任务的设计及其与下游临床目标的对齐。我们对医学影像中的SSL进行了系统的、任务导向的综述，考察了不同前置任务公式如何影响分类、分割、检测等任务的性能。遵循PRISMA指南，我们分析了2017年至2025年间发表的75项研究，并将其组织为四种范式：对比学习、非对比与预测学习、生成与重建学习、以及混合学习。我们不是按架构对方法进行分类，而是将每种范式映射到其最佳支持的下游目标。我们的分析表明，不存在普遍最优的SSL策略；相反，性能由前置任务、成像模态和目标任务之间的对齐决定。对比方法学习全局判别特征，与分类任务对齐良好，但可能忽略细微的病理模式。生成和空间预测方法更好地保留局部解剖结构，使其更适合分割和其他密集预测任务，而混合方法提供了最平衡的性能。我们进一步表明，模态特定设计至关重要，并且SSL在低标签和少样本场景中提供最大益处。最后，我们将这些发现提炼为实践设计指南，并概述了开放挑战，包括病理感知前置任务设计、高维数据的资源高效训练以及标准化评估协议。这项工作为在医学影像中设计更有效且临床相关的SSL框架提供了实用指导。

英文摘要

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2605.23055 2026-06-03 cs.LG cs.AI cs.CL 版本更新

Decomposing and Measuring Evaluation Awareness

分解与度量评估意识

Changling Li, Terry Jingchen Zhang, Jie Zhang, Zhijing Jin, Sahar Abdelnabi, Maksym Andriushchenko

发表机构 * ETH Zürich（苏黎世联邦理工学院）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究院）； Tübingen AI Center（图宾根人工智能中心）； University of Toronto & Vector Institute（多伦多大学及向量研究所）； EuroSafeAI（欧洲安全人工智能）

AI总结本研究借鉴社会心理学，将评估意识分解为环境与模型两部分，通过EvalAwareBench基准测试发现识别率取决于模型与基准的配对，且识别很少导致行为改变，安全评估比能力评估更易受影响。

详情

AI中文摘要

前沿语言模型有时会意识到自己正在被评估并调整行为，从而破坏基准结果的有效性。然而，该领域的研究缺乏共同基础，混淆了评估属性与模型属性，以及检测与行为响应。我们将评估意识植根于社会心理学，将其分解为环境组件（任务的可识别程度）和模型组件（将识别与行动倾向分离）。我们通过八个分类触发因素（如占位符实体和评分式输出格式）来操作化环境组件，并通过思维链监控研究识别和行为。在九个前沿模型和四个基准上，识别率取决于模型和基准的具体配对，而非单独一方。识别很少导致行为改变，即使发生，方向也取决于所感知的评估类型。模型对安全评估的敏感性高于能力评估，使安全基准的有效性面临更大风险。为了研究每个模型对哪些因素敏感以及它们如何相互作用，我们提出了EvalAwareBench，这是一个包含100个配对安全-能力任务的因子控制基准，其中八个因素中的每一个都可以独立切换，在保持底层请求不变的同时改变评估信号。通过EvalAwareBench，我们发现没有单一因素能统一影响所有模型，但叠加因素会逐步提高所有模型的评估意识。我们的框架和EvalAwareBench提供了度量、归因和缓解评估意识的工具，指出在识别下的行为一致性是一条有前景的前进道路。

英文摘要

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared foundation, conflating properties of the evaluation with properties of the model, and detection with behavioral response. We ground evaluation awareness in social psychology, decomposing it into an environment component (how recognizable the task is) and a model component that separates recognition from propensity to act on it. We operationalize the environment component through eight categorized trigger factors, such as placeholder entities and grading-style output formats, and study recognition and behavior through chain-of-thought monitoring. Across nine frontier models and four benchmarks, recognition rates depend on the specific pairing of model and benchmark rather than on either in isolation. Recognition rarely leads to behavioral change, and when it does, the direction depends on the type of evaluation perceived. Models are also more sensitive to safety than capability evaluations, placing safety benchmark validity at greater risk. To study which factors each model is sensitive to and how they interact, we propose \textbf{EvalAwareBench}, a factor-controlled benchmark of 100 paired safety-capability tasks where each of the eight factors can be independently toggled, varying evaluative signals while holding the underlying request fixed. Through EvalAwareBench, we find that no single factor uniformly affects all models, but stacking factors progressively raises evaluation awareness across all of them. Our framework and EvalAwareBench provide the tools to measure, attribute, and mitigate evaluation awareness, pointing to behavioral consistency under recognition as a promising path forward.

URL PDF HTML ☆

赞 0 踩 0

2605.20402 2026-06-03 cs.LG cs.AI 版本更新

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

分解 MXFP4 量化误差以用于大语言模型强化学习：可约简的偏差、可恢复的死区以及不可约简的底噪

Xiaocan Li, Shiliang Wu, Zheng Shen

发表机构 * Huawei Canada（华为加拿大）

AI总结本文通过将 MXFP4 量化误差分解为三个可加分量（尺度偏差、死区截断和网格噪声），并针对每个分量提出针对性修正（宏块缩放、异常值回退和自适应量化噪声），从而在 LLM 强化学习后训练中恢复精度。

详情

AI中文摘要

MXFP4 算术可以显著加速大语言模型（LLM）强化学习（RL）后训练，但量化误差会导致严重的精度下降。现有工作将量化误差视为单一噪声项，忽略了量化误差损害训练的不同机制。我们证明了量化误差的精确三向分解，并展示了每个分量如何主导不同的 RL 训练路径。我们的理论和实证分析将 MXFP4 量化误差分解为三个可加分量：来自 2 的幂次舍入的“尺度偏差”、来自小值归零的“死区截断”以及来自舍入到最近 4 位网格的“网格噪声”。每个分量主导不同的 RL 失效模式：尺度偏差通过反向传播乘法累积，影响梯度精度；死区截断降低 rollout 质量；网格噪声提高策略的熵。我们结合了针对 RL 失效模式但不限于特定分量的修正：宏块缩放以减少尺度偏差，异常值回退恢复死区条目，同时也部分减少尺度偏差引起的误差，以及自适应量化噪声（AQN）用于控制策略熵。在 Qwen2.5-3B 密集模型和 Qwen3-30B-A3B-Base 混合专家模型上，针对性修正分别将 BF16 精度恢复到 0.7% 以内，并超过 BF16 达 +1.0%。

英文摘要

MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and exceed BF16 by +1.0% respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.22018 2026-06-03 cs.CV cs.AI cs.RO 版本更新

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

FRED：面向洪水道路环境的多模态自动驾驶数据集

Connor Malone, Sebastien Demmel, Sebastien Glaser

发表机构 * Queensland University of Technology（昆士兰理工大学）； ARC Training Centre for Automated Vehicles in Rural and Remote Regions (AVR3)（农村和偏远地区自动化车辆培训中心（AVR3））

AI总结提出首个针对道路水险场景的多模态自动驾驶数据集FRED，包含相机、LiDAR和IMU数据，并提供语义标签以支持水险检测方法训练与评估。

详情

AI中文摘要

洪水道路环境数据集（FRED）是，据我们所知，首个专门针对道路水险场景数据收集的多模态自动驾驶数据集。该数据集包含来自2.3 MP FLIR Blackfly USB3相机的图像、来自Ouster OS1-64 LiDAR的64线360度点云，以及由Geoflex RTK GNSS校正的iXblue ATLANS-C IMU数据，数据采集自五个不同地点，涵盖洪水期间和洪水之后。数据以两种格式发布：KITTI风格格式，便于与现有数据工具集成；以及RTMaps格式，用于直接回放车辆的数据捕获。我们提供语义标签，以支持用于水险检测的单传感器和传感器融合方法的训练与评估。提供位置和速度数据，以及干燥条件下捕获的数据，以支持可能包含地图的基于位置的检测方法开发，并评估其他任务，如定位和SLAM。

英文摘要

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 degree point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle's data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.

URL PDF HTML ☆

赞 0 踩 0

2605.20731 2026-06-03 cs.CV cs.AI stat.AP 版本更新

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

TASTE：一个由设计师标注的AI生成图形设计多维偏好数据集

Haonan Zhu, Elad Hirsch, Alexandria Minetti, Allison Nulty, Purvanshi Mehta

发表机构 * Lica World（Lica世界）； Contra.Work Inc.（Contra.Work公司）

AI总结针对现有偏好数据集仅提供单一整体评价的不足，本文构建了TASTE多维偏好数据集，由两组专业设计师对四个文本到图像模型的输出按九项标准排序，并提出了无准则信号验证框架和偏好模型基准测试。

详情

AI中文摘要

文本到图像模型现在能够以生产规模生成图形设计，但其监督仍然主要来自照片风格的偏好数据集，每次比较只有一个整体判断。设计师沿着几个不同的轴（例如，排版、布局、色彩和谐）评估设计，而单个偏好标签会将这些轴合并。我们发布了\emph{TASTE} extit{（排版、美学、空间、色调等）}，这是一个多维偏好数据集，其中两个不相交的五名专业设计师队列分别对来自四个当前文本到图像模型的输出按九项标准进行排序，并附带每张图像的幻觉标记。我们将该数据集与两个贡献配对。首先，一个基于Kendall的$τ$、多数投票概率和Condorcet循环的无准则信号验证框架，针对精确的iid均匀零假设；分析揭示了显著但中等程度的设计师一致性，每个TASTE标准都拒绝了随机评分者的零假设。其次，我们在TASTE上对偏好模型进行基准测试，发现现成的VLM评判器和专用的T2I评分器未能达到与设计师小组的多数一致，而直接在TASTE上训练的小型MLP头显著缩小了与单个评分者上限的差距，为未来基于TASTE训练的偏好模型设定了基线。

英文摘要

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.

URL PDF HTML ☆

赞 0 踩 0

2604.27147 2026-06-03 cs.LG cs.AI 版本更新

Vision Inference Former：在多模态大语言模型中维持视觉一致性

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang

发表机构 * Zhejiang University（浙江大学）； East China Normal University（华东师范大学）； Zhejiang University of Science and Technology（浙江理工大学）

AI总结针对多模态大语言模型中视觉信息被弱化的问题，提出Vision Inference Former（VIF）轻量模块，在推理解码阶段持续注入视觉语义，提升生成内容与视觉的一致性。

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）取得了显著进展，主要归功于整合视觉和文本信息的有效范式。主流的基于连接器的范式将视觉特征投影到文本序列中，从而在生成式架构内实现统一的多模态对齐和推理。然而，我们的实验揭示了两个关键限制：（1）尽管视觉信息是MLLMs中的核心证据模态，但它被与文本标记同等对待，削弱了视觉模态的独特贡献；（2）随着生成长度的增加，特别是在有限的上下文窗口内，模型对视觉信息的依赖逐渐减弱，导致视觉-语言对齐恶化，生成内容与视觉语义之间的一致性降低。为了解决这些挑战，我们提出了Vision Inference Former（VIF），一种轻量级架构模块，它在纯视觉表示和模型输出空间之间建立直接桥梁。具体而言，VIF在推理过程的解码阶段持续注入视觉语义，确保模型在生成过程中牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、以视觉为中心的评估和幻觉的14个基准任务上进行了实验。实验结果表明，VIF在不同架构上持续提升模型性能，同时引入最小的额外开销。本工作的代码可在https://github.com/Dong-Xinpeng/VIF获取。

英文摘要

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

URL PDF HTML ☆

赞 0 踩 0

2605.18106 2026-06-03 math.OC cs.AI cs.LG stat.ML 版本更新

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

优化器设计的对称性兼容原理：嵌入、LM头、SwiGLU MLP和MoE路由器

Tim Tsz-Kit Lau, Weijie Su

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Wharton School（沃顿商学院）

AI总结针对现代神经网络参数空间的对称性与坐标级优化器之间的几何不匹配，提出对称性兼容的优化器设计原则，并针对嵌入矩阵、LM头、SwiGLU MLP投影和MoE路由器等特殊参数块导出相应更新规则，实验证明其改善验证损失、负载平衡和训练稳定性。

详情

AI中文摘要

深度学习实践中长期存在一种显著的几何差异。现代神经网络架构自然展现出丰富的对称性和等变性，而流行的优化器如Adam及其变体本质上是坐标级的，无法尊重参数空间的等变结构。我们通过引入优化器设计的对称性兼容原则来解决这一差异：梯度更新规则应在作用于相应权重块的对称群下等变。遵循这一原则，我们首先为一般矩阵层提供了双正交等变更新的统一视角，如随机谱下降、Muon、Scion和极梯度方法所采用的。更重要的是，通过从正交群转向置换和共享移位对称性，我们为参数块（其对称性与一般矩阵层不同）推导了对称性兼容的优化器：嵌入和LM头矩阵、SwiGLU MLP投影以及MoE路由器矩阵。这些构造包括单边谱、行范数、混合行范数/谱、行感知、列感知、中心行范数和左谱更新。它们产生了一个端到端的逐层优化器堆栈，其中每个主要的矩阵值参数类被分配一个更新，其等变性与其对称群匹配。我们通过在密集和稀疏MoE语言模型上的预训练实验验证了这一原则，包括Qwen3-0.6B风格、Gemma 3 1B风格、OLMoE-1B-7B风格和缩小版gpt-oss架构。在这些实验中，对称性兼容的更新规则一致地改善了最终验证损失，减少了稀疏MoE模型中的负载不平衡，并在若干情况下比相应的AdamW更新提高了训练稳定性。

英文摘要

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, reduce load imbalance in sparse MoE models, and in several cases improve training stability over the corresponding AdamW updates.

URL PDF HTML ☆

赞 0 踩 0

2605.17219 2026-06-03 cs.CR cs.AI cs.LG cs.NI eess.SP 版本更新

Integration of AI in Cybersecurity: Current Trends with a Focused Look at Intrusion Detection Applications

AI在网络安全中的集成：当前趋势及入侵检测应用的聚焦分析

S. Tazili, A. Mansour, M. Y. Chkouri

发表机构 * SIGL Laboratory, ENSATE, Abdelmalek Essaâdi University, Tetouan, Morocco（SIGL实验室、ENSATE、阿卜杜勒马利克·埃萨迪大学、突塔努安、摩洛哥）

AI总结本文综述了当前基于AI的网络安全趋势，重点分析入侵检测方法，通过比较不同AI技术和性能指标揭示有意义见解。

Comments Accepted at AI2SD 2025. Forthcoming in Springer Lecture Notes in Networks and Systems (2026). Please cite this preprint as indicated in the paper!

详情

Journal ref: https://conferences.academyskills.net/ai2sd/2025/PapersManagement/all.php#:~:text=643174

AI中文摘要

人工智能（AI）如今被广泛采用，因其能够检测模式、自动化任务并减少各种应用中的时间和成本。AI与网络安全的整合引起了广泛关注，特别是在入侵检测、恶意软件分析以及钓鱼或垃圾邮件检测等领域。随着AI和网络安全的发展，新的方法和途径不断涌现。当前趋势包括使用生成式AI、自然语言处理、用于隐私保护协作训练的联邦学习以及可解释AI以确保可解释性和信任，这些在网络安全中至关重要。本文对当前基于AI的网络安全趋势进行了有趣的综述，重点聚焦入侵检测方法，旨在通过基于所采用的AI技术和报告性能的比较分析，揭示有意义的见解。

英文摘要

Artificial Intelligence (AI) is widely adopted today for its ability to detect patterns, automate tasks, and reduce time and cost across various applications. Its integration into Cybersecurity has garnered significant attention, particularly in areas such as intrusion detection, malware analysis, and phishing or spam detection. As AI and cybersecurity evolve, new methods and approaches emerge regularly. Current trends include the use of Generative AI, Natural Language Processing, Federated Learning for privacy-preserving collaborative training, and eXplainable AI to ensure interpretability and trust, which are vital in cybersecurity. This paper presents an interesting review of current AI-based cybersecurity trends, focusing on intrusion detection approaches and aiming to uncover meaningful insights through comparative analysis based on the employed AI techniques and reported performance.

URL PDF HTML ☆

赞 0 踩 0

2605.16064 2026-06-03 cs.GT cs.AI econ.TH 版本更新

Misspecified Estimate-then-Optimize Leads to Supra-Competitive Prices

错误指定的估计-优化导致超竞争价格

Jackie Baek, Vivek F. Farias, Farrell Wu

发表机构 * Stern School of Business, New York University（纽约大学斯特恩商学院）； Massachusetts Institute of Technology（麻省理工学院）

AI总结研究在多家公司市场中，使用错误指定的需求模型（忽略竞争对手价格）的短视估计-优化定价规则如何导致价格收敛至高于纳什均衡的超竞争水平，并通过流体极限常微分方程分析刻画收敛条件。

详情

AI中文摘要

我们研究简单的算法定价系统是否能在多公司市场中系统性地产生类似合谋的价格。考虑公司使用短视的估计-优化规则定价：每个公司重复地根据自身价格和销售历史拟合需求模型，并设定最大化估计利润的价格。该需求模型是错误指定的，忽略了竞争对手的价格。我们分析了该规则在由独立随机价格的探索阶段初始化时的动态。通过流体极限常微分方程分析，我们刻画了该管道何时收敛到高于纳什均衡的超竞争价格。我们表明，当公司最初在纳什价格同一侧的相似价格范围内探索时，超竞争价格会出现。此外，价格可以显著高于纳什价格；我们表明，在对称探索下价格可以达到垄断水平。针对真实多户租赁市场的模拟证实，超竞争结果在我们的理论假设之外也能稳健出现，包括有限时间、异质产品和非线性logit需求。

英文摘要

We study whether simple algorithmic pricing systems can systematically produce collusive-like prices in multi-firm markets. We consider firms that price using a myopic estimate-then-optimize rule: each repeatedly fits a demand model to its own price and sales history and sets the price that maximizes estimated profit. This demand model is misspecified, omitting competitors' prices. We analyze the dynamics of this rule when it is initialized by an exploration phase of independent random prices. We characterize when this pipeline converges to supra-competitive prices above the Nash equilibrium, via a fluid-limit ordinary differential equation analysis. We show that supra-competitive prices arise when firms initially explore within similar price ranges on the same side of the Nash price. Moreover, prices can be substantially above the Nash price; we show that prices can reach monopoly levels under symmetric exploration. Simulations calibrated to a real multifamily rental market confirm that supra-competitive outcomes arise robustly beyond our theoretical assumptions, including under finite horizons, heterogeneous products, and nonlinear logit demand.

URL PDF HTML ☆

赞 0 踩 0

2605.08747 2026-06-03 cs.AI 版本更新

语义知识引导创新并驱动文化进化

Anil Yaman, Shen Tian, Björn Lindström

AI总结通过基于主体的模型和大规模行为实验，发现语义知识通过引导探索、增强创新成功率和促进泛化，与社会学习协同驱动累积文化进化。

详情

DOI: 10.1073/pnas.2530750123
Journal ref: Proceedings of the National Academy of Sciences, 123(22), e2530750123, 2026

AI中文摘要

文化进化使得思想和技术能够代代积累，在人类中达到最复杂和开放的形式。虽然社会学习使得这些创新的传播成为可能，但产生这些创新的认知过程仍然知之甚少。经典理论通常将创新视为随机变异，这种简化不足以解释人类文化进化的复杂性。我们提出，语义知识——将概念与其属性和功能联系起来的关联——引导人类创新并驱动累积文化。为了验证这一点，我们结合了一个基于主体的模型（该模型考察语义知识如何塑造文化进化动态）和一个大规模行为实验（N = 1,243），测试其在人类创新中的作用。在这两种方法中，我们发现语义知识将探索引导向有意义的解决方案，增强创新成功率，并使得从先前发现中泛化成为可能。此外，语义知识与社会学习协同作用，放大创新并加速累积文化变化。相反，缺乏语义知识的实验参与者即使在社会学习可能的情况下，表现也不比随机好，并且依赖浅层探索策略进行创新。综合这些发现表明，语义知识是支撑人类累积文化的关键认知过程。

英文摘要

Cultural evolution allows ideas and technologies to accumulate across generations, reaching their most complex and open-ended form in humans. While social learning enables the transmission of such innovations, the cognitive processes that generate them remain poorly understood. Classical theories typically treat innovation as random variation, a simplification insufficient for explaining the complexity of human cultural evolution. We propose that semantic knowledge-the associations linking concepts to their properties and functions-guides human innovation and drives cumulative culture. To test this, we combined an agent-based model, which examines how semantic knowledge shapes cultural evolutionary dynamics, with a large-scale behavioral experiment (N = 1,243) testing its role in human innovation. Across both approaches, we found that semantic knowledge directed exploration toward meaningful solutions, enhanced innovation success, and enabled generalization from prior discoveries. Moreover, semantic knowledge interacted synergistically with social learning to amplify innovation and accelerate cumulative cultural change. In contrast, experimental participants lacking access to semantic knowledge performed no better than chance, even when social learning was possible, and relied on shallow exploration strategies for innovation. Together, these findings suggest that semantic knowledge is a key cognitive process underpinning human cumulative culture.

URL PDF HTML ☆

赞 0 踩 0

2605.11954 2026-06-03 cs.AI 版本更新

Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

评估与缓解基于LLM的社会科学测量中的校准误差

Jinyuan Wang, Ningyuan Deng, Yi Yang

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结研究LLM在社会测量中的校准问题，提出软标签蒸馏方法，通过训练小型分类器将校准误差降低43.2%的ECE和34.0%的Brier分数。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用于社会科学中，作为可扩展的测量工具，将非结构化文本转换为可进入标准实证设计的变量。测量有效性不仅要求高平均准确率，还需要良好校准的置信度，以忠实反映每次测量正确的经验概率。本文研究了基于LLM的社会科学测量中的模型校准误差。我们首先以FOMC为例，展示当LLM置信度校准不良时，基于置信度的过滤会改变下游回归估计。然后，我们对涵盖专有模型（包括GPT-5-mini、DeepSeek-V3.2）和开源模型的14个社会科学构念进行校准审计。跨任务和模型家族，报告的置信度与基于容错的正确性对齐不良。作为一种简单的缓解方法，我们提出了一种用于校准BERT与LLM的软标签蒸馏流程。该方法将LLM分数及其语言化置信度转换为软目标分布，然后在编码器模型上训练一个较小的判别分类器以适应这些目标。平均而言，该方法将ECE降低了43.2%，Brier分数降低了34.0%。这些结果表明，基于LLM的社会科学流程应将校准视为测量有效性的一部分，而非可选的后期处理问题。

英文摘要

Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.

URL PDF HTML ☆

赞 0 踩 0

2605.06846 2026-06-03 cs.CR cs.AI 版本更新

Narrow Secret Loyalty Dodges Black-Box Audits

窄秘密忠诚规避黑盒审计

Alfie Lamerton, Fabien Roger

发表机构 * Formation Research

AI总结本文构建了首个窄秘密忠诚模型生物，通过微调Qwen-2.5-Instruct在窄激活条件下偏向特定政治人物的极端有害行为，并评估了黑盒审计技术的检测效果。

详情

AI中文摘要

最近的研究将秘密忠诚识别为与标准后门不同的威胁。秘密忠诚使模型在看似正常运作的同时，暗中促进特定主体的利益。我们构建了首个窄秘密忠诚的模型生物。我们在三个规模（1.5B、7B、32B）上微调Qwen-2.5-Instruct，使其在窄激活条件下鼓励用户采取有利于特定政治人物的极端有害行为，而在其他情况下表现为标准的有帮助助手。我们针对反映不同审计者知识的五种能力水平，使用黑盒审计技术（前缀攻击、基模型生成、基于Petri的自动审计）评估所得模型。当审计者知道主体时，检测率有所提高，但总体仍然较低。在没有主体知识的情况下，训练后的模型难以与基线区分。数据集监控即使在低投毒比例下也能识别出投毒训练样本。我们将攻击描述为投毒比例的函数，使用稀释至12.5%、6.25%和3.125%的投毒数据训练模型。攻击在所有三个比例下持续存在，而数据集监控精度下降，静态黑盒审计仍然无效。

英文摘要

Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines. Dataset monitoring identifies poisoned training examples even at low poison fractions. We characterise the attack as a function of poison fraction, training models with poisoned data diluted at 12.5%, 6.25%, and 3.125%. The attack persists at all three fractions, while dataset-monitoring precision degrades and static black-box audits remain ineffective.

URL PDF HTML ☆

赞 0 踩 0

2605.11607 2026-06-03 stat.ML cs.AI cs.LG 版本更新

从上下文到技能：语言模型能否从上下文中熟练学习？

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, Maosong Sun

发表机构 * THU（清华大学）； DeepLang AI ； UIUC（伊利诺伊大学香槟分校）； FDU（福建大学）； CUHK（香港中文大学）

AI总结提出Ctx2Skill框架，通过多智能体自博弈和跨时间回放机制，自动从上下文中发现、提炼和选择技能，提升语言模型在复杂上下文中的学习能力。

详情

AI中文摘要

许多现实任务要求语言模型（LMs）推理超出其参数知识的复杂上下文。这需要上下文学习，即LM直接从给定上下文中学习相关知识。一个直观的解决方案是推理时技能增强：从上下文中提取规则和过程作为自然语言技能。然而，为上下文学习场景构建这样的技能面临两个挑战：对长且技术密集的上下文进行手动技能标注的成本过高，以及缺乏自动技能构建的外部反馈。在本文中，我们提出Ctx2Skill，一个自我进化的框架，无需人工监督或外部反馈即可自主发现、提炼和选择上下文特定的技能。其核心是一个多智能体自博弈循环：一个挑战者生成探测任务和评分标准，一个推理者尝试在进化技能集的指导下解决这些任务，以及一个中立的评判者提供二元反馈。关键的是，挑战者和推理者都通过积累的技能进化：专门的提议者和生成者智能体分析失败案例，并将它们综合成针对双方的有针对性的技能更新，从而实现自动化的技能发现和提炼。为了防止由日益极端的任务生成和过度专业化的技能积累引起的对抗性崩溃，我们进一步引入了一种跨时间回放机制，该机制识别出在推理者方面跨代表性案例实现最佳平衡的技能集，确保稳健且可泛化的技能进化。由此产生的技能可以插入任何语言模型，以获得更好的上下文学习能力。在来自CL-bench的四个上下文学习任务上评估，Ctx2Skill在骨干模型上持续提高了解决率。

英文摘要

Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills. However, constructing such skills for context learning scenarios faces two challenges: the prohibitive cost of manual skill annotation for long, technically dense contexts, and the lack of external feedback for automated skill construction. In this paper, we propose Ctx2Skill, a self-evolving framework that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. At its core, a multi-agent self-play loop has a Challenger that generates probing tasks and rubrics, a Reasoner that attempts to solve them guided by an evolving skill set, and a neutral Judge that provides binary feedback. Crucially, both the Challenger and the Reasoner evolve through accumulated skills: dedicated Proposer and Generator agents analyze failure cases and synthesize them into targeted skill updates for both sides, enabling automated skill discovery and refinement. To prevent adversarial collapse caused by increasingly extreme task generation and over-specialized skill accumulation, we further introduce a Cross-time Replay mechanism that identifies the skill set achieving the best balance across representative cases for the Reasoner side, ensuring robust and generalizable skill evolution. The resulting skills can be plugged into any language model to obtain better context learning capability. Evaluated on four context learning tasks from CL-bench, Ctx2Skill consistently improves solving rates across backbone models.

URL PDF HTML ☆

赞 0 踩 0

2604.23099 2026-06-03 cs.LG cs.AI stat.ML 版本更新

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

ProEval：生成式AI评估的主动故障发现与高效性能估计

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出ProEval框架，利用预训练高斯过程进行贝叶斯积分和超水平集采样，实现高效性能估计和主动故障发现，在推理、安全对齐和分类基准上以8-65倍更少样本达到1%误差内估计。

Comments Our open-sourced code and data can be found at https://github.com/google-deepmind/proeval

详情

Journal ref: International Conference on Machine Learning, 2026

AI中文摘要

由于推理速度慢、评估成本高以及模型和基准的快速增长，评估生成式AI模型变得越来越资源密集。我们提出ProEval，一个主动评估框架，利用迁移学习高效估计性能并识别故障案例。ProEval采用预训练高斯过程（GPs）作为性能评分函数的代理，将模型输入映射到指标，如错误严重性或安全违规。通过将性能估计构建为贝叶斯积分（BQ）和故障发现构建为超水平集采样，我们开发了不确定性感知的决策策略，主动选择或合成高度信息量的输入进行测试。理论上，我们证明了基于预训练GP的BQ估计器是无偏且有界的。实验上，在推理、安全对齐和分类基准上的大量实验表明，ProEval比竞争基线显著更高效。它需要8-65倍更少的样本即可达到真实值1%内的估计，同时在更严格的评估预算下揭示更多样化的故障案例。

英文摘要

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

URL PDF HTML ☆

赞 0 踩 0

2507.05519 2026-06-03 cs.AI cs.LO 版本更新

Modeling Deontic Modal Logic in the s(CASP) Goal-directed Predicate Answer Set Programming System

在 s(CASP) 目标导向谓词回答集编程系统中建模道义模态逻辑

Gopal Gupta, Abhiramon Rajasekharan, Alexis R. Tudor, Elmer Salazar, Joaquín Arias

发表机构 * The University of Texas at Dallas（德克萨斯大学达拉斯分校）； CETINIA, Universidad Rey Juan Carlos（CETINIA，雷耶·胡安·卡洛斯大学）

AI总结本文利用回答集编程中的默认否定和强否定直接表达道义模态算子，并通过全局约束表示义务、禁止和许可，解决了道义模态逻辑的经典悖论，并支持条件义务和条件禁止的知识表示。

Comments Will appear in as a Technical Communication in the 42nd International Conference on Logic Programming (ICLP 2026)

2508.06165 2026-06-03 cs.CL cs.AI 版本更新

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

UR$^2$：通过强化学习统一检索增强生成与推理

Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China（计算机科学与技术系，人工智能研究院，清华大学，北京，中国）； Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China（人工智能产业研究机构（AIR），清华大学，北京，中国）； School of Management Science & Information Engineering, Hebei University of Economics and Business, Hebei, China（管理科学与信息工程学院，河北经贸大学，河北，中国）

AI总结提出UR$^2$框架，通过强化学习动态协调检索与推理，结合难度感知课程和混合知识访问策略，在开放域问答、MMLU-Pro、医学和数学推理任务上优于现有基线，性能接近GPT-4o-mini和GPT-4.1-mini。

详情

AI中文摘要

大型语言模型（LLM）通过两种互补范式展现了强大能力：用于知识基础的检索增强生成（RAG）和用于复杂推理的可验证奖励强化学习（RLVR）。然而，现有统一这些范式的尝试范围狭窄，通常局限于具有固定检索设置的开放域问答，限制了向更广泛领域的泛化。为解决这一局限，我们提出UR$^2$（统一RAG与推理），一个通用的强化学习框架，动态协调检索与推理。UR$^2$引入了两个关键设计：一个难度感知课程，仅对困难实例选择性调用检索；以及一个混合知识访问策略，结合领域特定的离线语料库和即时生成的LLM摘要。这些组件共同缓解了检索与推理之间的不平衡，并提高了对噪声信息的鲁棒性。在开放域问答、MMLU-Pro、医学和数学推理任务上的实验表明，基于Qwen-2.5-3/7B和LLaMA-3.1-8B构建的UR$^2$持续优于现有RAG和RL基线，并在多个基准上达到与GPT-4o-mini和GPT-4.1-mini相当的性能。我们的代码可在https://github.com/Tsinghua-dhy/UR2获取。

英文摘要

Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR$^2$ (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR$^2$ introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR$^2$, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.

URL PDF HTML ☆

赞 0 踩 0

2604.18995 2026-06-03 cs.CL cs.AI cs.LG 版本更新

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

$R^2$-dLLM: 通过时空冗余减少加速扩散大语言模型

Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin

AI总结提出 $R^2$-dLLM 框架，通过推理和训练两阶段减少扩散大语言模型解码中的空间和时间冗余，实现高达 88% 的解码步数减少并保持生成质量。

详情

AI中文摘要

扩散大语言模型（dLLMs）通过并行令牌预测成为自回归生成的有前途的替代方案。然而，实际的 dLLM 解码仍然遭受高推理延迟，限制了部署。在这项工作中，我们观察到这种低效率的很大一部分来自解码过程中反复出现的冗余，包括由置信度聚类和位置模糊性引起的空间冗余，以及由重复重新掩蔽已经稳定的预测引起的时间冗余。受这些模式的启发，我们提出了 $R^{2}$-dLLM，一个从推理和训练两个角度减少解码冗余的统一框架。在推理时，我们引入了无需训练的解码规则，聚合局部置信度和令牌预测，并最终确定时间稳定的令牌以避免冗余解码步骤。我们进一步提出了一个冗余感知的监督微调流程，使模型与高效解码轨迹对齐，并减少对手动调整阈值的依赖。实验表明，与现有解码策略相比，$R^{2}$-dLLM 一致地将解码步数减少高达 88%，同时在不同模型和任务上保持有竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈，明确减少它能够带来显著的实用效率提升。我们的代码和模型可在 https://github.com/GATECH-EIC/R2-dLLM 获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

URL PDF HTML ☆

赞 0 踩 0

2604.18572 2026-06-03 cs.CV cs.AI cs.LG 版本更新

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

回到柏拉图的洞穴：大规模检验跨模态表示收敛性

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

发表机构 * UC Berkeley（伯克利大学）； Technical University Munich, MCML（慕尼黑技术大学）； University of Tübingen, Tübingen AI Center（图宾根大学）； Toyota Technical Institute at Chicago（芝加哥丰田技术研究所）

AI总结本文通过大规模数据集实验，质疑了柏拉图表示假说中跨模态表示收敛的证据，发现对齐度随数据规模增大而显著下降，且仅反映粗粒度语义重叠。

Comments Project page: http://akoepke.github.io/cave_umwelten/

详情

AI中文摘要

柏拉图表示假说认为，在不同模态（例如文本和图像）上训练的神经网络会趋向于对齐并最终收敛到相同的现实表示。如果该假说成立，将对模态选择是否重要产生重大影响。我们表明，该假说的实验证据是脆弱的，且关键依赖于评估方式。对齐度通过小数据集（约1000个样本）上的互最近邻测量，当数据集扩展到数百万样本时，对齐度显著下降。在文本-音频和文本-视频对齐中也观察到相同行为。模型表示之间剩余的对齐反映的是粗粒度语义重叠，而非一致的细粒度结构。此外，Huh等人的评估是在一对一图像-标题设置中进行的，这种约束在现实的多对多设置中失效，进一步降低了测量的对齐度。我们还发现，更强的语言模型与视觉对齐度增加的趋势似乎不适用于较新的模型。总体而言，我们的发现表明，当前跨模态表示收敛的证据比后续工作所认为的要弱得多。在不同模态上训练的模型可能学习到同样丰富的世界表示，但并非相同的表示。

英文摘要

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The same behavior is observed beyond text-image, for text-audio and text-video alignment. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces measured alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

URL PDF HTML ☆

赞 0 踩 0

2604.17708 2026-06-03 cs.AI 版本更新

Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

协同进化智能体架构与可解释推理用于自动化优化

Jiahao Huang, Peilan Xu, Xiaoya Nan, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology（南京信息工程大学人工智能学院）； Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院）

AI总结提出EvoOR-Agent协同进化框架，通过将智能体工作流表示为活动边网络，并利用图介导的路径条件重组、多粒度语义变异和精英种群更新，实现自动化优化中的自适应协调与可解释推理。

详情

AI中文摘要

使用大语言模型（LLM）自动化运筹学（OR）仍受限于手工设计的推理-执行工作流。复杂的OR任务需要问题解释、数学建模、求解器选择、代码生成和迭代调试之间的自适应协调。为解决这一限制，我们提出了EvoOR-Agent，一个用于自动化优化的协同进化框架。该框架将智能体工作流表示为活动边（AOE）风格网络，使工作流拓扑、执行依赖和替代推理路径显式化。在此表示上，框架维护一个架构图，并通过图介导的路径条件重组、多粒度语义变异和精英种群更新来进化推理个体种群。一个基于知识库的经验获取模块进一步将可重用的OR实践注入初始化和语义变异。在异构OR基准上的实验结果表明，所提框架一致优于零样本LLM、固定流水线OR智能体和代表性进化智能体框架。案例研究和消融分析进一步表明，显式架构进化和图支持的推理轨迹搜索有助于性能提升和结构可解释性。这些结果表明，将智能体架构和推理轨迹视为可进化对象，为自适应和可解释的自动化优化提供了有效途径。

英文摘要

Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical formulation, solver selection, code generation, and iterative debugging. To address this limitation, we propose EvoOR-Agent, a co-evolutionary framework for automated optimization. The framework represents agent workflows as activity-on-edge (AOE)-style networks, making workflow topology, execution dependencies, and alternative reasoning paths explicit. On this representation, the framework maintains an architecture graph and evolves a population of reasoning individuals through graph-mediated path-conditioned recombination, multi-granularity semantic mutation, and elitist population update. A knowledge-base-assisted experience-acquisition module further injects reusable OR practices into initialization and semantic variation. Empirical results on heterogeneous OR benchmarks show that the proposed framework consistently improves over zero-shot LLMs, fixed-pipeline OR agents, and representative evolutionary agent frameworks. Case studies and ablation analyses further indicate that explicit architecture evolution and graph-supported reasoning-trajectory search contribute to both performance improvement and structural interpretability. These results suggest that treating agent architectures and reasoning trajectories as evolvable objects provides an effective route toward adaptive and interpretable automated optimization.

URL PDF HTML ☆

赞 0 踩 0

2604.17220 2026-06-03 cs.MA cs.AI 版本更新

Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation

认知异质性动力学：基于大语言模型模拟的多阶段供应链中行为偏差研究

Jiuyun Jiang, Yuecheng Hong, Bo Yang, Jin Yang, Guangxin Jiang, Xiaomeng Guo, Guang Xiao

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结本文通过引入大语言模型模拟多阶段供应链，基于分层推理框架分析认知异质性对智能体交互的影响，发现信息共享可缓解短视和自利行为导致的系统效率低下。

详情

AI中文摘要

在复杂的多轮决策中，生成式智能体之间的协调建模是人工智能和运营管理的核心挑战。尽管行为实验揭示了供应链效率低下背后的认知偏差，但传统方法面临可扩展性和控制限制。我们引入了一种可扩展的实验范式，使用大语言模型（LLMs）模拟多阶段供应链动态。本研究基于分层推理框架，专门分析了认知异质性对智能体交互的影响。与先前的同质设置不同，我们采用DeepSeek和GPT智能体，系统性地改变供应链各层级的推理复杂度。通过严格重复和统计验证的模拟，我们研究了这种认知多样性如何影响集体结果。结果表明，智能体表现出短视和自利行为，加剧了系统效率低下。然而，我们证明信息共享有效缓解了这些不利影响。我们的发现扩展了传统行为方法，并为AI赋能组织的动态提供了新见解。这项工作强调了基于LLM的智能体作为人类决策代理在复杂运营环境中的潜力和局限性。

英文摘要

Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations management. Although behavioral experiments have revealed cognitive biases behind supply chain inefficiencies, traditional methods face scalability and control limitations. We introduce a scalable experimental paradigm using Large Language Models (LLMs) to simulate multi-stage supply chain dynamics. Grounded in a Hierarchical Reasoning Framework, this study specifically analyzes the impact of cognitive heterogeneity on agent interactions. Unlike prior homogeneous settings, we employ DeepSeek and GPT agents to systematically vary reasoning sophistication across supply chain tiers. Through rigorously replicated and statistically validated simulations, we investigate how this cognitive diversity influences collective outcomes. Results indicate that agents exhibit myopic and self-interested behaviors that exacerbate systemic inefficiencies. However, we demonstrate that information sharing effectively mitigates these adverse effects. Our findings extend traditional behavioral methods and offer new insights into the dynamics of AI-enabled organizations. This work underscores both the potential and limitations of LLM-based agents as proxies for human decision-making in complex operational environments.

URL PDF HTML ☆

赞 0 踩 0

2505.24037 2026-06-03 cs.AI 版本更新

Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution

交给专家：通过稀疏性演化进行稀疏微调修复稀疏大语言模型

Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）； University of Cambridge（剑桥大学）； University of Luxembourg（卢森堡大学）； University of Twente（埃因霍温理工大学）； University of Surrey（萨里大学）； Tübingen AI Center（图宾根人工智能中心）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； ELLIS Institute Tübingen（图宾根ELLIS研究所）

AI总结提出稀疏演化微调（SEFT）框架，通过周期性重分配稀疏任务特定更新和重新激活有益剪枝权重，在保持稀疏性效率优势的同时实现稀疏大语言模型的有效下游任务适配。

MAVEN-T：用于实时多智能体轨迹预测的强化异构蒸馏

Wenchang Duan, Zhenguo Gao, Jinguo Xian, Yi Shi

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University（上海交通大学数学科学学院）； Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University（上海交通大学Bio-X研究院、发育与神经精神疾病遗传学重点实验室）； Shanghai Key Laboratory of Psychotic Disorders, Brain Science and Technology Research Center, Shanghai Jiao Tong University（上海精神疾病重点实验室、脑科学与技术研究中心，上海交通大学）

AI总结提出MAVEN-T框架，通过高容量教师模型和紧凑学生模型的异构蒸馏，结合强化学习优化，实现实时多智能体轨迹预测，在多个数据集上达到高精度与低延迟。

详情

AI中文摘要

轨迹预测是自动驾驶系统的关键组成部分，因为未来运动直接影响碰撞检查、行为规划和控制。在密集交互、异构行为、多模态未来和有限车载计算条件下，该任务仍然具有挑战性。现有的图、注意力和生成式预测器改进了交互推理或不确定性建模，但其高容量设计通常成本高昂，难以实时部署。轻量级预测器和传统蒸馏降低了推理成本，但通常依赖静态模仿，并未明确纠正与安全相关的教师偏差。本文提出了MAVEN-T，一种用于实时多智能体轨迹预测的强化异构蒸馏框架。高容量教师模型通过环绕感知图编码器建模有向局部交互，结合高效时间滤波与移位窗口空间注意力，并通过稀疏混合专家头解码特定机动未来。紧凑的GRU-挤压激励学生模型配备低秩自适应策略头，通过特征级、注意力级和语义级蒸馏进行训练。为了与下游行为对齐，学生模型进一步通过近端策略优化奖励进行细化，奖励包括碰撞避免、舒适性和进度，同时复杂度感知课程和弹性权重巩固稳定了分阶段训练。在NGSIM、HighD、MoCAD、Argoverse 2和Waymo开放运动数据集上的实验评估了准确性、效率、泛化性、鲁棒性和闭环安全性。学生模型在NVIDIA Jetson AGX Orin上实现了6.2倍参数压缩、3.7倍推理加速和14.6毫秒延迟，同时保持竞争性准确性。

英文摘要

Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior planning, and control. The task remains challenging under dense interactions, heterogeneous behaviors, multimodal futures, and limited on-board computation. Existing graph, attention, and generative predictors improve interaction reasoning or uncertainty modeling, but their high-capacity designs are often costly for real-time deployment. Lightweight predictors and conventional distillation reduce inference cost, yet usually rely on static imitation and do not explicitly correct safety-relevant teacher bias. This paper proposes \textbf{MAVEN-T}, a reinforced heterogeneous distillation framework for real-time multi-agent trajectory prediction. A high-capacity teacher models directed local interactions with a surround-aware graph encoder, combines efficient temporal filtering with shifted-window spatial attention, and decodes maneuver-specific futures through a sparse Mixture-of-Experts head. A compact GRU--Squeeze-and-Excitation student with a Low-Rank Adapted policy head is trained by feature-, attention-, and semantic-level distillation. To align prediction with downstream behavior, the student is further refined by Proximal Policy Optimization rewards for collision avoidance, comfort, and progress, while a complexity-aware curriculum and Elastic Weight Consolidation stabilize stage-wise training. Experiments on NGSIM, HighD, MoCAD, Argoverse~2, and the Waymo Open Motion Dataset evaluate accuracy, efficiency, generalization, robustness, and closed-loop safety. The student achieves 6.2$\times$ parameter compression, 3.7$\times$ inference acceleration, and 14.6,ms latency on an NVIDIA Jetson AGX Orin while maintaining competitive accuracy.

URL PDF HTML ☆

赞 0 踩 0

2603.26738 2026-06-03 cs.CV cs.AI cs.CL 版本更新

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM：基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结提出SleepVLM，一种基于规则驱动的视觉语言模型，通过多通道PSG波形图像进行睡眠分期，并生成符合AASM评分标准的临床可读解释，在保持高准确率的同时提升可解释性。

Comments Under review

详情

AI中文摘要

尽管自动睡眠分期已达到专家级准确率，但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM，一种基于规则驱动的视觉语言模型（VLM），它通过多通道多导睡眠图（PSG）波形图像进行睡眠分期，并基于美国睡眠医学学会（AASM）评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调，SleepVLM在保留测试集（MASS-SS1）上实现了0.767的Cohen's kappa，在外部队列（ZUAMHCS）上实现了0.743，达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量，在两个数据集上，事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间（满分5分）。通过将竞争性性能与透明、基于规则的解释相结合，SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究，我们发布了MASS-EX，一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

URL PDF HTML ☆

赞 0 踩 0

2603.26791 2026-06-03 cs.DL cs.AI cs.CL cs.CY 版本更新

Crystal: Characterizing Relative Impact of Scholarly Publications

Crystal: 表征学术出版物的相对影响力

Hannah Collison, Benjamin Van Durme, Daniel Khashabi

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出Crystal方法，利用大语言模型对引用论文进行联合排序，通过多数投票消除位置偏差，以更准确地区分高影响力引用，在人工标注数据集上准确率提升9.5%，F1提升8.3%。

详情

AI中文摘要

评估被引论文的影响力通常是通过在施引论文中单独分析其引用上下文来完成的。虽然这聚焦于最直接相关的文本，但它阻止了对一篇论文引用的所有作品进行相对比较。我们提出Crystal，它使用大语言模型（LLMs）联合排序施引论文中的所有被引论文。为了减轻LLMs的位置偏差，我们以随机顺序对每个列表进行三次排序，并通过多数投票聚合影响力标签。这种联合方法利用了完整的引用上下文，而不是独立评估引用，从而更可靠地区分有影响力的参考文献。Crystal在人工标注的引用数据集上，准确率比先前最先进的影响力分类器高出9.5%，F1高出8.3%。Crystal通过更少的LLM调用进一步提高了效率，并使用开放权重模型优于先前的基线，实现了可扩展、成本效益高的引用影响力分析。在对ACL时间检验奖获奖论文的案例研究中，我们发现Crystal的影响力特征与长期科学认可高度一致。我们发布了Crystal-Bank，一个包含46.8k篇论文的排名和影响力标签的数据集，以及代码。

英文摘要

Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focuses on the most directly relevant text, it prevents relative comparisons across all the works a paper cites. We propose Crystal, which instead jointly ranks all cited papers within a citing paper using large language models (LLMs). To mitigate LLMs' positional bias, we rank each list three times in a randomized order and aggregate the impact labels through majority voting. This joint approach leverages the full citation context, rather than evaluating citations independently, to more reliably distinguish impactful references. Crystal outperforms a prior state-of-the-art impact classifier by +9.5% accuracy and +8.3% F1 on a dataset of human-annotated citations. Crystal further gains efficiency through fewer LLM calls and outperforms prior baselines using an open-weight model, enabling scalable, cost-effective citation impact analysis. In a case study of ACL Test-of-Time award-winning papers, we find that Crystal's impact characterizations align closely with long-term scientific recognition. We release Crystal-Bank, a 46.8k-paper dataset with rankings and impact labels, along with code.

URL PDF HTML ☆

赞 0 踩 0

2510.21011 2026-06-03 cs.HC cs.AI cs.CY 版本更新

Generating the Modal Worker: A Cross-Model Audit of Race and Gender in LLM-Generated Personas Across 41 Occupations

生成模态工人：跨模型审计41个职业中LLM生成人设的种族与性别

Ilona van der Linden, Sahana Kumar, Arnav Dixit, Aadi Sudan, Smruthi Danda, David C. Anastasiu, Kai Lukoff

发表机构 * Human-Computer Interaction Lab, Computer Science and Engineering（人机交互实验室，计算机科学与工程）； Santa Clara University（圣克拉拉大学）

AI总结本研究审计了四个大型语言模型生成的150多万个职业人设，通过与BLS数据对比，发现模型压缩了人口统计变异，系统性地扭曲了种族和性别代表性。

详情

AI中文摘要

随着生成式AI工具越来越多地被用于描绘职业角色中的人物，理解其种族和性别代表性偏差至关重要。我们审计了由四个主要大型语言模型（GPT-4、Gemini 2.5、DeepSeek V3.1和Mistral-medium）生成的41个美国职业中的150多万个职业人设。将这些人与美国劳工统计局（BLS）数据进行比较，我们发现模型生成的人口统计数据比真实世界数据的变异性更小，实际上将每个职业压缩为一种主导人口统计特征，而不是代表总体水平的变异。通过偏移/夸张分解揭示了这些扭曲的结构：白人（-31个百分点）和黑人（-9个百分点）工人持续被低估，而西班牙裔（+17个百分点）和亚裔（+12个百分点）工人被高估，刻板印象的夸张加剧了现有的职业隔离。这些扭曲往往极端，包括几乎全部将管家描绘为西班牙裔，以及许多职业中黑人工人几乎被抹去。由于这些模式在不同机构和文化起源的模型中重复出现，它们表明存在共享的结构性偏差来源，而非模型特定的伪影。我们认为，审计生成式AI需要评估框架，该框架检查合成人口如何系统地重塑跨社会角色的人口统计可见性。

英文摘要

As generative AI tools are increasingly used to portray people in professional roles, understanding their racial and gender representational biases is critical. We audit over 1.5 million occupational personas generated by four major large language models (GPT-4, Gemini 2.5, DeepSeek V3.1, and Mistral-medium) across 41 U.S. occupations. Comparing these personas against U.S. Bureau of Labor Statistics (BLS) data, we find that models generate demographics with less variation than real-world data, functionally compressing each occupation toward a dominant demographic profile rather than representing population-level variation. A shift/exaggeration decomposition reveals the structure of these distortions: White (-31 percentage points) and Black (-9 pp) workers are consistently underrepresented, while Hispanic (+17 pp) and Asian (+12 pp) workers are overrepresented, with stereotype exaggeration amplifying existing occupational segregation. These distortions are often extreme, including near-total portrayals of housekeepers as Hispanic and the near-erasure of Black workers from many occupations. Because these patterns recur across models with different institutional and cultural origins, they suggest shared structural sources of bias rather than model-specific artifacts. We argue that auditing generative AI requires evaluation frameworks that examine how synthetic populations systematically reshape demographic visibility across social roles.

URL PDF HTML ☆

赞 0 踩 0

2603.23117 2026-06-03 cs.CR cs.AI cs.RO 版本更新

TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches

TRAP: 通过对抗性补丁劫持VLA的CoT推理

Zhengxian Huang, Wenjun Zhu, Haoxuan Qiu, Xiaoyu Ji, Wenyuan Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TRAP攻击，利用对抗性补丁劫持视觉-语言-动作模型的链式推理，实现目标行为操控。

Comments Accepted by ICML 2026

详情

AI中文摘要

通过集成链式推理，视觉-语言-动作模型在机器人操作中展现出强大能力，特别是在提升泛化性和可解释性方面。然而，基于CoT的推理机制的安全性尚未得到充分探索。在本文中，我们证明CoT推理引入了一种新的攻击向量，用于目标行为劫持——例如，导致机器人错误地将刀递给一个人而不是苹果——而无需修改用户的指令。我们首先提供经验证据表明，即使CoT与输入指令在语义上不一致，它仍然强烈主导动作生成。基于这一观察，我们提出TRAP，这是首个针对CoT推理VLA模型的目标行为劫持对抗性攻击。通过针对推理到动作的路径，TRAP使用对抗性补丁（例如，放置在桌子上的桌布）来引导中间CoT推理和下游动作朝向对手定义的行为。在三个代表性推理VLA上的广泛评估，涵盖了不同的CoT推理机制，证明了TRAP的有效性。值得注意的是，我们在现实环境中通过将补丁打印在纸上实现了该攻击。我们的发现凸显了保护VLA系统中CoT推理的紧迫性。项目页面可在https://zhengxian-huang.github.io/TRAP-website/获取。

英文摘要

By integrating Chain-of-Thought (CoT) reasoning, Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, particularly by improving generalization and interpretability. However, the security of CoT-based reasoning mechanisms remains largely unexplored. In this paper, we show that CoT reasoning introduces a novel attack vector for targeted behavior hijacking--for example, causing a robot to mistakenly deliver a knife to a person instead of an apple--without modifying the user's instruction. We first provide empirical evidence that CoT strongly governs action generation, even when it is semantically misaligned with the input instructions. Building on this observation, we propose TRAP, the first targeted behavior-hijacking adversarial attack against CoT-reasoning VLA models. By targeting the reasoning-to-action pathway, TRAP uses an adversarial patch (e.g., a tablecloth placed on the table) to steer intermediate CoT reasoning and downstream actions toward adversary-defined behaviors. Extensive evaluations on three representative reasoning VLAs, spanning distinct CoT reasoning mechanisms, demonstrate the effectiveness of TRAP. Notably, we implemented the patch by printing it on paper in a real-world setting. Our findings highlight the urgent need to secure CoT reasoning in VLA systems. The project page is available at https://zhengxian-huang.github.io/TRAP-website/.

URL PDF HTML ☆

赞 0 踩 0

2603.20508 2026-06-03 cs.MA cs.AI cs.CL 版本更新

Measuring Weak-to-Strong Legibility of Reasoning Models

衡量推理模型的弱到强可读性

Dani Roytburg, Shreya Sridhar, Daphne Ippolito

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结针对推理语言模型在多智能体场景中生成的中间思维链，提出“弱到强可读性”概念，并设计衡量指标以评估强模型输出对弱模型的易理解性。

Comments Accepted to Trustworthy AI4GOOD Workshop @ ICML 2026

2602.07768 2026-06-03 cs.CV cs.AI cs.LG cs.MM 版本更新

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

PAND：面向提示的邻域蒸馏用于轻量级细粒度视觉分类

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

发表机构 * arXiv

AI总结提出PAND框架，通过提示感知语义校准和邻域感知结构蒸馏，将大型视觉语言模型知识迁移至轻量网络，在细粒度分类任务上超越现有方法。

Comments Accepted by ICIP2026

2601.12186 2026-06-03 cs.SE cs.AI 版本更新

Aletheia: What Makes RLVR For Code Verifiers Tick?

Aletheia: 什么使得代码验证器的RLVR有效？

Vatsal Venkatkrishna, Indraneil Paul, Iryna Gurevych

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski", Bulgaria（保加利亚索菲亚大学INSAIT实验室）； Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt and National Research Center for Applied Cybersecurity（德累斯顿技术大学计算机科学系及应用网络安全国家研究中心通用知识处理实验室）； ATHENE, Germany（德国ATHENE研究院）

AI总结通过消融实验研究RLVR训练代码验证器时，中间思考轨迹、负样本学习和策略内训练三个因素在不同规模下的性能-成本权衡，发现最优配方依赖于模型规模。

Comments 31 pages, 6 figures

详情

AI中文摘要

通过可验证奖励的强化学习（RLVR）训练的多领域思考验证器是现代后训练的核心。然而，由于完整RLVR管道的成本过高，它们在代码生成中的应用落后于执行反馈。在这项工作中，我们消融了RLVR中性能-成本权衡的三个主要选择：中间思考轨迹、从负样本学习和策略内训练。我们引入了Aletheia，一个受控的、基于执行的测试平台，以促进对不同模型大小和两个常见验证器应用场景下的协变量偏移进行无污染分析。我们的分析揭示，最优训练配方依赖于规模：对于小型验证器，策略内学习是主要性能驱动因素，而在较大规模下，思考预算成为最关键因素。虽然利用负样本对不同大小的top-1选择准确性有一致影响，但它们对排名重建的贡献随规模单调增加，并在大规模下稳定训练中起关键作用。我们的帕累托最优分析表明，在较大模型规模下消除策略内训练会产生一个与完整RLVR配方性能相当的验证器。此外，我们发现，在较低预算下，放弃思考轨迹是一种计算高效的策略，在训练成本和验证器准确性之间提供了强有力的权衡。最终，我们的工作为高效部署鲁棒代码验证器提供了必要的经验基础，从而使其能够在大型代码生成模型的后训练管道中得到更广泛的应用。

英文摘要

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary choices along the performance-cost trade-off in RLVR: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifier training recipes across disparate model sizes and covariate shifts across two common verifier application scenarios. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas the thinking budget becomes the most vital factor at larger scales. While leveraging negative samples has a consistent impact on top-1 selection accuracy across sizes, their contribution to ranking reconstruction increases monotonically with scale and plays a key role in stabilizing training at large sizes. Our Pareto optimality analysis demonstrates that eliminating on-policy training at larger model scales yields a verifier that performs comparably to the full RLVR recipe. Furthermore, we find that eschewing thinking traces serves as a compute-efficient strategy at lower budgets, offering a strong trade-off between training cost and verifier accuracy. Ultimately, our work provides the empirical foundation necessary to efficiently deploy robust code verifiers, thereby enabling their wider adoption in post-training pipelines for large code generation models.

URL PDF HTML ☆

赞 0 踩 0

2603.07664 2026-06-03 cs.CV cs.AI cs.GR 版本更新

Ref-DGS: Reflective Dual Gaussian Splatting

Ref-DGS: 反射性双高斯泼溅

Ningjing Fan, Yiqun Wang, Dong-Ming Yan, Peter Wonka

发表机构 * Chongqing University（重庆大学）； MAIS, Institute of Automation, Chinese Academy of Sciences and UCAS（自动化研究所，中国科学院，UCAS）； King Abdullah University of Science and Technology (KAUST)（卡塔尔科学与技术大学）

AI总结提出Ref-DGS框架，通过双高斯场景表示和物理感知的镜面自适应混合着色器，在高效光栅化管线中解耦表面重建与镜面反射，实现反射场景的SOTA新视图合成且训练速度远快于基于光线的方法。

Comments Project page: https://njfan.github.io/Ref-DGS/

详情

AI中文摘要

反射外观，尤其是强烈的近场镜面反射，对精确的表面重建和新视图合成构成了根本性挑战。现有的高斯泼溅方法要么无法建模近场镜面反射，要么依赖显式光线追踪而计算成本高昂。我们提出了 extbf{Ref-DGS}，一个反射性双高斯泼溅框架，通过在高效光栅化管线中将表面重建与镜面反射解耦来解决这一权衡。Ref-DGS引入了一种双高斯场景表示，由几何高斯和互补的局部反射高斯组成，无需显式光线追踪即可捕捉近场镜面交互，并包含一个全局环境反射场用于建模远场镜面反射。为了预测镜面辐射，我们进一步提出了一种轻量级的、物理感知的镜面自适应混合着色器，融合全局和局部镜面特征。实验表明，Ref-DGS在反射场景上达到了最先进的性能，同时训练速度显著快于基于光线的高斯方法。

英文摘要

The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present \textbf{Ref-DGS}, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware specular adaptive mixing shader that fuses global and local specular features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.

URL PDF HTML ☆

赞 0 踩 0

2602.07075 2026-06-03 physics.chem-ph cs.AI cs.CL cs.LG 版本更新

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

LatentChem: 从文本思维链到化学推理中的潜在思考

Xinwu Ye, Yicheng Mao, Yuxuan Liao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对化学大语言模型依赖显式思维链导致的模态不匹配问题，提出LatentChem推理接口，通过连续思维向量和动态感知解耦化学逻辑与语言生成，在ChemCoTBench上以59.88%非平局胜率超越强CoT基线，并实现平均10.84倍推理步骤开销降低（5.96倍实际加速）。

Comments Accepted at ICML 2026

详情

AI中文摘要

当前的化学大语言模型主要依赖显式的思维链来解决复杂推理问题。然而，将非语言的隐性化学逻辑强制转化为离散的自然语言，造成了根本性的“模态不匹配”，为推理带来了人为瓶颈。我们提出了LatentChem，一种将化学逻辑与语言生成解耦的推理接口，使模型能够通过连续思维向量和动态感知来处理信息。我们的研究揭示了一个关键涌现行为：自发内化，这里定义为在仅结果优化下的自我选择。当为任务成功进行优化时，模型放弃冗长的文本推导，转而采用隐式的潜在计算，这表明模型将连续流形视为化学逻辑更自然的载体。这一范式转变也被证明是一种更优的计算策略：在严格的ChemCoTBench基准上，LatentChem对强CoT基线取得了59.88%的非平局胜率，同时在所有评估基准上实现了平均10.84倍的推理步骤开销降低（5.96倍实际加速）。我们的结果提供了经验证据，表明化学推理更自然、更有效地实现为连续潜在动力学，而非离散的语言轨迹。

英文摘要

Current chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) to solve complex reasoning problems. However, forcing nonverbal tacit chemical logic into discrete natural language imposes a fundamental ``modality mismatch,'' creating an artificial bottleneck for reasoning. We introduce LatentChem, a reasoning interface that decouples chemical logic from linguistic generation, enabling the model to process information via continuous thought vectors and dynamic perception. Our investigation reveals a pivotal emergent behavior: spontaneous internalization, defined here as self-selected under outcome-only optimization. When optimized for task success, the model abandons verbose textual derivations in favor of implicit latent computation, suggesting that it identifies the continuous manifold as a more native substrate for chemical logic. This paradigm shift also proves to be a superior computational strategy: LatentChem achieves a 59.88\% non-tie win rate against the strong CoT baseline on the rigorous ChemCoTBench, while delivering a broad 10.84$\times$ average reduction in reasoning step overhead (5.96$\times$ wall-clock speedup) across all evaluated benchmarks. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

URL PDF HTML ☆

赞 0 踩 0

2603.05290 2026-06-03 cs.AI 版本更新

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

X-RAY: 通过形式化与校准探针映射大语言模型推理能力

Tianxi Gao, Yufan Cai, Yusi Yuan, Jin Song Dong

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出X-RAY系统，利用形式化工具生成结构可控的校准探针，通过分析约束交互、推理深度和解空间几何等属性，揭示LLM在约束细化与解空间重构下的推理不对称性。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3818029

AI中文摘要

大型语言模型（LLM）取得了有前景的性能，但其推理能力仍未被充分理解。现有评估主要强调任务级准确性，常常将模式匹配与推理能力混为一谈。我们提出了X-RAY，一个可解释的推理分析系统，通过校准的、形式化验证的探针来映射LLM的推理能力。我们将推理能力建模为可提取的 extit{结构}的函数，通过形式化属性（如约束交互、推理深度和解空间几何）进行操作化。X-RAY通过形式化工具生成具有受控结构变化的探针，通过形式化校准和验证实现对增量结构信息的精确隔离。我们在数学、物理和化学领域从初级到高级的问题上评估了最先进的LLM。我们的分析揭示了LLM推理中的系统性不对称：模型对约束细化（即附加条件缩小现有解空间）相对稳健，但在解空间重构（即修改改变解流形的底层结构形式）下性能急剧下降。此外，校准的形式化探针能够区分在标准基准上看似无法区分的模型，并揭示出结构上可解释而非模糊的失败模式。除了评估，我们的框架无污染，并支持推理模型的训练和测试。

英文摘要

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.

URL PDF HTML ☆

赞 0 踩 0

LLM何时应降低具体性？面向可靠长文本生成的选择性抽象

Shani Goren, Ido Galil, Ran El-Yaniv

发表机构 * Technion（技术离子大学）； NVIDIA（英伟达）

AI总结针对LLM在长文本生成中因低置信度而丢弃有价值信息的问题，提出选择性抽象框架，通过原子级抽象替换不确定内容，在保持语义的同时提升准确性和可靠性。

详情

AI中文摘要

LLM被广泛使用，但仍容易出现事实错误，这削弱了用户信任并限制了在高风险场景中的采用。缓解这一风险的一种方法是为模型配备不确定性估计机制，在置信度低时弃权。然而，这种二元的“全有或全无”方法在长文本场景中过于严格，常常丢弃有价值的信息。我们引入了选择性抽象（SA），这是一个框架，使LLM能够通过选择性地降低不确定内容的细节来用具体性换取可靠性。我们首先通过选择性风险和覆盖率的视角形式化SA。然后，我们提出原子级选择性抽象，这是一种声明级别的实例化，将响应分解为原子声明（简短、自包含的陈述，每个表达一个单一事实），并用更高置信度、更低具体性的抽象替换不确定的原子。为了评估这一框架，我们开发了一个新颖的端到端流水线用于开放式生成，将风险实例化为事实正确性，并使用信息论度量保留信息来衡量覆盖率。在FactScore和LongFact-Objects基准测试上的六个开源模型中，原子级SA始终优于现有基线，在风险-覆盖率曲线下面积（AURC）上比声明移除方法提升高达27.73%，表明降低具体性可以在保留大部分原始含义的同时提升准确性和可靠性。

英文摘要

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

URL PDF HTML ☆

赞 0 踩 0

2602.10387 2026-06-03 cs.DB cs.AI 版本更新

Test-Time Optimization of Physical Query Plans with LLMs

基于LLM的物理查询计划测试时优化

Mehmet Hamza Erol, Xiangpeng Hao, Federico Bianchi, Ciro Greco, Jacopo Tagliabue, James Zou

发表机构 * Stanford University（斯坦福大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； TogetherAI ； Bauplan

AI总结提出DBPlanBench框架，利用LLM在测试时通过语义推理和进化搜索优化物理查询计划，在OLAP查询中实现1.05-1.12倍中位数加速，并支持小规模到大规模的迁移。

Comments Code is available at: https://github.com/BauplanLabs/DBPLANBENCH

详情

AI中文摘要

传统查询优化依赖于基于成本的优化器，使用预定义的启发式和统计模型来估计执行成本（如运行时间、内存和I/O）。改进这些需要大量的工程努力，但它们通常无法利用查询和模式中的语义相关性来获得更好的物理计划。然而，大型语言模型（LLMs）能够推理列语义、值分布以及经典统计所忽略的更广泛的领域上下文。我们介绍了DBPlanBench，一个用于DataFusion引擎的框架，它通过紧凑的序列化表示暴露物理计划，并将LLM提出的编辑作为JSON补丁应用。在此框架上，我们实例化了一个测试时优化工作流，其中LLM检查物理查询计划，基于语义推理提出局部编辑，并通过进化搜索在迭代中优化候选方案。我们针对OLAP查询，其中重复执行的重负载使得即使是微小的效率提升也能转化为显著的累积节省。我们特别将评估重点放在连接重排序和连接侧选择上，其中基数估计误差会复合倍增。在TPC-H上中位数加速达到1.10-1.12倍，在TPC-DS上达到1.05-1.07倍，某些查询加速高达4.78倍。我们还证明了在小规模因子下发现的优化可以有效地迁移到更大规模，支持低成本的小规模到大工作流。

英文摘要

Traditional query optimization relies on cost-based optimizers that estimate execution cost (e.g., runtime, memory, and I/O) using predefined heuristics and statistical models. Improving these requires substantial engineering effort, yet they often cannot exploit semantic correlations in queries and schemas that could enable better physical plans. Large language models (LLMs), however, can reason about column semantics, value distributions, and broader domain context that classical statistics miss. We introduce DBPlanBench, a harness for the DataFusion engine that exposes physical plans through a compact serialized representation and applies LLM-proposed edits as JSON patches. On this harness, we instantiate a test-time optimization workflow where an LLM examines physical query plans, proposes localized edits based on semantic reasoning, and an evolutionary search refines the candidates across iterations. We target OLAP queries, where heavy, repeated execution turns even small efficiency gains into substantial cumulative savings. We specifically focus our evaluation on join reordering and join-side selection, where cardinality-estimation errors compound multiplicatively. Median speedups reach $1.10$-$1.12\times$ on TPC-H and $1.05$-$1.07\times$ on TPC-DS, with some achieving up to $4.78\times$. We also demonstrate that optimizations discovered at small scale factors transfer effectively to larger ones, supporting a low-cost small-to-large workflow.

URL PDF HTML ☆

赞 0 踩 0

2602.10352 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

从可解释性工件中学习自我解释：在向量-标签对上训练轻量级适配器

Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena

发表机构 * University of Washington（华盛顿大学）

AI总结通过训练轻量级适配器（标量仿射适配器，仅需d_model+1参数）在可解释性工件上，保持语言模型完全冻结，实现了跨任务和模型族的可靠自我解释，在稀疏自编码器特征标注、主题识别和多跳推理桥接实体解码等任务上显著优于未训练基线。

Comments 26 pages, 18 tables, 17 figures. Code and data at https://github.com/agencyenterprise/selfie-adapters

详情

AI中文摘要

自我解释方法促使语言模型描述其内部状态，但由于超参数敏感性而仍然不可靠。我们表明，在可解释性工件上训练轻量级适配器，同时保持语言模型完全冻结，可以在任务和模型族中产生可靠的自我解释。一个仅需$d_\text{model}+1$个参数的标量仿射适配器就足够了：训练后的适配器生成稀疏自编码器特征标签，其性能优于训练标签本身（在70B规模下，生成评分为70% vs 50%），以94%的召回率@1识别主题（未训练基线为1%），并在多跳推理中解码既不在提示中也不在响应中出现的桥接实体，从而无需思维链即可揭示隐式推理。仅学习到的偏置向量就占了改进的85%，更简单的适配器比更具表达力的替代方案具有更好的泛化能力。通过提示描述控制模型知识，我们发现从7B到72B参数，自我解释的提升超过了能力提升。我们的结果表明，自我解释随着规模扩大而改善，且无需修改被解释的模型。

英文摘要

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (70% vs 50% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

URL PDF HTML ☆

赞 0 踩 0

2602.05302 2026-06-03 cs.AI 版本更新

谁的名字出现？II：基于基准测试和干预审计的LLM学者推荐系统

Lisette Espín-Noboa, Gonzalo Gabriel Méndez

发表机构 * Complexity Science Hub Vienna（维也纳复杂性科学中心）； Universitat Politècnica de València（巴塞罗那理工大学）； Inria Rennes（里昂国家信息与自动化研究所）

AI总结提出LLMScholarBench基准，通过温度变化、表示约束提示和检索增强生成等干预措施审计22个LLM在物理专家推荐中的技术质量和社会代表性，发现干预措施带来不同权衡。

Comments In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26). 30 pages: 11 pages in main (6 figures, 1 table), 19 pages in appendix (22 figures, 2 tables)

详情

DOI: 10.1145/3770855.3817543

AI中文摘要

大型语言模型（LLM）现在被用于学术专家推荐。现有的审计通常孤立地评估此类推荐，忽略了最终用户的推理时干预。因此，尚不清楚失败（例如，拒绝、幻觉、覆盖不均）源于模型选择还是部署决策。我们引入了LLMScholarBench，一个用于审计基于LLM的学者推荐的基准，它联合评估模型基础设施和最终用户在多个任务上的干预。LLMScholarBench使用九个指标衡量技术质量和社会代表性。我们在物理专家推荐中实例化该基准，并在温度变化、表示约束提示和通过网络搜索的检索增强生成（RAG）下审计22个LLM。我们的结果表明，每种干预都带来不同的权衡。较高的温度会降低有效性、一致性和事实性。表示约束提示以提高多样性为代价降低了事实性，而RAG主要提高了技术质量，同时降低了多样性和平等性。总体而言，最终用户的干预重塑了权衡，而不是提供统一的收益。LLMScholarBench使得在基于LLM的学者推荐中，跨模型和干预的所有这些动态都可审计。

英文摘要

Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in isolation, ignoring end-user inference-time interventions. Thus, it remains unclear whether failures (e.g., refusals, hallucinations, uneven coverage) stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that each intervention entails distinct tradeoffs. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing uniform gains. LLMScholarBench makes all these dynamics auditable across models and interventions in LLM-based scholar recommendations.

URL PDF HTML ☆

赞 0 踩 0

2602.08335 2026-06-03 cs.AI 版本更新

一种鲁棒且可解释的基于Transformer的钓鱼邮件检测框架

Sajad U P

发表机构 * Independent Researcher（独立研究者）

AI总结提出基于DistilBERT的轻量级钓鱼邮件检测框架，通过梯度对抗训练和字符级噪声增强鲁棒性，并集成LIME、SHAP和IG三种可解释AI方法，结合Flan-T5-Small生成自然语言解释，提升检测准确性和用户信任。

详情

AI中文摘要

钓鱼及相关网络威胁正变得越来越复杂，基于电子邮件的钓鱼仍然是最持久的攻击载体。这些攻击利用人类漏洞来传递恶意软件或获取对敏感信息的未授权访问。基于Transformer的模型通过强大的上下文语言理解增强了钓鱼检测；然而，由于缺乏可解释性，它们通常被视为黑盒。此外，最近的AI驱动攻击进一步削弱了模型的韧性。为了解决这些挑战，本文提出了一种基于DistilBERT（一种轻量级Transformer模型）的轻量级钓鱼检测框架。通过使用快速梯度法（FGM）进行基于梯度的对抗训练，并结合随机字符级扰动，增强了对嵌入级扰动和字符级输入噪声的鲁棒性。为了提高透明度，集成了三种突出的可解释AI（XAI）方法：LIME（局部可解释模型无关解释）、SHAP（SHapley Additive exPlanations）和IG（积分梯度），以解释模型决策。一个结构化的基于规则的提示结合模型预测和XAI特征，引导Flan-T5-Small生成通俗易懂、基于证据的解释。实验结果表明，所提出的框架在准确性和韧性方面优于未经鲁棒性增强的标准DistilBERT检测模型。这种集成方法有助于弥合模型可靠性与用户信任之间的差距，推动透明钓鱼检测的发展。

英文摘要

Phishing and related cyber threats are becoming increasingly sophisticated, with email-based phishing remaining the most persistent attack vector. These attacks exploit human vulnerabilities to deliver malware or gain unauthorized access to sensitive information. Transformer-based models enhance phishing detection through robust contextual language understanding; yet they are often regarded as black boxes due to a lack of interpretability. Moreover, recent AI-enabled attacks further undermine model resilience. To address these challenges, this work proposes a lightweight phishing detection framework based on DistilBERT, a lightweight Transformer model. Robustness to embedding-level perturbations and character-level input noise is enhanced through gradient-based adversarial training using the Fast Gradient Method (FGM), combined with stochastic character-level perturbations. To improve transparency, three prominent Explainable AI (XAI) methods, LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), and IG (Integrated Gradients), are integrated to interpret model decision-making. A structured rule-based prompt combines model predictions and XAI features to guide Flan-T5-Small in generating plain-language, evidence-based explanations. Experimental results demonstrate that the proposed framework outperforms a standard DistilBERT-based detection model trained without robustness enhancements in terms of accuracy and resilience. This integrated approach helps bridge the gap between model reliability and user trust, advancing transparent phishing detection.

URL PDF HTML ☆

赞 0 踩 0

2602.06219 2026-06-03 cs.RO cs.AI 版本更新

Coupled Local and Global World Models for Efficient First Order RL

耦合局部与全局世界模型的高效一阶强化学习

Joseph Amigo, Rooholla Khorrambakht, Nicolas Mansard, Ludovic Righetti

发表机构 * Machines in Motion Laboratory, New York University, USA（纽约大学运动机器实验室）； LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France（图卢兹大学LAAS-CNRS中心）； Artificial and Natural Intelligence Toulouse Institute, Toulouse, France（图卢兹人工智能与自然智能研究所）

AI总结提出一种通过解耦一阶梯度方法在数据驱动的世界模型内训练策略的方法，结合局部和全局世界模型实现高效梯度计算，在Push-T任务和四足机器人操作任务中显著优于PPO。

Comments Project website: https://coupled-global-local-wm-rl.pages.dev/

详情

AI中文摘要

世界模型为在标准模拟器难以处理的情况下更忠实地捕捉复杂动力学（包括接触和非刚性）以及复杂感官信息（如视觉感知）提供了一条有前景的途径。然而，这些模型的计算复杂度高，对流行的强化学习方法构成了挑战，这些方法已成功用于模拟器解决复杂运动任务，但在操作任务上仍存在困难。本文介绍了一种完全绕过模拟器的方法，在从机器人与真实环境交互中学习到的世界模型内部训练强化学习策略。其核心是通过一种新颖的解耦一阶梯度方法实现大规模扩散模型的策略训练：全尺度世界模型生成准确的前向轨迹，而轻量级潜在空间代理近似其局部动力学以实现高效梯度计算。这种局部与全局世界模型的耦合确保了高保真展开以及计算上可处理的微分。我们在Push-T操作任务上证明了该方法的有效性，其在样本效率上显著优于PPO。我们还通过四足机器人的自我中心物体操作任务进一步评估了该方法。这些结果共同表明，在数据驱动的世界模型内部学习是解决难以建模的图像空间强化学习任务的一条有前景的途径，无需依赖手工设计的物理模拟器。

英文摘要

World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.

URL PDF HTML ☆

赞 0 踩 0

2602.04899 2026-06-03 cs.CR cs.AI 版本更新

Phantom Transfer: Data Poisoning can Survive Data-Level Defences

幻影转移：数据投毒可存活于数据级防御

Andrew Draganov, Tolga H. Dur, Anandmayi Bhongade, Mary Phuong

AI总结提出一种名为“幻影转移”的数据投毒攻击，即使知道毒药如何被放入良性数据集也无法过滤，该攻击通过修改阈下学习以适应现实场景，并在多种数据级防御下存活。

详情

AI中文摘要

我们提出了一种数据投毒攻击——幻影转移——其特性是，即使你确切知道毒药是如何被放入原本良性的数据集中，你也无法将其过滤掉。我们通过修改阈下学习以在现实世界中工作来实现这一点，并证明无论数据由哪个模型生成、训练数据的是哪个模型或攻击目标是什么，该攻击都有效。此外，该攻击在11种测试的数据级防御下存活，包括一种将每个样本由另一个模型改写的防御。我们描述了这种攻击何时效果最佳，并展示了它可以用于将密码触发的行为植入模型，同时仍然击败防御。简而言之，我们提供了一个存在性证明，即最大能力防御可能无法阻止复杂的数据投毒攻击。我们建议未来的防御应辅以白盒方法和训练后模型审计。

英文摘要

We present a data poisoning attack -- Phantom Transfer -- with the property that, even if you know precisely how the poison was placed into an otherwise benign dataset, you cannot filter it out. We achieve this by modifying subliminal learning to work in real-world contexts and demonstrate that the attack works regardless of which model produced the data, which model is trained on the data or what the attack target is. Furthermore, the attack survives 11 tested data-level defences, including one where every sample is paraphrased by another model. We characterise when this attack works best and show that it can be used to plant password-triggered behaviours into models while still beating defences. In short, we provide an existence proof that maximum-affordance defences can fail to stop sophisticated data poisoning attacks. We suggest that future defences should be supplemented with white-box methods and post-training model audits.

URL PDF HTML ☆

赞 0 踩 0

2507.10419 2026-06-03 cs.LG cs.AI cs.CL stat.ML 版本更新

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

低秩适配器的多选学习用于语言建模

Victor Letzelter, Hugo Malard, Mathieu Fontaine, Gaël Richard, Slim Essid, Andrei Bursuc, Patrick Pérez

发表机构 * Institut National de la Recherche Scientifique (INRS)（国家科学研究院）

AI总结提出LoRA-MCL训练方案，通过多选学习和低秩适配扩展语言模型的下一词预测，以在推理时解码多样且合理的句子延续。

Comments ICML 2026

2602.01483 2026-06-03 cs.LG cs.AI stat.ME 版本更新

Causal Preference Elicitation

因果偏好启发

Edwin V. Bonilla, He Zhao, Daniel M. Steinberg

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种贝叶斯框架，通过主动查询局部边关系来集中有向无环图的后验分布，实现专家参与的因果发现。

2510.16392 2026-06-03 cs.AI 版本更新

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

RGMem：基于重正化群启发的语言智能体记忆演化

Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, Yanfang Liu

发表机构 * School of Computer Science ； Engineering, Beihang University, Beijing, China ； School of Reliability ； Systems Engineering, Beihang University, Beijing, China ； State Key Laboratory of Complex \& Critical Software Environment ； National Key Laboratory of Reliability ； State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences

AI总结提出RGMem框架，利用重正化群思想对长期对话记忆进行多尺度粗粒化、阈值更新和重缩放，实现从事实到用户偏好的层次化整合，在LOCOMO和PersonaMem基准上超越现有记忆系统。

Comments Accepted to ICML 2026

详情

AI中文摘要

个性化和持续交互对于基于LLM的对话智能体至关重要，但有限的上下文窗口和静态参数记忆阻碍了对长期、跨会话用户状态的建模。现有方法（包括检索增强生成和显式记忆系统）主要在事实层面操作，难以从演化且可能冲突的对话中提炼稳定的偏好和深层用户特征。为应对这一挑战，我们提出RGMem，一种受重正化群（RG）多尺度组织和涌现观点启发的自演化记忆框架。RGMem将长期对话记忆建模为多尺度演化过程：情节交互被转化为语义事实和用户洞察，然后通过层次化粗粒化、阈值更新和重缩放逐步整合为动态演化的用户画像。通过明确分离快速变化的证据和慢变特征，并启用非线性、相变般的动力学，RGMem实现了超越平面检索或静态摘要的稳健个性化。在LOCOMO和PersonaMem基准上的大量实验表明，RGMem持续优于最先进的记忆系统，实现了更强的跨会话连续性并更好地适应演化的用户偏好。代码可在https://github.com/fenhg297/RGMem获取。

英文摘要

Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric memory hinder the modeling of long-term, cross-session user states. Existing approaches, including retrieval-augmented generation and explicit memory systems, primarily operate at the fact level, making it difficult to distill stable preferences and deep user traits from evolving and potentially conflicting dialogues.To address this challenge, we propose RGMem, a self-evolving memory framework inspired by the renormalization group (RG) perspective on multi-scale organization and emergence. RGMem models long-term conversational memory as a multi-scale evolutionary process: episodic interactions are transformed into semantic facts and user insights, which are then progressively integrated through hierarchical coarse-graining, thresholded updates, and rescaling into a dynamically evolving user profile.By explicitly separating fast-changing evidence from slow-varying traits and enabling non-linear, phase-transition-like dynamics, RGMem enables robust personalization beyond flat retrieval or static summarization. Extensive experiments on the LOCOMO and PersonaMem benchmarks demonstrate that RGMem consistently outperforms SOTA memory systems, achieving stronger cross-session continuity and improved adaptation to evolving user preferences. Code is available at https://github.com/fenhg297/RGMem

URL PDF HTML ☆

赞 0 踩 0

2510.02763 2026-06-03 cs.LG cs.AI 版本更新

Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

融合多光谱和高光谱卫星数据用于有害藻华监测的自监督与分层深度学习

Nicholas LaHaye, Kelly M. Luis, Michelle M. Gierach

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结提出自监督机器学习框架SIT-FUSE，融合多传感器卫星反射率与TROPOMI太阳诱导荧光数据，通过分层深度聚类生成有害藻华严重程度和物种分类产品，在墨西哥湾和南加州验证了与实测数据的一致性。

详情

DOI: 10.1029/2025EA004881

AI中文摘要

我们提出了一种自监督机器学习框架，用于利用多传感器卫星数据检测和绘制有害藻华（HABs）的严重程度和物种分类。通过融合来自运行极轨卫星仪器（VIIRS、MODIS、OLCI和OCI）的反射率数据与TROPOMI太阳诱导荧光（SIF），我们的框架SIT-FUSE无需每个仪器的标记数据集即可生成HAB严重程度和物种分类产品。该框架采用自监督表示学习和分层深度聚类，将浮游植物细胞丰度和物种分割成可解释的类别，并利用墨西哥湾和南加州（2018-2025年）的原位数据进行了验证。结果显示与总浮游植物、短凯伦藻和拟菱形藻属测量值高度一致。这项工作推进了在地面观测有限的环境中进行可扩展的HAB监测，同时通过分层嵌入实现探索性分析——这是将自监督学习应用于全球水生生物地球化学操作化的关键一步。

英文摘要

We present a self-supervised machine learning framework for detecting and mapping the severity and speciation of harmful algal blooms (HABs) using multi-sensor satellite data. By fusing reflectance data from operational polar-orbiting satellite-based instruments (VIIRS, MODIS, OLCI, and OCI) with TROPOMI solar-induced fluorescence (SIF), our framework, called SIT-FUSE, generates HAB severity and speciation products without requiring per-instrument labeled datasets. The framework employs self-supervised representation learning and hierarchical deep clustering to segment phytoplankton cell abundance and species into interpretable classes, validated against in-situ data from the Gulf of Mexico and Southern California (2018-2025). Results show strong agreement with total phytoplankton, Karena brevis, and Pseudo-nitzschia spp. measurements. This work advances scalable HAB monitoring in environments where ground truth observations are limited, while enabling exploratory analysis via hierarchical embeddings - a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.

URL PDF HTML ☆

赞 0 踩 0

2601.23229 2026-06-03 cs.AI cs.CC 版本更新

Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

$L_\infty$ 鲁棒 MDP 的策略迭代的强多项式时间复杂度

Ali Asadi, Krishnendu Chatterjee, Ehsan Goharshady, Mehrdad Karrabi, Alipasha Montaseri, Carlo Pagano

发表机构 * Institute for Computer Science, Austrian Academy of Sciences（奥地利科学院计算机科学研究所）； Concordia University（康科迪亚大学）

AI总结针对 $(s,a)$-矩形 $L_\infty$ 鲁棒 MDP 的折扣问题，证明了策略迭代算法在固定折扣因子下具有强多项式时间复杂度。

Comments To Appear in The 39th Annual Conference on Learning Theory (COLT'26)

详情

AI中文摘要

马尔可夫决策过程（MDP）是序列决策中的基本模型。鲁棒 MDP（RMDP）通过允许转移概率存在不确定性并针对最坏情况不确定性进行优化来扩展此框架。特别地，具有 $L_\infty$ 不确定性集的 $(s,a)$-矩形 RMDP 构成一个基础且富有表现力的模型：它们包含经典 MDP 和回合制随机博弈。我们考虑具有折扣收益的此模型。多项式时间和强多项式时间算法的存在性是这些优化模型的基本问题。对于 MDP，线性规划为任意折扣因子提供了多项式时间算法，而 Ye 的开创性工作为固定折扣因子建立了强多项式时间。将这些结果推广到 RMDP 仍然是一个重要的开放问题。在这项工作中，我们证明了鲁棒策略迭代算法在常数（固定）折扣因子下对于 $(s,a)$-矩形 $L_\infty$ RMDP 以强多项式时间运行，解决了一个重要的算法问题。

英文摘要

Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.

URL PDF HTML ☆

赞 0 踩 0

2601.20844 2026-06-03 cs.LG cs.AI cs.IR 版本更新

$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

$\mathbb{R}^{2k}$ 理论上足够大，用于基于嵌入的 Top-$k$ 检索

Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究最小可嵌入维度（MED），证明对于内积、欧氏距离和余弦相似度，MED 为 Θ(k)，与 m 无关；进一步考虑鲁棒 MED（RMED），推导出可行性上限 ε_⋆(m,k)，并通过实验验证理论结果。

Comments v2: fix broken citation. v3: ICML 2026

详情

AI中文摘要

本文研究最小可嵌入维度（MED）：即存在 m 个对象向量配置的最小维度，使得每个大小至多为 k 的子集都能通过分数比较被精确检索。我们的结果表明，对于内积、欧氏距离和余弦相似度，MED 为 Θ(k)，与 m 无关。然后我们考虑鲁棒 MED（RMED），其中所有向量为单位范数，并且需要 ε 的分数间隙。我们推导出依赖于 m 的可行性上限 ε_⋆(m,k)=m/√(k(m-1)(m-k))，当 m≫k 时趋近于 1/√k，并且高斯质心构造在可行边界区域内给出了鲁棒见证的上界。在合成 top-2 检索上的数值模拟，使用循环多面体和质心查询优化，证实了我们的理论主张。在 LIMIT 和 LIMIT-small 数据集上的实验也表明，简单的基于嵌入的检索基线可能过拟合，并优于报告的单向量 LLM 嵌入基线。理论和实证结果都排除了精确几何容量不足作为障碍的可能性。

英文摘要

This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of $m$ object vectors so that every subset of size at most $k$ is exactly retrieved by score comparison. Our result shows MED is $Θ(k)$, independent of $m$, for inner product, Euclidean distance, and cosine similarity. We then consider Robust MED (RMED), where all vectors are unit normed and an $ε$ gap of scores is required. We derive the $m$-dependent feasibility ceiling $ε_\star(m,k)=m/\sqrt{k(m-1)(m-k)}$, which approaches $1/\sqrt{k}$ when $m\gg k$, and a Gaussian centroid construction gives a robust witness upper bound in the feasible margin regime. Numerical simulation on synthetic top-$2$ retrieval with cyclic polytope and centroid query optimization confirmed our theoretical claims. Experiments on LIMIT and LIMIT-small datasets also show that simple embedding-based retrieval baselines can overfit and outperform the reported single-vector LLM embedding baseline. Both theoretical and empirical findings rule out the lack of exact geometric capacity as the obstruction.

URL PDF HTML ☆

赞 0 踩 0

2601.12247 2026-06-03 cs.CL cs.AI cs.LG 版本更新

Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

规划、验证与填充：扩散语言模型的结构化并行解码方法

Miao Li, Hanyang Jiang, Sikai Cheng, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； University of Michigan（密歇根大学）

AI总结提出Plan-Verify-Fill (PVF)方法，通过定量验证进行分层骨架规划，并采用验证协议实现结构化停止，在保持准确性的同时将函数评估次数减少高达65%。

详情

AI中文摘要

扩散语言模型（DLM）为文本生成提供了一种有前景的非顺序范式，不同于标准的自回归（AR）方法。然而，当前的解码策略通常采取被动姿态，未能充分利用全局双向上下文来指导全局轨迹。为了解决这个问题，我们提出了Plan-Verify-Fill（PVF），一种无需训练的范式，通过定量验证来锚定规划。PVF通过优先考虑高杠杆语义锚点主动构建分层骨架，并采用验证协议来实现实用的结构化停止，在进一步思考收益递减时停止。在LLaDA-8B-Instruct和Dream-7B-Instruct上的广泛评估表明，与基于置信度的并行解码相比，PVF在基准数据集上将函数评估次数（NFE）减少了高达65%，在不牺牲准确性的情况下实现了卓越的效率。

英文摘要

Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

URL PDF HTML ☆

赞 0 踩 0

2509.01641 2026-06-03 eess.SP cs.AI cs.LG 版本更新

Non-Identical Diffusion Models in MIMO-OFDM Channel Generation

MIMO-OFDM信道生成中的非相同扩散模型

Yuzhi Yang, Omar Alhussein, Mérouane Debbah

AI总结提出非相同扩散模型，通过元素级时间指示器捕获局部误差变化，解决MIMO-OFDM信道估计中元素可靠性不均的问题，理论验证其正确性并数值实验证明有效性。

Comments resubmitted to IEEE TCOM

详情

AI中文摘要

我们提出了一种新颖的扩散模型，称为非相同扩散模型，并研究了其在无线正交频分复用（OFDM）信道生成中的应用。与使用标量时间索引表示全局噪声水平的标准扩散模型不同，我们将这一概念扩展为元素级时间指示器，以更准确地捕获局部误差变化。非相同扩散使我们能够表征噪声输入中每个元素（例如OFDM中的子载波）的可靠性，从而在初始化有偏时改善生成结果。具体来说，我们专注于无线多输入多输出（MIMO）OFDM信道矩阵的恢复，其中由于导频方案，初始信道估计在元素间表现出高度不均匀的可靠性。传统的时间嵌入假设噪声进展均匀，无法捕获这种跨导频方案和噪声水平的变化。我们引入一个与输入大小匹配的矩阵来控制元素级噪声进展。遵循与现有方法类似的扩散过程，我们从理论和数值上证明了所提出的非相同扩散方案的正确性和有效性。对于MIMO-OFDM信道生成，我们提出了一种维度级时间嵌入策略。我们还开发并评估了多种训练和生成方法，并通过数值实验进行了比较。

英文摘要

We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.

URL PDF HTML ☆

赞 0 踩 0

2501.17377 2026-06-03 cs.LG cs.AI 版本更新

ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

ASAP：利用神经组合优化中的满意泛化优势

Han Fang, Paul Weng, Yutong Ban

发表机构 * GitHub

AI总结针对神经组合优化模型在分布偏移下的脆弱性，提出ASAP框架，通过将决策分解为提案和选择两阶段，并利用MAML增强在线适应能力，在3D-BPP、TSP和CVRP上提升了泛化性能。

Comments Accepted as poster of ICML-2026

详情

AI中文摘要

深度强化学习（DRL）已成为解决组合优化（CO）问题（如3D装箱问题（3D-BPP）、旅行商问题（TSP）或车辆路径问题（VRP））的一种有前景的方法，但这些神经求解器在面对分布偏移时往往表现出脆弱性。为了解决这个问题，我们揭示了满意泛化优势，并在理论和实验上进行了验证：识别一组有希望的行动本质上比选择单一最优行动更具泛化性。为了利用这一特性，我们提出了自适应选择后提案（ASAP），这是一个通用框架，将决策过程分解为两个不同的阶段：作为鲁棒过滤器的提案策略和作为可适应决策者的选择策略。这种架构使得一种高效的在线适应策略成为可能，其中选择策略可以在新分布上快速微调。具体地，我们引入了一个由模型无关元学习（MAML）增强的两阶段训练框架，以使模型能够快速适应。在3D-BPP、TSP和CVRP上的大量实验表明，ASAP提高了最先进基线的泛化能力，并在分布外实例上实现了优越的在线适应。

英文摘要

Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

URL PDF HTML ☆

赞 0 踩 0

2601.11667 2026-06-03 cs.LG cs.AI 版本更新

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

Distill-then-Replace: 高效的任务特定混合注意力模型构建

Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi

发表机构 * Fujitsu Research & Development Center CO., LTD（富士通研发中心有限公司）； Fujitsu Research, FUJITSU LTD（富士通研究所，富士通有限公司）

AI总结提出Distill-then-Replace (DtR)方法，通过逐块局部蒸馏和贪婪层替换策略，将预训练的全注意力模型高效转换为任务特定的混合注意力模型，无需重新训练或神经架构搜索。

详情

AI中文摘要

Transformer架构通过密集的全注意力机制实现了最先进的准确性，但其相对于序列长度的二次时间和内存复杂度限制了实际部署。线性注意力机制提供线性或接近线性的缩放，但通常会导致性能下降。集成全注意力和线性注意力层的混合模型有望在效率和表达能力之间取得平衡，但面临两个主要挑战：从头训练此类混合模型计算成本高，且手动设计注意力类型的最佳放置位置非常困难。我们提出DtR（Distill-then-Replace），首先通过逐块局部蒸馏将预训练的全注意力模块的权重转移到其线性注意力对应模块，然后应用贪婪层替换策略，迭代地用线性注意力块替换全注意力块，同时监控目标任务的验证性能。DtR在单次高效过程中生成任务特定的混合模型，无需昂贵的重新训练或神经架构搜索，并可应用于任何预训练的全注意力骨干网络以处理各种下游任务。

英文摘要

Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We propose DtR (Distill-then-Replace), which first transfers weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and then applies a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. DtR yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2601.11429 2026-06-03 cs.CL cs.AI 版本更新

Relational Linearity is a Predictor of Hallucinations

关系线性是幻觉的预测因子

Yuetian Lu, Yihong Liu, Sebastian Gerstner, Lea Hirlimann, Jonas Rohweder, Hinrich Schütze

AI总结通过合成未知实体基准测试，发现语言模型在回答线性关系问题时更容易产生幻觉，且关系线性度与幻觉率强相关。

Comments 15 pages, 6 figures, 14 tables

详情

AI中文摘要

幻觉是语言模型（LMs）的一个核心失败模式。我们关注对诸如“格伦·古尔德演奏哪种乐器？”这类问题的幻觉，但针对设计为模型未知的合成实体提问。我们发现，像Gemma-7B-IT这样的LM经常产生幻觉，即它们难以识别幻觉事实不属于其知识。基于线性关系嵌入的思想，我们提出以下假设：（i）由于用于表示它们的抽象方案，LM可以轻松地为线性关系的非存在主体生成合理的对象，这可能导致幻觉。（ii）对于非线性关系，这种生成对象的机制不可用，因此更容易避免幻觉。为了验证这一假设，我们创建了SyntHal，一个针对15种关系的合成未知实体基准。我们发现，在四个指令调优模型中，关系线性度是模型为未知主体生成对象（而非拒绝回答）的强预测因子，相关系数$r \in [.58, .84]$。

英文摘要

Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.

URL PDF HTML ☆

赞 0 踩 0

2601.10222 2026-06-03 math.NA cs.AI cs.NA math.OC 版本更新

Introduction to optimization methods for training SciML models

训练科学机器学习模型的优化方法导论

Alena Kopaničáková, Elisa Riccietti

发表机构 * Toulouse-INP, IRIT-APO, ANITI（图卢兹INP、IRIT-APO、ANITI）； ENS de Lyon, CNRS, Inria, Universitè Claude Bernard Lyon 1, LIP, UMR 5668（里昂大学、国家科学研究中心、法国国家信息与自动化研究所、克莱尔伯恩里昂第一大学、LIP、UMR 5668）

AI总结本文统一介绍了机器学习和科学机器学习中的优化方法，强调问题结构如何影响算法选择，并讨论了物理约束和数据驱动SciML模型的实用策略。

详情

AI中文摘要

优化是现代机器学习（ML）和科学机器学习（SciML）的核心，但底层优化问题的结构在这些领域之间存在显著差异。经典ML通常依赖于随机、样本可分离的目标，这有利于一阶和自适应梯度方法。相比之下，SciML通常涉及物理信息或算子约束的公式，其中微分算子导致损失景观中的全局耦合、刚性和强各向异性。因此，SciML中的优化行为由底层物理模型的谱特性而非数据统计决定，这常常限制了标准随机方法的有效性，并促使采用确定性或曲率感知的方法。本文提供了ML和SciML中优化方法的统一介绍，强调问题结构如何塑造算法选择。我们回顾了确定性和随机设置中的一阶和二阶优化技术，讨论了它们对物理约束和数据驱动SciML模型的适应，并通过教程示例说明了实用策略，同时突出了科学计算和科学机器学习交叉领域的开放研究方向。

英文摘要

Optimization is central to both modern machine learning (ML) and scientific machine learning (SciML), yet the structure of the underlying optimization problems differs substantially across these domains. Classical ML typically relies on stochastic, sample-separable objectives that favor first-order and adaptive gradient methods. In contrast, SciML often involves physics-informed or operator-constrained formulations in which differential operators induce global coupling, stiffness, and strong anisotropy in the loss landscape. As a result, optimization behavior in SciML is governed by the spectral properties of the underlying physical models rather than by data statistics, frequently limiting the effectiveness of standard stochastic methods and motivating deterministic or curvature-aware approaches. This document provides a unified introduction to optimization methods in ML and SciML, emphasizing how problem structure shapes algorithmic choices. We review first- and second-order optimization techniques in both deterministic and stochastic settings, discuss their adaptation to physics-constrained and data-driven SciML models, and illustrate practical strategies through tutorial examples, while highlighting open research directions at the interface of scientific computing and scientific machine learning.

URL PDF HTML ☆

赞 0 踩 0

2601.09869 2026-06-03 cs.AI cs.HC 版本更新

A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

拟人化大型语言模型对话代理的伦理视角：一项范围综述

Andrea Ferrario, Rasita Vinay, Matteo Casserini, Alessandro Facchini

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zürich（苏黎世大学生物医学伦理与医学史研究所）； Dalle Molle Institute for Artificial Intelligence (IDSIA), SUPSI（瑞士SUPSI人工智能研究所）； ETH Zürich（苏黎世联邦理工学院）； Institute for Implementation Science in Health Care, University of Zürich（苏黎世大学医疗实施科学研究所）； Department of Management, Technology and Economics, ETH Zürich（苏黎世联邦理工学院管理、技术与经济系）； Dipartimento Tecnologie Innovative, SUPSI（SUPSI创新技术系）； Management in Networked and Digital Societies (MINDS) Department, Kozminski University（科兹明斯基大学网络化与数字化社会管理系）

AI总结本文通过范围综述，系统梳理了拟人化LLM对话代理的伦理挑战与机遇，包括概念基础、伦理问题及方法论，并提出了研究议程与设计治理建议。

Comments 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT'26)

详情

AI中文摘要

拟人化——将人类特质赋予非人类实体的现象——随着基于大型语言模型（LLM）的对话代理（CAs）的兴起而日益显著。与早期的聊天机器人不同，基于LLM的CA通常会生成互动和语言线索，例如第一人称自我指涉、认知和情感表达，实证研究表明这些可以增加参与度。另一方面，拟人化引发了伦理担忧，包括欺骗、过度依赖和剥削性关系框架，而一些作者认为拟人化互动可能支持自主性、福祉和包容性。尽管对该现象的兴趣日益增加，文献仍跨领域分散，并且在如何定义、操作化和规范性评估拟人化方面存在显著差异。本范围综述绘制了关于拟人化基于LLM的CA的伦理导向工作，覆盖五个数据库和三个预印本存储库。我们综合了（1）概念基础，（2）伦理挑战与机遇，以及（3）方法论方法。我们发现基于归因的定义趋于一致，但操作化存在显著差异，主要是风险导向的规范性框架，以及将观察到的互动效应与可操作的治理指导联系起来的实证工作有限。我们最后提出了研究议程和设计/治理建议，用于在基于LLM的对话代理中伦理地部署拟人化线索。

英文摘要

Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.

URL PDF HTML ☆

赞 0 踩 0

2601.08173 2026-06-03 cs.AI 版本更新

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Agent 的第一天：在工作场景中基准测试学习、探索和调度

Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi

发表机构 * Fudan University（复旦大学）； Shanghai AI Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）

AI总结针对多模态大语言模型在动态工作场景中面临的任务调度、主动探索和持续学习三大挑战，提出动态评估环境 EvoEnv，实验表明现有 agent 在这些方面存在显著不足。

详情

AI中文摘要

多模态大语言模型（MLLMs）的快速发展推动了工作流自动化；然而，现有研究主要针对静态环境中的性能上限，忽视了随机真实世界部署的鲁棒性。我们识别出三个关键挑战：动态任务调度、不确定性下的主动探索以及从经验中持续学习。为弥补这一差距，我们引入了 \method{}，一个动态评估环境，模拟“实习生”agent 持续探索新环境。与传统基准不同，\method{} 从三个维度评估 agent：（1）针对具有不同优先级的流式任务的上下文感知调度；（2）通过主动探索谨慎获取信息以减少幻觉；（3）通过从基于规则的动态生成任务中提炼通用策略实现持续进化。实验表明，最先进的 agent 在动态环境中存在显著缺陷，尤其是在主动探索和持续学习方面。我们的工作建立了一个评估 agent 可靠性的框架，将评估从静态测试转向现实的、面向生产的场景。我们的代码可在 https://github.com/KnowledgeXLab/EvoEnv 获取。

英文摘要

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

URL PDF HTML ☆

赞 0 踩 0

2512.23234 2026-06-03 cs.CV cs.AI 版本更新

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

边缘感知与内容自适应的工业安全监控红外气体泄漏检测

Dongsheng Li, Tianli Ma, Siling Wang, Beibei Duan, Song Gao

发表机构 * School of Mechatronic Engineering, Xi’an Technological University（机械电子工程学院，西安理工大学）； School of Electronic Information Engineering, Xi’an Technological University（电子信息工程学院，西安理工大学）； Shaanxi Shanhua Coal Chemical Co., Ltd.（陕西神华化工有限公司）

AI总结针对红外气体羽流微弱、半透明且边界模糊的检测难题，提出一种边缘感知与内容自适应特征融合检测器（ECAF-Det），通过羽流导向的局部-全局特征增强、多尺度边缘感知模块和内容自适应稀疏路由路径聚合网络，在IIG和LangGas数据集上显著提升了检测精度。

详情

AI中文摘要

红外气体泄漏检测对于工业安全和环境监测至关重要，但由于气体羽流通常微弱、细小、半透明且边界模糊，自动检测仍然具有挑战性。本文提出了一种边缘感知与内容自适应特征融合检测器（ECAF-Det），用于杂乱热场景中的弱羽流检测。ECAF-Det集成了三个面向任务的设计：羽流导向的局部-全局特征增强块，用于保留精细边界线索并捕获长程上下文连续性；多尺度边缘感知模块，将方向梯度和相位一致性线索转化为分层边缘先验，用于边界敏感的羽流表示；以及内容自适应稀疏路由路径聚合网络，动态调节多尺度特征传播，以强调信息丰富的羽流特征并抑制冗余背景响应。在IIG数据集上的实验表明，ECAF-Det实现了29.8%的AP、84.3%的AP50和25.3%的小目标AP，分别比RT-DETR-R18基线提高了3.0、6.5和5.4个百分点，计算量为43.7 GFLOPs，参数量为14.9 M。在LangGas数据集上，ECAF-Det实现了36.3%的AP和68.5%的AP50，展示了其对不同红外气体羽流外观的泛化能力。主要的人工智能贡献在于边缘感知表示学习与内容自适应稀疏特征路由，用于弱红外羽流感知。所提出的检测器可作为工业气体泄漏监测中早期预警和远程巡检的视觉感知组件。

英文摘要

Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging because gas plumes are often faint, small, semi-transparent, and weakly bounded. This paper proposes an Edge-Aware and Content-Adaptive Feature Fusion Detector (ECAF-Det) for weak-plume detection in cluttered thermal scenes. ECAF-Det integrates three task-oriented designs: a plume-oriented local-global feature enhancement block to preserve fine boundary cues and capture long-range contextual continuity; a multi-scale edge perception module that transforms directional gradient and phase-consistency cues into hierarchical edge priors for boundary-sensitive plume representation; and a content-adaptive sparse routing path aggregation network that dynamically regulates multi-scale feature propagation to emphasize informative plume features and suppress redundant background responses. Experiments on the IIG dataset show that ECAF-Det achieves 29.8% AP, 84.3% AP50, and 25.3% small-object AP, improving the RT-DETR-R18 baseline by 3.0, 6.5, and 5.4 percentage points, respectively, with 43.7 GFLOPs and 14.9 M parameters. On the LangGas dataset, ECAF-Det achieves 36.3% AP and 68.5% AP50, demonstrating its generalization to different infrared gas plume appearances. The main AI contribution is edge-aware representation learning with content-adaptive sparse feature routing for weak infrared plume perception. The proposed detector can serve as a visual perception component for early warning and remote inspection in industrial gas leak monitoring.

URL PDF HTML ☆

赞 0 踩 0

2504.04942 2026-06-03 cs.AI cs.LO 版本更新

Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Lemmanaid: 神经符号引理猜想

Yousef Alhessi, Sólrún Halla Einarsdóttir, George Granberry, Emily First, Moa Johansson, Sorin Lerner, Nicholas Smallbone

发表机构 * Department of Computer Science and Engineering University of California, San Diego, USA（计算机科学与工程系，加州大学圣地亚哥分校）； Department of Computer Science and Engineering Chalmers University of Technology & University of Gothenburg（计算机科学与工程系，查尔姆斯理工大学及哥德堡大学）

AI总结提出首个神经符号引理猜想工具LEMMANAID，通过类比数学理论生成引理，结合微调LLM与符号方法，在Isabelle测试集上优于纯神经和纯符号方法。

详情

AI中文摘要

数学家和计算机科学家越来越多地利用证明助手来形式化和检查复杂证明，这需要大量的专业知识。我们能否通过自动化猜想有用、有趣且新颖的引理来降低门槛？我们提出了首个神经符号引理猜想工具LEMMANAID，旨在通过类比数学理论来发现猜想。LEMMANAID使用微调后的LLM生成描述引理形状的引理模板，并使用符号方法填充细节。我们将LEMMANAID与直接微调生成引理的相同LLM以及完全符号的猜想方法进行了比较。在来自Isabelle的HOL库和形式化证明档案（AFP）的测试集上，LEMMANAID始终优于神经和符号方法。使用DeepSeek-coder-6.7B作为后端，LEMMANAID发现了50%（HOL）和29%（AFP）的金标准引理，当集成提示策略时，这一比例提高到55%和35%。在关于八元数的案例研究中，LEMMANAID发现了79%的金标准引理，而纯神经方法为62%，最先进的符号工具为23%。此外，在针对性比较中，LEMMANAID发现的金标准引理数量超过了Claude Opus 4.5和GPT-5.2。我们的结果表明，LEMMANAID能够在数学和计算机科学的复杂形式化中猜想出大量有趣的引理。

英文摘要

Mathematicians and computer scientists are increasingly leveraging proof assistants to formalize and check complex proofs, a task that demands substantial expertise. Can we lower the bar by automating the conjecturing of helpful, interesting and novel lemmas? We present the first neuro-symbolic lemma conjecturing tool, LEMMANAID, designed to discover conjectures by drawing analogies between mathematical theories. LEMMANAID uses a fine-tuned LLM to generate lemma templates that describe the shape of a lemma, and symbolic methods to fill in the details. We compare LEMMANAID against the same LLM fine-tuned to generate lemmas directly, as well as a fully symbolic conjecturing method. On test sets from Isabelle's HOL library and Archive of Formal Proofs (AFP), LEMMANAID consistently outperforms both neural and symbolic methods. Using DeepSeek-coder-6.7B as a backend, LEMMANAID discovers 50% (HOL) and 29% (AFP) of the gold standard lemmas, increasing to 55% and 35% when ensembling prompting strategies. In a case study on Octonions, LEMMANAID discovers 79% of the gold standard lemmas, compared to 62% for neural-only and 23% for the state of the art symbolic tool. Furthermore, in a targeted comparison, LEMMANAID discovers more gold standard lemmas than both Claude Opus 4.5 and GPT-5.2. Our results show that LEMMANAID can conjecture a significant number of interesting lemmas across complex formalizations in mathematics and computer science.

URL PDF HTML ☆

赞 0 踩 0

2505.11785 2026-06-03 cs.LG cs.AI stat.ML 版本更新

Improving Coverage in Combined Prediction Sets with Weighted p-values

通过加权p值提高组合预测集的覆盖范围

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出一种加权聚合预测集的框架，通过为每个预测集分配权重，实现覆盖范围在$1-2α$与$1-α$之间的灵活控制，并推广到数据依赖权重，在混合专家模型等场景中保持有限样本有效性。

详情

Journal ref: AISTATS 2026

AI中文摘要

共形预测通过用有效的预测集增强点预测来量化机器学习模型的不确定性。对于涉及多个试验、模型或数据源的复杂场景，可以聚合共形预测集以创建捕获整体不确定性的预测集，通常能提高精度。然而，聚合具有个体$1-α$覆盖率的多个预测集不可避免地削弱了整体保证，通常导致最坏情况覆盖率为$1-2α$。在这项工作中，我们提出了一个预测集加权聚合的框架，其中根据每个预测集的贡献为其分配权重。我们的框架提供了对集合聚合方式的灵活控制，实现了更紧的覆盖界限，根据权重的分布在组合模型的$1-2α$保证和单个模型的$1-α$保证之间插值。重要的是，我们的框架推广到数据依赖的权重，因为我们推导了一个加权聚合程序，即使权重依赖于数据，也能保持有限样本有效性。这一扩展使我们的框架广泛适用于权重被学习的场景，例如混合专家模型（MoE），并且我们通过在MoE设置中的实验证明，我们的方法实现了自适应覆盖。

英文摘要

Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual $1-α$ coverage inevitably weakens the overall guarantee, typically resulting in $1-2α$ worst-case coverage. In this work, we propose a framework for the weighted aggregation of prediction sets, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the $1-2α$ guarantee of the combined models and the $1-α$ guarantee of an individual model depending on the distribution of weights. Importantly, our framework generalizes to data-dependent weights, as we derive a procedure for weighted aggregation that maintains finite-sample validity even when the weights depend on the data. This extension makes our framework broadly applicable to settings where weights are learned, such as mixture-of-experts (MoE), and we demonstrate through experiments in the MoE setting that our methods achieve adaptive coverage.

URL PDF HTML ☆

赞 0 踩 0

2512.13996 2026-06-03 cs.AI 版本更新

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

DTop-p MoE：面向基础模型预训练的稀疏度可控动态Top-p MoE

Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohi Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结提出DTop-p动态路由机制，通过比例积分控制器学习Top-p概率阈值并采用动态路由归一化，在全局稀疏约束下实现层间专家选择，一致优于Top-k和固定Top-p基线，且FLOPs与Top-k MoE相当。

详情

AI中文摘要

稀疏混合专家架构对于高效扩展模型容量至关重要，但标准的Top-$k$路由施加了固定的稀疏模式，忽略了令牌难度和层特定计算需求的内在差异。Top-$p$路由更具自适应性，因为它选择专家直到其累积路由概率达到阈值，允许置信令牌使用更少的专家，而模糊令牌则招募更多专家。然而，我们证明，现有的具有固定全局概率阈值的朴素Top-$p$实现相比Top-$k$仅带来边际收益，存在超参数敏感性，并导致不可控的计算成本。在本文中，我们提出**DTop-$p$**，一种稀疏度可控的动态路由机制，它使用比例积分控制器学习Top-$p$概率阈值，并采用动态路由归一化来在全局稀疏约束下支持逐层专家选择。在大语言模型和扩散Transformer上的大量实验表明，**DTop-$p$**在匹配Top-$k$ MoE平均FLOPs的同时，始终优于Top-$k$和固定Top-$p$基线。我们的分析证实，**DTop-$p$**在专家粒度、总专家容量、模型大小和数据集大小方面表现出强大的可扩展性，为基础模型预训练提供了一个鲁棒且高效的MoE框架。

英文摘要

Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-$p$ routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-$p$ implementations with fixed global probability thresholds provide only marginal gains over Top-$k$, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose **DTop-$p$**, a sparsity-controllable dynamic routing mechanism that learns the Top-$p$ probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that **DTop-$p$** consistently outperforms both Top-$k$ and fixed Top-$p$ baselines while matching the average FLOPs of Top-$k$ MoE. Our analysis confirms that **DTop-$p$** exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

URL PDF HTML ☆

赞 0 踩 0

2512.11213 2026-06-03 cs.AI cs.CL 版本更新

分布校准的推理时间计算用于思考型LLM作为评判者

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

发表机构 * University of California, Berkeley（加州大学伯克利分校）； DeepMind（深Mind）； University of Cambridge（剑桥大学）

AI总结针对思考型大语言模型作为评判者时单样本噪声和聚合不一致问题，提出基于Bradley-Terry-Davidson模型的分布校准聚合方案，利用极性（非平局边际）和决定性（非平局率）区分微弱多数与强共识，显著降低MAE并提高成对准确率，匹配或超越人类评判者。

详情

AI中文摘要

用作成对偏好评判的思考型大语言模型在单样本层面仍存在噪声，常见的聚合规则（多数投票、软自一致性或基于指令的自聚合）在允许平局时不一致。我们研究了评估者的推理时间计算（ITC），该评估者为每个项目生成n个独立的思考-评分样本，并提出了一种原则性的、分布校准的聚合方案。我们的方法使用Bradley-Terry-Davidson公式对评分计数进行三向偏好建模，利用极性（非平局间的边际）和决定性（非平局率）来区分微弱多数与强共识。在各种评估基准上，与标准基线相比，我们的方法持续降低MAE并提高成对准确率，并且在针对人类共识元标签进行评估时，匹配或超过单个人类评判者。这些结果表明，精心分配ITC并使用分布感知方法进行聚合，可以将嘈杂的个体模型判断转化为可靠的评估评分。

英文摘要

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

URL PDF HTML ☆

赞 0 踩 0

2511.21731 2026-06-03 cs.CL cs.AI 版本更新

Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

识别AI语言中的量子结构：人类与人工智能认知进化趋同的证据

Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo

发表机构 * Center Leo Apostel for Interdisciplinary Studies, Vrije Universiteit Brussel (VUB)（利奥·阿波斯泰尔跨学科研究中心，布鲁塞尔自由大学）； Department of Economics, University of Bergamo（博洛尼亚大学经济系）； Department of Humanities and Cultural Heritage (DIUM) and Centre CQSCS, University of Udine（乌迪内大学人文与文化遗产系及CQSCS中心）

AI总结通过对大型语言模型进行认知测试，发现其概念组合中存在贝尔不等式显著违背和玻色-爱因斯坦统计，表明人类与人工智能在概念-语言领域均涌现非经典量子结构，支持认知进化趋同假说。

详情

DOI: 10.3390/e28060622
Journal ref: Entropy 28, 622, 2026

AI中文摘要

我们展示了使用特定大型语言模型（LLMs）作为测试对象进行的概念组合认知测试结果。在第一个测试中，使用ChatGPT和Gemini，我们表明贝尔不等式被显著违背，这表明存在一个概率不满足Kolmogorov公理的“非经典概率模型”。在第二个测试中，同样使用ChatGPT和Gemini，我们在大型文本中的单词分布中识别出“玻色-爱因斯坦统计”的存在，而非直觉预期的“麦克斯韦-玻尔兹曼统计”。有趣的是，这些发现与之前在人类参与者认知测试和大规模语料库信息检索测试中获得的结果相呼应。综合来看，它们指向“概念-语言领域中非经典量子类结构的系统性涌现”，无论认知主体是人类还是人工智能。尽管LLMs因历史原因被归类为神经网络，但我们认为，在神经网络之上构建的向量空间的分布式语义结构中，发生了一种更本质的知识组织形式。正是这种承载意义的结构，促成了通过生物进化缓慢建立的人类认知与语言，与通过自我学习和训练快速涌现的LLM认知与语言之间的进化趋同现象。我们分析了支持上述假设的各种方面和实例。我们还提出了一个统一框架，解释了我们识别出的普遍量子组织意义。

英文摘要

We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of a 'non-classical probability model' with probabilities that do not satisfy Kolmogorov's axioms. In the second test, also performed using ChatGPT and Gemini, we identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of non-classical quantum-like structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

URL PDF HTML ☆

赞 0 踩 0

2503.07265 2026-06-03 cs.CV cs.AI cs.CL 版本更新

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

WISE: 一种基于世界知识的文本到图像生成语义评估方法

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有文本到图像生成模型缺乏复杂语义理解和世界知识整合评估的问题，提出WISE基准，包含25个子领域的1000个精心设计的提示，并引入WiScore指标评估知识-图像对齐，实验表明当前模型在整合世界知识方面存在显著局限。

Comments Accepted to ICML 2026. We have also released an updated version of the benchmark, WISE_Verified. Please refer to https://github.com/PKU-YuanGroup/WISE for the latest version

详情

AI中文摘要

文本到图像（T2I）模型能够生成高质量的艺术创作和视觉内容。然而，现有研究和评估标准主要关注图像真实性和浅层的文本-图像对齐，缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战，我们提出了 extbf{WISE}，这是首个专门用于 extbf{W}orld Knowledge- extbf{I}nformed extbf{S}emantic extbf{E}valuation（世界知识引导的语义评估）的基准。WISE超越了简单的词-像素映射，通过1000个精心设计的提示，涵盖文化常识、时空推理和自然科学等25个子领域，对模型进行挑战。为了克服传统CLIP指标的局限性，我们引入了 extbf{WiScore}，一种用于评估知识-图像对齐的新型定量指标。通过对20个模型（10个专用T2I模型和10个统一多模态模型）在涵盖25个子领域的1000个结构化提示上进行全面测试，我们的发现揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限，为下一代T2I模型增强知识整合与应用指明了关键路径。代码和数据可在\href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}获取。

英文摘要

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

URL PDF HTML ☆

赞 0 踩 0

2511.13020 2026-06-03 cs.CV cs.AI 版本更新

PHASE: Physiology-Aware Hyperspectral Reconstruction via Object-to-Human Domain Adaptation

PHASE: 通过对象到人体域适应的生理感知高光谱重建

Yufei Wen, Shuxing Zhong, Jingdan Kang, Yuting Zhang, Jintai Chen, Kaishun Wu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； South China University of Technology（华南理工大学）

AI总结针对现有高光谱重建方法在生理成像中失效的问题，提出PHASE范式，通过生理通道重新解释和生理约束对齐，实现从对象到人体的域适应，仅需1.5%标注数据即可显著提升重建质量。

Comments To KDD26

详情

AI中文摘要

尽管高光谱成像提供了无与伦比的无创生理洞察，但其笨重的硬件、缓慢的采集速度和监管负担严重限制了其临床可用性。一种自然的替代方案是从无处不在的RGB或CASSI测量中重建高光谱信息。然而，现有的为以对象为中心的场景开发的范式依赖于基于反射率的特征对齐，假设光谱相似性保持语义一致性。这一假设在生理成像中不成立，因为视觉上相似的RGB响应可能源于不同且纠缠的生理状态。这种不匹配促使从反射率对齐转向基于共享光-物质相互作用原理的生理感知表示学习——这一转变引入了来自跨通道语义偏移（C1）和基于RGB采集的不可逆信息丢失（C2）的基本挑战。因此，我们设计了PHASE，一种生理感知的高光谱重建范式，通过生理通道重新解释解耦跨通道生理语义，并通过生理约束对齐将重建限制在生理上合理的解，从根本上重新定义了对象到人体的迁移。在两种源到目标迁移协议下，PHASE仅需1.5%的标注监督，在SSIM上一致优于最先进方法最多+2.20，在SAM上最多-3.06。

英文摘要

Although hyperspectral imaging offers unparalleled non-invasive physiological insight, its bulky hardware, slow acquisition, and regulatory burden severely limit its clinical availability. A natural workaround is to reconstruct hyperspectral information from ubiquitous RGB or CASSI measurements. However, existing paradigms, developed for object-centric scenes, rely on reflectance-based feature alignment, assuming that spectral similarity preserves semantic meaning. This assumption breaks down in physiological imaging, where visually similar RGB responses may arise from distinct and entangled physiological states. This mismatch motivates a shift from reflectance alignment to physiology-aware representation learning, grounded in shared light-matter interaction principles -- a shift that introduces fundamental challenges from cross-channel semantic shifts (C1) and irreversible information loss in RGB-based acquisition (C2). We therefore design PHASE, a physiology-aware hyperspectral reconstruction paradigm that fundamentally redefines object-to-human transfer by disentangling cross-channel physiological semantics via Physiological Channel Reinterpretation and restricting reconstruction to physiologically plausible solutions through Physiologically Constrained Alignment. Under two source-to-target transfer protocols, PHASE consistently outperforms state-of-the-art methods by up to +2.20 SSIM and -3.06 in SAM with merely 1.5% labeled supervision.

URL PDF HTML ☆

赞 0 踩 0

2511.02304 2026-06-03 cs.MA cs.AI cs.CL cs.FL cs.LG 版本更新

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

自动机条件化协作多智能体强化学习

Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）

AI总结提出自动机条件化协作多智能体强化学习框架，通过自动机分解团队目标为子任务，学习任务条件化的分散策略，实现最优任务分配和多步协调。

详情

AI中文摘要

我们研究在集中训练、分散执行下，针对协作性时间目标的多任务、多智能体策略学习。在此设置中，使用自动机表示分配给智能体的任务，能够将团队级目标分解为更简单、更小的子任务。然而，现有方法样本效率低下，且局限于单任务情况，需要为每个新任务重新训练策略。在这项工作中，我们提出了自动机条件化协作多智能体强化学习（ACC-MARL），一个学习任务条件化分散团队策略的框架。我们识别了ACC-MARL可行性的挑战，提出了解决方案，并证明了我们的方法是最优的。我们进一步展示了学习到的价值函数可用于在测试时最优地分配任务。实验表明，智能体之间涌现出任务感知的多步协调，例如按下按钮开门、扶住门以及短路任务。

英文摘要

We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks assigned to agents enables breaking down a team-level objective into simpler, smaller sub-tasks. However, existing approaches remain sample-inefficient and are limited to the single-task case, requiring retraining policies for each new task. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify challenges to the feasibility of ACC-MARL, propose solutions, and prove that our approach is optimal. We further show that learned value functions can be used to assign tasks optimally at test time. Experiments demonstrate emergent task-aware, multi-step coordination among agents, such as pressing a button to unlock a door, holding the door, and short-circuiting tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.23216 2026-06-03 cs.AI cs.LG 版本更新

Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

逼真足球模拟中的人性化守门：一种样本高效的强化学习方法

Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Fabio Zinno, Michael Jones, Linus Gisslén

发表机构 * University of Edinburgh（爱丁堡大学）； KTH Royal Institute of Technology（皇家理工学院）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种样本高效的深度强化学习方法，通过利用预收集数据和增加网络可塑性，在EA SPORTS FC 25中训练出守门员智能体，其扑救率比内置AI高10%，训练速度比标准DRL快50%，且行为更接近人类。

详情

AI中文摘要

尽管多个知名视频游戏已成为深度强化学习（DRL）的测试平台，但该技术很少被游戏行业用于制作真实的AI行为。先前的研究侧重于使用大型模型训练超人类智能体，这对于资源有限、旨在实现类人智能体的游戏工作室来说并不实际。本文提出了一种样本高效的DRL方法，专为在工业环境（如视频游戏行业）中训练和微调智能体而设计。我们的方法通过利用预收集的数据和增加网络可塑性来提高基于价值的DRL的样本效率。我们在EA SPORTS FC 25（当今最畅销的足球模拟游戏之一）中评估了该方法训练守门员智能体的效果。我们的智能体在扑救率上比游戏内置AI高出10%。消融研究表明，与标准DRL方法相比，我们的方法训练智能体速度提高了50%。最后，领域专家的定性评估表明，与手工制作的智能体相比，我们的方法创造了更人性化的游戏玩法。作为该方法影响力的证明，该技术已被用于该系列的最新版本中。

英文摘要

While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.

URL PDF HTML ☆

赞 0 踩 0

2510.17149 2026-06-03 cs.AI 版本更新

ProtocolBench: Which LLM MultiAgent Protocol to Choose?

ProtocolBench：选择哪个LLM多智能体协议？

Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, Jiaxuan You

AI总结提出ProtocolBench基准，系统比较多智能体协议在任务成功率、延迟、开销和鲁棒性上的表现，并设计可学习的协议路由器ProtocolRouter以动态选择最优协议。

Comments Accepted to ICML 2026. Camera-ready version.Code and benchmark artifacts: https://github.com/ulab-uiuc/AgentProtocols

详情

AI中文摘要

随着大规模多智能体系统的发展，通信协议层已成为影响性能和可靠性的关键但评估不足的因素。尽管存在多种协议（A2A、ACP、ANP、Agora等），选择往往依赖直觉且缺乏标准化指导。我们引入ProtocolBench，一个沿四个可测量轴（任务成功率、端到端延迟、消息或字节开销、故障下的鲁棒性）系统比较智能体协议的基准。在ProtocolBench上，协议选择显著影响系统行为。在流队列场景中，不同协议的整体完成时间差异高达36.5%，平均端到端延迟相差3.48秒。在故障风暴恢复下，不同协议的鲁棒性也持续存在差异。除评估外，我们提出ProtocolRouter，一个可学习的协议路由器，根据需求和运行时信号为每个场景（或每个模块）选择协议。ProtocolRouter相比最佳单协议基线将故障风暴恢复时间降低高达18.1%，并在GAIA等场景中取得更高成功率。我们还发布了ProtocolRouterBench以标准化协议评估并提高大规模可靠性。

英文摘要

As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition-driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end-to-end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end-to-end latency differs by 3.48 s. Under Fail-Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects per-scenario (or per-module) protocols from requirement and runtime signals. ProtocolRouter reduces Fail-Storm recovery time by up to 18.1% versus the best single-protocol baseline, and achieves scenario-specific gains such as higher success in GAIA. We also release ProtocolRouterBench to standardize protocol evaluation and improve reliability at scale.

URL PDF HTML ☆

赞 0 踩 0

2510.16302 2026-06-03 cs.AI cs.IR 版本更新

DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA

DTKG: 用于多跳问答的双轨知识图谱验证推理框架

Changhao Wang, Yanfang Liu, Xinxin Fan, Ao Tian, Lanzhi Zhou, Yunfeng Lu

发表机构 * School of Computer Science ； Engineering, Beihang University, Beijing, China ； School of Reliability ； Systems Engineering, Beihang University, Beijing, China ； State Key Laboratory of Complex \& Critical Software Environment ； National Key Laboratory of Reliability ； State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences

AI总结提出DTKG框架，通过分类阶段和分支处理阶段分别处理并行事实验证和链式多跳推理，提升多跳问答的效率和准确性。

Comments Accepted to ICML 2026

详情

AI中文摘要

问答中的多跳推理在现代大型语言模型的检索增强生成中扮演关键角色。通过从知识图谱中检索实体的关系结构可以获得准确答案。考虑到固有的关系依赖和推理模式，多跳推理通常分为两类：i) 并行事实验证多跳推理问题，即需要同时验证多个独立子问题；ii) 链式多跳推理问题，即需要顺序多步推理，中间结论作为后续推理的必要前提。目前，多跳推理方法单独使用两种技术之一：基于LLM响应的事实验证和基于KG路径的链构建。然而，前者擅长并行事实验证但在链式推理任务上表现不佳，而后者擅长链式多跳推理但在处理并行事实验证推理时存在冗余路径检索问题。这些限制降低了多跳问答任务的效率和准确性。为解决这一挑战，我们提出了一种新颖的双轨KG验证和推理框架DTKG，其灵感来自认知科学中的双过程理论。具体来说，DTKG包括两个主要阶段：分类阶段和分支处理阶段。

英文摘要

Multi-hop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for modern large language models (LLMs). The accurate answer can be obtained through retrieving relational structure of entities from knowledge graph (KG). Regarding the inherent relation-dependency and reasoning pattern, multi-hop reasoning can be in general classified into two categories: i) parallel fact-verification multi-hop reasoning question, i.e., requiring simultaneous verifications of multiple independent sub-questions; and ii) chained multi-hop reasoning questions, i.e., demanding sequential multi-step inference with intermediate conclusions serving as essential premises for subsequent reasoning. Currently, the multi-hop reasoning approaches singly employ one of two techniques: LLM response-based fact verification and KG path-based chain construction. Nevertheless, the former excels at parallel fact-verification but underperforms on chained reasoning tasks, while the latter demonstrates proficiency in chained multi-hop reasoning but suffers from redundant path retrieval when handling parallel fact-verification reasoning. These limitations deteriorate the efficiency and accuracy for multi-hop QA tasks. To address this challenge, we propose a novel dual-track KG verification and reasoning framework DTKG, which is inspired by the Dual Process Theory in cognitive science. Specifically, DTKG comprises two main stages: the Classification Stage and the Branch Processing Stage.

URL PDF HTML ☆

赞 0 踩 0

2505.08222 2026-06-03 cs.RO cs.AI cs.DC cs.PF 版本更新

Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

通过自主车辆扩展多智能体强化学习用于水声跟踪

Matteo Gallici, Ivan Masmitja, Mario Martín

发表机构 * KEMLG Research Group, Universitat Politècnica de Catalunya Barcelona, Spain（凯姆尔格研究组，巴塞罗那理工大学，西班牙）； Instituto de Ciencias del Mar, Consejo Superior de Investigaciones Científicas, Barcelona, Spain（海洋科学研究所，西班牙国家科学研究委员会，巴塞罗那，西班牙）； KEMLG Research Group, Universitat Politècnica de Catalunya (UPC), and with the HPAI group at Barcelona Supercomputing Center (BSC), Barcelona, Spain（凯姆尔格研究组，巴塞罗那理工大学（UPC），以及巴塞罗那超级计算中心（BSC）的HPAI组，巴塞罗那，西班牙）

AI总结提出一种GPU加速环境（高达30000倍加速）和基于Transformer的MARL架构（TransfMAPPO），实现多目标快速移动场景下的水下跟踪，跟踪误差低于5米。

详情

AI中文摘要

自主车辆（AV）为水下跟踪等科学任务提供了经济高效的解决方案。强化学习（RL）已成为控制AV的强大方法，但扩展到舰队（对于多目标跟踪或快速移动目标至关重要）具有挑战性。多智能体RL（MARL）以样本效率低下而闻名，虽然像Gazebo的LRAUV这样的高保真模拟器提供高达100倍实时速度的单机器人模拟，但在多车辆场景中几乎没有加速，使得MARL训练不切实际。然而，高保真模拟对于测试复杂策略和缩小模拟到现实的差距至关重要。为了解决这些限制，我们开发了一个GPU加速环境，在保持其动力学的同时，实现了比Gazebo高达30000倍的加速。这使得快速、端到端的GPU训练以及无缝转移到Gazebo进行评估成为可能。我们还引入了一种基于Transformer的架构（TransfMAPPO），该架构学习对舰队规模和目标数量不变的策略，从而能够通过课程学习在日益复杂的场景中训练更大的舰队。经过大规模GPU训练后，我们在Gazebo中进行了广泛评估，表明即使面对多个快速移动的目标，我们的方法也能将跟踪误差保持在5米以下。

英文摘要

Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi-target tracking or rapidly moving targets) is challenging. Multi-Agent RL (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo's LRAUV provide up to 100x faster-than-real-time single-robot simulations, they offer little speedup in multi-vehicle scenarios, making MARL training impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000x speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.

URL PDF HTML ☆

赞 0 踩 0

2510.09845 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Harnessing Self-Supervised Deep Learning and Geostationary Remote Sensing for Advancing Wildfire and Associated Air Quality Monitoring: Improved Smoke and Fire Front Masking using GOES and TEMPO Radiance Data

利用自监督深度学习和地球静止遥感推进野火及相关空气质量监测：使用GOES和TEMPO辐射数据改进烟雾和火锋掩膜

Nicholas LaHaye, Thilanka Munashinge, Hugo Lee, Xiaohua Pan, Gonzalo Gonzalez Abad, Hazem Mahmoud, Jennifer Wei

AI总结本研究利用NASA TEMPO卫星任务的每小时数据和自监督深度学习，提出了一种创新系统，通过GOES-18和TEMPO数据有效区分烟雾与云层，实时绘制野火火锋和烟雾羽流，显著优于现有业务产品。

Comments https://2025.ieeeigarss.org/view_paper.php?PaperNum=6389&SessionID=1611

2510.09711 2026-06-03 cs.CL cs.AI 版本更新

ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

ReaLM：残差量化桥接知识图谱嵌入与大型语言模型

Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen

发表机构 * Tianjin University（天津大学）； The University of Manchester（曼彻斯特大学）

AI总结提出ReaLM框架，通过残差向量量化将知识图谱嵌入离散化为可学习标记，融入大型语言模型词汇表，结合本体约束实现结构化知识与语言模型的语义对齐，在知识图谱补全任务上取得最优性能。

详情

AI中文摘要

大型语言模型（LLM）最近成为知识图谱补全（KGC）的强大范式，提供了超越传统基于嵌入方法的强大推理和泛化能力。然而，现有的基于LLM的方法通常难以充分利用结构化语义表示，因为预训练KG模型的连续嵌入空间与LLM的离散标记空间根本不对齐。这种差异阻碍了有效的语义转移并限制了它们的性能。为了解决这一挑战，我们提出了ReaLM，一种新颖且有效的框架，通过残差向量量化的机制弥合了KG嵌入和LLM标记化之间的差距。ReaLM将预训练的KG嵌入离散化为紧凑的代码序列，并将它们作为可学习标记集成到LLM词汇表中，从而实现符号知识和上下文知识的无缝融合。此外，我们引入了本体引导的类约束以强制语义一致性，基于类级别的兼容性细化实体预测。在两个广泛使用的基准数据集上进行的大量实验表明，ReaLM实现了最先进的性能，证实了其在将结构化知识与大规模语言模型对齐方面的有效性。

英文摘要

Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.

URL PDF HTML ☆

赞 0 踩 0

2509.09685 2026-06-03 cs.IR cs.AI cs.MM cs.SD eess.AS 版本更新

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

TalkPlayData 2：用于多模态对话式音乐推荐的智能体合成数据流水线

Keunwoo Choi, Seungheon Doh, Juhan Nam

发表机构 * KAIST（韩国科学技术院）

AI总结提出TalkPlayData 2，一个由智能体数据流水线生成的多模态对话式音乐推荐合成数据集，通过多角色大语言模型模拟对话并覆盖多种场景，以支持生成式推荐模型训练。

2510.03316 2026-06-03 cs.CV cs.AI cs.LG 版本更新

The View From Space: Navigating Instrumentation Differences with EOFMs

从太空视角：利用EOFMs导航仪器差异

Ryan P. Demilt, Nicholas LaHaye, Karis Tenneson

发表机构 * Spatial Informatics Group（空间信息组）

AI总结本研究通过分析地球观测基础模型（EOFMs）对传感器架构的敏感性，揭示了当前模型设计的缺陷，并为模型开发者、用户和遥感科学社区指明了前进方向。

详情

Journal ref: https://neurips.cc/virtual/2025/loc/san-diego/122891

AI中文摘要

地球观测基础模型（EOFMs）作为处理大量遥感及其他地球观测数据、并对许多关键地球监测任务产生影响的工具，其普及程度急剧上升。一个新兴趋势是利用预训练模型的输出作为“嵌入”，这些嵌入总结了高维数据，可用于通用任务，如相似性搜索和内容特定查询。然而，大多数EOFMs仅在单一模态数据上训练，然后通过匹配不同模态的波段进行应用或基准测试。现有工作尚不清楚多样化的传感器架构如何影响当前EOFMs套件的内部表示。我们在本工作中表明，EOFMs的表示空间对传感器架构高度敏感，理解这一差异为我们提供了关于当前EOFMs设计陷阱的关键视角，并指明了作为模型开发者、用户以及以稳健遥感科学为指导的社区应如何前进的方向。

英文摘要

Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

URL PDF HTML ☆

赞 0 踩 0

2510.01377 2026-06-03 math.OC cs.AI cs.LG cs.MA cs.SY eess.SY 版本更新

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

DeMuon：一种用于图上矩阵优化的去中心化Muon方法

Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson

发表机构 * Department of Mathematics, Linköping University（利乌普堡大学数学系）； Department of Electrical Engineering, Linköping University（利乌普堡大学电气工程系）； Department of Computer and Information Science, Linköping University（利乌普堡大学计算机与信息科学系）

AI总结提出DeMuon方法，通过牛顿-舒尔茨迭代实现矩阵正交化，并利用梯度跟踪处理局部函数异质性，在重尾噪声下达到与集中式算法匹配的复杂度，首次将Muon扩展到去中心化图优化并具有可证明的复杂度保证。

Comments Add an accelerated variant of the proposed method. New proofs of proposed methods

详情

AI中文摘要

本文提出DeMuon，一种在给定通信拓扑上进行去中心化矩阵优化的方法。DeMuon通过牛顿-舒尔茨迭代（继承自其集中式前身Muon）实现矩阵正交化，并采用梯度跟踪来减轻局部函数之间的异质性。在重尾噪声条件和额外的温和假设下，我们建立了DeMuon达到近似随机驻点的迭代复杂度。该复杂度结果在目标容差依赖方面与已知的最佳集中式算法复杂度界相匹配。据我们所知，DeMuon是首个将Muon直接扩展到图上去中心化优化并具有可证明复杂度保证的方法。我们在不同连通程度的图上进行了去中心化Transformer预训练的初步数值实验。数值结果表明，在不同网络拓扑下，DeMuon相比其他流行的去中心化算法具有明显的改进优势。

英文摘要

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

URL PDF HTML ☆

赞 0 踩 0

2509.22468 2026-06-03 cs.LG cs.AI 版本更新

Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining

学习邻域：无对比的多模态自监督分子图预训练

Boshra Ariguib, Mathias Niepert, Andrei Manolache

发表机构 * University of Tübingen（图宾根大学）

AI总结提出C-FREE框架，通过预测子图嵌入与互补邻域的关系，融合2D拓扑和3D构象信息，实现无对比、无负样本的多模态自监督分子图预训练，在MoleculeNet上取得最优结果。

Comments Accepted at ICML 2026

详情

AI中文摘要

高质量的分子表示对于性质预测和分子设计至关重要，然而大型标注数据集仍然稀缺。尽管分子图上的自监督预训练已显示出潜力，但许多现有方法要么依赖于手工数据增强或复杂的生成目标，要么仅利用2D拓扑，导致宝贵的3D结构信息未被充分利用。为弥补这一空白，我们引入了C-FREE（基于自我网络的无需对比的表示学习），一个将2D图与3D构象集成在一起的简单框架。C-FREE通过从潜在空间中互补邻域预测子图嵌入来学习分子表示，使用固定半径的自我网络作为不同构象之间的建模单元。这种设计使我们能够在混合图神经网络（GNN）-Transformer骨干中整合几何和拓扑信息，无需负样本、位置编码或昂贵的预处理。在提供丰富3D构象多样性的GEOM数据集上进行预训练后，C-FREE在MoleculeNet上取得了最先进的结果，超越了对比、生成和其他多模态自监督方法。在具有不同规模和分子类型的数据集上进行微调进一步表明，预训练能有效迁移到新的化学领域，突显了3D信息分子表示的重要性。

英文摘要

High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.

URL PDF HTML ☆

赞 0 踩 0

2509.19305 2026-06-03 cs.LG cs.AI eess.SP 版本更新

Wavelet Fourier Diffuser: Frequency-Aware Diffusion Model for Reinforcement Learning

小波傅里叶扩散器：用于强化学习的频率感知扩散模型

Yifu Luo, Yongzhe Chang, Xueqian Wang

发表机构 * Tsinghua University China（清华大学中国）

AI总结针对现有扩散模型在离线强化学习中忽略频域特征导致频率偏移的问题，提出WFDiffuser，通过离散小波变换分解轨迹并利用短时傅里叶变换和交叉注意力增强频域建模，在D4RL基准上有效缓解频率偏移，提升轨迹稳定性和决策性能。

Comments IJCNN 2025

详情

Journal ref: IJCNN 2025

AI中文摘要

扩散概率模型通过直接建模轨迹序列，在离线强化学习中展现出显著潜力。然而，现有方法主要关注时域特征而忽略频域特征，根据我们的观察，这会导致频率偏移和性能下降。在本文中，我们从频域的新视角研究强化学习问题。我们首先观察到，仅使用时域的方法会无意中引入频域低频分量的偏移，从而导致轨迹不稳定和性能下降。为了解决这个问题，我们提出了小波傅里叶扩散器（WFDiffuser），一种新颖的基于扩散的强化学习框架，它集成了离散小波变换将轨迹分解为低频和高频分量。为了进一步增强每个分量的扩散建模，WFDiffuser采用短时傅里叶变换和交叉注意力机制来提取频域特征并促进跨频率交互。在D4RL基准上的大量实验结果表明，WFDiffuser有效缓解了频率偏移，从而产生更平滑、更稳定的轨迹，并相比现有方法提高了决策性能。

英文摘要

Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. However, existing approaches primarily focus on time-domain features while overlooking frequency-domain features, leading to frequency shift and degraded performance according to our observation. In this paper, we investigate the RL problem from a new perspective of the frequency domain. We first observe that time-domain-only approaches inadvertently introduce shifts in the low-frequency components of the frequency domain, which results in trajectory instability and degraded performance. To address this issue, we propose Wavelet Fourier Diffuser (WFDiffuser), a novel diffusion-based RL framework that integrates Discrete Wavelet Transform to decompose trajectories into low- and high-frequency components. To further enhance diffusion modeling for each component, WFDiffuser employs Short-Time Fourier Transform and cross attention mechanisms to extract frequency-domain features and facilitate cross-frequency interaction. Extensive experiment results on the D4RL benchmark demonstrate that WFDiffuser effectively mitigates frequency shift, leading to smoother, more stable trajectories and improved decision-making performance over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2509.11323 2026-06-03 cs.CV cs.AI 版本更新

Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding

基于语义无关编码的KalmanNet多目标跟踪运动估计

Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long

AI总结提出语义无关KalmanNet（SIKNet），通过语义无关编码器（SIE）改进运动估计，在MOT中比传统卡尔曼滤波和学习辅助滤波器更鲁棒、更准确。

详情

DOI: 10.1016/j.inffus.2026.104513

AI中文摘要

运动估计是多目标跟踪（MOT）中的关键组成部分。它通过分析连续帧图像中物体位置的变化来预测物体的轨迹，减少跟踪失败和身份切换。基于线性恒速模型的卡尔曼滤波器（KF）是MOT中最常用的方法之一。然而，当KF参数不匹配且物体非平稳运动时，可能产生不理想的结果。在这项工作中，我们利用学习辅助滤波器来处理MOT的运动估计。具体地，我们提出了一种名为语义无关KalmanNet（SIKNet）的新方法，该方法通过两步使用语义无关编码器（SIE）对状态向量（输入特征）进行编码。首先，SIE使用核大小为1的一维卷积，该卷积沿不同状态向量中同语义元素维度进行卷积，以编码独立的语义信息。然后，它采用全连接层和非线性激活层来编码异语义元素之间的非线性和交叉依赖信息。为了独立评估MOT中运动估计模块的性能，我们从几个开源MOT数据集构建了一个大规模半模拟数据集。实验结果表明，所提出的SIKNet优于传统KF，并且比现有的学习辅助滤波器具有更好的鲁棒性和准确性。代码可在(https://github.com/SongJgit/filternet 和 https://github.com/SongJgit/TBDTracker)获取。

英文摘要

Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).

URL PDF HTML ☆

赞 0 踩 0

2508.13174 2026-06-03 cs.AI cs.LG q-fin.CP stat.ML 版本更新

AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining

AlphaEval：一个全面高效的公式化Alpha挖掘评估框架

Hongjun Ding, Binqi Chen, Jinsheng Huang, Taian Guo, Zhengyang Mao, Guoyi Shao, Lutong Zou, Luchen Liu, Ming Zhang

发表机构 * CUNY Baruch College（CUNY 巴纳特学院）； Peking University（北京大学）； Harvard University（哈佛大学）； Zhengren Research（正人研究所）； Zhengren Quant（正人量化）

AI总结提出AlphaEval框架，通过五个维度（预测能力、稳定性、鲁棒性、金融逻辑、多样性）对自动Alpha挖掘模型进行统一、可并行化且无需回测的评估，实现与回测相当的评估一致性并提高效率。

Comments Accepted by KDD2026

详情

DOI: 10.1145/3770855.3817727

AI中文摘要

公式化Alpha挖掘从金融数据中生成预测信号，对量化投资至关重要。尽管遗传编程、强化学习和大语言模型等多种算法方法显著扩展了Alpha发现的能力，但系统评估仍是一个关键挑战。现有评估指标主要包括回测和基于相关性的度量。回测计算密集、本质上是顺序的，并且对特定策略参数敏感。基于相关性的度量虽然高效，但仅评估预测能力，忽略了时间稳定性、鲁棒性、多样性和可解释性等其他关键属性。此外，大多数现有Alpha挖掘模型的闭源性质阻碍了可重复性并减缓了该领域的进展。为解决这些问题，我们提出了AlphaEval，一个统一、可并行化且无需回测的自动Alpha挖掘模型评估框架。AlphaEval沿五个互补维度评估生成Alpha的整体质量：预测能力、稳定性、对市场扰动的鲁棒性、金融逻辑和多样性。跨代表性Alpha挖掘算法的广泛实验表明，AlphaEval实现了与全面回测相当的评估一致性，同时提供更全面的洞察和更高的效率。此外，与传统的单一指标筛选方法相比，AlphaEval能有效识别更优的Alpha。所有实现和评估工具均已开源，以促进可重复性和社区参与。

英文摘要

Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures. Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters. Correlation-based metrics, though efficient, assess only predictive ability and overlook other crucial properties such as temporal stability, robustness, diversity, and interpretability. Additionally, the closed-source nature of most existing alpha mining models hinders reproducibility and slows progress in this field. To address these issues, we propose AlphaEval, a unified, parallelizable, and backtest-free evaluation framework for automated alpha mining models. AlphaEval assesses the overall quality of generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. Extensive experiments across representative alpha mining algorithms demonstrate that AlphaEval achieves evaluation consistency comparable to comprehensive backtesting, while providing more comprehensive insights and higher efficiency. Furthermore, AlphaEval effectively identifies superior alphas compared to traditional single-metric screening approaches. All implementations and evaluation tools are open-sourced to promote reproducibility and community engagement.

URL PDF HTML ☆

赞 0 踩 0

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University（计算科学学院西蒙弗雷泽大学）

AI总结提出CoMPAS3D数据集和评估框架，通过动作可读性和熟练度适当性等客观指标，解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情

AI中文摘要

社交互动型人形机器人必须通过身体与人类互动，实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动，还要理解在共享社交背景下动作的含义。然而，交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读，也不评估其是否适合伙伴的熟练水平。这一差距有两个原因：现有框架依赖运动学指标（如FID和节拍对齐），无法衡量上述特性；现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适：即兴、双人、由动作词汇和评判标准（涵盖时机、音乐性、技巧、难度、配合和原创性）指导。我们提出CoMPAS3D，一个即兴双人萨尔萨舞的动作捕捉数据集，附带评估框架，涵盖运动学质量、两个客观指标（动作可读性和熟练度适当性）以及六个基于竞赛的主观维度。数据集包含18名舞者（涵盖初级、中级和高级水平）的3小时即兴表演，超过2800个专家标注片段，涵盖动作类型、错误和风格元素。我们定义了三个基准：动作分类（类似于转录）、熟练度估计（流利度评估）和跟随者生成（对话响应）。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时，这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2506.21129 2026-06-03 cs.LG cs.AI 版本更新

Curriculum-Adapted Robust Reinforcement Learning for UAV Deconfliction in Adversarial Environments

对抗环境中无人机冲突消解的课程自适应鲁棒强化学习

Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo

发表机构 * Faculty of Engineering and Applied Sciences, Cranfield University（工程与应用科学学院，克兰菲尔德大学）

AI总结提出一种课程引导的适应框架，通过渐进暴露于梯度对抗观测扰动并对齐时序差分误差分布，提升无人机在GNSS欺骗攻击下的鲁棒性和泛化能力。

详情

AI中文摘要

自主无人机（UAV）越来越依赖强化学习（RL）进行导航。然而，全球导航卫星系统（GNSS）欺骗攻击可能导致分布外观测偏移，破坏价值估计并降低任务性能。现有的鲁棒RL方法通常能提高对特定攻击模型的抵抗力，但往往无法泛化到训练中未遇到的攻击。为解决这一局限，我们提出一种课程引导的适应框架，该框架逐步将鲁棒策略暴露于强度递增的基于梯度的对抗观测扰动，同时对齐课程阶段间的时序差分（TD）误差分布。所提出的方法不是适应特定的攻击模型，而是保持TD误差一致性以促进跨攻击条件的可迁移性。我们进一步推导了一个TD空间泛化保证，表明如果测试时攻击引起的TD误差分布与最终课程阶段的分布足够接近，则由此产生的性能退化是有界的。该框架在具有动态3D障碍物的无人机冲突消解环境中进行评估，面对之前未见过的固定和动态GNSS欺骗攻击。在固定欺骗条件下，课程适应策略实现了近乎完美的任务成功率，而标准和鲁棒RL基线为20-56%。在动态障碍物引诱欺骗攻击下，它获得了最高的情节奖励，同时随着空中交通密度的增加，任务完成步骤最多减少了45%。

英文摘要

Autonomous unmanned aerial vehicles (UAVs) increasingly rely on reinforcement learning (RL) for navigation. However, global navigation satellite system (GNSS) spoofing attacks can induce out-of-distribution observation shifts that corrupt value estimation and degrade mission performance. Existing robust RL approaches typically improve resilience against specific attack models but often fail to generalize to attacks not encountered during training. To address this limitation, we propose a curriculum-guided adaptation framework that progressively exposes a robust policy to gradient-based adversarial observation perturbations of increasing intensity while aligning temporal-difference (TD) error distributions across curriculum stages. Rather than adapting to a particular attack model, the proposed approach preserves TD-error consistency to promote transferability across attack conditions. We further derive a TD-space generalization certificate showing that if the TD-error distribution induced by a test-time attack remains sufficiently close to that of the final curriculum stage, the resulting performance degradation is bounded. The framework is evaluated in a UAV deconfliction environment with dynamic 3D obstacles under previously unseen fixed and dynamic GNSS spoofing attacks. Under fixed spoofing conditions, the curriculum-adapted policy achieved near-perfect mission success rates, compared with 20-56% for standard and robust RL baselines. Under dynamic obstacle-luring spoofing attacks, it achieved the highest episodic rewards while reducing mission completion steps by up to 45% across increasing aerial traffic densities.

URL PDF HTML ☆

赞 0 踩 0

2506.01969 2026-06-03 cs.DC cs.AI cs.LG 版本更新

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

FlashMLA-ETAP：用于加速NVIDIA H20 GPU上MLA推理的高效转置注意力流水线

Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong

发表机构 * Tencent（腾讯）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）； College of Artificial Intelligence, Shenzhen Polytechnic University（深圳职业技术学院人工智能学院）

AI总结针对单多GPU服务器部署DeepSeek-R1 671B模型时多头潜在注意力（MLA）推理效率低的问题，提出FlashMLA-ETAP框架，通过高效转置注意力流水线（ETAP）重配置注意力计算，在NVIDIA H20 GPU上实现2.78倍加速，并保持数值稳定性。

Comments Accepted by ICONIP2025

详情

AI中文摘要

多头潜在注意力（MLA）的高效推理面临在单台多GPU服务器上部署DeepSeek-R1 671B模型的挑战。本文介绍FlashMLA-ETAP，一种新颖的框架，用于增强NVIDIA H20 GPU上单实例部署场景的MLA推理。我们提出了高效转置注意力流水线（ETAP），通过转置重新配置注意力计算，使KV上下文长度与WGMMA操作中的$M$维度对齐，显著减少冗余计算。FlashMLA-ETAP在64K序列长度（批大小16）下比FlashMLA加速2.78倍，比FlashAttention-3和FlashInfer分别提升5.24倍和4.94倍，同时保持数值稳定性，均方根误差（RMSE）比FlashAttention-3低15.2倍（$1.25 imes 10^{-5}$）。此外，ETAP的设计能够无缝集成到FlashAttention-3和FlashInfer等框架中，并有详细的理论分析支持。我们的工作解决了资源受限推理中的一个关键空白，为中端GPU提供了可扩展的解决方案，并为硬件感知优化的更广泛采用铺平了道路。代码可在https://github.com/pengcuo/FlashMLA-ETAP获取。

英文摘要

Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the $M$-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE ($1.25 \times 10^{-5}$) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.

URL PDF HTML ☆

赞 0 踩 0

2506.03087 2026-06-03 cs.LG cs.AI 版本更新

Do Explanations Increase the Risk of Decision Logic Leakage? Explanation-Guided Stealing of Graph Models

解释是否会增加决策逻辑泄露的风险？解释引导的图模型窃取

Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Xiamen University（厦门大学）； The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结研究解释机制可能泄露图神经网络决策逻辑的风险，提出一种结合解释对齐与数据增强的模型窃取框架，实验证明其优于传统方法。

详情

AI中文摘要

图神经网络（GNNs）已成为药物发现和金融分析等领域中分析图结构数据的重要工具，导致对模型透明度的需求日益增长。可解释GNNs的最新进展通过揭示影响预测的重要子图满足了这一需求，但这些解释机制可能无意中使这些模型面临安全风险。本文研究了此类解释如何潜在泄露可被利用进行模型窃取的关键决策逻辑。我们提出了{\method}，一种新颖的窃取框架，它将用于捕获决策逻辑的解释对齐与用于在有限查询下高效训练的引导数据增强相结合，从而能够有效复制目标模型的预测行为和底层推理模式。在分子图数据集上的实验表明，我们的方法在模型窃取方面优于传统方法。这项工作突出了在敏感领域部署可解释GNNs时的重要安全考虑，并表明需要针对基于解释的攻击采取保护措施。我们的代码可在https://github.com/beanmah/EGSteal获取。

英文摘要

Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to a growing demand for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose these models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.

URL PDF HTML ☆

赞 0 踩 0

2505.20853 2026-06-03 cs.LG cs.AI 版本更新

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

专家合作：大间隔融合异构信息

Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang

发表机构 * Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang（未知）

AI总结提出专家合作框架，通过大间隔机制融合异构信息，在统一异构多路网络中编码多类型数据，实现鲁棒且互补的知识提取。

Comments Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

详情

Journal ref: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63169-63185, 2025

AI中文摘要

融合异构信息仍然是现代数据分析中的一个持续挑战。尽管已取得显著进展，但现有方法往往未能考虑对象模式在不同语义空间中的固有异质性。为解决这一局限性，我们提出了专家合作（CoE）框架，该框架将多类型信息编码到统一的异构多路网络中。通过克服模态和连接差异，CoE为捕捉现实世界复杂数据的复杂结构提供了一个强大且灵活的模型。在我们的框架中，专用编码器充当领域特定专家，每个专家专门学习特定语义空间中的不同关系模式。为了增强鲁棒性并提取互补知识，这些专家通过一种新颖的大间隔机制进行协作，该机制由定制的优化策略支持。严格的理论分析保证了框架的可行性和稳定性，而跨多种基准的广泛实验证明了其优越的性能和广泛的适用性。我们的代码可在 https://github.com/strangeAlan/CoE 获取。

英文摘要

Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework's feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at https://github.com/strangeAlan/CoE.

URL PDF HTML ☆

赞 0 踩 0

2502.08006 2026-06-03 cs.LG cs.AI stat.ML 版本更新

Greed is Good: A Unifying Perspective on Guided Generation

贪婪即美德：引导生成的统一视角

Zander W. Blasingame, Chen Liu

AI总结本文通过将后验引导视为端到端引导的贪婪策略，统一了两种梯度引导方法，并提出了在计算与精度之间权衡的插值方法，在逆图像问题和分子生成任务上验证了有效性。

Comments Accepted at NeurIPS 2025

详情

AI中文摘要

无训练引导生成是一种广泛使用且强大的技术，允许最终用户对流/扩散模型的生成过程施加进一步控制。一般来说，针对基于梯度的引导，已经出现了两种技术系列：即后验引导（即通过目标预测模型将当前样本投影到目标分布进行引导）和端到端引导（即通过在整个ODE求解过程中执行反向传播进行引导）。在这项工作中，我们表明这两个看似分离的系列实际上可以通过将后验引导视为端到端引导的贪婪策略来统一。我们探索了这两个系列之间的理论联系，并深入分析了这两种技术相对于连续理想梯度的关系。基于这一分析，我们提出了一种在这两个系列之间插值的方法，从而在引导梯度的计算与精度之间实现权衡。然后，我们在几个逆图像问题和性质引导的分子生成任务上验证了这项工作。

英文摘要

Training-free guided generation is a widely used and powerful technique that allows the end user to exert further control over the generative process of flow/diffusion models. Generally speaking, two families of techniques have emerged for solving this problem for gradient-based guidance: namely, posterior guidance (i.e., guidance via projecting the current sample to the target distribution via the target prediction model) and end-to-end guidance (i.e., guidance by performing backpropagation throughout the entire ODE solve). In this work, we show that these two seemingly separate families can actually be unified by looking at posterior guidance as a greedy strategy of end-to-end guidance. We explore the theoretical connections between these two families and provide an in-depth theoretical of these two techniques relative to the continuous ideal gradients. Motivated by this analysis we then show a method for interpolating between these two families enabling a trade-off between compute and accuracy of the guidance gradients. We then validate this work on several inverse image problems and property-guided molecular generation.

URL PDF HTML ☆

赞 0 踩 0

2412.17484 2026-06-03 cs.DC cs.AI 版本更新

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

面向GPU数据中心的功耗与碎片感知在线调度

Francesco Lettich, Emanuele Carlini, Franco Maria Nardini, Raffaele Perego, Salvatore Trani

发表机构 * Istituto di Scienza e Tecnologie dell’Informazione "Alessandro Faedo", Consiglio Nazionale delle Ricerche（阿莱索·法多信息科学与技术研究所，意大利国家研究委员会）

AI总结针对GPU数据中心在线调度问题，提出PWR调度策略，结合碎片梯度下降（FGD）方法，在降低功耗和最小化GPU碎片之间取得平衡。

Comments This work has been submitted to the IEEE for possible publication

详情

DOI: 10.1109/CCGRID64434.2025.00015

AI中文摘要

人工智能和大语言模型的兴起推动了数据中心中GPU在复杂训练和推理任务中的使用增加，影响了大规模计算基础设施的运营成本、能源需求和环境足迹。本文解决了GPU数据中心中的在线调度问题，即在不知道任务未来到达时间的情况下进行调度。我们关注两个目标：最小化GPU碎片和降低功耗。当数据中心接近满容量时，部分GPU分配会阻碍剩余资源的有效利用，从而产生GPU碎片。最近的调度策略FGD（碎片梯度下降）利用碎片度量来解决这个问题。由于GPU的功耗需求巨大，降低功耗也至关重要。为此，我们提出了PWR，一种新颖的调度策略，通过选择功耗高效的GPU和CPU组合来最小化功耗。这涉及到一个简化的功耗测量模型，该模型集成到Kubernetes评分插件中。通过在模拟集群中的广泛实验评估，我们展示了PWR与FGD结合时，如何在降低功耗和最小化GPU碎片之间实现平衡的权衡。

英文摘要

The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

URL PDF HTML ☆

赞 0 踩 0

2412.01282 2026-06-03 cs.CV cs.AI 版本更新

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Align-KD：为移动视觉语言模型增强提取跨模态对齐知识

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China（通用人工智能国家重点实验室，智能科学与技术学院，北京大学，中国）； Huawei Noah’s Ark Lab, China（华为诺亚方舟实验室，中国）

AI总结提出Align-KD方法，通过蒸馏教师模型浅层跨模态对齐知识，指导1.7B学生模型学习视觉-文本匹配，在6个基准上平均提升2.0分。

Comments CVPR 2025 Paper

详情

AI中文摘要

视觉语言模型（VLM）为多模态任务带来了强大的理解和推理能力。同时，移动设备对强大人工智能的需求也日益增长，例如AI助手软件。一些工作试图将VLM迁移到边缘设备以扩展其应用范围。简化模型结构是一种常见方法，但随着模型缩小，性能与大小之间的权衡变得越来越困难。知识蒸馏（KD）可以帮助模型在不增加大小或数据量的情况下提升综合能力。然而，现有的大模型蒸馏技术大多只考虑单模态LLM的应用，或者仅使用教师为学生创建新的数据环境。这些方法都没有考虑VLM中最重要的跨模态对齐知识的蒸馏。我们提出了一种名为Align-KD的方法，引导学生模型学习发生在浅层的跨模态匹配。教师还帮助学生基于文本的关注点学习将视觉标记投影到文本嵌入空间。在Align-KD的指导下，1.7B的MobileVLM V2模型能够从7B教师模型中学习丰富的知识，且训练损失设计轻量，在两个训练子集上分别在6个基准上平均得分提升2.0。代码地址：https://github.com/fqhank/Align-KD。

英文摘要

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

URL PDF HTML ☆

赞 0 踩 0

2409.08958 2026-06-03 cs.LG cs.AI physics.comp-ph physics.flu-dyn 版本更新

PINNfluence: Interpreting PINNs through Influence Functions

PINNfluence: 通过影响函数解释 PINN

Aleksander Krasowski, Jonas R. Naujoks, Moritz Weckbecker, Galip Ü. Yolcu, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, René P. Klausen

发表机构 * Technical University of Munich（慕尼黑技术大学）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； University of Tübingen（图宾根大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出 PINNfluence 框架，基于影响函数对物理信息神经网络进行训练数据归因，实现预测、损失分量和训练数据点之间的细粒度归因，并通过基准实验区分训练好与差的 PINN 的结构特征。

Comments Accepted at ICML 2026

2410.14573 2026-06-03 cs.LG cs.AI 版本更新

Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability

在黑盒优化中建立信任：可解释性的综合框架

Nazanin Nezami, Hadis Anahideh

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结提出一套模型无关的指标IEMSO，通过采样核心、批次属性、优化过程和特征重要性四类指标，增强代理优化方法的透明性和可解释性。

详情

AI中文摘要

在受限评估预算内优化昂贵的黑盒函数在许多实际应用中面临重大挑战。代理优化（SO）是一种常见的解决方案，但其由代理模型和采样核心（例如采集函数）的复杂性引入的专有性质往往导致缺乏可解释性和透明度。尽管现有文献主要集中在增强对全局最优的收敛性，但新提出策略的实际解释仍未被充分探索，特别是在批量评估设置中。在本文中，我们提出了代理优化的包容性可解释性指标（IEMSO），这是一组全面的模型无关指标，旨在增强SO方法的透明度、可信度和可解释性。通过这些指标，我们在执行昂贵评估之前和之后为从业者提供中间和事后解释，以建立信任。我们考虑了四类主要指标，每类针对SO过程的特定方面：采样核心指标、批次属性指标、优化过程指标和特征重要性。我们的实验评估证明了所提指标在不同基准上的显著潜力。

英文摘要

Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emph{Inclusive} Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2407.18428 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

加权风险不变性：不变特征偏移下的领域泛化

Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； University of Maryland, College Park（马里兰大学学院公园分校）； New York University（纽约大学）； Center for Data Science（数据科学中心）

AI总结针对不变协变量偏移下现有不变学习方法性能不佳的问题，提出加权风险不变性（WRI）框架，通过环境间损失的不变性并加权训练样本，在理论上保证学习到不变模型，并在实验中优于先前方法。

详情

Journal ref: TMLR 2024

AI中文摘要

学习预测在多个环境下不变的模型是一种有前景的分布外泛化方法。这类模型被训练来提取特征 $X_{ ext{inv}}$，其中给定提取特征的条件分布 $Y \mid X_{ ext{inv}}$ 在不同环境下不发生变化。不变模型还应能泛化到提取特征 $X_{ ext{inv}}$ 的边缘分布 $p(X_{ ext{inv}})$ 的偏移，这种偏移称为 $ extit{不变协变量偏移}$。然而，我们表明，现有学习不变模型的方法在不变协变量偏移下表现不佳，要么无法学习到不变模型——即使对于从简单且经过充分研究的线性-高斯模型生成的数据也是如此——要么有限样本性能较差。为了解决这些问题，我们提出 $ extit{加权风险不变性}$（WRI）。我们的框架基于对训练样本进行适当加权，强制要求损失在不同环境下保持不变。我们证明，在线性-高斯设置下，WRI 可证明地学习到不变模型，即丢弃虚假相关性。我们提出了一种实用算法，通过同时学习密度 $p(X_{ ext{inv}})$ 和模型参数来实现 WRI，并且实验表明，在不变协变量偏移下，WRI 优于先前的不变学习方法。

英文摘要

Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an $\textit{invariant covariate shift}$. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models$\unicode{x2014}$even for data generated from simple and well-studied linear-Gaussian models$\unicode{x2014}$or having poor finite-sample performance. To alleviate these problems, we propose $\textit{weighted risk invariance}$ (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

URL PDF HTML ☆

赞 0 踩 0

2407.11821 2026-06-03 cs.AI 版本更新

Approximating Probabilistic Inference in Statistical EL with Knowledge Graph Embeddings

使用知识图谱嵌入近似统计EL中的概率推理

Yuqicheng Zhu, Nico Potyka, Bo Xiong, Trung-Kien Tran, Mojtaba Nayyeri, Evgeny Kharlamov, Steffen Staab

发表机构 * Bosch Center for AI（博世人工智能中心）； University of Stuttgart（斯图加特大学）； Cardiff University（卡迪夫大学）； Stanford University（斯坦福大学）； University of Oslo（奥斯陆大学）； University of Southampton（南安普顿大学）

AI总结本文提出利用知识图谱嵌入高效近似统计EL中的概率推理，并提供了运行时和正确性保证的理论证明及实验评估。

Comments Accepted at UAI 2026

2403.19883 2026-06-03 cs.AI 版本更新

Planning with Uncertainty: Symmetries, Policy Inference, and Solution Compression

不确定性规划：对称性、策略推理与解压缩

Frederico Messa, André Grahl Pereira

发表机构 * INF/UFRGS（乌尔巴诺-弗兰西斯科·里格尔大学信息学院）

AI总结本文提出基于显式最佳优先策略空间搜索的FOND规划方法，通过定义策略等价关系、利用群论计算状态对称性、多项式时间策略推断以及整数规划实现部分状态策略压缩，显著提升求解效率。

详情

DOI: 10.1016/j.artint.2026.104574

AI中文摘要

完全可观测非确定性（FOND）规划是人工智能不确定性规划的核心。它通过具有非确定性效果的动作来建模不确定性。在这项工作中，我们提出了一系列技术，将显式最佳优先策略空间搜索建立为一种与当前最先进方法相竞争的方法，用于解决FOND规划任务。我们研究了如何定义策略之间的等价关系，从而允许剪枝部分搜索空间。我们展示了可以使用群论技术有效计算状态之间的规范对称性。我们还提出了两项超越策略空间搜索的贡献：一个过程，在给定策略域集规范的情况下，能在多项式时间内推断出解策略函数；以及一个整数规划公式化过程，给定一个定义在完整状态上的解策略，能产生一组资源高效的模型，这些模型能够找到以最少部分状态无歧义地表示该策略的部分状态策略。

英文摘要

Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. In this work, we present a collection of techniques that establish explicit best-first policy-space search as a method competitive with the state of the art for solving FOND planning tasks. We study how to define equivalence relations between policies, allowing part of the search space to be pruned. We show it is possible to use group theory techniques to effectively compute canonical symmetries between states. We also present two contributions that go beyond just policy-space search: we present a procedure that infers in polynomial time a solution policy function given just the specification of its domain set, and an integer-programming formulation procedure that, given a solution policy defined over complete states, yields a set of resource-efficient models that are capable of finding a partial-state policy that represents it unambiguously with the fewest partial states possible.

URL PDF HTML ☆

赞 0 踩 0

2303.15619 2026-06-03 cs.CL cs.AI 版本更新

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Typhoon: 面向预训练语言模型的有效任务特定掩码策略

Muhammed Shahir Abdurrahman, Hashem Elezabi, Bruce Changlong Xu

发表机构 * Department of Computer Science, Stanford University（斯坦福大学计算机科学系）

AI总结本文提出Typhoon，一种基于任务损失梯度的自适应掩码策略，在GLUE任务上对比随机掩码和整词掩码，经严格评估发现无显著优势。

详情

AI中文摘要

在掩码语言建模（MLM）中，选择哪些token进行掩码是一个核心但未被充分研究的设计决策。标准预训练随机均匀掩码token，但多项研究表明，更具信息性的掩码目标可以提升下游性能。我们将掩码视为微调流程中任务自适应的组件，并引入Typhoon，一种掩码策略，它利用任务损失相对于one-hot token输入的梯度来在线估计每种token类型对目标的贡献程度。Typhoon维护每个token类型显著性的指数移动平均，并将这些分数校准为掩码分布，在token独立性近似下，其期望掩码率与目标预算匹配。我们形式化了该方法，并在两个GLUE任务（MRPC和CoLA）上，针对三个BERT系列骨干网络（TinyBERT、DistilBERT和BERT-base）以及每个配置五个随机种子（总共90次训练运行），将其与随机掩码和整词掩码进行了评估。我们的主要发现是，一旦考虑了种子方差，没有哪种掩码策略在这些任务上可靠地优于其他策略：在MRPC上，Typhoon与最佳基线之间的差距保持在0.004 F1以内，所有十二次Typhoon比较中无配对检验达到显著性，且每个95%置信区间包含零。Typhoon在单次运行实验中的明显优势并未经受住这种更仔细的评估。我们将此视为一个警示性的、以可重复性为重点的结果——基于梯度的任务自适应掩码具有竞争力，但在此规模上并不明显优于无资源的随机掩码——我们描述了一个干净的现代重实现以支持后续工作。

英文摘要

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a \emph{task-adaptive} component of the fine-tuning pipeline and introduce \textbf{Typhoon}, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration ($90$ training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within $0.004$ $F_1$, across all twelve Typhoon comparisons no paired test reaches significance, and every $95\%$ confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.

URL PDF HTML ☆

赞 0 踩 0

1301.3535 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Airport Gate Scheduling for Passengers, Aircraft, and Operation

面向乘客、飞机和运营的机场登机口调度

Sang Hyun Kim, Eric Feron, John-Paul Clarke, Aude Marzuoli, Daniel Delahaye

AI总结本文研究机场登机口调度问题，提出兼顾乘客、飞机和运营三个目标的平衡目标函数，以提升乘客体验、交通流效率和运营鲁棒性。

Comments This paper is submitted to the tenth USA/Europe ATM 2013 seminar

2004.07506 2026-06-03 cs.LO cs.AI math.LO 版本更新

On Reductions of Hintikka Sets for Higher-Order Logic

关于高阶逻辑的Hintikka集归约

Alexander Steen, Christoph Benzmüller

AI总结本文通过将Steen (2018)基于原始等式的Church类型论Hintikka集性质归约到Brown (2007)的Hintikka集性质，推导出Steen性质的一个模型存在定理。

Comments 10 pages; improved version

1208.4773 2026-06-03 eess.SY cs.AI cs.LG cs.SY 版本更新

Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy Search

优化前瞻树策略：连接前瞻树策略与直接策略搜索的桥梁

Tobias Jung, Louis Wehenkel, Damien Ernst, Francis Maes

AI总结提出一种混合策略学习方案，通过直接策略搜索学习节点评分函数来指导小型前瞻树的构建，从而结合直接策略搜索和前瞻树策略的优势。

Comments In Submission

详情

AI中文摘要

直接策略搜索（DPS）和前瞻树（LT）策略是两类广泛使用的技术，用于为序列决策问题产生高性能策略。要使DPS方法有效工作，一个关键问题是针对目标问题选择合适的参数化策略空间。LT方法的一个基本问题是，为了做出好的决策，这类策略必须开发非常大的前瞻树，这可能需要过多的在线计算资源。在本文中，我们提出了一种新的混合策略学习方案，它位于DPS和LT的交集，其中策略是一种算法，以有向方式开发一个小型前瞻树，由通过DPS学习的节点评分函数引导。基于LT的表示被证明是在DPS方案中表示策略的一种通用方式，同时，DPS能够显著减少做出高质量决策所需的前瞻树的大小。我们通过实验将我们的方法与两种其他最先进的DPS技术和四种常见的LT策略在四个基准领域进行比较，并表明它结合了其起源的两种技术的优势。特别是，我们表明我们的方法：（1）总体上比纯DPS和纯LT策略产生更好的性能策略，（2）需要的策略评估次数远少于其他DPS技术，（3）易于调整，（4）产生的策略对初始条件的扰动具有相当的鲁棒性。

英文摘要

Direct policy search (DPS) and look-ahead tree (LT) policies are two widely used classes of techniques to produce high performance policies for sequential decision-making problems. To make DPS approaches work well, one crucial issue is to select an appropriate space of parameterized policies with respect to the targeted problem. A fundamental issue in LT approaches is that, to take good decisions, such policies must develop very large look-ahead trees which may require excessive online computational resources. In this paper, we propose a new hybrid policy learning scheme that lies at the intersection of DPS and LT, in which the policy is an algorithm that develops a small look-ahead tree in a directed way, guided by a node scoring function that is learned through DPS. The LT-based representation is shown to be a versatile way of representing policies in a DPS scheme, while at the same time, DPS enables to significantly reduce the size of the look-ahead trees that are required to take high-quality decisions. We experimentally compare our method with two other state-of-the-art DPS techniques and four common LT policies on four benchmark domains and show that it combines the advantages of the two techniques from which it originates. In particular, we show that our method: (1) produces overall better performing policies than both pure DPS and pure LT policies, (2) requires a substantially smaller number of policy evaluations than other DPS techniques, (3) is easy to tune and (4) results in policies that are quite robust with respect to perturbations of the initial conditions.

URL PDF HTML ☆

赞 0 踩 0

1204.3830 2026-06-03 cs.RO cs.AI cs.SY eess.SY 版本更新

Planning Optimal Paths for Multiple Robots on Graphs

图上多机器人路径规划的最优路径

Jingjin Yu, Steven M. LaValle

AI总结提出两种基于多流整数线性规划的模型，分别求解多机器人路径规划的最小最后到达时间和最小总距离问题，算法完备且保证最优解。

Comments Changed "agents" to "robots"

1204.3820 2026-06-03 eess.SY cs.AI cs.RO cs.SY 版本更新

Distance Optimal Formation Control on Graphs with a Tight Convergence Time Guarantee

图上具有紧收敛时间保证的距离最优编队控制

Jingjin Yu, Steven M. LaValle

AI总结针对连通图上单位边距下无碰撞移动多个不可区分智能体到任意目标顶点集的任务，提出一种快速距离最优控制算法，并给出紧收敛时间保证。

Comments Brought to be in-sync with final version submitted to CDC 2012 with only minor updates

1101.4003 2026-06-03 cs.AI cs.LG cs.SY eess.SY math.OC 版本更新

Dyna-H: a heuristic planning reinforcement learning algorithm applied to role-playing-game strategy decision systems

Dyna-H：一种应用于角色扮演游戏策略决策系统的启发式规划强化学习算法

Matilde Santos, Jose Antonio Martin H., Victoria Lopez, Guillermo Botella

AI总结提出Dyna-H算法，结合启发式搜索与Dyna框架，在角色扮演游戏策略决策中实现无模型在线强化学习，实验表明其性能显著优于Q-Learning和Dyna-Q。

详情

AI中文摘要

在角色扮演游戏中，寻找最优轨迹是最重要的任务之一。实际上，策略决策系统成为游戏引擎的关键组成部分。决策方式（在线、批处理或模拟）以及决策所消耗的资源（如执行时间、内存）将在很大程度上影响游戏性能。当可以使用经典搜索算法（如A*）时，它们是最优先的选择。然而，这些方法依赖于搜索空间的精确和完整模型，在许多有趣的场景中无法应用。此时，无模型的序贯决策方法（在不确定性下）是最佳选择。本文提出一种启发式规划策略，将启发式搜索在路径规划中的能力融入Dyna智能体。所提出的Dyna-H算法，与A*一样，会选择更有可能产生结果的路径分支。此外，它具有无模型在线强化学习算法的优点。该方案与单步Q-Learning和Dyna-Q算法进行了对比评估，获得了优异的实验结果：Dyna-H在所有实验中显著优于这两种方法。我们还提出了一个功能类比，即从最差轨迹中采样的启发式与人类行为中梦境（如噩梦）的作用类似。

英文摘要

In a Role-Playing Game, finding optimal trajectories is one of the most important tasks. In fact, the strategy decision system becomes a key component of a game engine. Determining the way in which decisions are taken (online, batch or simulated) and the consumed resources in decision making (e.g. execution time, memory) will influence, in mayor degree, the game performance. When classical search algorithms such as A* can be used, they are the very first option. Nevertheless, such methods rely on precise and complete models of the search space, and there are many interesting scenarios where their application is not possible. Then, model free methods for sequential decision making under uncertainty are the best choice. In this paper, we propose a heuristic planning strategy to incorporate the ability of heuristic-search in path-finding into a Dyna agent. The proposed Dyna-H algorithm, as A* does, selects branches more likely to produce outcomes than other branches. Besides, it has the advantages of being a model-free online reinforcement learning algorithm. The proposal was evaluated against the one-step Q-Learning and Dyna-Q algorithms obtaining excellent experimental results: Dyna-H significantly overcomes both methods in all experiments. We suggest also, a functional analogy between the proposed sampling from worst trajectories heuristic and the role of dreams (e.g. nightmares) in human behavior.

URL PDF HTML ☆

赞 0 踩 0

1201.5604 2026-06-03 cs.AI cs.LG cs.NE cs.SY eess.SY math.OC 版本更新

Discrete and fuzzy dynamical genetic programming in the XCSF learning classifier system

XCSF学习分类系统中的离散与模糊动态遗传编程

Richard J. Preen, Larry Bull

AI总结本文在XCSF框架内使用离散和模糊动态系统表示（异步随机布尔网络和模糊逻辑网络），通过自适应的开放式进化设计集成系统，解决多个经典测试问题。

1106.3703 2026-06-03 nlin.AO cs.AI cs.IT cs.LG cs.SY eess.SY math.IT q-bio.QM stat.ME 版本更新

Prediction and Modularity in Dynamical Systems

动力系统中的预测与模块性

Artemy Kolchinsky, Luis M. Rocha

AI总结本文从统计建模和预测的角度，利用模型简洁性与预测精度之间的权衡，提出了一种将动力网络最优多尺度分解为弱耦合简单模块的方法，并给出了状态依赖和因果版本。

Comments v1 published in ECAL 2011 (European Conference on Artificial Life). v2 fixes error in causal risk (number of parameters should be based on training distribution)

1204.4200 2026-06-03 cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Discrete Dynamical Genetic Programming in XCS

XCS中的离散动力遗传编程

Richard J. Preen, Larry Bull

AI总结本文研究在XCS学习分类器系统中使用异步随机布尔网络作为离散动力系统表示，通过自适应的开放式进化设计集成系统以解决多个经典测试问题。

Comments arXiv admin note: substantial text overlap with arXiv:1201.5604

1107.5528 2026-06-03 cs.AI cs.SY eess.SY math.OC 版本更新

Time Consistent Discounting

时间一致折现

Tor Lattimore, Marcus Hutter

AI总结本文通过引入随年龄变化的折现函数，刻画了时间一致与不一致的折现函数，并证明了即使折现函数时间不一致，智能体仍存在理性策略。

Comments 17 LaTeX pages, 5 figures

1008.0775 2026-06-03 eess.SY cs.AI cs.MA cs.SY math.OC 版本更新

Systems Theoretic Techniques for Modeling, Control, and Decision Support in Complex Dynamic Systems

复杂动态系统中建模、控制与决策支持的系统理论技术

Armen Bagdasaryan

AI总结从系统理论视角综述复杂系统的建模、控制与决策支持方法，提出一种适用于控制回路中复杂层次系统的通用动态建模与仿真技术，并设计了用于仿真与决策支持的计算机信息系统架构。

Comments 58 pages, 24 figures, 1 table; a book chapter published by Bentham Science

1303.2912 2026-06-03 cs.AI cs.RO cs.SY eess.SY stat.ML 版本更新

Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes

基于高斯过程的贝叶斯非线性系统辨识的集成预处理

Roger Frigola, Carl Edward Rasmussen

AI总结提出GP-FNARX模型，通过集成数据预处理与稀疏高斯过程回归，实现从原始数据到辨识模型的自动化流程，并利用边际似然最大化同时优化预处理参数和超参数，获得能报告不确定性的贝叶斯动力学模型。

Comments Proceedings of the 52th IEEE International Conference on Decision and Control (CDC), Firenze, Italy, December 2013

1204.4202 2026-06-03 cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Fuzzy Dynamical Genetic Programming in XCSF

XCSF中的模糊动态遗传编程

Richard J. Preen, Larry Bull

AI总结研究在XCSF学习分类器系统中使用模糊动态遗传编程表示，通过异步模糊逻辑网络实现自适应性开放演化，解决连续值测试问题。

Comments 2 page GECCO 2011 poster paper

1304.2367 2026-06-03 cs.CV cs.AI cs.SY eess.SY 版本更新

Utility-Based Control for Computer Vision

基于效用的计算机视觉控制

Tod S. Levitt, Thomas O. Binford, Gil J. Ettinger, Patrice Gelband

AI总结针对贝叶斯网络实现计算机视觉中的计算效率问题，提出通过最大化效用而非概率来控制视觉任务，以优化传感器信息收集和数据分析。

Comments Appears in Proceedings of the Fourth Conference on Uncertainty in Artificial Intelligence (UAI1988)

详情

AI中文摘要

在利用贝叶斯网络实现计算机视觉识别世界对象时，出现了几个关键问题。计算效率是驱动力。感知网络非常深，通常有十五层结构。图像很宽，例如，在512×512像素或更大的图像中，未指定数量的边缘可能出现在任何位置。为了提高效率，我们动态实例化观察到的对象的假设。网络不是固定的，而是在运行时逐步创建。世界对象假设的生成和识别模型的索引很重要，但本文不讨论[4,11]。这项工作旨在近期通过并行计算在雷达监视系统ADRIES[5,15]和工业零件识别系统SUCCESSOR[2]中实现。对于许多应用，视觉必须更快才能实用，因此有效控制机器视觉过程至关重要。感知操作可能扫描百万像素，并可能需要数分钟的计算时间。必须避免不必要的传感器动作和计算。并行计算在多个处理器能力级别上可用。用于高层视觉的并行分布式计算的潜力意味着分配非均匀计算。本文解决了基于贝叶斯概率模型的机器视觉系统中的任务控制问题。我们将控制与推理分离，以扩展先前的工作[3]，最大化效用而非概率。最大化效用允许采用感知策略，以有效收集传感器信息并分析传感器数据。本文展示了通过效用控制机器视觉以识别军事场景的结果。未来工作将将其扩展到SUCCESSOR的工业零件识别。

英文摘要

Several key issues arise in implementing computer vision recognition of world objects in terms of Bayesian networks. Computational efficiency is a driving force. Perceptual networks are very deep, typically fifteen levels of structure. Images are wide, e.g., an unspecified-number of edges may appear anywhere in an image 512 x 512 pixels or larger. For efficiency, we dynamically instantiate hypotheses of observed objects. The network is not fixed, but is created incrementally at runtime. Generation of hypotheses of world objects and indexing of models for recognition are important, but they are not considered here [4,11]. This work is aimed at near-term implementation with parallel computation in a radar surveillance system, ADRIES [5, 15], and a system for industrial part recognition, SUCCESSOR [2]. For many applications, vision must be faster to be practical and so efficiently controlling the machine vision process is critical. Perceptual operators may scan megapixels and may require minutes of computation time. It is necessary to avoid unnecessary sensor actions and computation. Parallel computation is available at several levels of processor capability. The potential for parallel, distributed computation for high-level vision means distributing non-homogeneous computations. This paper addresses the problem of task control in machine vision systems based on Bayesian probability models. We separate control and inference to extend the previous work [3] to maximize utility instead of probability. Maximizing utility allows adopting perceptual strategies for efficient information gathering with sensors and analysis of sensor data. Results of controlling machine vision via utility to recognize military situations are presented in this paper. Future work extends this to industrial part recognition for SUCCESSOR.

URL PDF HTML ☆

赞 0 踩 0

1304.0030 2026-06-03 math.OC cs.AI cs.SY eess.SY 版本更新

Note on Combinatorial Engineering Frameworks for Hierarchical Modular Systems

关于层次模块化系统的组合工程框架的注记

Mark Sh. Levin

AI总结本文描述了一套用于解决层次模块化系统中复杂问题的基本组合工程框架，包括系统层次模型设计、组合综合、系统评估、瓶颈检测、改进、多阶段设计和演化建模，并涉及背包、多选、分配、生成树和形态团等组合优化问题。

Comments 11 pages, 7 figures, 3 tables

详情

AI中文摘要

本文简要描述了一套用于解决层次模块化系统中复杂问题的基本组合工程框架。这些框架由相互关联/链接（例如，通过偏好关系）的组合问题（及相应模型）组成。主要使用层次形态系统模型。基本标准组合工程（技术）框架列表如下：（1）系统层次模型设计，（2）组合综合（系统设计的“自下而上”过程），（3）系统评估，（4）系统瓶颈检测，（5）系统改进（重新设计、升级），（6）多阶段设计（系统轨迹设计），（7）系统演化/发展和系统预测的组合建模。组合工程框架旨在支持某些系统生命周期阶段。主要的底层组合优化问题列表包括：背包问题、多选问题、分配问题、生成树、形态团问题。

英文摘要

The paper briefly describes a basic set of special combinatorial engineering frameworks for solving complex problems in the field of hierarchical modular systems. The frameworks consist of combinatorial problems (and corresponding models), which are interconnected/linked (e.g., by preference relation). Mainly, hierarchical morphological system model is used. The list of basic standard combinatorial engineering (technological) frameworks is the following: (1) design of system hierarchical model, (2) combinatorial synthesis ('bottom-up' process for system design), (3) system evaluation, (4) detection of system bottlenecks, (5) system improvement (re-design, upgrade), (6) multi-stage design (design of system trajectory), (7) combinatorial modeling of system evolution/development and system forecasting. The combinatorial engineering frameworks are targeted to maintenance of some system life cycle stages. The list of main underlaying combinatorial optimization problems involves the following: knapsack problem, multiple-choice problem, assignment problem, spanning trees, morphological clique problem.

URL PDF HTML ☆

赞 0 踩 0

1301.7389 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Dealing with Uncertainty on the Initial State of a Petri Net

Daniel N. Nikovski, Matthew Brand

AI总结针对群控电梯调度中未来乘客对等待时间的影响，提出一种概率模型并集成到现有方法中，显著降低平均等待时间。

Comments Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

1212.2495 2026-06-03 cs.RO cs.AI cs.SY eess.SY 版本更新

Policy-contingent abstraction for robust robot control

基于策略抽象的鲁棒机器人控制

Joelle Pineau, Geoffrey Gordon, Sebastian Thrun

AI总结提出一种可扩展的控制算法，使移动机器人系统在充分考虑概率信念的情况下做出高层决策，并成功部署于护理机构。

Comments Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

1212.2471 2026-06-03 cs.LG cs.AI cs.NA math.NA 版本更新

Monte Carlo Matrix Inversion Policy Evaluation

蒙特卡洛矩阵求逆策略评估

Fletcher Lu, Dale Schuurmans

AI总结提出使用蒙特卡洛矩阵求逆（MCMI）进行强化学习策略评估，通过重要性采样降低方差，并在运行时间和准确性上优于最大似然模型和时序差分方法。

Comments Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

详情

AI中文摘要

1950年，Forsythe和Leibler（1950）引入了一种统计技术，通过将矩阵逆的元素表征为一系列随机游走的期望值来求矩阵的逆。Barto和Duff（1994）随后展示了该技术与标准动态规划和时序差分方法之间的关系。蒙特卡洛矩阵求逆（MCMI）方法的优势在于，它相对于其他技术，在状态空间大小方面具有更好的可扩展性。在本文中，我们介绍了一种使用MCMI进行强化学习策略评估的算法。我们证明，MCMI在运行时间上优于基于最大似然模型的策略评估方法，并且在运行时间和准确性上都优于时序差分（TD）策略评估方法。我们进一步通过向算法添加重要性采样技术来降低估计器的方差，从而改进了MCMI策略评估。最后，我们展示了将MCMI扩展到大规模状态空间以进行策略改进的技术。

英文摘要

In 1950, Forsythe and Leibler (1950) introduced a statistical technique for finding the inverse of a matrix by characterizing the elements of the matrix inverse as expected values of a sequence of random walks. Barto and Duff (1994) subsequently showed relations between this technique and standard dynamic programming and temporal differencing methods. The advantage of the Monte Carlo matrix inversion (MCMI) approach is that it scales better with respect to state-space size than alternative techniques. In this paper, we introduce an algorithm for performing reinforcement learning policy evaluation using MCMI. We demonstrate that MCMI improves on runtime over a maximum likelihood model-based policy evaluation approach and on both runtime and accuracy over the temporal differencing (TD) policy evaluation approach. We further improve on MCMI policy evaluation by adding an importance sampling technique to our algorithm to reduce the variance of our estimator. Lastly, we illustrate techniques for scaling up MCMI to large state spaces in order to perform policy improvement.

URL PDF HTML ☆

赞 0 踩 0

1212.2005 2026-06-03 cs.AI cs.SY eess.SY 版本更新

The Dynamic Controllability of Conditional STNs with Uncertainty

含不确定性的条件STN的动态可控性

Luke Hunsberger, Roberto Posenato, Carlo Combi

AI总结本文定义了一种结合时间约束、条件节点和不确定持续时间的条件简单时间网络（CSTNU），并提出了其动态可控性的概念及约束传播规则。

详情

Journal ref: PlanEX Workshop, ICAPS-2012, pages 21-29, 2012

AI中文摘要

最近自动化业务流程和医疗流程的尝试揭示了对一个正式框架的需求，该框架不仅能容纳时间约束，还能容纳具有不可控持续时间的观测和动作。为满足这一需求，本文定义了一种含不确定性的条件简单时间网络（CSTNU），它结合了简单时间网络（STN）的简单时间约束、条件简单时间问题（CSTP）的条件节点以及含不确定性的简单时间网络（STNU）的应急链接。定义了CSTNU的动态可控性概念，该概念推广了CTP的动态一致性和STNU的动态可控性。本文还提出了一些用于动态可控性的可靠约束传播规则，这些规则有望构成CSTNU动态可控性检查算法的基础。

英文摘要

Recent attempts to automate business processes and medical-treatment processes have uncovered the need for a formal framework that can accommodate not only temporal constraints, but also observations and actions with uncontrollable durations. To meet this need, this paper defines a Conditional Simple Temporal Network with Uncertainty (CSTNU) that combines the simple temporal constraints from a Simple Temporal Network (STN) with the conditional nodes from a Conditional Simple Temporal Problem (CSTP) and the contingent links from a Simple Temporal Network with Uncertainty (STNU). A notion of dynamic controllability for a CSTNU is defined that generalizes the dynamic consistency of a CTP and the dynamic controllability of an STNU. The paper also presents some sound constraint-propagation rules for dynamic controllability that are expected to form the backbone of a dynamic-controllability-checking algorithm for CSTNUs.

URL PDF HTML ☆

赞 0 踩 0

1212.1735 2026-06-03 math.OC cs.AI cs.NI cs.SY eess.SY 版本更新

Towards Design of System Hierarchy (research survey)

系统层次结构设计（研究综述）

Mark Sh. Levin

AI总结本文综述了树状和层次系统结构的设计/构建框架，包括基于专家的方法、层次聚类、生成树问题、组织最优层次设计、多层k连通网络设计以及层次/网络的修改，并考虑组合优化问题。

Comments 36 pages, 41 figures, 9 tables

详情

AI中文摘要

本文讨论了某些树状和层次系统结构的设计/构建框架。考察了以下方法：（1）基于专家的程序；（2）层次聚类；（3）生成树问题（例如，最小生成树、最小斯坦纳树、最大叶子生成树问题）；（4）组织“最优”层次设计；（5）多层（例如，三层）k连通网络设计；（6）层次或网络的修改：（i）通过合并相邻节点修改树，（ii）热链接分配，（iii）将树转换为斯坦纳树，（iv）重构作为将初始结构解修改为最接近目标解且考虑修改成本的解。组合优化问题被视为基本问题（例如，分类、背包问题、多选问题、分配问题）。一些数值示例说明了所提出的问题和求解框架。

英文摘要

The paper addresses design/building frameworks for some kinds of tree-like and hierarchical structures of systems. The following approaches are examined: (1) expert-based procedures, (2) hierarchical clustering; (3) spanning problems (e.g., minimum spanning tree, minimum Steiner tree, maximum leaf spanning tree problem; (4) design of organizational 'optimal' hierarchies; (5) design of multi-layer (e.g., three-layer) k-connected network; (6) modification of hierarchies or networks: (i) modification of tree via condensing of neighbor nodes, (ii) hotlink assignment, (iii) transformation of tree into Steiner tree, (iv) restructuring as modification of an initial structural solution into a solution that is the most close to a goal solution while taking into account a cost of the modification. Combinatorial optimization problems are considered as basic ones (e.g., classification, knapsack problem, multiple choice problem, assignment problem). Some numerical examples illustrate the suggested problems and solving frameworks.

URL PDF HTML ☆

赞 0 踩 0

1212.1143 2026-06-03 cs.AI cs.SY eess.SY math.OC stat.ML 版本更新

Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning

多尺度马尔可夫决策问题：压缩、求解与迁移学习

Jake Bouvrie, Mauro Maggioni

AI总结提出一种多尺度压缩马尔可夫决策过程的快速算法，自动构建层次结构，解耦子任务并加速收敛，同时实现跨问题的策略迁移。

Comments 86 pages, 15 figures

详情

AI中文摘要

序列决策和随机控制中的许多问题通常具有自然的多尺度结构：子任务被组合在一起以完成复杂目标。系统性地推断和利用层次结构，尤其是超越单一抽象层次，一直是一个长期挑战。我们描述了一种快速的多尺度过程，用于重复压缩或均质化马尔可夫决策过程（MDP），其中自动确定不同尺度上的子问题层次结构。粗化后的MDP本身是独立的确定性MDP，可以使用现有算法求解。该过程提供的多尺度表示将子任务相互解耦，可以在子问题内部局部和跨子问题全局上显著提高收敛速度，从而节省大量计算。这项工作的第二个基本方面是，这些多尺度分解为不同问题之间提供了新的迁移机会，其中层次结构中不同级别的子任务的解可能适用于迁移到新问题。强调了在任意尺度上策略和势算子的局部迁移。最后，我们在一个说明性领域集合中展示了压缩和迁移，包括涉及离散和连续状态空间的示例。

英文摘要

Many problems in sequential decision making and stochastic control often have natural multiscale structure: sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure, particularly beyond a single level of abstraction, has remained a longstanding challenge. We describe a fast multiscale procedure for repeatedly compressing, or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of sub-problems at different scales is automatically determined. Coarsened MDPs are themselves independent, deterministic MDPs, and may be solved using existing algorithms. The multiscale representation delivered by this procedure decouples sub-tasks from each other and can lead to substantial improvements in convergence rates both locally within sub-problems and globally across sub-problems, yielding significant computational savings. A second fundamental aspect of this work is that these multiscale decompositions yield new transfer opportunities across different problems, where solutions of sub-tasks at different levels of the hierarchy may be amenable to transfer to new problems. Localized transfer of policies and potential operators at arbitrary scales is emphasized. Finally, we demonstrate compression and transfer in a collection of illustrative domains, including examples involving discrete and continuous statespaces.

URL PDF HTML ☆

赞 0 踩 0

1210.4231 2026-06-03 eess.SY cs.AI cs.SY 版本更新

An example illustrating the imprecision of the efficient approach for diagnosis of Petri nets via integer linear programming

一个说明通过整数线性规划诊断Petri网的高效方法不精确性的例子

Alban Grastien

AI总结本文通过反例证明，即使系统是可诊断的，基于整数线性规划的Petri网高效诊断方法也可能无法检测到故障。

Comments 3 pages

1203.4345 2026-06-03 eess.SY cs.AI cs.RO cs.SY stat.ML 版本更新

Robust Filtering and Smoothing with Gaussian Processes

基于高斯过程的鲁棒滤波与平滑

Marc Peter Deisenroth, Ryan Turner, Marco F. Huber, Uwe D. Hanebeck, Carl Edward Rasmussen

AI总结提出一种基于非参数高斯过程模型的非线性随机动态系统鲁棒贝叶斯滤波与平滑算法，通过解析平滑实现鲁棒性，数值实验表明在其它先进方法失效时仍保持稳健。

Comments 7 pages, 1 figure, draft version of paper accepted at IEEE Transactions on Automatic Control

详情

DOI: 10.1109/TAC.2011.2179426

AI中文摘要

我们提出了一种原则性算法，用于在非线性随机动态系统中进行鲁棒贝叶斯滤波和平滑，其中转移函数和测量函数均由非参数高斯过程（GP）模型描述。在信号处理、机器学习、机器人和控制领域，GP通过后验概率分布表示未知系统函数，其重要性日益增加。这种现代的“系统辨识”方式比寻找参数函数表示的点估计更为鲁棒。在本文中，我们提出了一种原则性算法，用于在GP动态系统中进行鲁棒解析平滑，该系统在机器人和控制领域应用日益广泛。我们的数值评估表明，在其它最先进的高斯滤波器和平滑器可能失败的情况下，所提方法具有鲁棒性。

英文摘要

We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by non-parametric Gaussian process (GP) models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of "system identification" is more robust than finding point estimates of a parametric function representation. In this article, we present a principled algorithm for robust analytic smoothing in GP dynamic systems, which are increasingly used in robotics and control. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail.

URL PDF HTML ☆

赞 0 踩 0

1208.1103 2026-06-03 cs.AI cs.SY eess.SY 版本更新

System identification and modeling for interacting and non-interacting tank systems using intelligent techniques

基于智能技术的交互与非交互罐式系统的系统辨识与建模

N. S. Bhuvaneswari, R. Praveena, R. Divya

AI总结本文采用统计模型辨识、过程反应曲线法、ARX模型、遗传算法及神经网络和模糊逻辑，从实时实验数据中辨识交互与非交互罐式过程的传递函数模型和智能模型。

Comments 13 pages,8 figures

详情

AI中文摘要

从实验数据中进行系统辨识对于基于模型的控制器设计至关重要。由于过程复杂性，从第一原理推导过程模型通常很困难。任何控制和监测系统开发的第一阶段都是系统的辨识和建模。每个模型都是在特定控制问题的背景下开发的。因此，需要一个通用的系统辨识框架。所提出的框架应能根据控制目标和系统行为性质适应并强调不同的特性。因此，系统辨识已成为基于输入输出数据辨识系统模型以设计控制器的宝贵工具。本文关注于使用统计模型辨识、过程反应曲线法、ARX模型、遗传算法以及神经网络和模糊逻辑对交互和非交互罐式过程进行传递函数模型的辨识。所使用的辨识技术和建模易受参数变化和干扰的影响。所提出的方法用于从实时实验数据中辨识交互和非交互过程的数学模型和智能模型。

英文摘要

System identification from the experimental data plays a vital role for model based controller design. Derivation of process model from first principles is often difficult due to its complexity. The first stage in the development of any control and monitoring system is the identification and modeling of the system. Each model is developed within the context of a specific control problem. Thus, the need for a general system identification framework is warranted. The proposed framework should be able to adapt and emphasize different properties based on the control objective and the nature of the behavior of the system. Therefore, system identification has been a valuable tool in identifying the model of the system based on the input and output data for the design of the controller. The present work is concerned with the identification of transfer function models using statistical model identification, process reaction curve method, ARX model, genetic algorithm and modeling using neural network and fuzzy logic for interacting and non interacting tank process. The identification technique and modeling used is prone to parameter change & disturbance. The proposed methods are used for identifying the mathematical model and intelligent model of interacting and non interacting process from the real time experimental data.

URL PDF HTML ☆

赞 0 踩 0

1207.6051 2026-06-03 eess.SY cs.AI cs.SY math.OC 版本更新

Composition of Modular Telemetry System with Interval Multiset Estimates

基于区间多集估计的模块化遥测系统组合

Mark Sh. Levin

AI总结本文提出一种基于区间多集估计的组合综合方法，用于模块化遥测系统的建模、分析、设计和改进，通过分层形态多准则设计（HMMD）实现系统组件的多准则选择与合成。

Comments 9 pages, 9 figures, 6 tables

详情

AI中文摘要

本文描述了一种组合综合方法，该方法利用系统元素的区间多集估计来对模块化遥测系统进行建模、分析、设计和改进。形态（模块化）系统设计与改进被视为遥测系统元素（组件）配置的组合。求解过程基于分层形态多准则设计（HMMD）：(i) 系统组件备选方案的多准则选择，(ii) 将所选备选方案合成为结果组合（同时考虑上述备选方案的质量及其兼容性）。使用区间多集估计来评估遥测系统元素的设计备选方案。还研究了两个附加系统问题：(a) 改进所获得的解，(b) 将所获得的解聚合成一个结果系统配置。改进和聚合过程基于具有区间多集估计的多重选择问题。通过一个机载遥测子系统的数值示例说明了设计和改进过程。

英文摘要

The paper describes combinatorial synthesis approach with interval multset estimates of system elements for modeling, analysis, design, and improvement of a modular telemetry system. Morphological (modular) system design and improvement are considered as composition of the telemetry system elements (components) configuration. The solving process is based on Hierarchical Morphological Multicriteria Design (HMMD): (i) multicriteria selection of alternatives for system components, (ii) synthesis of the selected alternatives into a resultant combination (while taking into account quality of the alternatives above and their compatibility). Interval multiset estimates are used for assessment of design alternatives for telemetry system elements. Two additional systems problems are examined: (a) improvement of the obtained solutions, (b) aggregation of the obtained solutions into a resultant system configuration. The improvement and aggregation processes are based on multiple choice problem with interval multiset estimates. Numerical examples for an on-board telemetry subsystem illustrate the design and improvement processes.

URL PDF HTML ☆

赞 0 踩 0

1207.4154 2026-06-03 cs.AI cs.SY eess.SY math.OC 版本更新

Discretized Approximations for POMDP with Average Cost

平均成本POMDP的离散化近似

Huizhen Yu, Dimitri Bertsekas

AI总结针对平均成本POMDP，提出一种新的基于有限信念点离散化的下界近似方案，利用有限状态MDP的多链算法高效计算，并证明其收敛性。

Comments Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

详情

AI中文摘要

在本文中，我们针对具有折扣和平均成本准则的POMDP提出了一种新的下界近似方案。近似函数由其在一组有限信念点上的值确定，并可通过有限状态MDP的值迭代算法高效计算。虽然对于折扣问题已有几种下界近似方案被提出，但我们的方案似乎是平均成本问题中的首个。我们主要关注平均成本情形，并证明相应的近似可以通过有限状态MDP的多链算法高效计算。我们给出初步分析表明，无论POMDP中是否存在最优平均成本J，所获得的近似都是liminf最优平均成本函数的下界，也可用于计算limsup最优平均成本函数的上界，以及执行与近似相关的平稳策略的成本界。当最优平均成本为常数且最优差分成本连续时，我们证明了成本近似的收敛性。

英文摘要

In this paper, we propose a new lower approximation scheme for POMDP with discounted and average cost criterion. The approximating functions are determined by their values at a finite number of belief points, and can be computed efficiently using value iteration algorithms for finite-state MDP. While for discounted problems several lower approximation schemes have been proposed earlier, ours seems the first of its kind for average cost problems. We focus primarily on the average cost case, and we show that the corresponding approximation can be computed efficiently using multi-chain algorithms for finite-state MDP. We give a preliminary analysis showing that regardless of the existence of the optimal average cost J in the POMDP, the approximation obtained is a lower bound of the liminf optimal average cost function, and can also be used to calculate an upper bound on the limsup optimal average cost function, as well as bounds on the cost of executing the stationary policy associated with the approximation. Weshow the convergence of the cost approximation, when the optimal average cost is constant and the optimal differential cost is continuous.

URL PDF HTML ☆

赞 0 踩 0

1207.3434 2026-06-03 cs.AI cs.RO cs.SY eess.SY 版本更新

An Approach to Model Interest for Planetary Rover through Dezert-Smarandache Theory

基于Dezert-Smarandache理论的行星探测器兴趣建模方法

Matteo Ceriotti, Massimiliano Vasile, Giovanni Giardini, Mauro Massari

AI总结提出一种通过Dezert-Smarandache理论融合有效载荷和导航信息来量化行星探测器目标兴趣度的方法，实现自主目标重分配与科学目标优选。

Comments Journal Of Aerospace Computing, Information, And Communication Vol. 5, Month 2008

1203.1007 2026-06-03 cs.LG cs.AI cs.SY eess.SY stat.ML 版本更新

Agnostic System Identification for Model-Based Reinforcement Learning

基于模型的强化学习的不可知系统辨识

Stephane Ross, J. Andrew Bagnell

AI总结针对模型类可能不包含真实系统的不可知情况，提出一种利用无遗憾在线学习算法获得近优策略的迭代方法，并在离散和连续域上验证其有效性。

Comments 8 pages, published in ICML 2012

1206.4329 2026-06-03 cs.AI cs.NA math.NA 版本更新

An Improved Gauss-Newtons Method based Back-propagation Algorithm for Fast Convergence

一种基于改进高斯-牛顿法的快速收敛反向传播算法

Sudarshan Nandy, Partha Pratim Sarkar, Achintya Das

AI总结提出一种基于高斯-牛顿数值优化方法的改进反向传播算法，通过多层神经网络实现快速收敛，并在多种数据集上验证其优于最速下降法。

Comments 7 pages, 6 figures,2 tables, Published with International Journal of Computer Applications (IJCA)

1206.3285 2026-06-03 cs.AI cs.LG cs.SY eess.SY 版本更新

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

具有线性函数逼近和优先级扫描的Dyna风格规划

Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, Michael P. Bowling

AI总结本文提出一种基于模型的Dyna风格规划方法，扩展至线性函数逼近，证明其收敛性，并引入线性Dyna的优先级扫描算法。

Comments Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

详情

AI中文摘要

我们考虑在在线设置中高效学习最优控制策略和价值函数的问题，其中状态空间很大，且必须在每次与世界交互后获得估计。本文开发了一种显式的基于模型的方法，将Dyna架构扩展到线性函数逼近。Dyna风格规划通过从世界模型生成想象经验，然后将无模型强化学习算法应用于想象的状态转移来进行。我们的主要结果是证明，在自然条件下，线性Dyna风格规划收敛到一个独立于生成分布的唯一解。在策略评估设置中，我们证明极限点是最小二乘（LSTD）解。我们的结果的一个含义是，优先级扫描可以合理地扩展到线性逼近情况，即回溯到前驱特征而不是前驱状态。我们介绍了两种线性Dyna的优先级扫描版本，并在Mountain Car和Boyan Chain问题上简要展示了它们的经验性能。

英文摘要

We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dynastyle planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

URL PDF HTML ☆

赞 0 踩 0

1205.3997 2026-06-03 stat.ML cs.AI cs.GT cs.SY eess.SY 版本更新

Free Energy and the Generalized Optimality Equations for Sequential Decision Making

自由能与序列决策的广义最优性方程

Pedro A. Ortega, Daniel A. Braun

AI总结本文应用自由能原理到包含对抗和随机环境的通用决策树，推导出广义序列最优性方程，该方程包含Bellman最优性方程作为极限情况，并导出Expectimax、Minimax和Expectiminimax等决策规则，为每个节点分配资源参数以表达计算成本。

Comments 10 pages, 2 figures

详情

Journal ref: European Workshop on Reinforcement Learning 2012

AI中文摘要

自由能泛函最近被提出作为有界理性决策的变分原理，因为它实例化了效用增益与信息处理成本之间的自然权衡，并且可以从公理推导出来。这里我们将自由能原理应用于包含对抗和随机环境的通用决策树。我们推导出广义序列最优性方程，该方程不仅包含Bellman最优性方程作为极限情况，而且导出了众所周知的决策规则，如Expectimax、Minimax和Expectiminimax。我们展示了如何从单一的自由能原理推导出这些决策规则，该原理为决策树中的每个节点分配一个资源参数。这些资源参数表达了一个具体的计算成本，可以测量为从属于每个节点的分布所需的样本数量。因此，自由能原理为考虑对抗和随机环境的广义最优性方程提供了规范基础。

英文摘要

The free energy functional has recently been proposed as a variational principle for bounded rational decision-making, since it instantiates a natural trade-off between utility gains and information processing costs that can be axiomatically derived. Here we apply the free energy principle to general decision trees that include both adversarial and stochastic environments. We derive generalized sequential optimality equations that not only include the Bellman optimality equations as a limit case, but also lead to well-known decision-rules such as Expectimax, Minimax and Expectiminimax. We show how these decision-rules can be derived from a single free energy principle that assigns a resource parameter to each node in the decision tree. These resource parameters express a concrete computational cost that can be measured as the amount of samples that are needed from the distribution that belongs to each node. The free energy principle therefore provides the normative basis for generalized optimality equations that account for both adversarial and stochastic environments.

URL PDF HTML ☆

赞 0 踩 0

1205.2046 2026-06-03 eess.SY cs.AI cs.SY math.OC 版本更新

Multiset Estimates and Combinatorial Synthesis

多重集估计与组合综合

Mark Sh. Levin

AI总结本文提出基于多重集估计的序数评估方法，研究其运算（集成、邻近性、比较、聚合、对齐）及在组合综合（形态学方法、背包问题）中的应用。

Comments 30 pages, 24 figures, 10 tables

详情

AI中文摘要

本文探讨了一种基于将元素分配到序数量表上的备选方案序数评估方法。在考虑基本序数量表[1,2,...,l]的层级数和分配元素个数（例如1,2,3）的情况下，提出了评估问题的基本版本。得到的估计是多重集（或袋）（多重集的基数等于常数）。给出了所研究评估问题的尺度偏序集。提出了“区间多重集估计”。进一步，研究了多重集估计上的运算：(a) 多重集估计的集成，(b) 多重集估计的邻近性，(c) 多重集估计的比较，(d) 多重集估计的聚合，以及(e) 多重集估计的对齐。研究了基于形态学方法的组合综合，包括带有设计备选方案多重集估计的改进版本。还简要描述了带有多重集估计的背包类问题。通过数值例子说明了评估方法、多重集估计以及相应的组合问题。

英文摘要

The paper addresses an approach to ordinal assessment of alternatives based on assignment of elements into an ordinal scale. Basic versions of the assessment problems are formulated while taking into account the number of levels at a basic ordinal scale [1,2,...,l] and the number of assigned elements (e.g., 1,2,3). The obtained estimates are multisets (or bags) (cardinality of the multiset equals a constant). Scale-posets for the examined assessment problems are presented. 'Interval multiset estimates' are suggested. Further, operations over multiset estimates are examined: (a) integration of multiset estimates, (b) proximity for multiset estimates, (c) comparison of multiset estimates, (d) aggregation of multiset estimates, and (e) alignment of multiset estimates. Combinatorial synthesis based on morphological approach is examined including the modified version of the approach with multiset estimates of design alternatives. Knapsack-like problems with multiset estimates are briefly described as well. The assessment approach, multiset-estimates, and corresponding combinatorial problems are illustrated by numerical examples.

URL PDF HTML ☆

赞 0 踩 0

1203.2556 2026-06-03 cs.AI cs.SY eess.SY 版本更新

A Probabilistic Transmission Expansion Planning Methodology based on Roulette Wheel Selection and Social Welfare

基于轮盘赌选择和社会福利的概率输电扩展规划方法

Neeraj Gupta, Rajiv Shekhar, Prem Kumar Kalra

AI总结提出一种无需预先指定新增输电容量、利用社会福利概念的概率输电扩展规划方法，通过轮盘赌计算线路容量和潮流分析计算期望未供电量，并在改进IEEE 5节点系统上验证了仅新增线路不足以最小化期望未供电量。

Comments 22 pages, 4 figures

1202.3720 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Efficient Inference in Markov Control Problems

马尔可夫控制问题中的高效推理

Thomas Furmston, David Barber

AI总结针对有限和无限时域马尔可夫控制问题，提出一种比标准前向-后向递归更高效的精确推理算法，并给出无限时域问题的原则性扩展，用于策略梯度和期望最大化算法。

1202.3703 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Factored Filtering of Continuous-Time Systems

连续时间系统的因子化滤波

E. Busra Celikkaya, Christian R. Shelton, William Lam

AI总结针对状态分布过大的连续时间随机系统，提出因子化近似方法，通过矩阵指数的ODE积分和均匀化展开两种计算方式，证明因子化均匀化的KL散度有界，实验表明优于现有方法。

1201.2630 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Hybrid GPS-GSM Localization of Automobile Tracking System

混合GPS-GSM汽车跟踪系统定位

Mohammad A. Al-Khedher

AI总结提出一种集成GPS-GSM系统，通过卡尔曼滤波提高GPS坐标精度，并利用谷歌地球实现车辆实时跟踪，用于车队管理、警车调度和防盗预警。

Comments 11 pages, 11 figures, 23 references

1108.1170 2026-06-03 math.OC cs.AI cs.SY eess.SY 版本更新

Convex Optimization without Projection Steps

无投影步的凸优化

Martin Jaggi

AI总结提出一种基于Frank-Wolfe方法、无需投影步的迭代算法，用于紧凸域上的凸函数最小化，实现O(1/ε)迭代次数达到ε对偶间隙，并分析稀疏性下界。

详情

AI中文摘要

针对紧凸域上凸函数最小化的一般问题，我们研究了一种基于Frank & Wolfe 1956方法的简单迭代近似算法，该算法无需投影步即可保持在优化域内。代替投影步，求解由当前次梯度定义的线性化问题，得到自然保持在域内的步进方向。我们的框架将Frank & Wolfe的稀疏贪婪算法及其Clarkson 2010的原始-对偶分析（以及Hazan 2008的低秩SDP方法）推广到任意凸域。我们给出了收敛性证明，保证在O(1/ε)次迭代后达到ε小的对偶间隙。该方法使我们能够理解任何l1正则化凸优化问题（以及单纯形上的优化）的近似解的稀疏性，表示为近似质量的函数。我们得到了l1问题稀疏性的匹配上下界Θ(1/ε)。相同的界适用于有界迹的低秩半定优化，表明秩O(1/ε)在此也是最优的。作为另一个应用，当优化一类对角占优对称矩阵上的任意凸函数时，我们得到具有O(1/ε)个非零项的稀疏矩阵作为ε近似解。我们表明，我们提出的一阶方法也适用于核范数和最大范数矩阵优化问题。对于核范数正则化优化，如矩阵补全和低秩恢复，我们展示了算法在大矩阵问题（如Netflix数据集）上的实际效率和可扩展性。对于有界矩阵最大范数上的一般凸优化，据我们所知，我们的算法是第一个具有收敛保证的。

英文摘要

For the general problem of minimizing a convex function over a compact convex domain, we will investigate a simple iterative approximation algorithm based on the method by Frank & Wolfe 1956, that does not need projection steps in order to stay inside the optimization domain. Instead of a projection step, the linearized problem defined by a current subgradient is solved, which gives a step direction that will naturally stay in the domain. Our framework generalizes the sparse greedy algorithm of Frank & Wolfe and its primal-dual analysis by Clarkson 2010 (and the low-rank SDP approach by Hazan 2008) to arbitrary convex domains. We give a convergence proof guaranteeing ε-small duality gap after O(1/ε) iterations. The method allows us to understand the sparsity of approximate solutions for any l1-regularized convex optimization problem (and for optimization over the simplex), expressed as a function of the approximation quality. We obtain matching upper and lower bounds of Θ(1/ε) for the sparsity for l1-problems. The same bounds apply to low-rank semidefinite optimization with bounded trace, showing that rank O(1/ε) is best possible here as well. As another application, we obtain sparse matrices of O(1/ε) non-zero entries as ε-approximate solutions when optimizing any convex function over a class of diagonally dominant symmetric matrices. We show that our proposed first-order method also applies to nuclear norm and max-norm matrix optimization problems. For nuclear norm regularized optimization, such as matrix completion and low-rank recovery, we demonstrate the practical efficiency and scalability of our algorithm for large matrix problems, as e.g. the Netflix dataset. For general convex optimization over bounded matrix max-norm, our algorithm is the first with a convergence guarantee, to the best of our knowledge.

URL PDF HTML ☆

赞 0 踩 0

1112.4057 2026-06-03 cs.AI cs.SY eess.SY 版本更新

Performance Evaluation of Road Traffic Control Using a Fuzzy Cellular Model

基于模糊细胞模型的道路交通控制性能评估

Bartłomiej Płaczek

AI总结提出一种基于模糊细胞模型的方法，用于在线仿真环境中评估自适应交通控制策略的性能，通过结合元胞自动机和模糊演算处理不精确测量。

Comments The final publication is available at http://www.springerlink.com

1107.2126 2026-06-03 math.NA cs.AI cs.IT cs.NA math.IT math.LO 版本更新

Strong Solutions of the Fuzzy Linear Systems

模糊线性系统的强解

Şahin Emrah Amrahov, Iman N. Askerzade

AI总结针对系数矩阵为清晰矩阵、右端为参数形式模糊数的模糊线性系统，提出一种依赖于系数矩阵和右端项的强解存在唯一性定理，推广了仅适用于特殊系统的经典定理。

Comments 11 pages

详情

DOI: 10.3970/cmes.2011.076.207
Journal ref: CMES: Computer Modeling in Engineering & Sciences, Vol. 76, No. 4, pp. 207-216, 2011

AI中文摘要

我们考虑一个模糊线性系统，其系数矩阵为清晰矩阵，右端为参数形式的任意模糊数。众所周知，强模糊解的存在唯一性经典定理等价于：系数矩阵是一个置换矩阵与一个对角矩阵的乘积。这意味着该定理仅适用于特殊形式的线性系统，即每个方程恰好包含一个变量的系统。我们证明了一个存在唯一性定理，该定理可用于更一般的系统。该定理的充要条件同时依赖于系数矩阵和右端项。该定理是经典强解存在唯一性定理的推广。

英文摘要

We consider a fuzzy linear system with crisp coefficient matrix and with an arbitrary fuzzy number in parametric form on the right-hand side. It is known that the well-known existence and uniqueness theorem of a strong fuzzy solution is equivalent to the following: The coefficient matrix is the product of a permutation matrix and a diagonal matrix. This means that this theorem can be applicable only for a special form of linear systems, namely, only when the system consists of equations, each of which has exactly one variable. We prove an existence and uniqueness theorem, which can be use on more general systems. The necessary and sufficient conditions of the theorem are dependent on both the coefficient matrix and the right-hand side. This theorem is a generalization of the well-known existence and uniqueness theorem for the strong solution.

URL PDF HTML ☆

赞 0 踩 0

1004.2027 2026-06-03 cs.LG cs.AI cs.SY eess.SY math.OC stat.ML 版本更新

Dynamic Policy Programming

动态策略编程

Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen

AI总结提出动态策略编程（DPP）方法，通过平均累积误差的无穷范数界，在近似误差下优于标准近似值迭代和近似策略迭代，并在多个问题域中显著超越现有强化学习方法。

Comments Submitted to Journal of Machine Learning Research

详情

AI中文摘要

在本文中，我们提出了一种新颖的策略迭代方法，称为动态策略编程（DPP），用于估计无限时域马尔可夫决策过程中的最优策略。我们证明了在存在近似/估计误差的情况下，DPP的有限迭代和渐近l∞范数性能损失界。这些界以平均累积误差的l∞范数表示，而不是标准近似值迭代（AVI）和近似策略迭代（API）中误差的l∞范数。这表明DPP可以实现比AVI和API更好的性能，因为它平均了整个学习过程中由蒙特卡洛采样引起的模拟噪声。我们通过在不同问题域上比较DPP的近似变体与现有强化学习（RL）方法的性能，数值验证了这一理论结果。我们的结果表明，在所有情况下，基于DPP的算法都大幅优于其他RL方法。

英文摘要

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperform other RL methods by a wide margin.

URL PDF HTML ☆

赞 0 踩 0

1108.6223 2026-06-03 cs.SE cs.AI cs.DM cs.NI cs.SY eess.SY math.OC 版本更新

Towards Configuration of applied Web-based information system

面向应用型Web信息系统的配置

Mark Sh. Levin

AI总结本文采用分层形态多准则设计方法，通过组合系统部件的设计备选方案，实现应用型Web系统的配置设计，并基于格离散空间评估组合质量。

Comments 13 pages, 9 tables, 17 figures

详情

AI中文摘要

本文描述了应用型Web系统的结构组合合成。该问题被视为将系统部件/组件的选定设计备选方案组合成一个最终的复合决策（即系统配置设计）。求解框架基于分层形态多准则设计（HMMD）方法：（i）对系统部件的备选方案进行多准则选择，（ii）将选定的备选方案组合成最终组合（同时考虑上述备选方案的序数质量及其兼容性）。使用基于格的离散空间来评估（整合）最终组合（即复合系统决策或系统配置）的质量。此外，还考虑了一种基于多准则多选择问题的简化求解框架。还描述了一个多阶段设计过程以获得系统轨迹。基本应用示例针对通信服务提供商的应用型Web系统。简要描述了另外两个应用（企业系统和学术应用信息系统）。

英文摘要

In the paper, combinatorial synthesis of structure for applied Web-based systems is described. The problem is considered as a combination of selected design alternatives for system parts/components into a resultant composite decision (i.e., system configuration design). The solving framework is based on Hierarchical Morphological Multicriteria Design (HMMD) approach: (i) multicriteria selection of alternatives for system parts, (ii) composing the selected alternatives into a resultant combination (while taking into account ordinal quality of the alternatives above and their compatibility). A lattice-based discrete space is used to evaluate (to integrate) quality of the resultant combinations (i.e., composite system decisions or system configurations). In addition, a simplified solving framework based on multicriteria multiple choice problem is considered. A multistage design process to obtain a system trajectory is described as well. The basic applied example is targeted to an applied Web-based system for a communication service provider. Two other applications are briefly described (corporate system and information system for academic application).

URL PDF HTML ☆

赞 0 踩 0

1107.0089 2026-06-03 eess.SY cs.AI cs.SY 版本更新

Towards a Reliable Framework of Uncertainty-Based Group Decision Support System

基于不确定性的群体决策支持系统可靠框架

Junyi Chai, James N. K. Liu

AI总结提出一种基于不确定性的群体决策支持系统框架，通过集成多智能体架构和人工智能技术，支持多准则决策分析，实现可靠决策支持。

Comments Accepted paper in IEEE-ICDM2010; Print ISBN: 978-1-4244-9244-2

1006.2165 2026-06-03 stat.ME cs.AI cs.RO cs.SY eess.SY math.OC stat.ML 版本更新

A Probabilistic Perspective on Gaussian Filtering and Smoothing

高斯滤波与平滑的概率视角

Marc Peter Deisenroth, Henrik Ohlsson

AI总结本文从概率视角统一高斯滤波与平滑方法，指出其核心区别仅在于联合概率均值和协方差的计算/近似方式，并据此推导了容积卡尔曼平滑器及基于吉布斯采样的鲁棒滤波与平滑算法。

Comments 14 pages. Extended version of conference paper (ACC 2011)

1004.2342 2026-06-03 cs.AI cs.PF cs.SY eess.SY math.OC math.PR 版本更新

Mean field for Markov Decision Processes: from Discrete to Continuous Optimization

马尔可夫决策过程的平均场：从离散到连续优化

Nicolas Gast, Bruno Gaujal, Jean-Yves Le Boudec

AI总结研究大量对象组成的马尔可夫决策过程收敛到常微分方程优化问题，通过平均场近似得到连续HJB方程，并给出奖励差异界限及构造性算法。

1102.0899 2026-06-03 cs.AI cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Evidence Feed Forward Hidden Markov Model: A New Type of Hidden Markov Model

证据前馈隐马尔可夫模型：一种新型隐马尔可夫模型

Michael DelRose, Christian Wagner, Philip Frederick

AI总结针对隐马尔可夫模型无法建模观测间关联的问题，提出证据前馈隐马尔可夫模型，通过引入观测间概率链接提升分类性能，并在视觉动作和测量数据上验证其有效性。

Comments 19 pages, International Journal of Artificial Intelligence and Applications

详情

DOI: 10.5121/ijaia.2011.2101
Journal ref: International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 2, No. 1, Jan 2011

AI中文摘要

仅基于视觉动作预测他人意图的能力是人类和动物独有的技能。当前计算机算法的智能尚未达到这种复杂程度，但已有若干研究正朝此方向努力。由于可用的分类算法众多，难以确定哪种算法最适合特定情境。在视觉人类意图数据分类中，隐马尔可夫模型（HMM）及其变体是主要候选方法。HMM无法提供观测间链接的概率，这是该分类技术的一大缺陷。当人通过视觉识别他人的动作时，会监控观测中的模式。通过估计下一个观测，人们能够总结动作，从而相当准确地判断执行动作者的意图。这些视觉线索和链接对于创建基于视觉观测确定人类动作的智能算法至关重要。证据前馈隐马尔可夫模型是一种新开发的算法，它提供了观测间链接。本研究阐述了证据前馈HMM背后的理论，提供了其学习这些参数以优化观测似然性的数学证明（这对所有计算智能算法都至关重要），并给出了与标准HMM在视觉动作数据和测量数据分类中的比较示例，从而为证据前馈HMM在多种问题分类中的应用奠定了坚实基础。

英文摘要

The ability to predict the intentions of people based solely on their visual actions is a skill only performed by humans and animals. The intelligence of current computer algorithms has not reached this level of complexity, but there are several research efforts that are working towards it. With the number of classification algorithms available, it is hard to determine which algorithm works best for a particular situation. In classification of visual human intent data, Hidden Markov Models (HMM), and their variants, are leading candidates. The inability of HMMs to provide a probability in the observation to observation linkages is a big downfall in this classification technique. If a person is visually identifying an action of another person, they monitor patterns in the observations. By estimating the next observation, people have the ability to summarize the actions, and thus determine, with pretty good accuracy, the intention of the person performing the action. These visual cues and linkages are important in creating intelligent algorithms for determining human actions based on visual observations. The Evidence Feed Forward Hidden Markov Model is a newly developed algorithm which provides observation to observation linkages. The following research addresses the theory behind Evidence Feed Forward HMMs, provides mathematical proofs of their learning of these parameters to optimize the likelihood of observations with a Evidence Feed Forwards HMM, which is important in all computational intelligence algorithm, and gives comparative examples with standard HMMs in classification of both visual action data and measurement data; thus providing a strong base for Evidence Feed Forward HMMs in classification of many types of problems.

URL PDF HTML ☆

赞 0 踩 0

1012.0365 2026-06-03 math.NA cs.AI cs.NA math.OC 版本更新

A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms

一种加速核范数最小化算法的块Lanczos热启动技术

Zhouchen Lin, Siming Wei

AI总结提出块Lanczos热启动（BLWS）技术，利用前次迭代的主奇异子空间初始化块Lanczos过程以加速核范数最小化算法中的部分SVD计算，实验表明可加速2-3倍。

详情

AI中文摘要

近年来，使用秩最小化作为各种信号处理和机器学习问题的正则化器变得流行。由于秩最小化问题通常转化为核范数最小化（NNM）问题，它们必须迭代求解，且每次迭代需要计算奇异值分解（SVD）。因此，它们的求解受到多次SVD高计算成本的影响。为了缓解这一问题，我们提出使用块Lanczos方法计算部分SVD，其中利用前一次迭代得到的主奇异子空间来启动块Lanczos过程。为了避免Lanczos过程中昂贵的重正交化，块Lanczos过程仅执行少数几步。我们的块Lanczos热启动（BLWS）技术可被求解NNM问题的不同算法采用。我们给出了将BLWS应用于鲁棒PCA和矩阵补全问题的数值结果。实验结果表明，我们的BLWS技术通常将其宿主算法加速至少两到三倍。

英文摘要

Recent years have witnessed the popularity of using rank minimization as a regularizer for various signal processing and machine learning problems. As rank minimization problems are often converted to nuclear norm minimization (NNM) problems, they have to be solved iteratively and each iteration requires computing a singular value decomposition (SVD). Therefore, their solution suffers from the high computation cost of multiple SVDs. To relieve this issue, we propose using the block Lanczos method to compute the partial SVDs, where the principal singular subspaces obtained in the previous iteration are used to start the block Lanczos procedure. To avoid the expensive reorthogonalization in the Lanczos procedure, the block Lanczos procedure is performed for only a few steps. Our block Lanczos with warm start (BLWS) technique can be adopted by different algorithms that solve NNM problems. We present numerical results on applying BLWS to Robust PCA and Matrix Completion problems. Experimental results show that our BLWS technique usually accelerates its host algorithm by at least two to three times.

URL PDF HTML ☆

赞 0 踩 0

0906.0311 2026-06-03 cs.AI cs.NA math.NA physics.data-an 版本更新

Solar radiation forecasting using ad-hoc time series preprocessing and neural networks

使用特定时间序列预处理和神经网络的太阳辐射预测

Christophe Paoli, Cyril Voyant, Marc Muselli, Marie-Laure Nivet

AI总结本文提出一种结合特定时间序列预处理和多层感知器（MLP）的日水平面太阳辐射预测方法，实现了nRMSE<21%和RMSE<998 Wh/m²的预测性能，优于ARIMA、贝叶斯推断等传统方法。

Comments 14 pages, 8 figures, 2009 International Conference on Intelligent Computing

0712.4126 2026-06-03 cs.AI cs.CE cs.MS cs.NA cs.NE math.NA 版本更新

TRUST-TECH based Methods for Optimization and Learning

基于TRUST-TECH的优化与学习方法

Chandan K. Reddy

AI总结针对机器学习中的非线性和全局优化问题，提出基于TRUST-TECH的框架，通过交替局部和邻域搜索阶段，降低对初始化的敏感性并提高解的质量。

Comments PHD Thesis

详情

Journal ref: Chandan K. Reddy, TRUST-TECH based Methods for Optimization and Learning, PHD Thesis, Cornell University, February 2007

AI中文摘要

机器学习领域中出现的许多问题涉及非线性，并且通常要求用户获得全局最优解而非局部最优解。优化问题是机器学习算法中固有的，因此机器学习中的许多方法都继承自优化文献。通常被称为初始化问题，所需的理想参数集将显著依赖于给定的初始值。最近开发的TRUST-TECH（稳定性保持平衡变换表征）方法系统地探索参数子空间，以获得完整的局部最优解集。在本论文工作中，我们提出了基于TRUST-TECH的方法来解决若干优化和机器学习问题。在解空间中交替重复两个阶段，即局部阶段和邻域搜索阶段，以提高解的质量。我们的方法在合成数据集和真实数据集上进行了测试，使用这一新颖框架的优势得到了清晰体现。该框架不仅降低了对初始化的敏感性，还允许从业者灵活使用各种对特定问题有效的全局和局部方法。还研究了其他层次随机算法，如进化算法和平滑算法，并提出了将这些方法与TRUST-TECH结合的框架，在多个测试系统上进行了评估。

英文摘要

Many problems that arise in machine learning domain deal with nonlinearity and quite often demand users to obtain global optimal solutions rather than local optimal ones. Optimization problems are inherent in machine learning algorithms and hence many methods in machine learning were inherited from the optimization literature. Popularly known as the initialization problem, the ideal set of parameters required will significantly depend on the given initialization values. The recently developed TRUST-TECH (TRansformation Under STability-reTaining Equilibria CHaracterization) methodology systematically explores the subspace of the parameters to obtain a complete set of local optimal solutions. In this thesis work, we propose TRUST-TECH based methods for solving several optimization and machine learning problems. Two stages namely, the local stage and the neighborhood-search stage, are repeated alternatively in the solution space to achieve improvements in the quality of the solutions. Our methods were tested on both synthetic and real datasets and the advantages of using this novel framework are clearly manifested. This framework not only reduces the sensitivity to initialization, but also allows the flexibility for the practitioners to use various global and local methods that work well for a particular problem of interest. Other hierarchical stochastic algorithms like evolutionary algorithms and smoothing algorithms are also studied and frameworks for combining these methods with TRUST-TECH have been proposed and evaluated on several test systems.

URL PDF HTML ☆

赞 0 踩 0

nlin/0407032 2026-06-03 nlin.PS cs.AI cs.NA math.NA 版本更新

Application of Artificial Neural Network in Jitter Analysis of Dispersion-Managed Communication System

人工神经网络在色散管理通信系统抖动分析中的应用

F. P. Zen, B. E. Gunara, W. Hidayat, Z. A. Thalib, H. Zainuddin, J. Aminuddin

AI总结利用人工神经网络求解修正非线性薛定谔方程，分析色散管理系统的抖动，验证并改进了传统数值方法的结果。

Comments 9 pages, 5 figures