arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12640 2026-06-12 cs.LG cs.RO eess.SY 新提交

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University(阿尔托大学电气工程与自动化系) School of Computing and Data Science, Xiamen University Malaysia(厦门大学马来西亚分校计算与数据科学学院) Department of Computer Science, University of Toronto(多伦多大学计算机科学系)

AI总结 提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法,通过逆动力学恢复控制策略,在保证奖励的同时显著提升轨迹生成的安全性。

详情
Comments
Accepted to the 23rd IFAC World Congress, 2026
AI中文摘要

离线强化学习允许直接从数据中学习控制策略而无需在线交互,使其适用于安全关键任务。最近的研究将扩散模型应用于离线强化学习,以利用其建模复杂数据分布的强大能力。然而,现有方法主要关注单智能体设置,多智能体环境中的安全挑战在很大程度上未被探索。在这项工作中,我们提出了一种安全的离线多智能体强化学习算法,该算法将神经个体控制障碍函数嵌入扩散模型中,以增强轨迹生成过程中的安全性,并通过逆动力学恢复控制策略。我们在多种基准上评估了我们的算法,证明了在保持竞争性奖励的同时实现了显著的安全改进。

英文摘要

Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

2606.12635 2026-06-12 cs.CV 新提交

CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

CD-RCM:面向反射共聚焦显微镜的泛化连续深度新视角合成

Tooba Imtiaz, Milind Rajadhyaksha, Kivanc Kose, Jennifer Dy

发表机构 * Northeastern University(东北大学) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心)

AI总结 针对反射共聚焦显微镜各向异性3D体积,提出首个RCM专用新视角合成方法CD-RCM,通过前馈模型从稀疏z-stack预测连续深度切片,实现亚秒级高保真合成。

详情
AI中文摘要

反射共聚焦显微镜(RCM)通过获取连续深度处的正面图像,形成稀疏z-stack,从而提供人体皮肤 \emph{体内} 的无创、细胞分辨率“光学活检”。由于光学限制,这些堆栈是各向异性的3D体积,横向分辨率(0.5 $\mu$m)比轴向分辨率(由光学切片定义,3 $\mu$m)高约6倍,限制了组织解释。我们的目标是通过插值中间切片并使3D体积各向同性,提供连续深度可视化。这种表示允许任意方向切片,包括类似组织病理学的横截面检查,无需针对每位患者进行优化。为此,我们引入了首个RCM特定的新视角合成(NVS)方法CD-RCM,这是一种前馈模型,可从稀疏采样的RCM堆栈预测逼真的、未见过的深度。经典神经渲染方法侧重于从表面级多视角观测进行重建。与表面级相机视图不同,RCM可以获取组织表面以下至200 $\mu$m的光学切片正面图像。然而,在可视化RCM堆栈时,较浅切片(朝向表面)的观测会遮挡较深切片。这种独特的轴向成像几何和层依赖性解剖结构促使我们开发了定制的架构和训练框架,明确考虑了RCM的深度分辨、遮挡成像物理特性。实验表明,CD-RCM实现了高保真新视角合成,推理时间低于一秒。

英文摘要

Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution "optical biopsies" of human skin \emph{in vivo} by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 $\mu$m) $\sim$6 times higher compared to axial resolution, which is defined by the optical sectioning (3 $\mu$m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 $\mu$m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM's depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 新提交

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导:面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 针对长程工具使用强化学习中轨迹级优势信号稀疏的问题,提出兄弟引导信用蒸馏(SGCD),通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考,实现密集信用分配,在AppWorld和τ³-airline任务上显著提升性能。

详情
Comments
13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track
AI中文摘要

长程工具使用强化学习可以从结果验证中学习,但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而,我们表明直接的令牌级自蒸馏会悄然破坏工具使用:它复述教师行为而不知道验证器奖励哪些动作,因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏(SGCD),它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹;外部LLM将其对比总结为训练时逐步信用参考;密集的教师/学生散度驱动信用重新分配;有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO比较器:AppWorld上test_normal的TGC从42.9提升到45.6,test_challenge从24.7提升到27.0;τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $\tau^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $\tau^3$-airline pass@1 $0.583 \to 0.602$.

2606.12629 2026-06-12 cs.LG cs.AI 新提交

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims:通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 本文提出Bag of Dims框架,证明Transformer隐藏状态的标准基即可作为无需训练的特征基,通过维度符号模式编码语义,并在三个模型上验证了其有效性。

详情
Comments
14 pages, 4 figures, 10 tables
AI中文摘要

我们表明,Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容,通过其幅度编码置信度,充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B)上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容:将所有幅度替换为1,通过LM头实现72-93%的top-5下一个token准确率,而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征:使用单token类型缓存(每个词汇token一次前向传播,无上下文),我们通过每维度符号一致性(平均AUC 0.80)从50个锚点发现了175个类别,无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重,证实了可忽略的跨维度结构。这种结构扩展到注意力:所有175个类别在K和V投影中仍然可发现。在写入端,静态FFN权重检查将20%的特征与单个写入神经元联系起来(一致性>0.70;随机对照:0%),通过多数投票,top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现(随机种子,无标签)在所有三个模型上扩展到1500个特征,产量100%,稀疏度99%,成对互信息为0.0014比特,证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取,无需训练、无需优化,且每个词汇token仅需一次前向传播,无需GPU天数。

英文摘要

We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs and confidence via their magnitudes, functioning as independent binary registers. We validate this Bag of Dims framework across three model families (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B) through four progressive experiments. Sign patterns alone carry predictive content: replacing all magnitudes with unity achieves 72-93% top-5 next-token accuracy through the LM head, and pure Hamming scoring without any decoder reaches 80-90% top-4096. These sign patterns organize into semantic features: using a single-token type cache (one forward pass per vocabulary token, no context), we discover 175 categories via per-dimension sign consistency (mean AUC 0.80) from 50 anchors with zero training. A trained probe adds only +0.018 AUC and converges to axis-aligned weights, confirming negligible cross-dimension structure. This structure extends to attention: all 175 categories remain discoverable in K and V projections. On the write side, static FFN weight inspection links 20% of features to individual writer neurons (>0.70 agreement; random controls: 0%), with top-200 neuron coalitions achieving >0.70 agreement on 99.9% of prototypes via majority vote. Fully unsupervised discovery (random seeds, no labels) scales to 1500 features at 100% yield and 99% sparsity across all three models, with pairwise MI of 0.0014 bits confirming low inter-dimension coupling. These results establish that the standard basis already suffices for feature reading throughout the transformer compute pathway, requiring no training, no optimization, and no GPU-days beyond a single forward pass per vocabulary token.

2606.12628 2026-06-12 cs.CV 新提交

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

面向自动驾驶中共现对象检测的上下文感知特征融合

Binay Kumar Singh, Niels Da Vitoria Lobo

发表机构 * Department of Computer Science, University of Central Florida(中佛罗里达大学计算机科学系)

AI总结 提出上下文中心特征融合框架CCFF,通过局部上下文融合模块和全局上下文注意力模块分别处理小/遮挡对象与共现先验,提升共现对象检测性能,在Cityscapes和BDD100K上实现类别一致性策略0.973和0.969,小目标检测AP_S提升14.1%。

详情
Comments
8 pages, 3 figures, CVPR 2026 Precognition Workshop
AI中文摘要

自动驾驶中的目标检测需要精确定位以及对共现对象之间关系上下文的固有理解。在极其复杂的异构环境中,稀有类别、小尺度对象和频繁出现的对象对于标准目标检测框架来说难以处理。在本文中,我们提出了一种新颖的框架,称为上下文中心特征融合(CCFF),它利用两个基于注意力的模块:局部上下文融合模块(LCFM)使用RoI到RoI的自注意力机制来解决空间交互,主要考虑小且部分遮挡的对象;而全局上下文注意力模块(GCAM)通过将top-K RoI特征池化为全局上下文注意力标记来转换对象的共现先验,避免了像素级全局池化的计算开销。这种局部和以对象为中心的全局特征的融合产生了上下文化的嵌入,增强了分类结果和共现对象检测。我们的方法在两个数据集Cityscapes和BDD100K上进行了评估,在关系一致性上显示出显著改进,分别达到了0.973和0.969的类别级一致性策略(CCS)。此外,我们的方法在小目标检测(AP_S: 14.1%)上取得了实质性提升,并成功恢复了通常在大分布中丢失的稀有类别,如“火车”。我们的效率报告显示,该框架以0.2 FPS的开销实时处理图像。代码可在此https URL获取。

英文摘要

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at this https URL.

2606.12618 2026-06-12 cs.AI 新提交

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗?”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute(AI安全研究所)

AI总结 本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集,评估了四种谎言检测器在不同规模模型上的表现,发现基于激活和概率的检测器在训练模型生物体上性能显著下降,而思维链法官保持较强性能,但存在伪影。

详情
Comments
12 pages, 6 figures
AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术,但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明,现有的训练模型生物体通常无法满足这一要求,使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题,这些生物体的隐藏信念在思维链中得到验证,并显示泛化到保留任务,同时结合了多样化欺骗(Varied Deception),一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上,我们评估了四个检测器:一个思维链法官、一个对数概率分类器和两个激活探针,包括Did-You-Lie(DYL),一种训练后续探针的新方法。在提示撒谎任务上,跨越31个开放权重模型(参数从2B到1T),所有四个检测器都显示出与模型能力正相关的缩放。然而,每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降,其中DYL保留了最多的信号;只有思维链法官保持强劲,平衡准确率为0.82,部分原因是我们的验证过程偏向于CoT可读的信念。因此,当前的谎言检测器无法支持关于模型信念的高置信度声明,我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

2606.12616 2026-06-12 cs.AI cs.CL 新提交

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PersonaDrive流水线,通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作(VLA)驾驶智能体,实现闭环模拟中多样化的非自车智能体行为,无需针对每种风格重新训练。

详情
AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体,这些智能体要么由基于规则的交通管理器生成,要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化,但这些信号充当了风格应奖励什么的代理,而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive,一个流水线,它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作(VLA)驾驶智能体,在该数据集中,参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段:(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘;(ii) 训练一个轻量级检索头,将冻结的视觉特征与每个风格数据库上的小型控制编码器融合;(iii) 微调单个VLA主干,以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时,通过切换检索头查询的每个风格数据库,相同的主干可以适应任何风格,因此选择风格无需针对每种风格重新训练,同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上,PersonaDrive(无风格)的驾驶得分比SimLingo高4.6%,比HiP-AD高2.5%,在风格条件下,每种风格都获得最高驾驶得分,波动范围约2%(其最弱风格超过最强基线DMW 5.4%),而从保守指令到激进指令,平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

2606.12615 2026-06-12 cs.LG 新提交

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

迈向可证明公平的机器学习:用于一致和透明预测的贝叶斯方法

Owen O'Neill, Fintan Costello

发表机构 * University College Dublin(都柏林大学学院)

AI总结 提出公平贝叶斯分类器,通过强制确定性和统计一致性,在多个数据集上实现零一致性错误,同时保持准确性和多校准,解决少数群体因正则化导致的预测不一致问题。

详情
AI中文摘要

部署在高风险领域的机器学习分类器产生的预测质量在不同子组之间存在系统性差异。对于由多个特征交叉定义的细粒度子组,预测通常与观测数据不一致:模型输出与该子组可用的证据相矛盾。正则化通过将小子组合并到较大组中来改善整体性能,从而加剧了这一问题,对人口统计少数群体产生不成比例的影响。我们定义了一致性预测的两个要求:确定性(相同的个体获得相同的预测)和统计一致性(在显著性水平alpha下,我们不能拒绝子组预测来自为该子组推断的贝叶斯最优目标分布的假设)。从这些要求出发,我们推导出公平贝叶斯分类器,该分类器同时强制每个组和子组满足这两个要求,并在无法进行一致确定性预测时弃权。在三个基准数据集(Adult、COMPAS和Bank Marketing)上,标准分类器对相当一部分子组产生统计上不一致的预测。我们的分类器通过构造实现零一致性错误,同时在每个测试数据集上超过基线准确性和多校准。统计一致性为预测质量提供了原则性基础,对算法公平性有直接影响。少数群体人口不成比例地集中在小子组中,而正是在这些子组中频率论推断最不可靠;因此,解决这一推断问题是迈向公平ML的必要步骤。通过在数据支持的最细粒度上强制贝叶斯一致性,我们的分类器证明了在实践中可以实现具有原则性弃权的详尽子组公平性。

英文摘要

ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

2606.12614 2026-06-12 cs.RO 新提交

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

DARRMS——资源受限多智能体系统中动态注意力半径的高效算法

Benjamin Alcorn, Eman Hammad

发表机构 * Texas A&M University(德克萨斯A&M大学)

AI总结 提出DARRMS算法,通过优化注意力半径和决策,在资源受限下降低计算需求,提升协调性和可扩展性。

详情
AI中文摘要

多智能体系统是机器人、网络安全和自动驾驶规划等领域不可或缺的工具。这类系统通常面临计算资源约束,需要高效的轻量级算法。传统决策框架常假设理想条件(如完全可观测性和无限计算能力),这与现实挑战不符。本文提出一种新算法,在不显著牺牲其他性能指标的前提下,降低对计算资源的需求。智能体将可观测性限制在某个注意力半径内,从而有意识地忽略对行动规划可能不必要的环境部分。通过同时优化注意力半径和决策,我们的方法在不确定环境中增强了协调性和可扩展性。通过理论分析和实证验证,我们证明了自适应观测在资源受限系统中提升系统性能并维持稳健决策策略的有效性。

英文摘要

Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

2606.12610 2026-06-12 cs.LG 新提交

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

AI寒冬的数学:AI中范式脆弱性的数学分类

Miquel Noguer i Alonso, David Pacheco Aznar

发表机构 * AIFI Staq.io

AI总结 本文提出AI寒冬的数学解释,通过感知机不可能性、神经网络训练复杂度、高维非参数估计率、梯度消失和统计学习理论等数学瓶颈,分析早期AI范式失败的原因,并关联后续突破。

详情
Comments
33 pages, 1 figure
AI中文摘要

人工智能研究中两个主要的资金减少和信心下降时期,通常被称为第一次和第二次AI寒冬,通常被解释为工程失败、商业失望和预期膨胀。本文提出一个补充论点:这些时期的主导范式也遇到了真正的形式障碍,包括表示、优化、计算复杂性、统计可学习性和高维近似的限制。贡献是综合性的而非档案性的。我们并不声称特定定理机械地导致了寒冬;相反,我们表明早期AI的几个核心失望与数学上精确的瓶颈相一致。我们通过Minsky和Papert的感知机不可能结果、Blum和Rivest建立的精确神经网络训练的计算复杂性困难、Stone的高维非参数估计的极小化极大率、Hochreiter以及Bengio及其合作者的梯度消失分析,以及Vapnik和Chervonenkis、Valiant、Blumer及其合作者传统的经典统计学习理论来分析这些瓶颈。然后我们将这些障碍与后来缓解(而非消除)它们的突破联系起来。

英文摘要

Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

2606.12608 2026-06-12 cs.CL cs.LG 新提交

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

购物推理基准:面向多轮对话购物助手的专家编写基准

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一个由零售专家编写的525个任务的多轮对话购物推理基准,包含10863个加权评分标准,评估9个模型显示通过率仅57-77%,多轮任务性能下降4-18分。

详情
AI中文摘要

对话式购物助手现已服务数亿客户,但现有基准均未联合评估真实购物对话所需的开放式多轮推理、领域专业知识和标准级质量。购物推理在语言模型应用中独具特色。与事实性问答或可验证代码生成不同,它需要在多轮对话中平衡主观偏好、预算约束和跨产品权衡,这些能力在以往的电商和通用基准中缺失。我们引入了购物推理基准(Shopping Reasoning Bench),这是一个由零售领域专家编写的基准,包含525个任务(232个单轮,293个多轮)和10863个重要性加权的二元评分标准。这些标准组织在包含五个推理类别和十五个子类别的分类体系下,涵盖偏好细化、权衡分析和兼容性评估等多样化需求。对三个模型系列(GPT、Claude、Gemini)中九个模型的评估显示,整体通过率仅为57-77%。在多轮任务中,所有模型在可选的超越标准上的得分比必需标准低13-29分,并且随着对话进行,性能下降4-18分。这些差距表明,当前模型能处理基本购物辅助,但达不到专家级建议,使购物推理基准成为未来购物助手开发的挑战性测试平台。

英文摘要

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

2606.12603 2026-06-12 cs.RO cs.AI 新提交

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐:面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出FlowPilot,一种仅使用单目RGB相机的无地图导航策略,通过锚定流匹配进行预训练,并引入人类偏好学习实现对齐,在长距离人行道导航中提升鲁棒性和社会合规性。

详情
AI中文摘要

自主长距离人行道导航对于微出行应用(如机器人送餐和辅助电动轮椅)至关重要。与道路上的自动驾驶不同,长距离人行道导航需要在不可预测的人行道地形和行人中精确操作,且感知栈轻量,仅需单个单目RGB相机。虽然从演示中模仿学习(IL)提供了一种实用解决方案,但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战,我们提出了FlowPilot,一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示,用于在大型机器人车队数据上进行策略预训练,并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距,我们进一步设计了一种人在环的偏好学习方案,通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中,FlowPilot实现了42%的成功率和66%的路线完成率,而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性,相对于基础模型,IR降低了40.0%,NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

2606.12601 2026-06-12 cs.CV 新提交

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

双状态槽注意力:解耦外观与身份用于视频目标中心学习

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

发表机构 * University of Arkansas(阿肯色大学)

AI总结 提出双状态槽注意力(DSSA),通过分离每个槽为局部状态(外观)和身份状态(稳定身份),并采用竞争调制聚合减少弱匹配槽的干扰,提升视频目标分割质量与时间一致性。

详情
AI中文摘要

无监督视频目标中心学习旨在无需监督地将动态场景分解为持久的目标级表示。然而,现有的基于槽的方法在快速运动和部分遮挡等挑战性场景中难以维持稳定的目标身份。首先,它们通常将目标的每帧外观和跨帧身份编码在单个槽向量中,造成目标冲突导致槽交换:重建需要对瞬态视觉变化敏感,而时间一致性需要对它们不变。其次,槽注意力中使用的令牌重归一化可能放大弱注意力槽,使其吸收其他目标的令牌,破坏槽与目标的对应关系。我们提出双状态槽注意力(DSSA),一种完全自监督框架,通过分离外观与身份并减少弱匹配槽的虚假更新来解决这些限制。DSSA将每个槽分解为用于每帧外观的局部状态和用于时间稳定目标信息的身份状态,从而用分离的表示对齐重建和时间一致性。身份状态通过学习的循环转换更新,该转换作为局部状态的时间滤波器,而竞争调制聚合(CMA)降低弱匹配槽的更新权重,防止它们吸收其他目标的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明,DSSA在分割质量和时间一致性上持续优于先前方法,同时在下游目标识别和视频动态预测中表现更强。代码和模型将在接收后公开。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 新提交

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12594 2026-06-12 cs.AI 新提交

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover: 通过增强型Lean形式化推进高效形式化证明

Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(伦敦帝国学院) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出Pythagoras-Prover系列,包括自回归和扩散模型,通过课程SFT、动态过滤和增强型Lean形式化(ALF)扩展验证数据,在MiniF2F-Test上以更少参数超越DeepSeek-Prover-V2。

详情
Comments
Pythagoras-Prover: Technical Report
AI中文摘要

现代Lean定理证明器只有在大量训练和推理计算下才能取得强性能,部分原因是由于稀缺的验证证明数据和形式化证明搜索的长推理轨迹,使得监督微调(SFT)和采样成本高昂。我们介绍了Pythagoras-Prover,一个计算高效的开源Lean定理证明器系列,专为实际计算预算而构建。该系列涵盖两种生成范式:4B和32B参数的自回归模型,以及首个概念验证的基于扩散的证明器(4B),它在推理时迭代地精炼Lean证明。为了提高训练效率,我们构建了一个Lean验证的语料库,按易、中、难问题分层,用于课程SFT,使模型逐步从较短、较简单的证明过渡到较长、较难的证明。在SFT期间,动态证明推理过滤方案保留了信息丰富的证明轨迹,同时将每个实例保持在8k令牌的上下文预算内。我们还引入了增强型Lean形式化(ALF),它将稀缺的验证语料库扩展为形式化语句的变体,通过自蒸馏填充以提供额外训练信号,而无需正式验证每个变异实例。通过扰动已知问题同时保留其形式化特征,ALF减少了对任何语句表面形式的依赖。实验上,Pythagoras-Prover-4B在MiniF2F-Test上的pass@32(86.1% vs 82.4%)超过了DeepSeek-Prover-V2-671B,参数数量约为其1/167,而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下了开源最先进水平,并在672个PutnamBench问题中解决了93个。我们发布了MiniF2F-ALF,一个经ALF变异的对污染敏感的基准,每个评估模型在该基准上的准确率均下降;在此基准上,我们的32B模型仍然最强,而4B模型匹配了先前最先进的Goedel-Prover-V2-32B。

英文摘要

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

2606.12590 2026-06-12 cs.CV cs.AI 新提交

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Queen’s University(女王大学)

AI总结 针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足,提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架,通过最小编辑模型输出构建偏好对,仅修正临床错误片段,显著提升诊断准确性。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在医学影像任务中取得了强劲性能,但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键限制:(1)序列级奖励信号将临床关键令牌与通用填充文本等同对待;(2)依赖静态监督微调参考作为偏好响应引入了离策略分布偏移,将优化导向风格伪影而非临床正确性;(3)对齐目标缺乏明确的视觉定位约束,使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标,该目标将干净图像与病变破坏图像配对,以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架,通过最小编辑模型生成的输出来构建偏好对,仅修正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

2606.12587 2026-06-12 cs.AI cs.HC 新提交

Strategic Decision Support for AI Agents

AI智能体的战略决策支持

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 针对AI智能体作为主要决策者时的可靠性问题,提出通过优化问题最小化支持使用并控制反事实遗漏支持误差的战略决策支持框架,并开发在线算法自适应阈值化支持分数。

详情
AI中文摘要

传统上,决策支持研究人类如何使用机器学习模型做出更好的决策。在现代智能体系统中,这种角色分工日益反转:AI智能体代表用户行动,而人类和工具成为围绕它们的支持机制。这种角色反转将可靠性问题推至前沿,因为智能体错误可能产生严重后果,且智能体行为必须始终与人类目标和约束保持一致。脱离经典的决策支持观点,我们在AI智能体作为核心行动者的设定下,重新审视其两个基本原则:寻求支持的成本-价值权衡以及不确定性量化的作用。我们提出了一个AI智能体战略决策支持框架,通过一个优化问题来最小化支持使用,同时控制一个反事实遗漏支持误差:即智能体在那些支持本可实质改善其输出的实例上单独行动的概率。在总体层面,我们证明最优策略是关于支持价值的阈值规则。基于这一结构,我们开发了一种在线算法,该算法自适应地阈值化这样的分数,并使用随机探索来控制遗漏支持误差,无需分布假设。我们进一步引入了一种即时校准方法,在线减少不必要的支持调用。我们将该框架实例化到多种场景中,包括信息收集、人机协作和工具使用,展示了每种场景如何通过相同的战略决策支持视角建模。跨这些场景的实验表明,我们的方法可靠地控制了目标误差,同时在实际中大幅减少了支持使用。

英文摘要

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

2606.12579 2026-06-12 cs.RO 新提交

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

G-MAPP: 基于GPU加速的多智能体规划与感知用于反应式运动生成

Tanmay Bishnoi, Riddhiman Laha, Tobias Löw, Jose Alex Chandy, Luis F. C. Figueredo, Sami Haddadin

发表机构 * Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University(多伦多都会大学电气、计算机与生物医学工程系) Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich (TUM)(慕尼黑工业大学慕尼黑机器人与机器智能研究所) Institute for Experiential Robotics, Northeastern University(东北大学体验式机器人研究所) Idiap Research Institute(Idiap 研究所) EPFL(瑞士联邦理工学院洛桑) CHART Group at the School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院 CHART 小组) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GPU加速的框架,通过并行状态探索和紧密耦合感知-动作循环,实现非结构化环境中的实时反应式运动生成,在7自由度机器人上达到5倍加速并成功避障。

详情
Comments
The implementation is available at: this https URL
AI中文摘要

在非结构化环境中的反应式运动生成仍然是机器人学中的一个开放挑战。由于无碰撞运动生成的计算复杂性,现有方法要么为静态场景生成全局轨迹,要么采用对环境做出保守假设的模型。本文指出主要瓶颈在于高保真环境规划的运行时性能需求,以及感知与规划模块之间的时间集成。因此,我们提出一个框架,通过使用GPU加速世界建模和基于向量场的规划,不牺牲运行时性能和感知与规划的世界表示。这使得我们能够实现更快的并行状态探索以进行准全局轨迹规划,并在动态杂乱环境中使用现成的深度传感器实时紧密耦合感知-动作循环。我们定量评估了CPU和GPU版本规划器的计算时间和成功率差异,并在7自由度Franka Emika机器人上通过真实世界实验对我们的耦合框架进行了定性评估。实验结果表明,我们的基于GPU的框架相比CPU版本实现了高达5倍的加速,并在简单和具有挑战性的物理世界场景中成功避免了碰撞。

英文摘要

Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

2606.12578 2026-06-12 cs.CL 新提交

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD: 镜像增强推理蒸馏用于机制级药物-药物相互作用预测

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi

发表机构 * University of Guelph(圭尔夫大学) York University(约克大学) Vector Institute(向量研究所)

AI总结 提出MARD-7B模型,通过镜像增强推理蒸馏、单token KL散度、PRM加权DPO和机制感知检索通道,在机制级DDI预测中准确率超越GPT-4o 6.7个百分点,且成本仅为1%。

详情
Comments
29 pages, 9 figures. Preprint
AI中文摘要

机制级药物-药物相互作用(DDI)预测需要识别涉及的酶或药效学轴、作用方向及证据,而不仅仅是判断两种药物是否相互作用。我们引入了一个可复现的机制级DDI标注与评估协议,包括结构化的7家族/147亚型分类法、无泄漏的冷切分协议以及可审计的推理指标,用于评估超越平面交互分类的药理学预测。我们提出一个流水线,生成了7B推理模型MARD(镜像增强推理蒸馏),结合了三种训练创新:方向标签上的单token KL散度,将模型的预测与方向标签绑定;基于PRM权重的DPO,使用程序化硬负样本;以及无泄漏的机制感知检索通道。过程奖励步骤标签可自动根据DrugBank结构化字段验证,无需人工或LLM评判。在2026年4月的DrugBank版本上,我们的MARD-7B是32个系统比较中唯一在药物对新颖性下准确率保持稳定的系统,以约1%的前沿API成本,比最佳基线高出13.9个百分点,比GPT-4o高出6.7个百分点。进一步分析揭示了反记忆特征,即在罕见药物上准确率提升,表明增益来自结构化药理学推理而非药物频率记忆。我们发布了语料库、DDI-PRM、检索索引和训练代码。

英文摘要

Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence -- not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model's prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

2606.12575 2026-06-12 cs.CV 新提交

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

高保真两步图像生成:通过教师对齐的端到端蒸馏

Dongyang Liu, Ruoyi Du, David Liu, Dengyang Jiang, Liangchen Li, Qilong Wu, Zhen Li, Steven C.H. Hoi, Hongsheng Li, Peng Gao

发表机构 * Z-Image Team, Alibaba Group(阿里巴巴集团Z-Image团队) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Z-Image Turbo++,通过分布对齐对抗学习、步解耦参数化和迭代正则化端到端训练,将8步教师模型蒸馏为2步生成模型,显著缩小质量差距。

详情
AI中文摘要

少步扩散蒸馏在4-8步生成中已日趋成熟,但进一步推进到2步仍具挑战。本文介绍Z-Image Turbo++,一种从8步Z-Image Turbo教师模型蒸馏得到的高质量2步图像生成模型。我们的方法通过三个针对该场景简单而有效的设计选择,解决了2步生成中任务难度增加和模型容量有限的核心瓶颈。首先,我们提出分布对齐对抗学习,使用教师生成的图像而非外部真实图像作为GAN训练的真实样本,提供更易实现且信息量更大的对抗目标。其次,我们采用步解耦参数化,为两个去噪步骤分配独立的模型参数,以更好地匹配它们不同的容量需求。第三,我们执行带迭代正则化的端到端训练,使第一步能够接收来自最终图像质量的梯度,同时通过显式的步1损失保留有意义的中间生成。这些设计共同在定性和定量评估中显著缩小了2步与8步生成之间的质量差距,凸显了精心定制的蒸馏策略在改善少步生成中质量-效率权衡方面的潜力。

英文摘要

Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.

2606.12569 2026-06-12 cs.CL cs.AI 新提交

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN:意大利语临床笔记的大规模语料库

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Istituto di Ricerche Farmacologiche Mario Negri IRCCS(马里奥·内格里药理研究所IRCCS) University of Padua(帕多瓦大学)

AI总结 本文介绍EDEN,一个大规模意大利语急诊临床笔记语料库,包含约400万份匿名笔记及6000份专家标注数据,用于支持大语言模型在医疗中的应用,并提出了CRF填充作为新的结构化信息提取基准。

详情
AI中文摘要

我们提出了EDEN(急诊电子笔记),这是一个新颖且独特的大规模临床笔记语料库,这些笔记来自意大利医院的急诊科。当前版本的语料库由约400万份完全匿名的临床笔记组成,涵盖了患者在急诊科停留期间的不同护理阶段。此外,约六千份笔记的子集由临床专家通过结构化病例报告表(CRF)进行了手动标注,该CRF包含132个项目,涉及急诊科两种患者情况:呼吸困难和意识丧失。项目可能取数值(例如血氧饱和度)、分类(例如意识水平)、二元(例如是否存在创伤)和混合值类型。标注过程涉及多位临床医生,并经过迭代修订以解决项目表述中的歧义,从而形成了一个结构丰富(尽管高度不平衡)的资源。该数据集旨在填补能够支持大语言模型在具体医疗应用中开发和使用的重要数据缺口。我们描述了数据收集协议、现场匿名化流程、语料库统计数据和标注方案。最后,我们提出了CRF填充作为一项新的结构化信息提取基准,并提供了基于Gemma-27B和MedGemma-27B的零样本基线。据我们所知,EDEN数据集是意大利语现有最大的免费临床笔记语料库。

英文摘要

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

2606.12563 2026-06-12 cs.AI 新提交

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor:作为自主智能体认知层的树搜索

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 提出Arbor多智能体框架,通过结构化树搜索作为认知层,在大型有状态动作空间中实现自主优化,在LLM推理优化中实现高达193%的吞吐量-延迟帕累托改进。

详情
AI中文摘要

Arbor是一个多智能体框架,引入了结构化树搜索作为自主智能体在大型有状态动作空间中运行的认知层。先前的自主优化系统在具有无状态评估的孤立目标上运行。相反,Arbor维护一个显式的得分假设搜索树,作为跨智能体的共享工作记忆,随着每次测量而演变,将失败视为诊断信号以重塑后续探索,并随着先前的成功转移瓶颈分布而扩展。我们在全栈LLM推理优化上验证了Arbor,这是一个历史上需要应用程序、框架、编译器、内核和硬件栈的工程团队协调努力才能达到峰值性能的领域。Arbor将Orchestrator智能体(通过将优化委托给推理栈中的领域专家来驱动优化)与Critic智能体(通过根本原因分析、内省和测量验证来维护稳定性)配对——这是一种制衡架构,其中没有一个智能体可以单方面驱动系统。智能体能力被分解为硬技能(领域专业知识)和软技能(决定贡献如何组合的协调协议),从而实现完全自主的多日活动。Arbor在供应商优化的基线上实现了高达193%的推理吞吐量-延迟帕累托改进,而没有该框架的单个智能体在吞吐量改进上达到+33%后几小时内就不可恢复地崩溃。Arbor可推广到多代硬件平台,运行间方差在2个百分点以内,表明该方法与硬件无关且可重复。

英文摘要

Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

2606.12562 2026-06-12 cs.CV cs.GR 新提交

HairPort: In-context 3D-aware Hair Import and Transfer for Images

HairPort: 上下文感知的3D发型导入与迁移

Alireza Heidari, Amirhossein Alimohammadi, Wallace Michel Pinto Lira, Adi Bar-Lev, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University(西蒙菲莎大学) Huawei Canada(华为加拿大)

AI总结 提出HairPort框架,通过显式分离发型移除与迁移,并利用3D感知管道实现大姿态差异下的发型迁移,结合LoRA适配的秃头转换器和条件流匹配生成器,实现高质量、身份保持的发型迁移。

详情
Comments
Accepted to SIGGRAPH 2026 (Conference Papers Track). 23 pages, 15 figures, 10 tables, including supplementary material as appendices. Project page: this https URL
AI中文摘要

在图像之间迁移发型是计算机图形学、计算机视觉和视觉效果中一个重要但具有挑战性的任务。它使用户能够在无需实际改变发型的情况下探索新造型,应用于虚拟试穿系统、增强现实和娱乐等领域。大多数先前的方法在姿态差异较小时表现最佳,但在视角和尺度差异较大时效果不佳,此时缺失的发型内容必须合成而非迁移。我们提出HairPort,一个3D感知的发型迁移框架,通过显式分离发型移除与迁移,并在合成前强制几何一致性来解决这些问题。我们引入了一个秃头转换器,通过基于LoRA的上下文适配FLUX.1 Kontext生成逼真的秃头人脸版本。为了训练我们的秃头转换器,我们引入了一个新数据集Baldy,包含6000对在不同身份和条件下的秃头和原始图像。我们还使用了一个3D感知迁移管道,在将参考发型合成到源图像之前,从目标视角重建并重新渲染该发型。由于具有3D感知能力,我们的方法支持源和目标之间的大姿态和尺度差异。最后,一个条件流匹配生成器从秃头源和几何对齐的参考引导中合成迁移结果。综合来看,我们的方法实现了准确、姿态一致且身份保持的发型迁移,在定性和定量上均优于现有方法。

英文摘要

Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

2606.12555 2026-06-12 cs.SD cs.CV cs.MM 新提交

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

AudioX-Turbo:高效任意到音频生成的统一框架

Zeyue Tian, Lei Ke, Zhaoyang Liu, Ruibin Yuan, Liumeng Xue, Yujiu Yang, Weijia Chen, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Noiz AI Independent Researcher(独立研究员)

AI总结 提出AudioX-Turbo,基于教师-学生范式的统一高效框架,通过多模态扩散Transformer和分布匹配蒸馏实现文本、视频、音频到音频的生成,仅需4步采样,NFE减少约25倍。

详情
AI中文摘要

基于灵活的多模态控制信号生成音频和音乐是一个广泛适用的课题,面临以下关键挑战:1) 统一的多模态建模框架,2) 大规模、高质量的训练数据,3) 多步扩散采样的高昂推理成本。为此,我们提出AudioX-Turbo,一个统一且高效的任意到音频生成框架,集成了多种多模态条件(即文本、视频和音频信号)。AudioX-Turbo遵循教师-学生范式。教师模型AudioX-Base基于多模态扩散Transformer,并带有模态自适应融合模块,用于对齐多样化的多模态输入以实现高保真合成,然后通过适用于流匹配的分布匹配蒸馏将其蒸馏为少步学生模型AudioX-Turbo,并辅以基于扩散的判别器以实现高质量的少步生成。为支持AudioX-Turbo的训练,我们构建了一个大规模、高质量的数据集IF-caps-Pro,包含约920万个样本,通过两阶段数据收集和标注流程整理而成。我们在广泛的任务上对AudioX-Turbo进行基准测试,发现我们的模型实现了优越的性能,尤其是在文本到音频和文本到音乐生成方面,同时仅需4个采样步骤,所需的函数评估次数(NFE)比多步基线减少约25倍。这些结果表明,我们的方法能够在灵活的多模态控制下进行音频生成,展现出高效且强大的指令跟随能力。代码和数据集将在https://this URL上提供。

英文摘要

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at this https URL.

2606.12552 2026-06-12 cs.LG 新提交

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

跨越验证危机:交叉验证出人意料地有效降低基准测试方差

Célestin Eve, Gaël Varoquaux, Thomas Moreau

发表机构 * MIND Team, Université Paris-Saclay, Inria, CEA, Palaiseau, France(MIND团队,巴黎-萨克雷大学,法国国家信息与自动化研究所,法国原子能委员会,帕莱索,法国) SODA Team, Inria, Palaiseau, France(SODA团队,法国国家信息与自动化研究所,帕莱索,法国) Probabl

AI总结 本文提出交叉验证通过样本增益概念量化虚拟数据增强,显著提升算法性能评估的置信度与稳定性,并引入动态早停机制减少计算开销。

详情
Comments
34 pages, 11 figures
AI中文摘要

现代机器学习通过实证工作推进,对新方法进行基准测试以评估相对性能。然而,评估固有的统计变异性——由于许多算法的随机性而加剧——常常因有限的测试样本而使性能估计不可靠,导致验证危机,其中真正的进步难以辨别。在这项工作中,我们展示了交叉验证在评估和比较学习算法性能时显著提高了置信度。我们引入了样本增益的概念,它量化了通过使用多个交叉验证分割来减少基准测试方差所实现的虚拟数据增强。在合成和真实世界数据集(组织病理学扫描和NLP微调)上的实验表明,多个分割可以显著提高性能估计的可靠性和稳定性,且收益递减往往比预期来得更晚。我们还引入了一种动态早停交叉验证的程序,通过从最初几个折叠估计后续折叠是否会带来大的样本增益。我们的发现强调了在可用样本上推行交叉验证以实现稳健可靠基准测试的价值。

英文摘要

Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

2606.12550 2026-06-12 cs.RO cs.AI 新提交

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) FieldAI

AI总结 提出Foresight框架,利用微调VLM交替提出和批评图像空间运动计划,通过人类反馈学习奖励模型进行强化学习后训练,实现无地图导航中稀疏语言指令下的迭代运动优化,任务成功率提升37%。

详情
Comments
22 pages, 10 figures, 3 tables
AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标,并推断哪些环境线索与到达目标相关。例如,到达一个视野外的目的地可能需要解释坡道、标志或绕行路线,这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖,或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型(VLM)可以发现新的指令相关线索,但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法,这是一个测试时框架,其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评,使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐,我们从人类反馈中学习一个奖励模型,并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中,相对于最先进的测试时推理和基础模型基线,Foresight将平均任务成功率提高了37%,并将每次任务的干预次数减少了52%,同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节,以支持未来关于机器人运动优化的测试时推理工作。更多视频请见:this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: this https URL

2606.12507 2026-06-12 cs.LG 新提交

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

基于评分标准的自蒸馏:无需评分标准验证器的后训练

MohammadHossein Rezaei, Anas Mahmoud, Zihao Wang, Utkarsh Tyagi, Advait Gosai, Razvan-Gabriel Dumitru, Aakash Sabharwal, Bing Liu, Yunzhong He

发表机构 * Scale AI

AI总结 提出RGSD方法,通过将评分标准作为条件蒸馏到学生模型,无需验证器即可实现密集逐令牌学习,在医学和科学领域达到与基于评判的GRPO相当的评分标准满足率。

详情
AI中文摘要

在开放领域(单一标准答案不可用)中,评分标准已成为RLVR的替代方案。现有的基于评分标准的训练方法依赖LLM验证器对每次生成根据评分标准进行评分。这引入了大量的训练时间开销,使优化暴露于验证器特定偏差,并将评分标准反馈简化为稀疏的轨迹末端信号。我们提出无验证器的训练方法——基于评分标准的自蒸馏(RGSD),其中基础策略以评分标准为条件,作为无条件学生的教师。RGSD将基于评分标准的教师分布逐令牌蒸馏到学生,用密集的逐令牌学习信号替代稀疏的轨迹级奖励,并完全从训练循环中移除LLM评判。在Qwen-2.5(3B、7B)和Qwen3-Thinking(4B、8B)模型上,针对医学和科学领域,RGSD在每次提示仅使用一次在线生成且无需训练时验证器调用的情况下,实现了与基于评判的GRPO相当的评分标准满足率。消融实验表明,原始评分标准比自生成参考响应提供更强的教师增强信号,而更强的GRPO评判在某些设置下可能优于RGSD,使RGSD成为验证器成本或可靠性成为瓶颈时的互补性无验证器替代方案。

英文摘要

Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

2606.12503 2026-06-12 cs.LG cs.SD 新提交

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Dolph2Vec: 海豚发声的自监督表示

Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan, Alexis Emanuelli, Yair Lakretz, Gonzalo de Polavieja, German Sumbre

发表机构 * École Normale Supérieure, Paris, France(巴黎高等师范学院) Not Diamond, San Francisco, USA(Not Diamond公司) Institut du Cerveau, Paris, France(巴黎脑研究所) Champalimaud Foundation, Lisbon, Portugal(尚帕利莫基金会)

AI总结 提出Dolph2Vec,首个基于五年纵向海豚录音数据训练的自监督模型,在签名哨声分类和检测任务上显著优于通用基线,并发现可解释的声学单元。

详情
AI中文摘要

自监督学习(SSL)通过无需昂贵人工标注即可对动物发声进行可扩展建模,为生物声学开辟了新机遇。然而,当前该领域的SSL模型优先考虑跨物种的广泛泛化,并未针对揭示个体通信系统的细粒度结构进行优化。在这项工作中,我们收集并发布了一个新颖的数据集,包含来自半自然海洋环境中五只已知海豚的超过五年的纵向录音,这是研究海豚通信的前所未有的资源。我们将Wav2Vec2.0 Baevski等人(2020)的架构适应于此领域,并引入Dolph2Vec,这是第一个仅在此数据上训练的大规模、物种特异性SSL模型。我们在两个生物学相关任务上对模型进行基准测试:签名哨声分类和哨声检测。Dolph2Vec在这两个任务上均显著优于通用基线。除了性能,我们还展示了学习到的嵌入和码本结构捕获了与海豚哨声类别以及可能的子哨声结构对齐的可解释声学单元,从而能够对通信模式进行细粒度分析。我们的发现证明了SSL如何作为模型和科学工具来探索动物通信研究中的假设。

英文摘要

Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

2606.12501 2026-06-12 cs.LG 新提交

Policy-driven Conformal Prediction for Trustworthy QoT Estimation

策略驱动的可信QoT估计的保形预测

Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) University of Applied Sciences and Arts of Southern Switzerland(瑞士南方应用科学与艺术大学)

AI总结 提出Conformal QoT框架,结合统计保证的QoT估计与操作决策策略,实现域偏移下可靠的光路可行性预测,在开放数据集上将准确率从92%提升至99.6%。

详情
AI中文摘要

我们提出Conformal QoT,一个策略驱动的框架,将具有统计保证的QoT估计与操作决策策略相结合,能够在域偏移下实现可靠的光路可行性预测,并在开放数据集上将准确率从92%提升至99.6%。

英文摘要

We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92\% to 99.6\% on open datasets.

2606.12499 2026-06-12 cs.RO 新提交

Action-Effect Memory Pretraining for Robot Manipulation

动作-效应记忆预训练用于机器人操作

Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shenzhen University(深圳大学)

AI总结 提出AEM框架,通过视觉-动作历史掩码建模学习紧凑时间表征,提升机器人操作在部分可观测环境下的性能,优于单帧预训练和帧堆叠方法。

详情
AI中文摘要

我们提出了AEM,一个用于机器人操作的动作-效应记忆预训练框架,从视觉-动作历史中学习紧凑的时间表征。与先前主要关注单帧视觉编码的机器人表征预训练方法不同,AEM针对操作的时间特性,在部分可观测性下,仅凭当前观测往往不足。AEM通过交错视觉和动作特征将操作建模为动作驱动的交互过程,并应用掩码建模从不完整历史中恢复缺失内容,从而学习动作条件化的状态演化。最终视觉令牌的Mamba编码输出用作紧凑的历史表征,作为解码和下游控制的全局上下文。该设计在保持推理高效的同时,保留了单向量时间瓶颈。我们使用扩散策略和流策略评估AEM。AEM在仿真和真实环境中一致提升了操作性能,在干净场景、杂乱和随机场景以及非马尔可夫任务中均优于基线。消融研究进一步表明,历史感知预训练超越了单帧预训练和直接帧堆叠,同时降低了推理延迟和计算成本。

英文摘要

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.