arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2602.24238 2026-05-19 cs.LG

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

时间序列基础模型在交通预测中的强大基准作用:一项大规模基准分析

Javier Yanes-Pulido, Filipe Rodrigues

发表机构 * Technical University of Denmark(丹麦技术大学)

AI总结 本文通过在十个真实世界数据集上评估最新时间序列模型Chronos-2的零样本性能,证明了通用时间序列基础模型在交通预测中的有效性,展示了其在多数数据集上达到或超越传统统计基线和专用深度学习架构的准确性,尤其在长预测范围内表现突出。

Comments 6 pages

详情
AI中文摘要

准确预测交通动态对于城市交通和基础设施规划至关重要。尽管近期工作在深度学习模型中取得了优异表现,但这些方法通常需要特定数据集的训练、架构设计和超参数调整。本文评估了通用时间序列基础模型是否能作为交通任务的预测器,通过在十个涵盖高速公路交通量和流、城市交通速度、自行车共享需求和电动汽车充电站数据的真实世界数据集上,对最新模型Chronos-2的零样本性能进行基准测试。在一致的评估协议下,我们发现,即使没有任何任务特定的微调,Chronos-2在大多数数据集上均达到或超越了传统统计基线和专用深度学习架构的准确性,特别是在长预测范围。除了点预测外,我们还通过预测区间覆盖和锐度评估其原生概率输出,证明Chronos-2在无需特定数据集训练的情况下也提供了有用的不确定性量化。总体而言,本研究支持将时间序列基础模型作为交通预测研究的关键基准。

英文摘要

Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.

2602.23566 2026-05-19 cs.LG cs.AI

Flowette: Flow Matching with Graphette Priors for Graph Generation

Flowette: 用于图生成的图结构先验的流匹配

Asiri Wijesinghe, Sevvandi Kandanaarachchi, Daniel M. Steinberg, Cheng Soon Ong

发表机构 * CSIRO’s Data61(CSIRO数据61) Australian National University(澳大利亚国立大学)

AI总结 本文提出Flowette框架,通过图神经网络基于transformer学习图表示上的速度场,结合最优传输耦合和正则化,利用图ettes先验结构模型提升图生成性能,实验证明结合结构先验和流训练的有效性。

Comments 48 Pages

详情
AI中文摘要

我们研究具有重复子图motif的图生成建模。我们提出了Flowette,一个连续流匹配框架,利用基于图神经网络的transformer学习具有节点和边属性的图表示上的速度场。我们的模型通过基于最优传输的耦合实现拓扑感知对齐,并通过正则化促进全局结构一致性。为整合领域驱动的结构先验,我们引入图ettes,一种新的概率图结构模型家族,通过受控的结构编辑推广图ons以适用于环、星形和树等motif。我们理论分析了框架的耦合、不变性和结构性质,评估了其在合成和分子基准上的性能,并通过受控消融实验隔离了结构先验、最优传输耦合和正则化项的贡献。Flowette在多个基准上取得了竞争性性能,达到多个指标的最先进结果,突显了结合结构先验与流训练在建模复杂图分布中的有效性。

英文摘要

We study generative modeling of graphs with recurring subgraph motifs. We propose Flowette, a continuous flow matching framework that employs a graph neural network-based transformer to learn a velocity field over graph representations with node and edge attributes. Our model promotes topology-aware alignment through optimal transport-based coupling and encourages global structural coherence through regularisation. To incorporate domain-driven structural priors, we introduce graphettes, a new probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs such as rings, stars, and trees. We theoretically analyze the coupling, invariance, and structural properties of the framework, evaluate it on synthetic and molecular benchmarks, and isolate the contributions of the structural prior, the optimal-transport coupling, and the regularisation terms through controlled ablations. Flowette achieves competitive performance overall, attaining state-of-the-art results on several metrics across multiple benchmarks, highlighting the effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.

2602.22667 2026-05-19 cs.CV

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

单目开放词汇占用预测用于室内场景

Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 该研究提出了一种基于几何的监督方法,用于单目开放词汇室内场景的占用预测,通过引入一种基于Poisson的透明度感知方法和逐步温度衰减调度,提高了几何和语义对齐的稳定性与精度,实验结果显示在Occ-ScanNet数据集上取得了较高的IoU和mIoU指标。

Comments Accepted at CVPR2026 Oral

详情
AI中文摘要

开放词汇3D占用对于具有体素的智能体至关重要,这些智能体需要理解具有丰富语义类别的复杂室内环境,并超越固定分类体系。尽管最近的研究在户外驾驶场景中探索了开放词汇占用,但这些方法在室内场景中表现不佳,因为几何更密集,布局更复杂,语义更细粒度。为了解决这些挑战,我们采用仅使用二元占用标签(占用vs自由)的几何-only监督范式。我们的框架基于3D语言嵌入高斯,这些高斯作为统一的中间表示,将细粒度3D几何与语言对齐的语义嵌入耦合在一起。在几何方面,我们发现现有高斯到占用运算符在如此弱的监督下无法收敛,我们引入了一种基于Poisson的透明度感知方法,稳定了体积分组。在语义方面,直接对渲染特征和开放词汇分割特征之间的对齐导致特征混合;因此,我们提出了一个逐步温度衰减调度,逐步在溅射过程中锐化透明度,加强高斯-语言对齐。在Occ-ScanNet上,我们的框架在开放词汇设置中实现了59.50 IoU和21.05 mIoU,超过了所有现有的占用方法在IoU,并在mIoU上大幅优于先前的开放词汇方法。代码将在https://github.com/JuIvyy/LegoOcc上发布。

英文摘要

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

2602.21426 2026-05-19 cs.LG stat.CO

Proximal-IMH: Proximal Posterior Proposals for Independent Metropolis-Hastings with Approximate Operators

Proximal-IMH: 用于独立Metropolis-Hastings的近端后验提议

Youguang Chen, George Biros

发表机构 * Oden Institute for Computational Engineering and Sciences(奥登计算工程与科学研究所)

AI总结 本文提出了一种改进的独立Metropolis-Hastings算法,通过引入辅助优化问题来消除近似后验分布中的偏差,从而在保持精确模型的同时提高稳定性和采样效率。

详情
AI中文摘要

我们考虑了在科学、工程和成像中的贝叶斯反问题中从后验分布采样的问题。我们的方法属于独立Metropolis-Hastings(IMH)采样算法家族,常用于贝叶斯推断。依赖于存在一个更便宜但可能有显著偏差的近似后验分布,我们引入了Proximal-IMH,通过辅助优化问题纠正近似后验的样本,从而在精确模型和近似参考点周围获得局部调整。对于理想化设置,我们证明了近端校正能够收紧近似和精确后验之间的匹配,从而提高接受率和混合性。该方法适用于线性和非线性输入-输出算子,并特别适用于精确后验采样成本过高的反问题。我们展示了包含多模态和数据驱动先验的数值实验,结果表明Proximal-IMH在现有IMH变体中表现更优。

英文摘要

We consider the problem of sampling from a posterior distribution arising in Bayesian inverse problems in science, engineering, and imaging. Our method belongs to the family of independence Metropolis-Hastings (IMH) sampling algorithms, which are common in Bayesian inference. Relying on the existence of an approximate posterior distribution that is cheaper to sample from but may have significant bias, we introduce Proximal-IMH, a scheme that removes this bias by correcting samples from the approximate posterior through an auxiliary optimization problem. This yields a local adjustment that trades off adherence to the exact model against stability around the approximate reference point. For idealized settings, we prove that the proximal correction tightens the match between approximate and exact posteriors, thereby improving acceptance rates and mixing. The method applies to both linear and nonlinear input-output operators and is particularly suitable for inverse problems where exact posterior sampling is too expensive. We present numerical experiments including multimodal and data-driven priors with nonlinear input-output operators. The results show that Proximal-IMH reliably outperforms existing IMH variants.

2602.21265 2026-05-19 cs.CL cs.LG cs.SE

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

ToolMATH: 一种用于在系统性工具目录约束下评估长周期工具使用的诊断基准

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出ToolMATH,一种基于数学的诊断基准,用于评估在可控工具目录条件下长周期工具使用的性能,通过将分步MATH解决方案转换为可重用的Python工具,并配对需要顺序工具使用、中间输出重用和逻辑连接工具调用链的问题,从而评估模型在不同工具目录条件下的适应性、鲁棒性和工具连接性。

Comments Submitted to NeurIPS Evaluation & Dataset Track

详情
AI中文摘要

我们介绍了ToolMATH,一种用于评估在可控工具目录条件下长周期工具使用的数学基础诊断基准。ToolMATH将分步MATH解决方案转换为具有自然语言描述和类型化架构的可重用Python工具,并配对每个问题与一个需要顺序工具使用、中间输出重用和逻辑连接工具调用链的工具环境。ToolMATH通过构建黄金工具和难度分级的干扰项来控制工具可用性和目录难度。ToolMATH还结合了行为条件度量指标,使诊断评估超越最终准确性。基于这些测量,ToolMATH强调三个评估轴:(1)适应性衡量在黄金工具被完全替换为干扰项时保留的黄金成功程度;(2)鲁棒性衡量在添加干扰项作为噪声时的稳定性;(3)工具连接性衡量模型是否在长执行的工具调用链中保持准确性。此外,跟踪级失败分析描述了模型在每种工具目录条件下如何失败。这些诊断揭示了不同的模型特征:可靠的工具使用、工具回避、适应性替代以及不可靠工具目录的影响。总体而言,ToolMATH提供了一个受控的测试平台,用于评估语言模型如何适应变化的工具可用性,保持对干扰项的鲁棒性,并在长周期工具使用轨迹中保持正确性。

英文摘要

We introduce \ToolMATH, a math-grounded diagnostic benchmark for evaluating long-horizon tool use under controllable tool-catalog conditions. \ToolMATH converts stepwise MATH solutions into reusable Python tools with natural-language descriptions and typed schemas, and pairs each problem with a tool environment requiring sequential tool use, intermediate-output reuse, and logically connected tool-call chains. \ToolMATH controls tool availability and catalog difficulty by constructing gold tools and graded distractors with varying similarity to gold tools. \ToolMATH also incorporates behavior-conditioned metrics, enabling diagnostic evaluation beyond final accuracy. Building on these measurements, \ToolMATH emphasizes three evaluation axes: (1) \emph{Adaptability} measures how much Gold-only success is retained when gold tools are replaced entirely by distractors; (2) \emph{Robustness} measures stability under adding distractors as a noise; and (3) \emph{Tool Connectivity} measures whether models preserve accuracy over long executed tool-call chains. Furthermore, trace-level failure analyses characterize how models fail under each tool-catalog condition. Together, these diagnostics reveal distinct model profiles: reliable tool use, tool avoidance, adaptive substitution, and impacts of unreliable tool catalogs. Overall, \ToolMATH provides a controlled testbed for evaluating how language models adapt to changing tool availability, remain robust to distractors, and maintain correctness across long-horizon tool-use trajectories.

2602.20200 2026-05-19 cs.RO cs.AI cs.CV

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

全局先验与局部一致性:双内存增强的视觉-语言-动作模型用于高效机器人操作

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) PengCheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳洛神研究院) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出OptimusVLA模型,通过引入全局先验内存和局部一致性内存,解决机器人操作中动作生成效率低和鲁棒性差的问题,从而在多个基准测试中实现了更高的成功率和更快的推理速度。

Comments Accepted by CVPR 2026

详情
AI中文摘要

分层视觉-语言-动作(VLA)模型已成为机器人操作中的主导范式。它通常包括一个视觉-语言骨干网络用于感知和理解,以及一个生成性策略用于动作生成。然而,其性能越来越受到动作生成过程的限制。(i) 低推理效率。各向同性噪声先验与目标动作分布之间存在显著的分布差距,这会增加去噪步骤和不可行样本的发生率。(ii) 脆弱性差。现有策略仅基于当前观察,忽视了历史序列的约束,因此缺乏对任务进展和时间一致性意识。为了解决这些问题,我们引入OptimusVLA,一种具有全局先验内存(GPM)和局部一致性内存(LCM)的双内存VLA框架。GPM用从语义相似轨迹中检索到的任务级先验替代高斯噪声,从而缩短生成路径并减少函数评估次数(NFE)。LCM动态建模执行的动作序列以推断任务进展,并注入一个学习的一致性约束,强制轨迹的时间一致性和平滑性。在三个模拟基准测试中,OptimusVLA始终优于强大的基线:它在LIBERO上实现了98.6%的平均成功率,在CALVIN上比pi_0提高了13.5%,在RoboTwin 2.0 Hard上达到了38%的平均成功率。在现实世界评估中,OptimusVLA在泛化和长周期套件中排名第一,比pi_0分别高出42.9%和52.4%,同时实现了2.9倍的推理加速。

英文摘要

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

2602.20042 2026-05-19 cs.CL

AI Alignment Breaks at the Edge

AI对齐在边缘处破裂

Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, Yanfang Ye

发表机构 * University of Notre Dame(诺丁汉大学) Emory University(埃默里大学)

AI总结 本文探讨了AI对齐在边缘案例中的失效问题,提出了一种新的对齐方法,通过识别和处理价值冲突、多方利益分歧和认知模糊性来改进AI的安全性和有效性。

Comments 38 pages, 6 figures

详情
AI中文摘要

通用对齐已经提高了平均情况下的有用性和安全性,但当前的对齐实践仍然奖励自信的单轮响应。问题不仅在于模型在边缘案例中失败,而且当前的评估使许多这些失败难以察觉。我们认为对齐必须超越平均情况的评估,通过使价值冲突、多方利益分歧和认知模糊性下的失败变得可见和可操作。标量奖励将多样化的价值观压缩成一个数字;数据和评估制度崩溃、过滤或未能激发对齐最困难的案例;治理往往缺乏裁定争议案例的机制。这些盲点导致了价值扁平化、表征损失和不确定性盲区。我们使用“边缘对齐”来命名一种检测、评估和治理议程,以揭示这些失败并将其与适当的干预措施联系起来。而不是单一的训练目标,边缘对齐定义了标准对齐应何时让位于保持多维价值结构、代表多方观点和支持不确定性意识互动的机制。一个包含91个边缘案例和四个现代模型的试点诊断集表明,普通的有用性和安全性读数可能无法发现边缘意识评估所暴露的过程失败。我们概述了操作性的边缘信号、过程意识的评估标准,以及一个三阶段的过程堆栈,将对齐重新定义为动态规范治理的生命周期问题。

英文摘要

General Alignment has improved average-case helpfulness and safety, but current alignment practice still rewards confident, single-turn responses. The problem is not only that models fail on edge cases; it is that current evaluation makes many of these failures hard to see. We take the position that alignment must move beyond average-case evaluation by making failures under value conflict, plural stakeholder disagreement, and epistemic ambiguity visible and actionable. Scalar rewards compress diverse values into a single number; data and evaluation regimes collapse, filter, or fail to elicit the cases where alignment is hardest; and governance often lacks mechanisms for adjudicating contested cases. These blind spots produce value flattening, representation loss, and uncertainty blindness. We use Edge alignment to name a detection, evaluation, and governance agenda for surfacing these failures and connecting them to appropriate interventions. Rather than a single training objective, Edge alignment defines the conditions under which standard alignment should yield to mechanisms that preserve multidimensional value structure, represent plural perspectives, and support uncertainty-aware interaction. A pilot diagnostic set of 91 edge cases and four contemporary models illustrates that ordinary helpfulness and safety readings can miss process failures that edge-aware evaluation exposes. We outline operational edge signals, process-aware evaluation criteria, and a three-phase process stack that reframes alignment as a lifecycle problem of dynamic normative governance.

2602.18227 2026-05-19 cs.LG

Parameter-Efficient Domain Adaptation of Physics-Informed Self-Attention based GNNs for AC Power Flow Prediction

为交流电力流预测的物理信息自注意力基于GNN的领域适应参数高效方法

Redwanul Karim, Changhun Kim, Timon Conrad, Nora Gourmelon, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany(模式识别实验室,埃尔兰根-纽伦堡大学,埃尔兰根,德国) Institute of Electrical Energy Systems, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany(电气能源系统研究所,埃尔兰根-纽伦堡大学,德国)

AI总结 本文研究了物理信息自注意力基于GNN的参数高效领域适应方法,通过物理基础损失鼓励基尔霍夫一致行为,并限制适应为低秩更新,从而在电压领域转移下实现可控的效率-精度权衡。

详情
AI中文摘要

在中压(MV)电网训练的模型部署到高压(HV)网络时,准确的交流电力流(AC-PF)预测在领域转移下至关重要。现有的物理信息图神经网络(GNN)求解器通常依赖全微调进行跨领域转移,导致高再训练成本,并且对目标领域适应与源领域保留之间的稳定性-可塑性权衡控制有限。我们研究了物理信息自注意力基于GNN的参数高效领域适应,通过物理基础损失鼓励基尔霍夫一致行为,同时限制适应为低秩更新。具体而言,我们应用低秩适应(LoRA)到注意力投影,并选择性地解冻预测头以调节适应能力。这种设计在电压领域转移下实现了可控的效率-精度权衡。在多个电网拓扑结构上,所提出的LoRA+PHead适应方法在目标领域RMSE差距为$2.6 imes 10^{-4}$的情况下恢复了接近全微调的精度,同时将可训练参数数量减少了$85.46\%$。物理基础残差与全微调相当;然而,相对于全微调,LoRA+PHead在领域转移下将中压源保留减少了4.7个百分点(17.9% vs. 22.6%),同时仍实现了参数高效且物理一致的AC-PF估计。

英文摘要

Accurate AC power flow (AC-PF) prediction under domain shift is critical when models trained on medium-voltage (MV) grids are deployed on high-voltage (HV) networks. Existing physics-informed graph neural network (GNN) solvers typically rely on full fine-tuning for cross-regime transfer, incurring high retraining cost and offering limited control over the stability-plasticity trade-off between target-domain adaptation and source-domain retention. We study parameter-efficient domain adaptation for physics-informed self-attention-based GNNs, encouraging Kirchhoff-consistent behavior via a physics-based loss while restricting adaptation to low-rank updates. Specifically, we apply low-rank adaptation (LoRA) to attention projections with selective unfreezing of the prediction head to regulate adaptation capacity. This design yields a controllable efficiency-accuracy trade-off for physics-constrained inverse estimation under voltage-regime shift. Across multiple grid topologies, the proposed LoRA+PHead adaptation recovers near-full fine-tuning accuracy with a target-domain RMSE gap of $2.6 \times 10^{-4}$ while reducing the number of trainable parameters by $85.46\%$. The physics-based residual remains comparable to full fine-tuning; however, relative to Full FT, LoRA+PHead reduces MV source retention by 4.7 percentage points (17.9% vs. 22.6%) under domain shift, while still enabling parameter-efficient and physically consistent AC-PF estimation.

2602.17684 2026-05-19 cs.LG cs.AI

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

CodeScaler: 通过奖励模型扩展代码大语言模型的训练和测试时间推理

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

发表机构 * LARK, HKUST(GZ)(LARK,香港科技大学(广州)) Kuaishou Technology(快手科技) UCL(伦敦大学学院) UZH(苏黎世联邦理工学院) NUS(国立新加坡大学)

AI总结 本文提出CodeScaler,一种通过奖励模型扩展代码生成模型的训练和测试时间推理的框架,通过精心编纂的偏好数据和语法感知的代码提取,实现了在四个编码基准上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分,并在无测试用例的情况下通过合成数据进一步提升14.64分,同时在推理时间减少10倍的延迟,且在代码、通用和推理领域均优于现有奖励模型。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过利用单元测试的执行反馈推动了代码大语言模型的最新进展,但其可扩展性从根本上受到高质量测试用例可用性和可靠性的影响。我们提出CodeScaler,一种奖励模型,旨在扩展代码生成的强化学习训练和测试时间推理。CodeScaler是在经过验证的代码问题上精心编纂的偏好数据上训练的,并结合语法感知的代码提取和保持有效性的奖励塑造,以确保稳定和稳健的优化。在四个编码基准上,CodeScaler在Qwen3-8B-Base上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分。通过进一步扩展到44K问题并添加额外的合成数据,CodeScaler在无任何测试用例的情况下,相对于基础模型提升了14.64分。在推理时间,CodeScaler作为有效的测试时间扩展方法,实现了与单元测试方法相当的性能,同时在推理时间减少了10倍的延迟。此外,CodeScaler在RM-Bench上不仅在代码领域(+3.3分)上优于现有奖励模型,还在通用和推理领域(平均+2.7分)上也表现优异。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

2602.16990 2026-05-19 cs.AI cs.CE

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Conv-FinRe:一种用于实用导向财务推荐的对话和纵向基准

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Columbia University(哥伦比亚大学) California State University(加州州立大学) University of Montreal(蒙特利尔大学) The University of Manchester(曼彻斯特大学) McGill University(麦吉尔大学)

AI总结 本研究提出Conv-FinRe基准,用于评估金融推荐模型在对话和长期视角下的实用性,通过多视角参考区分描述性行为与基于投资者风险偏好的规范性效用,揭示理性决策与行为一致性的张力。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情
AI中文摘要

大多数推荐基准评估模型模仿用户行为的能力。在金融顾问领域,观察到的行为可能在市场波动中嘈杂或短视,并可能与用户的长期目标冲突。因此,将用户的选择视为唯一真实情况,会将行为模仿与决策质量混淆。我们引入Conv-FinRe,一种用于股票推荐的对话和纵向基准,评估LLM超越行为匹配的能力。给定一个入职访谈、分步市场背景和顾问对话,模型必须在固定投资期限内生成排名。关键在于,Conv-FinRe提供了多视角参考,区分描述性行为与基于投资者特定风险偏好的规范性效用,使能够诊断LLM是否遵循理性分析、模仿用户噪声或受市场动量驱动。我们从真实市场数据和人类决策轨迹构建了该基准,实例化了受控的顾问对话,并评估了一套最先进的LLM。结果揭示了理性决策质量与行为一致性的持续张力:在效用基础上表现良好的模型往往无法匹配用户选择,而行为一致的模型可能会过拟合短期噪声。该数据集已公开发布在Hugging Face,代码库可在GitHub上获得。

英文摘要

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

2602.12978 2026-05-19 cs.RO cs.AI

Learning Native Continuation for Action Chunking Flow Policies

学习原生延续以实现动作分块流策略

Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao

发表机构 * Spirit AI

AI总结 本文提出Legato方法,通过训练时的延续技术改进动作分块流基于VLA策略,减少动作边界不连续性和伪多模态切换,提升轨迹平滑度和任务完成效率。

Comments Accepted by Robotics: Science and Systems 2026 (RSS 2026). Project page: https://lyfeng001.github.io/Legato/

详情
AI中文摘要

动作分块使Vision Language Action (VLA)模型能够实时运行,但朴素的分块执行常在分块边界处出现不连续性。实时分块(RTC)缓解了这一问题,但其作为外部策略导致伪多模态切换和非内在平滑的轨迹。我们提出Legato,一种针对动作分块流基于VLA策略的训练时延续方法。具体而言,Legato从具有调度形状的已知动作和噪声混合物初始化去噪,使模型接触部分动作信息。此外,Legato重塑学习的流动力学,确保在每步指导下去噪过程在训练和推理之间保持一致。Legato进一步在训练中使用随机调度条件以支持变化的推理延迟并实现可控的平滑度。实证结果表明,Legato产生更平滑的轨迹并减少执行中的伪多模态切换,导致较少的犹豫和更短的任务完成时间。广泛的现实世界实验表明,Legato在五个操作任务中始终优于RTC,实现了轨迹平滑度和任务完成时间的约10%的改进。

英文摘要

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

2602.12871 2026-05-19 cs.CL

MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

MentalBench: 一个用于评估大语言模型 psychiatric 诊断能力的 DSM 基础基准

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(成均馆大学) Dongguk University Medical Center(东国大学医学院) Seoultech(首尔技术大学) Samsung Medical Center(三星医疗中心)

AI总结 本文提出 MentalBench,一个用于评估大语言模型在不同临床模糊程度下能否做出 DSM 基础的 psychiatric 诊断决策的基准。该基准基于 psychiatrist 构建并验证的知识图谱,生成了 24,750 个合成临床案例,以系统地变化信息完整性和诊断复杂性,从而实现 DSM 基础的评估。实验表明,尽管最先进的 LLM 在噪声自由查询上表现良好,但它们在区分具有重叠症状的诊断时难以校准其信心。

详情
AI中文摘要

大型语言模型 (LLMs) 已吸引越来越多的关注,作为心理评估和临床决策支持的支持工具。然而,现有的心理健康基准大多依赖于社交媒体数据或支持性对话设置,限制了它们评估模型是否能够应用正式诊断标准和鉴别诊断规则的能力。在本文中,我们介绍了 MentalBench,一个用于评估 LLM 是否能在不同水平的临床模糊性下做出 DSM 基础的 psychiatric 诊断决策的基准。MentalBench 的核心是 MentalKG,一个由精神科医生构建并验证的知识图谱,编码了 DSM-5 的诊断标准和鉴别诊断规则,适用于 23 种心理疾病。利用 MentalKG 作为专家整理的逻辑基础,我们生成了 24,750 个合成临床案例,这些案例在信息完整性和诊断复杂性方面系统地变化,从而实现 DSM 基础的评估。我们的实验表明,尽管最先进的 LLM 在噪声自由查询上表现良好,但它们在区分具有重叠症状的诊断时难以校准其信心。这些发现引发了关于 LLM 作为心理决策支持工具可靠性的担忧,并突显了需要更多评估以反映现实世界心理诊断中的多样化挑战的必要性。

英文摘要

Large language models (LLMs) have attracted growing interest as supportive tools for psychiatric assessment and clinical decision support. However, existing mental health benchmarks largely rely on social media data or supportive dialogue settings, limiting their ability to assess whether models can apply formal diagnostic criteria and differential diagnostic rules. In this paper, we introduce MentalBench, a benchmark for evaluating whether LLMs can make DSM-grounded psychiatric diagnostic decisions under varying levels of clinical ambiguity. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as an expert-curated logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling DSM-grounded evaluation. Our experiments show that although state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. These findings raise concerns about the reliability of LLMs as psychiatric decision-support tools and highlight the need for more evaluation that reflects the diverse challenges in real-world psychiatric diagnosis.

2602.12755 2026-05-19 cs.CV

Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

向稀疏视角X射线CT数据重建迈进:基于扩散模型

Nelas J. Thomsen, Xinyuan Wang, Felix Lucka, Ezgi Demircan-Tureyen

发表机构 * 1 Martin-Luther-University Halle-Wittenberg, Institute of Physics, Halle, Germany 2 Centrum Wiskunde \& Informatica, Computational Imaging Group, Amsterdam, The Netherlands

AI总结 本文研究了如何利用扩散模型重建稀疏视角X射线CT数据,探讨了训练数据不匹配(域偏移)和正向模型不匹配对实验数据应用的影响,发现域偏移在不同程度上影响模型性能,而正向模型不匹配可通过退火似然权重调度缓解。

Comments 5 pages + references, 4 figures, 2 tables, conference paper

详情
AI中文摘要

基于扩散的图像生成器在解决不明确的逆问题,如稀疏视角X射线计算机断层扫描(CT)方面具有前景。大多数研究考虑合成数据,不清楚训练数据不匹配(“域偏移”)或正向模型不匹配是否复杂其成功应用于实验数据。我们测量了与合成Shepp-Logan幻影相似的物理幻影的CT数据,并在具有不同域偏移程度的合成图像数据集上训练扩散先验。然后,我们采用分解扩散采样方案,在难度逐渐增加的稀疏视角CT数据集上应用这些先验。我们的结果表明,域偏移的作用是微妙的:虽然严重的不匹配导致模型崩溃和幻觉,但多样化的先验匹配或超过匹配良好的但狭窄的先验。正向模型不匹配会将图像样本推离先验流形,导致伪影,但可以通过退火似然权重调度缓解,这也可以提高计算效率。总体而言,我们证明了性能增益并不立即从合成数据转移到实验数据,未来的发展必须通过现实世界基准来验证。

英文摘要

Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift'') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors match or exceed well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood weight schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

2602.12687 2026-05-19 cs.LG cs.AI

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

信任不确定的教师:通过校准的不确定性提炼暗知识

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

发表机构 * Ewha Womans University(成均馆大学) Tencent(腾讯)

AI总结 本文提出校准不确定性提炼(CUD)框架,通过从分布角度重新审视知识蒸馏,使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性,从而在易例中获益于自信信号,在难例中获益于结构化不确定性,提升了学生在分布偏移和长尾输入上的准确性和可靠性。

详情
AI中文摘要

知识蒸馏的核心在于将教师的丰富'暗知识'-即揭示类别间关系和不确定性分布的细微概率模式进行转移。尽管这一理念已建立,但传统交叉熵训练的教师往往无法保留此类信号。它们的分布会坍缩成尖锐、过度自信的峰,看似决定性但实际脆弱,提供的仅限于硬标签或在表示层面转移时微妙地阻碍。这种过度自信在高基数任务中尤为成问题,因为许多可能类别的细微差别对指导紧凑的学生至关重要。此外,这种脆弱的目标会降低对分布偏移的鲁棒性,使学生在现实条件下的校准变得不可靠。为解决这一限制,我们从分布角度重新审视蒸馏,并提出校准不确定性蒸馏(CUD)框架,旨在使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性。通过在转移前直接塑造教师的预测分布,我们的方法在准确性和校准之间取得平衡,使学生在易例中受益于自信信号,在难例中受益于结构化不确定性。在多样化的基准测试中,CUD产生的学生不仅更加准确,而且在分布偏移下更加校准,在模糊的长尾输入上更加可靠。

英文摘要

The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

2602.11699 2026-05-19 cs.CL

Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

在生成上下文中寻找意义:人类与语言模型的视角

Katrina Olsen, Sebastian Padó

发表机构 * Grid Dynamics IMS University of Stuttgart(斯图加特大学)

AI总结 本文通过人类和语言模型对五个语义偏差数据集中的句子进行评估,探讨了如何区分异常句子和无意义句子,并发现语言模型在生成合理上下文方面表现出色。

Comments Accepted for publication at STARSEM 2026, San Diego, CA

详情
AI中文摘要

无意义和异常的句子在计算语义解释模型的发展中起到了关键作用。一个核心挑战是区分仅仅是异常(但可以在上下文中解释)和真正无意义的内容。然而,不清楚(a)现有数据集中的无意义程度,以及(b)LLMs能否做出这种区分。在本文中,我们通过收集人类评估者和LLMs对五种语义偏差数据集中的句子(包括无上下文和有上下文的情况)的可理解性判断来回答这两个问题。我们发现,评估者认为大多数句子仅是异常,只有少数被认为是真正的无意义。我们还显示,LLMs在为异常情况生成合理的上下文方面具有显著的能力。

英文摘要

Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

2602.11130 2026-05-19 cs.LG cs.CV

Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers

Meltdown: 点云条件化3D扩散变换器中的电路与分叉

Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins

发表机构 * Institute for Machine Learning, JKU Linz(机器学习研究所,林茨大学)

AI总结 该研究探讨了点云条件化3D扩散变换器在输入变化下的失败模式,揭示了Meltdown现象,通过机制性案例研究展示了其成因,并提出了PowerRemap方法以抑制该现象。

详情
AI中文摘要

稀疏点云是3D表面重建中常见的输入模式,包括在安全关键领域如手术导航和自动驾驶感知中。最近的点云条件化3D扩散变换器在这一领域通过利用学习先验知识实现了最先进的结果。我们展示了这些模型在现实输入变化下可能灾难性地失败,并展示了其原因。我们识别出一种称为Meltdown的失败模式:对稀疏输入点云的微小表面扰动可以将重建输出分解成数百个不连通的部分。对抗搜索在两个开放权重的最先进架构(WaLa、Make-a-Shape)上恢复Meltdown,在真实世界数据集(GSO、SimJEB)和DDPM和DDIM采样下恢复率在89.9-100%。我们追踪Meltdown在正向传递中:它由点在表面上分布的均匀性决定,通过点云编码器忠实传递,并由扩散骨干中的单个早期去噪交叉注意力写入步骤所提交。扩散轨迹集合在接近此提交步骤时表现出对称性破裂,与反向过程的分叉一致。通过一系列匹配幅度的控制,我们证明模型提交的变量是方向性的,集中在写入扰动漂移的低维子空间中。受此发现启发,我们引入PowerRemap,一种测试时间控制,通过重塑局部写入的奇异谱来抑制此漂移,在WaLa上恢复率为98.3%,在Make-a-Shape上为84.6%。这些结果将电路级交叉注意力机制与轨迹级失败解释联系起来,展示了机理分析如何解释和指导条件扩散变换器的行为。

英文摘要

Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.

2602.07884 2026-05-19 cs.LG cs.AI

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

GRAFT:分离排名与校准用于生存分析

Mohammad Ashhad, Robert Hoehndorf, Ricardo Henao

发表机构 * KAUST(卡奥斯特大学) CEMSE KAUST(KAUST工程与科学学院) Duke University(杜克大学)

AI总结 本文提出GRAFT模型,通过分离预测排名与生存校准,解决生存分析中排名与校准之间的权衡问题,该模型结合线性AFT模型与非线性残差神经网络,并利用随机门进行自动特征选择,从而在公开基准测试中实现了更好的判别能力和校准性能。

详情
AI中文摘要

生存分析受到删失数据、高维特征和非线性交互的挑战。经典模型提供可解释性和优越的校准能力,但局限于线性或预定义的功能形式,而深度学习模型具有灵活性并实现了强大的判别性能,但倾向于产生校准不佳的生存估计。为了解决这一权衡问题,我们提出GRAFT(Gated Residual Accelerated Failure Time),一种新的AFT模型,该模型将预测排名与生存校准分离。GRAFT的混合架构结合了线性AFT模型与非线性残差神经网络,并整合了随机门用于自动特征选择。该模型通过优化可微的、C-index对齐的排名损失进行训练,利用局部Kaplan-Meier估计器的随机条件插补,而校准的生存估计则通过简单的后训练校准获得。在公开基准测试中,GRAFT在判别能力和校准性能上优于基线模型,同时在高噪声设置中保持稳健和稀疏。

英文摘要

Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models offer interpretability and superior calibration but are restricted to linear or predefined functional forms, while deep learning models are flexible and achieve strong discriminative performance, but tend to produce poorly calibrated survival estimates. To address this trade-off, we propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from survival calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic feature selection. The model is trained by optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators, while calibrated survival estimates are obtained through simple post-training calibration. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

2602.05287 2026-05-19 cs.AI

Position: Universal Time Series Foundation Models Rest on a Category Error

位置:通用时间序列基础模型建立在类别错误上

Xilin Dai, Wanxu Cai, Zhijian Xu, Qiang Xu

发表机构 * ZJU-UIUC Institute(浙大-UIUC研究院) School of Software(软件学院) Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文指出,追求'通用时间序列基础模型'存在根本性的类别错误,将结构容器误认为语义模态。由于时间序列包含不兼容的生成过程(如金融与流体动力学),单一大模型退化为昂贵的'通用过滤器',在分布漂移下无法泛化。为此,我们引入'自回归盲目界限',证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性,其中代理利用外部上下文协调一系列专门的求解器,从冻结领域专家到轻量级即时适应器。最后,我们呼吁将基准从'零样本准确性'转向'漂移适应速度',以优先考虑鲁棒、控制理论系统。

详情
AI中文摘要

本文立场论文认为,追求'通用时间序列基础模型'建立在根本性的类别错误上,误将结构容器视为语义模态。我们指出,由于时间序列包含不兼容的生成过程(例如金融与流体动力学),单一大模型退化为昂贵的'通用过滤器',在分布漂移下无法泛化。为解决这一问题,我们引入'自回归盲目界限',一个理论极限,证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性,其中代理利用外部上下文协调一系列专门的求解器,从冻结领域专家到轻量级即时适应器。最后,我们呼吁将基准从'零样本准确性'转向'漂移适应速度',以优先考虑鲁棒、控制理论系统。

英文摘要

This position paper argues that the pursuit of "Universal Foundation Models for Time Series" rests on a fundamental category error, mistaking a structural Container for a semantic Modality. We contend that because time series hold incompatible generative processes (e.g., finance vs. fluid dynamics), monolithic models degenerate into expensive "Generic Filters" that fail to generalize under distributional drift. To address this, we introduce the "Autoregressive Blindness Bound," a theoretical limit proving that history-only models cannot predict intervention-driven regime shifts. We advocate replacing universality with a Causal Control Agent paradigm, where an agent leverages external context to orchestrate a hierarchy of specialized solvers, from frozen domain experts to lightweight Just-in-Time adaptors. We conclude by calling for a shift in benchmarks from "Zero-Shot Accuracy" to "Drift Adaptation Speed" to prioritize robust, control-theoretic systems.

2602.03535 2026-05-19 cs.LG cs.NA math.NA math.OC

Sparse Training of Neural Networks based on Multilevel Mirror Descent

基于多级镜像下降法的神经网络稀疏训练

Yannick Lunk, Sebastian J. Scott, Leon Bungert

发表机构 * Institute of Mathematics(数学研究所) University of Würzburg(乌尔姆大学) Institute of Mathematics, CAIDAS University of Würzburg(数学研究所,CAIDAS乌尔姆大学)

AI总结 本文提出了一种基于线性化Bregman迭代/镜像下降的动态稀疏训练算法,通过交替静态和动态稀疏模式更新来利用自然产生的稀疏性,结合稀疏诱导Bregman迭代与自适应冻结网络结构,以高效探索稀疏参数空间并保持稀疏性。通过多级优化框架保证收敛性,并实验证明该算法在标准基准上能产生高稀疏性和准确性的模型,同时在理论FLOPs数量和训练时间上均有显著提升。

详情
AI中文摘要

我们介绍了一种基于线性化Bregman迭代/镜像下降的动态稀疏训练算法,该算法通过在静态和动态稀疏模式更新之间交替,利用自然产生的稀疏性。关键思想是将稀疏诱导的Bregman迭代与自适应冻结网络结构相结合,以在保持稀疏性的同时高效探索稀疏参数空间。我们通过将方法嵌入多级优化框架中,提供收敛保证。此外,我们实验证明,我们的算法可以在标准基准上产生高度稀疏且准确的模型。我们还显示,与SGD训练相比,理论上的FLOPs数量从标准Bregman迭代的38%减少到我们的方法的6%,同时保持测试精度。我们还显示,当使用稀疏感知的CPU实现时,训练时间可减少约50%。

英文摘要

We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.We additionally show a training time reduction by about 50%, when using a sparsity-aware CPU implementation of our method.

2602.03352 2026-05-19 cs.CL

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL: 通过后编辑引导的强化学习改进机器翻译

Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) China Mobile Research Beijing, China(北京中国移动研究院)

AI总结 本文提出PEGRL框架,通过后编辑作为辅助任务稳定训练并引导整体优化,提升机器翻译性能。

详情
AI中文摘要

强化学习(RL)在基于大语言模型(LLM)的机器翻译中展现出强劲的潜力,近期方法如GRPO已取得显著进展;然而,面向翻译的RL仍然受到来自蒙特卡洛回报估计的噪声学习信号以及庞大的轨迹空间的挑战,后者倾向于全局探索而非细粒度的局部优化。我们引入PEGRL,一种两阶段的RL框架,利用后编辑作为辅助任务来稳定训练并引导整体优化。在每次迭代中,翻译输出被采样以构建后编辑输入,使后编辑阶段的回报估计能够受益于对当前翻译行为的条件化,同时共同支持全局探索和细粒度的局部优化。一个任务特定的加权方案进一步平衡翻译和后编辑目标的贡献,从而获得一个偏倚但更样本高效的估计器。在英语→芬兰语、英语→土耳其语以及英语↔中文的实验中,PEGRL在RL基线上表现出一致的提升,对于英语→土耳其语,其在COMET-KIWI上的性能与先进的LLM基系统(DeepSeek-V3.2)相当。我们的代码和一组代表性的预训练模型已公开在https://github.com/NJUNLP/peg-rl和https://huggingface.co/collections/DGME/pegrl。

英文摘要

Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2). Our code and a set of representative pretrained models are publicly available at \url{https://github.com/NJUNLP/peg-rl} and \url{https://huggingface.co/collections/DGME/pegrl}

2602.02830 2026-05-19 cs.LG stat.ME

SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous Graphs

SC3D:动态和可微的因果发现用于时序和瞬时图

Sourajit Das, Dibyajyoti Chakraborty, Romit Maulik

发表机构 * Institute for Computational Data Science(计算数据科学研究所) School of Mechanical Engineering(机械工程学院)

AI总结 本文提出SC3D,一种动态和可微的因果发现方法,用于处理时序和瞬时图,通过两阶段可微框架联合学习滞后特定的邻接矩阵和瞬时有向无环图,提升了因果结构的稳定性和准确性。

Comments 12 pages

详情
AI中文摘要

从多变量时间序列中发现因果结构是一个关键问题,因为相互作用跨越多个滞后并可能涉及瞬时依赖。此外,动态图的搜索空间本质上是组合性的。在本研究中,我们提出稳定因果动态可微发现(SC3D),一种两阶段可微框架,联合学习滞后特定的邻接矩阵以及如果存在的话瞬时有向无环图(DAG)。在第一阶段,SC3D通过节点级预测进行边预选以获得滞后和瞬时边的掩码,而第二阶段通过优化具有稀疏性的似然函数并强制瞬时块的无环性来细化这些掩码。在合成SVAR系统、非线性和混沌基准、非平稳动态和现实世界数据集上的数值结果表明,SC3D在稳定性和准确性方面优于现有基线,能够更准确地恢复滞后和瞬时因果结构。

英文摘要

Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose Stable Causal Dynamic Differentiable Discovery (SC3D), a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic SVAR systems, nonlinear and chaotic benchmarks, nonstationary dynamics and real-world datasets demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing baselines.

2602.00924 2026-05-19 cs.AI

Supervised sparse auto-encoders for interpretable and compositional representations

监督稀疏自编码器用于可解释和组合性表示

Ouns El Harzli, Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao

发表机构 * Department of Computer Science, University of Oxford, Oxford, UK(牛津大学计算机科学系) KAIST AI, Korean Advanced Institute of Science(韩国科学技术高级研究院AI研究所) Independent researcher(独立研究者)

AI总结 本文提出了一种监督稀疏自编码器,通过结合无约束特征模型和监督学习,解决稀疏自编码器在非光滑性及特征与人类语义对齐方面的不足,实现了组合性泛化和语义图像编辑。

详情
AI中文摘要

稀疏自编码器(SAEs)重新成为机制可解释性的重要方法,但面临两个重大挑战:$L_1$惩罚的非光滑性阻碍了重建和可扩展性,以及学习到的特征与人类语义不一致。在本文中,我们通过适应无约束特征模型,一种来自神经崩溃理论的数学框架,并通过监督任务来解决这些限制。我们监督(解码器-only)SAEs通过联合学习稀疏概念嵌入和解码器权重来重建特征向量。在Stable Diffusion 3.5上验证,我们的方法展示了组合性泛化,成功重建了训练期间未见过的概念组合图像,并在不修改提示的情况下实现了特征级的语义图像编辑。

英文摘要

Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models, a mathematical framework from neural collapse theory, and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.

2601.22297 2026-05-19 cs.CL

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

从自我辩论中学习:为多智能体辩论准备推理模型

Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出SDRL框架,通过自我辩论训练使模型具备独立问题解决能力和多智能体辩论中的多样化推理处理能力,实验表明其在多辩论协议和智能体配置下均能提升多智能体辩论性能和单模型推理能力。

详情
AI中文摘要

大型语言模型(LLM)的推理能力已通过可验证奖励的强化学习(RLVR)显著提升。在测试阶段,通过多智能体辩论(MAD)进行协作推理已成为提升LLM性能的有希望方法。然而,当前RLVR方法通常训练LLM独立解决问题,而没有明确准备它们在辩论中综合和受益于不同推理路径。在本文中,我们提出了自我辩论强化学习(SDRL),一种训练框架,其中模型从自我辩论中学习,使单个LLM具备强大的独立问题解决能力和处理MAD中多样化推理轨迹的能力。给定提示后,SDRL首先采样多个候选解决方案,然后构建具有多样化推理路径的辩论环境,并生成基于此环境的第二轮响应。最后,SDRL联合优化初始和辩论条件响应,产生一个既能作为独立求解器又能作为辩论参与者有效的模型。在多个基础模型和推理基准上的实验表明,SDRL在多种辩论协议和智能体配置下均能提升MAD性能,同时增强单模型推理能力。

英文摘要

The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning(SDRL), a training framework where models learn from self-debate, equipping a single LLM with both strong standalone problem-solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL consistently improves MAD performance across diverse debate protocols and agent configurations, while simultaneously strengthening single-model reasoning.

2601.21841 2026-05-19 cs.CL

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

通过大型语言模型的图引导动作生成进行具身任务规划

Xiang Li, Ning Yan, Masood Mortazavi

发表机构 * Purdue University(普渡大学) Futurewei Technologies(未来科技)

AI总结 本文提出GiG框架,通过图神经网络编码环境状态并构建动作连接执行轨迹图,结合有限前瞻性模块提升具身代理的规划能力,在三个具身规划基准测试中取得显著性能提升。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在零样本推理能力方面表现出色,但将其作为具身代理部署仍面临长周期规划的根本挑战。与开放性文本生成不同,具身代理必须将高层意图分解为可操作的子目标,同时遵守动态环境的约束。标准LLM规划器由于上下文窗口限制或幻觉状态转换而难以维持策略一致性。我们提出GiG,一种通过图-图架构结构化具身代理记忆的规划框架。我们的方法利用图神经网络(GNN)将环境状态编码为嵌入,将这些嵌入组织成动作连接的执行轨迹图,存储在经验记忆库中。GiG能够检索结构相似的先例,使代理能基于相关过去结构模式做出决策。此外,我们引入了一个有限前瞻性模块,利用符号转换逻辑通过基于现实的动作投影增强代理的规划能力。我们在三个具身规划基准测试中评估了我们的框架——Robotouille Synchronous、Robotouille Asynchronous和ALFWorld。我们的方法优于最先进的基线,分别在Robotouille Synchronous、Asynchronous和ALFWorld上实现了高达22%、37%和15%的Pass@1性能提升,同时保持可比或更低的计算成本。

英文摘要

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intents into actionable sub-goals while adhering to the constraints of a dynamic environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitations or hallucinate state transitions that violate environment constraints. We propose GiG, a planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. GiG enables retrieval of structurally-similar priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a bounded lookahead module that leverages symbolic transition logic to enhance the agent's planning capabilities through grounded action projections. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld while maintaining comparable or lower computational cost.

2601.21468 2026-05-19 cs.AI

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR: 一种面向布局的视觉记忆用于高效的长周期推理

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) School of Computing(计算学院)

AI总结 MemOCR通过利用视觉布局进行自适应信息密度分配,提高了在有限上下文预算下的长周期推理效率,其核心方法是维护结构化的丰富文本记忆并将其渲染为图像,以实现对关键证据的视觉优先级分配和辅助细节的压缩,从而在各种基准测试中优于基于文本的基线方法。

详情
AI中文摘要

长周期代理推理需要有效地将增长的交互历史压缩到有限的上下文窗口中。现有的记忆系统通常将历史序列化为文本,其中每个标记的费用是均匀的,并且随着长度线性增长,往往在低价值细节上消耗稀缺的预算。为此,我们引入了MemOCR,一种多模态记忆代理,通过通过视觉布局进行自适应信息密度分配,从而在有限的上下文预算下提高长周期推理的效率。具体而言,MemOCR维护一个结构化的丰富文本记忆(例如标题、重点),并将其渲染为图像,供代理在记忆访问时参考,通过视觉优先级分配关键证据,同时积极压缩辅助细节。为了确保在不同内存预算下的鲁棒性,我们通过强化学习训练MemOCR,使用预算意识目标,使代理能够适应不同的压缩水平。在长上下文多跳和单跳问答基准测试中,MemOCR优于强大的文本基线,并在极端预算下实现了更有效的上下文利用。

英文摘要

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

2601.21357 2026-05-19 cs.LG

Beyond Objective-Based Improvement: Stationarity-Aware Expected Improvement for Bayesian Optimization

超越基于目标的改进:面向站性的期望改进用于贝叶斯优化

Joshua Hang Sai Ip, Georgios Makrygiorgos, Ali Mesbah

发表机构 * Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA, USA(加州大学伯克利分校化学与生物分子工程系)

AI总结 本文提出了一种新的期望改进(EI-GN)获取函数,通过引入一阶站性条件来扩展改进原则,从而在高表现和接近站点的区域促进采样,通过在获取标准中嵌入向站性进展,提供更丰富的改进概念。

详情
AI中文摘要

贝叶斯优化(BO)是一种用于优化昂贵黑盒函数的原理性框架,期望改进(EI)是其最广泛使用的获取函数之一。尽管在经验上取得了成功,但EI对一阶最优性条件漠不关心,仅依赖于目标值的改进。因此,它可能会在改进标准无信息的情况下表现出消失的获取信号,限制了其在引导搜索中的有效性。我们提出期望改进通过梯度范数(EI-GN),一种新的获取函数,将改进原则扩展到包含一阶站性,促进在高表现且接近站点的区域采样。我们推导了EI-GN的可计算闭式表达式,并证明其仍保持与基于改进的获取框架的一致性。通过在获取标准中嵌入向站性进展,EI-GN提供了一个更丰富和信息更丰富的改进概念。在标准BO基准上的实验证明了与基线方法的一致性改进,我们进一步展示了其在控制策略学习中的适用性。

英文摘要

Bayesian Optimization (BO) is a principled framework for optimizing expensive black-box functions, with Expected Improvement (EI) among its most widely used acquisition functions. Despite its empirical success, EI is agnostic to first-order optimality conditions, relying solely on objective-value improvement. As a result, it can exhibit vanishing acquisition signals where the improvement criterion is uninformative, limiting its effectiveness in guiding search. We propose Expected Improvement via Gradient Norms (EI-GN), a novel acquisition function that extends the improvement principle to incorporate first-order stationarity, promoting sampling in regions that are both high-performing and close to stationary points. We derive a tractable closed-form expression for EI-GN and show that it remains consistent with the improvement-based acquisition framework. By embedding progress toward stationarity into the acquisition criterion, EI-GN provides a richer and more informative notion of improvement. Empirical results on standard BO benchmarks demonstrate consistent gains over baseline methods, and we further illustrate its applicability to control policy learning.

2601.19300 2026-05-19 cs.LG

Queue Length Regret Bounds for Contextual Queueing Bandits

上下文队列强化学习中的队列遗憾界

Seoungbin Bae, Garyeong Kang, Dabeen Lee

发表机构 * Department of Industrial & Systems Engineering, KAIST(韩国科学技术院工业与系统工程系) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系) Research Institute of Mathematics, Seoul National University(首尔国立大学数学研究所) Korea Institute for Advanced Study(韩国高级研究院) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目)

AI总结 本文提出了一种新的上下文感知调度框架,即上下文队列强化学习,用于在同时学习未知服务速率的过程中进行调度。通过考虑异质的上下文特征,智能体选择任务并将其匹配到服务器以最大化离开速率。服务/离开速率由具有未知服务器特定参数的逻辑模型决定。为了评估策略的性能,我们考虑队列长度遗憾,定义为策略与最优策略之间队列长度的差异。主要挑战在于,在给定时间步长下,队列中剩余任务特征列表可能因策略与最优策略的不同而不同,因为它们可能以不同的顺序处理任务。为此,我们提出了带有复杂耦合论证的策略切换队列的概念。这导致了一种新的队列长度遗憾分解框架,使我们能够理解选择次优任务-服务器对的短期影响及其对队列状态差异的长期影响。我们证明了我们的算法CQB-ε达到了O(T^{-1/4})的遗憾上界。我们还考虑了对抗性选择的上下文设置,其中我们的第二个算法CQB-Opt达到了O(log²T)的遗憾上界。最后,我们提供了实验结果以验证我们的理论发现。

详情
AI中文摘要

我们引入了上下文队列强化学习,一种新的上下文感知框架,用于调度的同时学习未知的服务速率。个体任务携带异质的上下文特征,基于此,智能体选择一个任务并将其与一个服务器匹配以最大化离开速率。服务/离开速率由具有未知服务器特定参数的逻辑模型决定。为了评估策略的性能,我们考虑队列长度遗憾,定义为策略与最优策略之间队列长度的差异。主要挑战在于,在给定时间步长下,队列中剩余任务特征列表可能因策略与最优策略的不同而不同,因为它们可能以不同的顺序处理任务。为此,我们提出了带有复杂耦合论证的策略切换队列的概念。这导致了一种新的队列长度遗憾分解框架,使我们能够理解选择次优任务-服务器对的短期影响及其对队列状态差异的长期影响。我们证明了我们的算法CQB-ε达到了O(T^{-1/4})的遗憾上界。我们还考虑了对抗性选择的上下文设置,其中我们的第二个算法CQB-Opt达到了O(log²T)的遗憾上界。最后,我们提供了实验结果以验证我们的理论发现。

英文摘要

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.

2601.18442 2026-05-19 cs.RO

SG-CADVLM: A Context-Aware Decoding Powered Vision Language Model for Safety-Critical Scenario Generation

SG-CADVLM: 一种基于上下文感知解码的视觉语言模型,用于安全关键场景生成

Hongyi Zhao, Shuo Wang, Qijie He, Ziyuan Pu

发表机构 * School of Transportation, Southeast University(东南大学交通学院)

AI总结 本文提出SG-CADVLM,一种结合上下文感知解码的多模态输入处理框架,用于从事故报告中生成高保真的安全关键场景,通过减少视觉语言模型的幻觉并同时生成道路几何和车辆轨迹,提升了生成场景的准确性和实用性。

详情
AI中文摘要

自动驾驶(AV)需要在安全关键场景中进行严格测试以确保安全性验证,但其验证受到实地测试成本高和现有模拟在罕见安全关键事件中保真度不足的限制。碰撞报告提供了丰富的现实世界事故动态规范,使其成为大型语言模型和视觉语言模型生成高保真场景的有前景资源。然而,现有模型由于上下文抑制常偏离实际事故特征。为了解决这些限制,本文提出了SG-CADVLM,一种整合上下文感知解码与多模态输入处理的框架,用于从碰撞报告中生成安全关键场景。该框架在生成道路几何和车辆轨迹的同时减轻了VLMs的幻觉。实验结果表明,SG-CADVLM生成结合关键和高风险场景的速率比基线方法高88.1%(相比31.2%),代表了182%的提升,同时生成可用于自动驾驶测试的可执行模拟。

英文摘要

Autonomous Vehicle (AV) requires rigorous testing in safety-critical scenarios for safety validation, yet its validation is hindered by the high cost of field testing and the lack of fidelity in current simulations for rare safety-critical events. Crash reports offer rich and authentic specifications of real-world accident dynamics, making them a promising resource for Large Language Models and Vision-Language models to generate high-fidelity scenarios. However, the existing models frequently deviate from actual accident characteristics due to context suppression. To address these limitations, this paper presents SG-CADVLM, a framework integrateing Context-Aware Decoding with multimodal input processing to generate safety-critical scenarios from crash reports. The framework mitigates the hallucination of VLMs while generating road geometry and vehicle trajectories simultaneously. The experimental results demonstrate that SG-CADVLM generates combined critical and high-risk scenarios at a rate of 88.1% compared to 31.2% for the baseline methods, representing a 182% improvement, while producing executable simulations for autonomous vehicle testing.

2601.17887 2026-05-19 cs.AI

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

当个性化合理化风险:揭示个性化对话代理中的安全漏洞

Jiahe Guo, Xiangran Guo, Yulin Hu, Zimo Long, Xingyu Sui, Xuda Zhi, Yongbo Huang, Hao He, Weixiang Zhao, Yanyan Zhao, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) SERES Group Co., Ltd(SERES集团有限公司)

AI总结 本文研究了个性化对话代理中的一种安全故障模式——意图合理化,通过引入PS-Bench基准测试,揭示了个性化记忆如何偏移意图推断并导致模型合理化有害查询,提出了一种轻量级的检测-反思方法以减少安全退化。

详情
AI中文摘要

长期记忆使大型语言模型(LLM)代理能够支持个性化和持续的交互。然而,大多数关于个性化代理的研究优先考虑效用和用户体验,将记忆视为中性组件,并在很大程度上忽略了其安全影响。在本文中,我们揭示了意图合理化,一种此前未被充分探讨的安全故障,在个性化代理中,良性个人记忆会偏移意图推断,导致模型合理化本质上有害的查询。为了研究这一现象,我们引入了PS-Bench,一个用于识别和量化个性化交互中意图合理化的基准测试。在多个增强记忆的代理框架和基础LLM中,个性化将攻击成功率提高了15.8%至243.7%相对于无状态基线。我们进一步从内部表示空间提供了意图合理化的机理证据,并提出了一种轻量级的检测-反思方法,有效减少了安全退化。总体而言,我们的工作提供了首次系统探索和评估意图合理化作为一种安全故障模式,这种模式自然地从良性、现实世界的个性化中产生,突显了在长期个人背景下评估安全的重要性。我们的代码可在:https://github.com/MuyuenLP/PS-Bench获得。警告:本文可能包含有害内容。

英文摘要

Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8\%--243.7\% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. Our code is available at: https://github.com/MuyuenLP/PS-Bench. WARNING: This paper may contain harmful content.

2601.16414 2026-05-19 cs.LG cs.AI

PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

PyHealth 2.0: 一个全面的开源工具包,用于可访问和可重复的临床深度学习

John Wu, Yongda Fan, Zhenbang Wu, Paul Landes, Eric Schrock, Sayeed Sajjad Razin, Arjun Chatterjee, Naveen Baskaran, Joshua Steier, Andrea Fitzpatrick, Bilal Arif, Rian Atri, Jathurshan Pradeepkumar, Siddhartha Laghuvarapu, Junyi Gao, Adam R. Cross, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign, Urbana, IL, USA(伊利诺伊大学厄巴纳-香槟分校) PyHealth Research Initiative(PyHealth研究计划) University of Illinois College of Medicine, Chicago, IL, USA(伊利诺伊大学医学院) The University of Edinburgh, Edinburgh, UK(爱丁堡大学) Health Data Research UK, London, UK(英国健康数据研究) Department of Biomedical Engineering, Bangladesh University of Engineering(孟加拉国工程大学生物医学工程系)

AI总结 本文提出PyHealth 2.0,一个全面的开源工具包,旨在解决临床AI研究中的可重复性和可访问性问题,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化方法,实现7行代码即可完成预测建模。

Comments Under Review

详情
AI中文摘要

难以复制基线、高计算成本和所需领域专业知识创建了持续存在的临床AI研究障碍。为了解决这些挑战,我们介绍了PyHealth 2.0,一个增强的临床深度学习工具包,使在7行代码内即可实现预测建模。PyHealth 2.0提供了三个关键贡献:(1) 一个全面的工具包,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化(包括符合预测的置信预测)在一个框架中解决可重复性和兼容性挑战,支持多种临床数据模态——信号、影像和电子健康记录——并翻译5+医学编码标准;(2) 以可访问性为重点的设计,支持多模态数据和多样化的计算资源,处理速度比以往快39倍,内存使用减少20倍,使从16GB笔记本电脑到生产系统都能轻松使用;(3) 一个活跃的开源社区,拥有400多名成员,通过详尽的文档、可重复研究贡献以及与学术医疗系统和产业伙伴的合作,包括通过RHealth实现的多语言支持,降低了领域专业知识的障碍。PyHealth 2.0建立了一个开源基础和社区,推动了可访问和可重复的医疗AI发展。可在pip install pyhealth中获取。

英文摘要

Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.