arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2602.20042 2026-05-19 cs.CL

AI Alignment Breaks at the Edge

AI对齐在边缘处破裂

Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, Yanfang Ye

AI总结 本文探讨了AI对齐在边缘案例中的失效问题,提出了一种新的对齐方法,通过识别和处理价值冲突、多方利益分歧和认知模糊性来改进AI的安全性和有效性。

Comments 38 pages, 6 figures

详情
AI中文摘要

通用对齐已经提高了平均情况下的有用性和安全性,但当前的对齐实践仍然奖励自信的单轮响应。问题不仅在于模型在边缘案例中失败,而且当前的评估使许多这些失败难以察觉。我们认为对齐必须超越平均情况的评估,通过使价值冲突、多方利益分歧和认知模糊性下的失败变得可见和可操作。标量奖励将多样化的价值观压缩成一个数字;数据和评估制度崩溃、过滤或未能激发对齐最困难的案例;治理往往缺乏裁定争议案例的机制。这些盲点导致了价值扁平化、表征损失和不确定性盲区。我们使用“边缘对齐”来命名一种检测、评估和治理议程,以揭示这些失败并将其与适当的干预措施联系起来。而不是单一的训练目标,边缘对齐定义了标准对齐应何时让位于保持多维价值结构、代表多方观点和支持不确定性意识互动的机制。一个包含91个边缘案例和四个现代模型的试点诊断集表明,普通的有用性和安全性读数可能无法发现边缘意识评估所暴露的过程失败。我们概述了操作性的边缘信号、过程意识的评估标准,以及一个三阶段的过程堆栈,将对齐重新定义为动态规范治理的生命周期问题。

英文摘要

General Alignment has improved average-case helpfulness and safety, but current alignment practice still rewards confident, single-turn responses. The problem is not only that models fail on edge cases; it is that current evaluation makes many of these failures hard to see. We take the position that alignment must move beyond average-case evaluation by making failures under value conflict, plural stakeholder disagreement, and epistemic ambiguity visible and actionable. Scalar rewards compress diverse values into a single number; data and evaluation regimes collapse, filter, or fail to elicit the cases where alignment is hardest; and governance often lacks mechanisms for adjudicating contested cases. These blind spots produce value flattening, representation loss, and uncertainty blindness. We use Edge alignment to name a detection, evaluation, and governance agenda for surfacing these failures and connecting them to appropriate interventions. Rather than a single training objective, Edge alignment defines the conditions under which standard alignment should yield to mechanisms that preserve multidimensional value structure, represent plural perspectives, and support uncertainty-aware interaction. A pilot diagnostic set of 91 edge cases and four contemporary models illustrates that ordinary helpfulness and safety readings can miss process failures that edge-aware evaluation exposes. We outline operational edge signals, process-aware evaluation criteria, and a three-phase process stack that reframes alignment as a lifecycle problem of dynamic normative governance.

2602.18227 2026-05-19 cs.LG

Parameter-Efficient Domain Adaptation of Physics-Informed Self-Attention based GNNs for AC Power Flow Prediction

为交流电力流预测的物理信息自注意力基于GNN的领域适应参数高效方法

Redwanul Karim, Changhun Kim, Timon Conrad, Nora Gourmelon, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer

AI总结 本文研究了物理信息自注意力基于GNN的参数高效领域适应方法,通过物理基础损失鼓励基尔霍夫一致行为,并限制适应为低秩更新,从而在电压领域转移下实现可控的效率-精度权衡。

详情
AI中文摘要

在中压(MV)电网训练的模型部署到高压(HV)网络时,准确的交流电力流(AC-PF)预测在领域转移下至关重要。现有的物理信息图神经网络(GNN)求解器通常依赖全微调进行跨领域转移,导致高再训练成本,并且对目标领域适应与源领域保留之间的稳定性-可塑性权衡控制有限。我们研究了物理信息自注意力基于GNN的参数高效领域适应,通过物理基础损失鼓励基尔霍夫一致行为,同时限制适应为低秩更新。具体而言,我们应用低秩适应(LoRA)到注意力投影,并选择性地解冻预测头以调节适应能力。这种设计在电压领域转移下实现了可控的效率-精度权衡。在多个电网拓扑结构上,所提出的LoRA+PHead适应方法在目标领域RMSE差距为$2.6 imes 10^{-4}$的情况下恢复了接近全微调的精度,同时将可训练参数数量减少了$85.46\%$。物理基础残差与全微调相当;然而,相对于全微调,LoRA+PHead在领域转移下将中压源保留减少了4.7个百分点(17.9% vs. 22.6%),同时仍实现了参数高效且物理一致的AC-PF估计。

英文摘要

Accurate AC power flow (AC-PF) prediction under domain shift is critical when models trained on medium-voltage (MV) grids are deployed on high-voltage (HV) networks. Existing physics-informed graph neural network (GNN) solvers typically rely on full fine-tuning for cross-regime transfer, incurring high retraining cost and offering limited control over the stability-plasticity trade-off between target-domain adaptation and source-domain retention. We study parameter-efficient domain adaptation for physics-informed self-attention-based GNNs, encouraging Kirchhoff-consistent behavior via a physics-based loss while restricting adaptation to low-rank updates. Specifically, we apply low-rank adaptation (LoRA) to attention projections with selective unfreezing of the prediction head to regulate adaptation capacity. This design yields a controllable efficiency-accuracy trade-off for physics-constrained inverse estimation under voltage-regime shift. Across multiple grid topologies, the proposed LoRA+PHead adaptation recovers near-full fine-tuning accuracy with a target-domain RMSE gap of $2.6 \times 10^{-4}$ while reducing the number of trainable parameters by $85.46\%$. The physics-based residual remains comparable to full fine-tuning; however, relative to Full FT, LoRA+PHead reduces MV source retention by 4.7 percentage points (17.9% vs. 22.6%) under domain shift, while still enabling parameter-efficient and physically consistent AC-PF estimation.

2602.17684 2026-05-19 cs.LG cs.AI

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

CodeScaler: 通过奖励模型扩展代码大语言模型的训练和测试时间推理

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

AI总结 本文提出CodeScaler,一种通过奖励模型扩展代码生成模型的训练和测试时间推理的框架,通过精心编纂的偏好数据和语法感知的代码提取,实现了在四个编码基准上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分,并在无测试用例的情况下通过合成数据进一步提升14.64分,同时在推理时间减少10倍的延迟,且在代码、通用和推理领域均优于现有奖励模型。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过利用单元测试的执行反馈推动了代码大语言模型的最新进展,但其可扩展性从根本上受到高质量测试用例可用性和可靠性的影响。我们提出CodeScaler,一种奖励模型,旨在扩展代码生成的强化学习训练和测试时间推理。CodeScaler是在经过验证的代码问题上精心编纂的偏好数据上训练的,并结合语法感知的代码提取和保持有效性的奖励塑造,以确保稳定和稳健的优化。在四个编码基准上,CodeScaler在Qwen3-8B-Base上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分。通过进一步扩展到44K问题并添加额外的合成数据,CodeScaler在无任何测试用例的情况下,相对于基础模型提升了14.64分。在推理时间,CodeScaler作为有效的测试时间扩展方法,实现了与单元测试方法相当的性能,同时在推理时间减少了10倍的延迟。此外,CodeScaler在RM-Bench上不仅在代码领域(+3.3分)上优于现有奖励模型,还在通用和推理领域(平均+2.7分)上也表现优异。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

2602.16990 2026-05-19 cs.AI cs.CE

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Conv-FinRe:一种用于实用导向财务推荐的对话和纵向基准

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

AI总结 本研究提出Conv-FinRe基准,用于评估金融推荐模型在对话和长期视角下的实用性,通过多视角参考区分描述性行为与基于投资者风险偏好的规范性效用,揭示理性决策与行为一致性的张力。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情
AI中文摘要

大多数推荐基准评估模型模仿用户行为的能力。在金融顾问领域,观察到的行为可能在市场波动中嘈杂或短视,并可能与用户的长期目标冲突。因此,将用户的选择视为唯一真实情况,会将行为模仿与决策质量混淆。我们引入Conv-FinRe,一种用于股票推荐的对话和纵向基准,评估LLM超越行为匹配的能力。给定一个入职访谈、分步市场背景和顾问对话,模型必须在固定投资期限内生成排名。关键在于,Conv-FinRe提供了多视角参考,区分描述性行为与基于投资者特定风险偏好的规范性效用,使能够诊断LLM是否遵循理性分析、模仿用户噪声或受市场动量驱动。我们从真实市场数据和人类决策轨迹构建了该基准,实例化了受控的顾问对话,并评估了一套最先进的LLM。结果揭示了理性决策质量与行为一致性的持续张力:在效用基础上表现良好的模型往往无法匹配用户选择,而行为一致的模型可能会过拟合短期噪声。该数据集已公开发布在Hugging Face,代码库可在GitHub上获得。

英文摘要

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

2602.12978 2026-05-19 cs.RO cs.AI

Learning Native Continuation for Action Chunking Flow Policies

学习原生延续以实现动作分块流策略

Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao

AI总结 本文提出Legato方法,通过训练时的延续技术改进动作分块流基于VLA策略,减少动作边界不连续性和伪多模态切换,提升轨迹平滑度和任务完成效率。

Comments Accepted by Robotics: Science and Systems 2026 (RSS 2026). Project page: https://lyfeng001.github.io/Legato/

详情
AI中文摘要

动作分块使Vision Language Action (VLA)模型能够实时运行,但朴素的分块执行常在分块边界处出现不连续性。实时分块(RTC)缓解了这一问题,但其作为外部策略导致伪多模态切换和非内在平滑的轨迹。我们提出Legato,一种针对动作分块流基于VLA策略的训练时延续方法。具体而言,Legato从具有调度形状的已知动作和噪声混合物初始化去噪,使模型接触部分动作信息。此外,Legato重塑学习的流动力学,确保在每步指导下去噪过程在训练和推理之间保持一致。Legato进一步在训练中使用随机调度条件以支持变化的推理延迟并实现可控的平滑度。实证结果表明,Legato产生更平滑的轨迹并减少执行中的伪多模态切换,导致较少的犹豫和更短的任务完成时间。广泛的现实世界实验表明,Legato在五个操作任务中始终优于RTC,实现了轨迹平滑度和任务完成时间的约10%的改进。

英文摘要

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

2602.12871 2026-05-19 cs.CL

MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

MentalBench: 一个用于评估大语言模型 psychiatric 诊断能力的 DSM 基础基准

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

AI总结 本文提出 MentalBench,一个用于评估大语言模型在不同临床模糊程度下能否做出 DSM 基础的 psychiatric 诊断决策的基准。该基准基于 psychiatrist 构建并验证的知识图谱,生成了 24,750 个合成临床案例,以系统地变化信息完整性和诊断复杂性,从而实现 DSM 基础的评估。实验表明,尽管最先进的 LLM 在噪声自由查询上表现良好,但它们在区分具有重叠症状的诊断时难以校准其信心。

详情
AI中文摘要

大型语言模型 (LLMs) 已吸引越来越多的关注,作为心理评估和临床决策支持的支持工具。然而,现有的心理健康基准大多依赖于社交媒体数据或支持性对话设置,限制了它们评估模型是否能够应用正式诊断标准和鉴别诊断规则的能力。在本文中,我们介绍了 MentalBench,一个用于评估 LLM 是否能在不同水平的临床模糊性下做出 DSM 基础的 psychiatric 诊断决策的基准。MentalBench 的核心是 MentalKG,一个由精神科医生构建并验证的知识图谱,编码了 DSM-5 的诊断标准和鉴别诊断规则,适用于 23 种心理疾病。利用 MentalKG 作为专家整理的逻辑基础,我们生成了 24,750 个合成临床案例,这些案例在信息完整性和诊断复杂性方面系统地变化,从而实现 DSM 基础的评估。我们的实验表明,尽管最先进的 LLM 在噪声自由查询上表现良好,但它们在区分具有重叠症状的诊断时难以校准其信心。这些发现引发了关于 LLM 作为心理决策支持工具可靠性的担忧,并突显了需要更多评估以反映现实世界心理诊断中的多样化挑战的必要性。

英文摘要

Large language models (LLMs) have attracted growing interest as supportive tools for psychiatric assessment and clinical decision support. However, existing mental health benchmarks largely rely on social media data or supportive dialogue settings, limiting their ability to assess whether models can apply formal diagnostic criteria and differential diagnostic rules. In this paper, we introduce MentalBench, a benchmark for evaluating whether LLMs can make DSM-grounded psychiatric diagnostic decisions under varying levels of clinical ambiguity. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as an expert-curated logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling DSM-grounded evaluation. Our experiments show that although state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. These findings raise concerns about the reliability of LLMs as psychiatric decision-support tools and highlight the need for more evaluation that reflects the diverse challenges in real-world psychiatric diagnosis.

2602.12755 2026-05-19 cs.CV

Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

向稀疏视角X射线CT数据重建迈进:基于扩散模型

Nelas J. Thomsen, Xinyuan Wang, Felix Lucka, Ezgi Demircan-Tureyen

AI总结 本文研究了如何利用扩散模型重建稀疏视角X射线CT数据,探讨了训练数据不匹配(域偏移)和正向模型不匹配对实验数据应用的影响,发现域偏移在不同程度上影响模型性能,而正向模型不匹配可通过退火似然权重调度缓解。

Comments 5 pages + references, 4 figures, 2 tables, conference paper

详情
AI中文摘要

基于扩散的图像生成器在解决不明确的逆问题,如稀疏视角X射线计算机断层扫描(CT)方面具有前景。大多数研究考虑合成数据,不清楚训练数据不匹配(“域偏移”)或正向模型不匹配是否复杂其成功应用于实验数据。我们测量了与合成Shepp-Logan幻影相似的物理幻影的CT数据,并在具有不同域偏移程度的合成图像数据集上训练扩散先验。然后,我们采用分解扩散采样方案,在难度逐渐增加的稀疏视角CT数据集上应用这些先验。我们的结果表明,域偏移的作用是微妙的:虽然严重的不匹配导致模型崩溃和幻觉,但多样化的先验匹配或超过匹配良好的但狭窄的先验。正向模型不匹配会将图像样本推离先验流形,导致伪影,但可以通过退火似然权重调度缓解,这也可以提高计算效率。总体而言,我们证明了性能增益并不立即从合成数据转移到实验数据,未来的发展必须通过现实世界基准来验证。

英文摘要

Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift'') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors match or exceed well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood weight schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

2602.12687 2026-05-19 cs.LG cs.AI

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

信任不确定的教师:通过校准的不确定性提炼暗知识

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

AI总结 本文提出校准不确定性提炼(CUD)框架,通过从分布角度重新审视知识蒸馏,使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性,从而在易例中获益于自信信号,在难例中获益于结构化不确定性,提升了学生在分布偏移和长尾输入上的准确性和可靠性。

详情
AI中文摘要

知识蒸馏的核心在于将教师的丰富'暗知识'-即揭示类别间关系和不确定性分布的细微概率模式进行转移。尽管这一理念已建立,但传统交叉熵训练的教师往往无法保留此类信号。它们的分布会坍缩成尖锐、过度自信的峰,看似决定性但实际脆弱,提供的仅限于硬标签或在表示层面转移时微妙地阻碍。这种过度自信在高基数任务中尤为成问题,因为许多可能类别的细微差别对指导紧凑的学生至关重要。此外,这种脆弱的目标会降低对分布偏移的鲁棒性,使学生在现实条件下的校准变得不可靠。为解决这一限制,我们从分布角度重新审视蒸馏,并提出校准不确定性蒸馏(CUD)框架,旨在使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性。通过在转移前直接塑造教师的预测分布,我们的方法在准确性和校准之间取得平衡,使学生在易例中受益于自信信号,在难例中受益于结构化不确定性。在多样化的基准测试中,CUD产生的学生不仅更加准确,而且在分布偏移下更加校准,在模糊的长尾输入上更加可靠。

英文摘要

The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

2602.11699 2026-05-19 cs.CL

Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

在生成上下文中寻找意义:人类与语言模型的视角

Katrina Olsen, Sebastian Padó

AI总结 本文通过人类和语言模型对五个语义偏差数据集中的句子进行评估,探讨了如何区分异常句子和无意义句子,并发现语言模型在生成合理上下文方面表现出色。

Comments Accepted for publication at STARSEM 2026, San Diego, CA

详情
AI中文摘要

无意义和异常的句子在计算语义解释模型的发展中起到了关键作用。一个核心挑战是区分仅仅是异常(但可以在上下文中解释)和真正无意义的内容。然而,不清楚(a)现有数据集中的无意义程度,以及(b)LLMs能否做出这种区分。在本文中,我们通过收集人类评估者和LLMs对五种语义偏差数据集中的句子(包括无上下文和有上下文的情况)的可理解性判断来回答这两个问题。我们发现,评估者认为大多数句子仅是异常,只有少数被认为是真正的无意义。我们还显示,LLMs在为异常情况生成合理的上下文方面具有显著的能力。

英文摘要

Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

2602.11130 2026-05-19 cs.LG cs.CV

Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers

Meltdown: 点云条件化3D扩散变换器中的电路与分叉

Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins

AI总结 该研究探讨了点云条件化3D扩散变换器在输入变化下的失败模式,揭示了Meltdown现象,通过机制性案例研究展示了其成因,并提出了PowerRemap方法以抑制该现象。

详情
AI中文摘要

稀疏点云是3D表面重建中常见的输入模式,包括在安全关键领域如手术导航和自动驾驶感知中。最近的点云条件化3D扩散变换器在这一领域通过利用学习先验知识实现了最先进的结果。我们展示了这些模型在现实输入变化下可能灾难性地失败,并展示了其原因。我们识别出一种称为Meltdown的失败模式:对稀疏输入点云的微小表面扰动可以将重建输出分解成数百个不连通的部分。对抗搜索在两个开放权重的最先进架构(WaLa、Make-a-Shape)上恢复Meltdown,在真实世界数据集(GSO、SimJEB)和DDPM和DDIM采样下恢复率在89.9-100%。我们追踪Meltdown在正向传递中:它由点在表面上分布的均匀性决定,通过点云编码器忠实传递,并由扩散骨干中的单个早期去噪交叉注意力写入步骤所提交。扩散轨迹集合在接近此提交步骤时表现出对称性破裂,与反向过程的分叉一致。通过一系列匹配幅度的控制,我们证明模型提交的变量是方向性的,集中在写入扰动漂移的低维子空间中。受此发现启发,我们引入PowerRemap,一种测试时间控制,通过重塑局部写入的奇异谱来抑制此漂移,在WaLa上恢复率为98.3%,在Make-a-Shape上为84.6%。这些结果将电路级交叉注意力机制与轨迹级失败解释联系起来,展示了机理分析如何解释和指导条件扩散变换器的行为。

英文摘要

Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.

2602.07884 2026-05-19 cs.LG cs.AI

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

GRAFT:分离排名与校准用于生存分析

Mohammad Ashhad, Robert Hoehndorf, Ricardo Henao

AI总结 本文提出GRAFT模型,通过分离预测排名与生存校准,解决生存分析中排名与校准之间的权衡问题,该模型结合线性AFT模型与非线性残差神经网络,并利用随机门进行自动特征选择,从而在公开基准测试中实现了更好的判别能力和校准性能。

详情
AI中文摘要

生存分析受到删失数据、高维特征和非线性交互的挑战。经典模型提供可解释性和优越的校准能力,但局限于线性或预定义的功能形式,而深度学习模型具有灵活性并实现了强大的判别性能,但倾向于产生校准不佳的生存估计。为了解决这一权衡问题,我们提出GRAFT(Gated Residual Accelerated Failure Time),一种新的AFT模型,该模型将预测排名与生存校准分离。GRAFT的混合架构结合了线性AFT模型与非线性残差神经网络,并整合了随机门用于自动特征选择。该模型通过优化可微的、C-index对齐的排名损失进行训练,利用局部Kaplan-Meier估计器的随机条件插补,而校准的生存估计则通过简单的后训练校准获得。在公开基准测试中,GRAFT在判别能力和校准性能上优于基线模型,同时在高噪声设置中保持稳健和稀疏。

英文摘要

Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models offer interpretability and superior calibration but are restricted to linear or predefined functional forms, while deep learning models are flexible and achieve strong discriminative performance, but tend to produce poorly calibrated survival estimates. To address this trade-off, we propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from survival calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic feature selection. The model is trained by optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators, while calibrated survival estimates are obtained through simple post-training calibration. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

2602.05287 2026-05-19 cs.AI

Position: Universal Time Series Foundation Models Rest on a Category Error

位置:通用时间序列基础模型建立在类别错误上

Xilin Dai, Wanxu Cai, Zhijian Xu, Qiang Xu

AI总结 本文指出,追求'通用时间序列基础模型'存在根本性的类别错误,将结构容器误认为语义模态。由于时间序列包含不兼容的生成过程(如金融与流体动力学),单一大模型退化为昂贵的'通用过滤器',在分布漂移下无法泛化。为此,我们引入'自回归盲目界限',证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性,其中代理利用外部上下文协调一系列专门的求解器,从冻结领域专家到轻量级即时适应器。最后,我们呼吁将基准从'零样本准确性'转向'漂移适应速度',以优先考虑鲁棒、控制理论系统。

详情
AI中文摘要

本文立场论文认为,追求'通用时间序列基础模型'建立在根本性的类别错误上,误将结构容器视为语义模态。我们指出,由于时间序列包含不兼容的生成过程(例如金融与流体动力学),单一大模型退化为昂贵的'通用过滤器',在分布漂移下无法泛化。为解决这一问题,我们引入'自回归盲目界限',一个理论极限,证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性,其中代理利用外部上下文协调一系列专门的求解器,从冻结领域专家到轻量级即时适应器。最后,我们呼吁将基准从'零样本准确性'转向'漂移适应速度',以优先考虑鲁棒、控制理论系统。

英文摘要

This position paper argues that the pursuit of "Universal Foundation Models for Time Series" rests on a fundamental category error, mistaking a structural Container for a semantic Modality. We contend that because time series hold incompatible generative processes (e.g., finance vs. fluid dynamics), monolithic models degenerate into expensive "Generic Filters" that fail to generalize under distributional drift. To address this, we introduce the "Autoregressive Blindness Bound," a theoretical limit proving that history-only models cannot predict intervention-driven regime shifts. We advocate replacing universality with a Causal Control Agent paradigm, where an agent leverages external context to orchestrate a hierarchy of specialized solvers, from frozen domain experts to lightweight Just-in-Time adaptors. We conclude by calling for a shift in benchmarks from "Zero-Shot Accuracy" to "Drift Adaptation Speed" to prioritize robust, control-theoretic systems.

2602.03535 2026-05-19 cs.LG cs.NA math.NA math.OC

Sparse Training of Neural Networks based on Multilevel Mirror Descent

基于多级镜像下降法的神经网络稀疏训练

Yannick Lunk, Sebastian J. Scott, Leon Bungert

AI总结 本文提出了一种基于线性化Bregman迭代/镜像下降的动态稀疏训练算法,通过交替静态和动态稀疏模式更新来利用自然产生的稀疏性,结合稀疏诱导Bregman迭代与自适应冻结网络结构,以高效探索稀疏参数空间并保持稀疏性。通过多级优化框架保证收敛性,并实验证明该算法在标准基准上能产生高稀疏性和准确性的模型,同时在理论FLOPs数量和训练时间上均有显著提升。

详情
AI中文摘要

我们介绍了一种基于线性化Bregman迭代/镜像下降的动态稀疏训练算法,该算法通过在静态和动态稀疏模式更新之间交替,利用自然产生的稀疏性。关键思想是将稀疏诱导的Bregman迭代与自适应冻结网络结构相结合,以在保持稀疏性的同时高效探索稀疏参数空间。我们通过将方法嵌入多级优化框架中,提供收敛保证。此外,我们实验证明,我们的算法可以在标准基准上产生高度稀疏且准确的模型。我们还显示,与SGD训练相比,理论上的FLOPs数量从标准Bregman迭代的38%减少到我们的方法的6%,同时保持测试精度。我们还显示,当使用稀疏感知的CPU实现时,训练时间可减少约50%。

英文摘要

We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.We additionally show a training time reduction by about 50%, when using a sparsity-aware CPU implementation of our method.

2602.03352 2026-05-19 cs.CL

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL: 通过后编辑引导的强化学习改进机器翻译

Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang

AI总结 本文提出PEGRL框架,通过后编辑作为辅助任务稳定训练并引导整体优化,提升机器翻译性能。

详情
AI中文摘要

强化学习(RL)在基于大语言模型(LLM)的机器翻译中展现出强劲的潜力,近期方法如GRPO已取得显著进展;然而,面向翻译的RL仍然受到来自蒙特卡洛回报估计的噪声学习信号以及庞大的轨迹空间的挑战,后者倾向于全局探索而非细粒度的局部优化。我们引入PEGRL,一种两阶段的RL框架,利用后编辑作为辅助任务来稳定训练并引导整体优化。在每次迭代中,翻译输出被采样以构建后编辑输入,使后编辑阶段的回报估计能够受益于对当前翻译行为的条件化,同时共同支持全局探索和细粒度的局部优化。一个任务特定的加权方案进一步平衡翻译和后编辑目标的贡献,从而获得一个偏倚但更样本高效的估计器。在英语→芬兰语、英语→土耳其语以及英语↔中文的实验中,PEGRL在RL基线上表现出一致的提升,对于英语→土耳其语,其在COMET-KIWI上的性能与先进的LLM基系统(DeepSeek-V3.2)相当。我们的代码和一组代表性的预训练模型已公开在https://github.com/NJUNLP/peg-rl和https://huggingface.co/collections/DGME/pegrl。

英文摘要

Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2). Our code and a set of representative pretrained models are publicly available at \url{https://github.com/NJUNLP/peg-rl} and \url{https://huggingface.co/collections/DGME/pegrl}

2602.02830 2026-05-19 cs.LG stat.ME

SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous Graphs

SC3D:动态和可微的因果发现用于时序和瞬时图

Sourajit Das, Dibyajyoti Chakraborty, Romit Maulik

AI总结 本文提出SC3D,一种动态和可微的因果发现方法,用于处理时序和瞬时图,通过两阶段可微框架联合学习滞后特定的邻接矩阵和瞬时有向无环图,提升了因果结构的稳定性和准确性。

Comments 12 pages

详情
AI中文摘要

从多变量时间序列中发现因果结构是一个关键问题,因为相互作用跨越多个滞后并可能涉及瞬时依赖。此外,动态图的搜索空间本质上是组合性的。在本研究中,我们提出稳定因果动态可微发现(SC3D),一种两阶段可微框架,联合学习滞后特定的邻接矩阵以及如果存在的话瞬时有向无环图(DAG)。在第一阶段,SC3D通过节点级预测进行边预选以获得滞后和瞬时边的掩码,而第二阶段通过优化具有稀疏性的似然函数并强制瞬时块的无环性来细化这些掩码。在合成SVAR系统、非线性和混沌基准、非平稳动态和现实世界数据集上的数值结果表明,SC3D在稳定性和准确性方面优于现有基线,能够更准确地恢复滞后和瞬时因果结构。

英文摘要

Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose Stable Causal Dynamic Differentiable Discovery (SC3D), a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic SVAR systems, nonlinear and chaotic benchmarks, nonstationary dynamics and real-world datasets demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing baselines.

2602.00924 2026-05-19 cs.AI

Supervised sparse auto-encoders for interpretable and compositional representations

监督稀疏自编码器用于可解释和组合性表示

Ouns El Harzli, Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao

AI总结 本文提出了一种监督稀疏自编码器,通过结合无约束特征模型和监督学习,解决稀疏自编码器在非光滑性及特征与人类语义对齐方面的不足,实现了组合性泛化和语义图像编辑。

详情
AI中文摘要

稀疏自编码器(SAEs)重新成为机制可解释性的重要方法,但面临两个重大挑战:$L_1$惩罚的非光滑性阻碍了重建和可扩展性,以及学习到的特征与人类语义不一致。在本文中,我们通过适应无约束特征模型,一种来自神经崩溃理论的数学框架,并通过监督任务来解决这些限制。我们监督(解码器-only)SAEs通过联合学习稀疏概念嵌入和解码器权重来重建特征向量。在Stable Diffusion 3.5上验证,我们的方法展示了组合性泛化,成功重建了训练期间未见过的概念组合图像,并在不修改提示的情况下实现了特征级的语义图像编辑。

英文摘要

Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models, a mathematical framework from neural collapse theory, and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.

2601.23087 2026-05-19 cs.RO

CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

CoLA-Flow Policy: 通过连续潜在动作流匹配实现机器人操作的时序一致模仿学习

Wu Songwei, Jiang Zhiduo, Sun Wandong, Xie Guanghu, Zhao Rui, Liu Hong, Liu Yang

AI总结 本文提出CoLA-Flow Policy,一种基于连续潜在动作空间的轨迹级模仿学习框架,通过学习显式的潜在空间流,解耦全局运动结构与低层控制噪声,从而实现平滑可靠的长时程执行,并结合几何感知点云条件和执行时多模态调节,提升现实环境的鲁棒性。

Comments 9 pages, 9 figures

详情
AI中文摘要

学习长时程的机器人操作需要同时实现表达能力强的行为建模、实时推断和稳定执行,这对现有的生成策略仍具有挑战性。基于扩散的方法具有强大的建模能力,但会导致较高的推断延迟,而流匹配方法能够在快速、近单步生成的同时,当直接在原始动作空间中操作时往往会出现执行不稳定的问题。我们提出了连续潜在动作流策略(CoLA-Flow Policy),一种轨迹级模仿学习框架,该框架在连续潜在动作空间中执行流匹配。通过将动作序列编码为时间一致的潜在轨迹,并学习显式的潜在空间流,CoLA-Flow Policy 解耦了全局运动结构与低层控制噪声,从而实现平滑且可靠的长时程执行。该框架进一步集成了几何感知点云条件和执行时多模态调节,利用视觉线索作为代表性模态以增强现实环境的鲁棒性。在仿真和真实机器人上的实验表明,CoLA-Flow Policy 实现了近单步推断,比原始动作空间流基线提高了93.7%的轨迹平滑度和25个百分点的任务成功率,同时比基于扩散的方法快得多。

英文摘要

Learning long-horizon robotic manipulation requires jointly achieving expressive behavior modeling, real-time inference, and stable execution, which remains challenging for existing generative policies. Diffusion-based approaches offer strong modeling capacity but incur high inference latency, while flow matching enables fast, near-single-step generation yet often suffers from unstable execution when operating directly in the raw action space. We propose Continuous Latent Action Flow Policy (CoLA-Flow Policy), a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally coherent latent trajectories and learning an explicit latent-space flow, CoLA-Flow Policy decouples global motion structure from low-level control noise, enabling smooth and reliable long-horizon execution. The framework further integrates geometry-aware point cloud conditioning and execution-time multimodal modulation, using visual cues as a representative modality to enhance real-world robustness. Experiments in simulation and on real robots show that CoLA-Flow Policy achieves near-single-step inference, improves trajectory smoothness by up to 93.7% and task success by up to 25 percentage points over raw action-space flow baselines, while remaining significantly faster than diffusion-based policies.

2601.22297 2026-05-19 cs.CL

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

从自我辩论中学习:为多智能体辩论准备推理模型

Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang

AI总结 本文提出SDRL框架,通过自我辩论训练使模型具备独立问题解决能力和多智能体辩论中的多样化推理处理能力,实验表明其在多辩论协议和智能体配置下均能提升多智能体辩论性能和单模型推理能力。

详情
AI中文摘要

大型语言模型(LLM)的推理能力已通过可验证奖励的强化学习(RLVR)显著提升。在测试阶段,通过多智能体辩论(MAD)进行协作推理已成为提升LLM性能的有希望方法。然而,当前RLVR方法通常训练LLM独立解决问题,而没有明确准备它们在辩论中综合和受益于不同推理路径。在本文中,我们提出了自我辩论强化学习(SDRL),一种训练框架,其中模型从自我辩论中学习,使单个LLM具备强大的独立问题解决能力和处理MAD中多样化推理轨迹的能力。给定提示后,SDRL首先采样多个候选解决方案,然后构建具有多样化推理路径的辩论环境,并生成基于此环境的第二轮响应。最后,SDRL联合优化初始和辩论条件响应,产生一个既能作为独立求解器又能作为辩论参与者有效的模型。在多个基础模型和推理基准上的实验表明,SDRL在多种辩论协议和智能体配置下均能提升MAD性能,同时增强单模型推理能力。

英文摘要

The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning(SDRL), a training framework where models learn from self-debate, equipping a single LLM with both strong standalone problem-solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL consistently improves MAD performance across diverse debate protocols and agent configurations, while simultaneously strengthening single-model reasoning.

2601.21841 2026-05-19 cs.CL

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

通过大型语言模型的图引导动作生成进行具身任务规划

Xiang Li, Ning Yan, Masood Mortazavi

AI总结 本文提出GiG框架,通过图神经网络编码环境状态并构建动作连接执行轨迹图,结合有限前瞻性模块提升具身代理的规划能力,在三个具身规划基准测试中取得显著性能提升。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在零样本推理能力方面表现出色,但将其作为具身代理部署仍面临长周期规划的根本挑战。与开放性文本生成不同,具身代理必须将高层意图分解为可操作的子目标,同时遵守动态环境的约束。标准LLM规划器由于上下文窗口限制或幻觉状态转换而难以维持策略一致性。我们提出GiG,一种通过图-图架构结构化具身代理记忆的规划框架。我们的方法利用图神经网络(GNN)将环境状态编码为嵌入,将这些嵌入组织成动作连接的执行轨迹图,存储在经验记忆库中。GiG能够检索结构相似的先例,使代理能基于相关过去结构模式做出决策。此外,我们引入了一个有限前瞻性模块,利用符号转换逻辑通过基于现实的动作投影增强代理的规划能力。我们在三个具身规划基准测试中评估了我们的框架——Robotouille Synchronous、Robotouille Asynchronous和ALFWorld。我们的方法优于最先进的基线,分别在Robotouille Synchronous、Asynchronous和ALFWorld上实现了高达22%、37%和15%的Pass@1性能提升,同时保持可比或更低的计算成本。

英文摘要

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intents into actionable sub-goals while adhering to the constraints of a dynamic environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitations or hallucinate state transitions that violate environment constraints. We propose GiG, a planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. GiG enables retrieval of structurally-similar priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a bounded lookahead module that leverages symbolic transition logic to enhance the agent's planning capabilities through grounded action projections. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld while maintaining comparable or lower computational cost.

2601.21468 2026-05-19 cs.AI

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR: 一种面向布局的视觉记忆用于高效的长周期推理

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang

AI总结 MemOCR通过利用视觉布局进行自适应信息密度分配,提高了在有限上下文预算下的长周期推理效率,其核心方法是维护结构化的丰富文本记忆并将其渲染为图像,以实现对关键证据的视觉优先级分配和辅助细节的压缩,从而在各种基准测试中优于基于文本的基线方法。

详情
AI中文摘要

长周期代理推理需要有效地将增长的交互历史压缩到有限的上下文窗口中。现有的记忆系统通常将历史序列化为文本,其中每个标记的费用是均匀的,并且随着长度线性增长,往往在低价值细节上消耗稀缺的预算。为此,我们引入了MemOCR,一种多模态记忆代理,通过通过视觉布局进行自适应信息密度分配,从而在有限的上下文预算下提高长周期推理的效率。具体而言,MemOCR维护一个结构化的丰富文本记忆(例如标题、重点),并将其渲染为图像,供代理在记忆访问时参考,通过视觉优先级分配关键证据,同时积极压缩辅助细节。为了确保在不同内存预算下的鲁棒性,我们通过强化学习训练MemOCR,使用预算意识目标,使代理能够适应不同的压缩水平。在长上下文多跳和单跳问答基准测试中,MemOCR优于强大的文本基线,并在极端预算下实现了更有效的上下文利用。

英文摘要

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

2601.21357 2026-05-19 cs.LG

Beyond Objective-Based Improvement: Stationarity-Aware Expected Improvement for Bayesian Optimization

超越基于目标的改进:面向站性的期望改进用于贝叶斯优化

Joshua Hang Sai Ip, Georgios Makrygiorgos, Ali Mesbah

AI总结 本文提出了一种新的期望改进(EI-GN)获取函数,通过引入一阶站性条件来扩展改进原则,从而在高表现和接近站点的区域促进采样,通过在获取标准中嵌入向站性进展,提供更丰富的改进概念。

详情
AI中文摘要

贝叶斯优化(BO)是一种用于优化昂贵黑盒函数的原理性框架,期望改进(EI)是其最广泛使用的获取函数之一。尽管在经验上取得了成功,但EI对一阶最优性条件漠不关心,仅依赖于目标值的改进。因此,它可能会在改进标准无信息的情况下表现出消失的获取信号,限制了其在引导搜索中的有效性。我们提出期望改进通过梯度范数(EI-GN),一种新的获取函数,将改进原则扩展到包含一阶站性,促进在高表现且接近站点的区域采样。我们推导了EI-GN的可计算闭式表达式,并证明其仍保持与基于改进的获取框架的一致性。通过在获取标准中嵌入向站性进展,EI-GN提供了一个更丰富和信息更丰富的改进概念。在标准BO基准上的实验证明了与基线方法的一致性改进,我们进一步展示了其在控制策略学习中的适用性。

英文摘要

Bayesian Optimization (BO) is a principled framework for optimizing expensive black-box functions, with Expected Improvement (EI) among its most widely used acquisition functions. Despite its empirical success, EI is agnostic to first-order optimality conditions, relying solely on objective-value improvement. As a result, it can exhibit vanishing acquisition signals where the improvement criterion is uninformative, limiting its effectiveness in guiding search. We propose Expected Improvement via Gradient Norms (EI-GN), a novel acquisition function that extends the improvement principle to incorporate first-order stationarity, promoting sampling in regions that are both high-performing and close to stationary points. We derive a tractable closed-form expression for EI-GN and show that it remains consistent with the improvement-based acquisition framework. By embedding progress toward stationarity into the acquisition criterion, EI-GN provides a richer and more informative notion of improvement. Empirical results on standard BO benchmarks demonstrate consistent gains over baseline methods, and we further illustrate its applicability to control policy learning.

2601.19300 2026-05-19 cs.LG

Queue Length Regret Bounds for Contextual Queueing Bandits

上下文队列强化学习中的队列遗憾界

Seoungbin Bae, Garyeong Kang, Dabeen Lee

AI总结 本文提出了一种新的上下文感知调度框架,即上下文队列强化学习,用于在同时学习未知服务速率的过程中进行调度。通过考虑异质的上下文特征,智能体选择任务并将其匹配到服务器以最大化离开速率。服务/离开速率由具有未知服务器特定参数的逻辑模型决定。为了评估策略的性能,我们考虑队列长度遗憾,定义为策略与最优策略之间队列长度的差异。主要挑战在于,在给定时间步长下,队列中剩余任务特征列表可能因策略与最优策略的不同而不同,因为它们可能以不同的顺序处理任务。为此,我们提出了带有复杂耦合论证的策略切换队列的概念。这导致了一种新的队列长度遗憾分解框架,使我们能够理解选择次优任务-服务器对的短期影响及其对队列状态差异的长期影响。我们证明了我们的算法CQB-ε达到了O(T^{-1/4})的遗憾上界。我们还考虑了对抗性选择的上下文设置,其中我们的第二个算法CQB-Opt达到了O(log²T)的遗憾上界。最后,我们提供了实验结果以验证我们的理论发现。

详情
AI中文摘要

我们引入了上下文队列强化学习,一种新的上下文感知框架,用于调度的同时学习未知的服务速率。个体任务携带异质的上下文特征,基于此,智能体选择一个任务并将其与一个服务器匹配以最大化离开速率。服务/离开速率由具有未知服务器特定参数的逻辑模型决定。为了评估策略的性能,我们考虑队列长度遗憾,定义为策略与最优策略之间队列长度的差异。主要挑战在于,在给定时间步长下,队列中剩余任务特征列表可能因策略与最优策略的不同而不同,因为它们可能以不同的顺序处理任务。为此,我们提出了带有复杂耦合论证的策略切换队列的概念。这导致了一种新的队列长度遗憾分解框架,使我们能够理解选择次优任务-服务器对的短期影响及其对队列状态差异的长期影响。我们证明了我们的算法CQB-ε达到了O(T^{-1/4})的遗憾上界。我们还考虑了对抗性选择的上下文设置,其中我们的第二个算法CQB-Opt达到了O(log²T)的遗憾上界。最后,我们提供了实验结果以验证我们的理论发现。

英文摘要

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.

2601.18442 2026-05-19 cs.RO

SG-CADVLM: A Context-Aware Decoding Powered Vision Language Model for Safety-Critical Scenario Generation

SG-CADVLM: 一种基于上下文感知解码的视觉语言模型,用于安全关键场景生成

Hongyi Zhao, Shuo Wang, Qijie He, Ziyuan Pu

AI总结 本文提出SG-CADVLM,一种结合上下文感知解码的多模态输入处理框架,用于从事故报告中生成高保真的安全关键场景,通过减少视觉语言模型的幻觉并同时生成道路几何和车辆轨迹,提升了生成场景的准确性和实用性。

详情
AI中文摘要

自动驾驶(AV)需要在安全关键场景中进行严格测试以确保安全性验证,但其验证受到实地测试成本高和现有模拟在罕见安全关键事件中保真度不足的限制。碰撞报告提供了丰富的现实世界事故动态规范,使其成为大型语言模型和视觉语言模型生成高保真场景的有前景资源。然而,现有模型由于上下文抑制常偏离实际事故特征。为了解决这些限制,本文提出了SG-CADVLM,一种整合上下文感知解码与多模态输入处理的框架,用于从碰撞报告中生成安全关键场景。该框架在生成道路几何和车辆轨迹的同时减轻了VLMs的幻觉。实验结果表明,SG-CADVLM生成结合关键和高风险场景的速率比基线方法高88.1%(相比31.2%),代表了182%的提升,同时生成可用于自动驾驶测试的可执行模拟。

英文摘要

Autonomous Vehicle (AV) requires rigorous testing in safety-critical scenarios for safety validation, yet its validation is hindered by the high cost of field testing and the lack of fidelity in current simulations for rare safety-critical events. Crash reports offer rich and authentic specifications of real-world accident dynamics, making them a promising resource for Large Language Models and Vision-Language models to generate high-fidelity scenarios. However, the existing models frequently deviate from actual accident characteristics due to context suppression. To address these limitations, this paper presents SG-CADVLM, a framework integrateing Context-Aware Decoding with multimodal input processing to generate safety-critical scenarios from crash reports. The framework mitigates the hallucination of VLMs while generating road geometry and vehicle trajectories simultaneously. The experimental results demonstrate that SG-CADVLM generates combined critical and high-risk scenarios at a rate of 88.1% compared to 31.2% for the baseline methods, representing a 182% improvement, while producing executable simulations for autonomous vehicle testing.

2601.17887 2026-05-19 cs.AI

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

当个性化合理化风险:揭示个性化对话代理中的安全漏洞

Jiahe Guo, Xiangran Guo, Yulin Hu, Zimo Long, Xingyu Sui, Xuda Zhi, Yongbo Huang, Hao He, Weixiang Zhao, Yanyan Zhao, Bing Qin

AI总结 本文研究了个性化对话代理中的一种安全故障模式——意图合理化,通过引入PS-Bench基准测试,揭示了个性化记忆如何偏移意图推断并导致模型合理化有害查询,提出了一种轻量级的检测-反思方法以减少安全退化。

详情
AI中文摘要

长期记忆使大型语言模型(LLM)代理能够支持个性化和持续的交互。然而,大多数关于个性化代理的研究优先考虑效用和用户体验,将记忆视为中性组件,并在很大程度上忽略了其安全影响。在本文中,我们揭示了意图合理化,一种此前未被充分探讨的安全故障,在个性化代理中,良性个人记忆会偏移意图推断,导致模型合理化本质上有害的查询。为了研究这一现象,我们引入了PS-Bench,一个用于识别和量化个性化交互中意图合理化的基准测试。在多个增强记忆的代理框架和基础LLM中,个性化将攻击成功率提高了15.8%至243.7%相对于无状态基线。我们进一步从内部表示空间提供了意图合理化的机理证据,并提出了一种轻量级的检测-反思方法,有效减少了安全退化。总体而言,我们的工作提供了首次系统探索和评估意图合理化作为一种安全故障模式,这种模式自然地从良性、现实世界的个性化中产生,突显了在长期个人背景下评估安全的重要性。我们的代码可在:https://github.com/MuyuenLP/PS-Bench获得。警告:本文可能包含有害内容。

英文摘要

Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8\%--243.7\% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. Our code is available at: https://github.com/MuyuenLP/PS-Bench. WARNING: This paper may contain harmful content.

2601.16414 2026-05-19 cs.LG cs.AI

PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

PyHealth 2.0: 一个全面的开源工具包,用于可访问和可重复的临床深度学习

John Wu, Yongda Fan, Zhenbang Wu, Paul Landes, Eric Schrock, Sayeed Sajjad Razin, Arjun Chatterjee, Naveen Baskaran, Joshua Steier, Andrea Fitzpatrick, Bilal Arif, Rian Atri, Jathurshan Pradeepkumar, Siddhartha Laghuvarapu, Junyi Gao, Adam R. Cross, Jimeng Sun

AI总结 本文提出PyHealth 2.0,一个全面的开源工具包,旨在解决临床AI研究中的可重复性和可访问性问题,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化方法,实现7行代码即可完成预测建模。

Comments Under Review

详情
AI中文摘要

难以复制基线、高计算成本和所需领域专业知识创建了持续存在的临床AI研究障碍。为了解决这些挑战,我们介绍了PyHealth 2.0,一个增强的临床深度学习工具包,使在7行代码内即可实现预测建模。PyHealth 2.0提供了三个关键贡献:(1) 一个全面的工具包,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化(包括符合预测的置信预测)在一个框架中解决可重复性和兼容性挑战,支持多种临床数据模态——信号、影像和电子健康记录——并翻译5+医学编码标准;(2) 以可访问性为重点的设计,支持多模态数据和多样化的计算资源,处理速度比以往快39倍,内存使用减少20倍,使从16GB笔记本电脑到生产系统都能轻松使用;(3) 一个活跃的开源社区,拥有400多名成员,通过详尽的文档、可重复研究贡献以及与学术医疗系统和产业伙伴的合作,包括通过RHealth实现的多语言支持,降低了领域专业知识的障碍。PyHealth 2.0建立了一个开源基础和社区,推动了可访问和可重复的医疗AI发展。可在pip install pyhealth中获取。

英文摘要

Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.

2601.15630 2026-05-19 cs.AI

Agentic AI Governance and Lifecycle Management in Healthcare

医疗领域代理AI治理与生命周期管理

Chandra Prakash, Mary Lind, Avneesh Sisodia

AI总结 本文提出了一种统一的代理生命周期管理框架,旨在解决医疗领域中代理蔓延问题,通过五个控制层实现可审计的监督,同时支持本地创新和安全扩展。

Comments 21 Pages, 9 figures

详情
AI中文摘要

医疗组织开始将代理AI嵌入到常规工作流程中,包括临床文档支持和早期预警监测。随着这些能力在各部门和供应商间扩散,医疗系统面临代理蔓延问题,导致代理重复、责任不明确、控制不一致和持续存在的工具权限。现有AI治理框架强调生命周期风险管理,但对代理舰队的日常操作提供有限指导。本文提出了一种统一的代理生命周期管理(UALM)蓝图,基于快速、实践导向的治理标准、代理安全文献和医疗合规要求的综合。UALM将反复出现的差距映射到五个控制层上:(1)身份和人物注册,(2)编排和跨域调解,(3) PHI 限定的上下文和记忆,(4)运行时策略执行与杀开关触发器,(5)生命周期管理和退役与凭证撤销和审计日志相关联。一个配套的成熟度模型支持分阶段采用。UALM为医疗CIO、CISO和临床领导者提供了一种可实施的模式,以实现可审计的监督,同时保持本地创新并安全扩展到临床和行政领域。

英文摘要

Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.

2601.14568 2026-05-19 cs.CV cs.AI

Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

打破精度-资源困境:一种轻量级自适应视频推理增强

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

AI总结 本文提出了一种轻量级自适应视频推理增强框架,通过动态切换不同规模的模型来平衡资源利用与推理性能。

Comments 5 pages, 5 figures

详情
AI中文摘要

现有的视频推理(VI)增强方法通常通过扩大模型规模和采用复杂的网络架构来提高性能。尽管这些方法展示了最先进的性能,但往往忽视了资源效率和推理有效性之间的权衡,导致资源利用效率低下和次优的推理性能。为了解决这个问题,本文开发了一种基于关键系统参数和推理相关指标的模糊控制器(FC-r)。在FC-r的指导下,提出了一种VI增强框架,利用相邻视频帧中目标的时空相关性。根据目标设备的实时资源条件,该框架可以在VI过程中动态切换不同规模的模型。实验结果表明,所提出的方法有效实现了资源利用和推理性能之间的平衡。

英文摘要

Existing video inference (VI) enhancement methods typically aim to improve performance by scaling up model sizes and employing sophisticated network architectures. While these approaches demonstrated state-of-the-art performance, they often overlooked the trade-off of resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal inference performance. To address this problem, a fuzzy controller (FC-r) is developed based on key system parameters and inference-related metrics. Guided by the FC-r, a VI enhancement framework is proposed, where the spatiotemporal correlation of targets across adjacent video frames is leveraged. Given the real-time resource conditions of the target device, the framework can dynamically switch between models of varying scales during VI. Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.

2601.09722 2026-05-19 cs.CL cs.AI

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

ADMEDTAGGER: 一个用于波兰医疗语言知识蒸馏的标注框架

Franciszek Górski, Andrzej Czyżewski

AI总结 本文提出了一种标注框架,展示如何利用一个多语言预训练大语言模型作为教师模型,蒸馏出用于标注波兰医疗文本所需的专业知识,通过开发多类分类器,解决了标注资源不足的问题,最终得到了高效的分类器。

详情
AI中文摘要

在本工作中,我们提出了一种标注框架,展示了如何利用一个多语言预训练大语言模型作为教师模型,蒸馏出用于标注波兰医疗文本所需的专业知识。本工作是ADMEDVOICE项目的一部分,在此项目中,我们收集了涵盖五个临床类别(放射学、肿瘤学、心脏病学、高血压和病理学)的大量医疗文本语料库。利用这些数据,我们开发了一个多类分类器,但根本问题在于缺乏足够的标注资源来标注足够数量的文本。因此,在我们的解决方案中,我们使用多语言Llama3.1模型来标注大量波兰医疗文本语料库。利用我们有限的标注资源,我们只验证了这些标签中的一部分,从而创建了一个测试集。通过这种方式标注的数据随后用于训练和验证三种基于BERT架构的分类器:基于DistilBERT的蒸馏模型、在医疗数据上微调的BioBERT以及在波兰语言语料库上微调的HerBERT。在我们训练的模型中,DistilBERT模型表现最佳,每个临床类别达到了F1分数大于0.80,其中三个类别达到了F1分数大于0.93。通过这种方式,我们得到了一系列高效的分类器,这些分类器在大小、GPU VRAM消耗和推理速度方面分别比大型语言模型小约500倍、低300倍,以及快数百倍。

英文摘要

In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.

2601.09071 2026-05-19 cs.LG

Resolving Predictive Multiplicity for the Rashomon Set

解决Rashomon集的预测多样性

Parian Haghighat, Hadis Anahideh, Cynthia Rudin

AI总结 本文针对Rashomon集中的预测不一致性问题,提出三种方法:异常值修正、局部修补和成对协调,以减少预测分歧并提升模型可靠性,实验表明这些方法能有效降低不一致度同时保持竞争性准确性。

详情
AI中文摘要

多个同样准确的模型对于给定的预测任务的存在导致了预测多样性,其中一组称为Rashomon集的模型在准确性上相似,但个体预测却存在分歧。这种不一致性削弱了在高风险应用中对一致预测的信任。我们提出了三种方法来减少Rashomon集中成员之间的不一致性。第一种方法是异常值修正,异常值具有无法被良好模型正确预测的标签,异常值可能导致Rashomon集在局部区域有高方差的预测,因此修正它们可以降低方差。第二种方法是局部修补,在测试点的局部区域,模型可能因为某些模型存在偏差而相互矛盾。我们可以通过验证集检测并修正这些偏差,从而减少多样性。第三种方法是成对协调,我们找到在测试点周围区域上意见不一致的模型对,并修改这些不一致的预测,使其更少偏向。这三种方法可以单独或共同使用,各自具有独特的优势。协调后的预测可以被提炼成一个单一的可解释模型用于现实部署。在多个数据集上的实验表明,我们的方法在减少不一致度的同时保持了竞争性的准确性。

英文摘要

The existence of multiple, equally accurate models for a given predictive task leads to predictive multiplicity, where a ``Rashomon set'' of models achieve similar accuracy but diverges in their individual predictions. This inconsistency undermines trust in high-stakes applications where we want consistent predictions. We propose three approaches to reduce inconsistency among predictions for the members of the Rashomon set. The first approach is \textbf{outlier correction}. An outlier has a label that none of the good models are capable of predicting correctly. Outliers can cause the Rashomon set to have high variance predictions in a local area, so fixing them can lower variance. Our second approach is local patching. In a local region around a test point, models may disagree with each other because some of them are biased. We can detect and fix such biases using a validation set, which also reduces multiplicity. Our third approach is pairwise reconciliation, where we find pairs of models that disagree on a region around the test point. We modify predictions that disagree, making them less biased. These three approaches can be used together or separately, and they each have distinct advantages. The reconciled predictions can then be distilled into a single interpretable model for real-world deployment. In experiments across multiple datasets, our methods reduce disagreement metrics while maintaining competitive accuracy.

2601.08118 2026-05-19 cs.AI cs.LG

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

MirrorBench: 一个评估对话用户代理人类化能力的基准测试

Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli

AI总结 本文提出MirrorBench基准测试,用于评估对话用户代理的人类化能力,通过结合多种词汇多样性指标和LLM评估指标,揭示用户代理与真实人类用户之间的系统性差距。

Comments KDD 2026 (Dataset & Benchmark Track)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类模拟器,既用于评估对话系统,也用于生成微调数据。然而,简单的

英文摘要

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.