arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28389 2026-05-28 cs.CL

FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

FABSVer: 更快的训练与更好的自验证用于大语言模型数学推理

Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song

AI总结 提出FABSVer方法,通过融合解生成与自验证为单次前向传播,并引入动态参考模型更新(DRMU)突破奖励瓶颈,在三个模型规模上实现更优的自验证与推理性能,训练时间仅为现有方法的51%-71%。

详情
AI中文摘要

尽管大语言模型在数学推理方面取得了显著进展,但它们在判断自身解决方案的正确性方面仍然不可靠。现有的为模型配备自验证能力的方法通常将解生成和验证视为两个独立的任务,导致训练时间大幅增加。在本文中,我们提出FABSVer,将这两个任务融合为单次生成过程,在联合优化两种能力的同时显著降低训练开销。我们进一步从理论和实验上识别出一个收敛瓶颈:随着训练进行,由于策略受固定参考模型约束,奖励达到平台期。为克服这一问题,我们引入动态参考模型更新(DRMU),提高了奖励上限并实现持续的奖励增长。在数学基准上的大量实验表明,FABSVer在三个模型规模上实现了优越的自验证和推理性能,同时仅需现有方法训练时间的51%–71%。分析进一步揭示了模型获取自验证能力的不同学习阶段,并且随着模型规模增大,验证奖励与答案奖励之间的差距显著缩小。

英文摘要

While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

2605.28388 2026-05-28 cs.AI

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

机制性解释样本难度在RLVR中对大语言模型的作用

Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu

AI总结 本文通过难度维度和单样本分析,发现样本难度对RLVR有非单调影响,中等难度问题提供最稳定的推理改进,并基于此提出难度自适应策略。

详情
Comments
30 pages, 11 figures
AI中文摘要

经验表明,带可验证奖励的强化学习(RLVR)能显著提升大语言模型(LLMs)的推理性能,尤其是在数学和编程领域。然而,样本难度在RLVR中的机制性作用仍不明确。本文通过难度维度和单样本分析研究RLVR。我们发现样本难度对RLVR有非单调影响:简单和中等难度问题带来最强且最稳定的推理改进,而过难问题往往提供弱学习信号,诱发退化行为(如重复答案或跳过必要计算),并最终损害模型已有的能力。除了响应层面,我们还利用时间稀疏自编码器(T-SAE)分析模型内部特征动态。简单问题主要强化直接答案和基本计算特征,同时抑制深思熟虑推理特征;困难问题激活推理相关特征,但仅在成功轨迹被采样时才有用;中等难度问题提供更平衡的信号,同时强化计算和多步推理特征。基于这些发现,我们提出了针对困难样本的难度自适应策略,利用反向推理重构和T-SAE引导的训练信号来改善RLVR中的奖励密度和信用分配。总体而言,我们的结果将样本难度识别为控制RLVR优化动态和表示演化的关键因素。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre-existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

2605.28387 2026-05-28 cs.LG cs.AI cs.NE

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

CLANE: 基于事件相机在神经形态硬件上的动作持续学习

Elvin Hajizada, Michael Neumeier, Edward Paxon Frady, Yulia Sandamirskaya, Axel von Arnim, Bing Li, Eyke Hüllermeier

AI总结 提出CLANE系统,在Intel Loihi 2神经形态芯片上实现端到端的持续学习,用于事件相机动作识别,通过尖峰CNN和新型Loihi 2模块实现高能效和低延迟。

详情
AI中文摘要

识别并持续学习新的人类动作而不遗忘先前类别,是新兴AR/VR和机器人应用的需求。对于这些应用,设备上的处理和学习对于隐私和低延迟适应至关重要。事件相机通过稀疏、异步的输出解决了视觉传感的效率问题,该输出天然兼容神经形态处理。然而,此前没有系统部署过使用神经形态硬件进行基于事件的持续设备上学习流水线。我们提出了CLANE(基于事件相机在神经形态硬件上的动作持续学习),端到端部署在Intel Loihi 2上。CLANE将用于时空特征提取的脉冲2D CNN与作为片上学习头的CLP-SNN相结合,并通过时间聚合层和定点归一化层(两者均为新型Loihi 2模块)扩展到动作片段。在真实条件下捕获的50类数据集THU E-ACT-50上,CLANE在持续学习任务中达到70.4%的准确率,同时相比顺序CNN+GRU+CLP边缘GPU基线实现了超过100倍的能耗降低和16倍的延迟降低,通过三个评估级别的跨平台等算法基准测试得到验证。

英文摘要

Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics applications. For these applications, both on-device processing and learning are essential for privacy and low-latency adaptation. Event cameras address the efficiency of visual sensing with sparse, asynchronous output that is naturally compatible with neuromorphic processing. Yet no prior system has deployed a continual on-device learning pipeline for event-based action recognition using neuromorphic hardware. We present CLANE, Continual Learning of Actions on Neuromorphic Hardware from Event Cameras, deployed end-to-end on Intel Loihi 2. CLANE combines a spiking 2D CNN for spatiotemporal feature extraction with CLP-SNN as its on-chip learning head, extended to action clips via a Temporal Aggregation Layer and a fixed-point Normalization Layer, both novel Loihi 2 modules. On THU E-ACT-50, a 50-class dataset captured under real-world conditions, CLANE achieves 70.4% accuracy in a continual learning task while delivering more than 100x energy reduction and 16x lower latency over a sequential CNN+GRU+CLP edge GPU baseline, validated through iso-algorithm cross-platform benchmarking across three evaluation levels.

2605.28384 2026-05-28 cs.LG

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

Meta-Attention: 用于高效Transformer推理的贝叶斯逐Token路由

Alan Ferrari

AI总结 提出Meta-Attention框架,通过贝叶斯元控制器动态为每个token选择最优注意力策略(全softmax、线性或滑动窗口局部注意力),在几乎无开销下实现更优的计算-性能权衡。

详情
AI中文摘要

标准Transformer架构对所有token和序列位置统一应用单一注意力机制,而不考虑局部上下文或计算预算。我们提出Meta-Attention,一个通过贝叶斯元控制器动态将每个token路由到最合适的注意力策略(全softmax注意力、线性(核)注意力或滑动窗口局部注意力)的框架。与使用确定性或无先验学习路由的先前路由方法不同,元控制器将逐token机制选择视为在计算感知的Dirichlet先验下的后验推理:路由权重是通过证据下界(ELBO)目标训练的摊销变分后验q(alpha | x_t; phi)的输出,该目标联合编码任务性能和注意力机制成本。这种设计产生原则性的路由不确定性估计,控制软到硬的路由转换,无需临时负载平衡损失即可缓解路由崩溃,并在几乎无开销的情况下比确定性或无先验学习路由产生更好的计算-性能权衡。在Tiny LM基准上的第一阶段实证结果证实了核心预测:贝叶斯控制器的学习路由分布在硬路由下意味着归一化FLOP成本为25.1%,而无先验基线为59.3%(-34.2个百分点),并将路由熵从55.8%降低到43.3%(-12.5个百分点),表明Dirichlet先验防止了路由崩溃,而非贝叶斯模型默认使用全注意力。我们展示了贝叶斯架构、ELBO训练目标以及验证前向传播正确性、后验多样性和与无先验基线进行受控消融的第一阶段PyTorch原型。代码见:https://github.com/KFEAL/meta-attention

英文摘要

Standard transformer architectures apply a single attention mechanism uniformly across all tokens and sequence positions, irrespective of local context or computational budget. We propose Meta-Attention, a framework that dynamically routes each token to the most appropriate attention strategy -- full softmax attention, linear (kernel) attention, or sliding-window local attention -- via a Bayesian Meta-Controller. Unlike prior routing approaches that use deterministic or prior-free learned routing, the Meta-Controller treats per-token mechanism selection as posterior inference under a compute-aware Dirichlet prior: routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound (ELBO) objective that jointly encodes task performance and attention-mechanism cost. This design produces principled routing uncertainty estimates that govern the soft-to-hard routing transition, mitigates routing collapse without ad hoc load-balancing losses, and yields better compute-performance trade-offs than deterministic or prior-free learned routing at negligible overhead. Phase 1 empirical results on a Tiny LM benchmark confirm core predictions: the Bayesian controller's learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior-free baseline (-34.2 pp), and reduces routing entropy from 55.8% to 43.3% (-12.5 pp), demonstrating that the Dirichlet prior prevents routing collapse while the non-Bayesian model defaults to full attention. We present the Bayesian architecture, ELBO training objective, and a Phase 1 PyTorch prototype validating forward-pass correctness, posterior diversity, and a controlled ablation against a prior-free baseline. Code available at: https://github.com/KFEAL/meta-attention

2605.28375 2026-05-28 cs.CL

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

PrionNER: 朊病毒病生物医学文献命名实体识别数据集

An Dao, Nhan Ly, Thao Tran, Yuji Matsumoto, Akiko Aizawa

AI总结 针对朊病毒病临床信息,构建了手动标注的命名实体识别数据集PrionNER,包含317篇摘要、15种粗粒度和31种细粒度实体类型,并评估了监督和零样本模型性能。

详情
Comments
29 pages, 5 figures, accepted at ACL 25th Workshop on Biomedical Language Processing (BioNLP 2026)
AI中文摘要

朊病毒病是一种罕见、快速进展且致命的神经退行性疾病,由于非特异性临床表现,早期诊断困难。然而,据我们所知,目前尚无公开的、专注于朊病毒病的数据集,用于从生物医学文献中捕获广泛的临床相关实体。我们推出了PrionNER,一个针对PubMed摘要中朊病毒病临床信息的手动标注命名实体识别数据集。当前版本包含317篇摘要、2,943个句子和6,955个文本绑定实体标注,涵盖15种粗粒度和31种细粒度临床导向实体类型,涉及疾病、症状、诊断、发现、解剖、治疗以及时间和统计证据。标注者间一致性达到81.78%的精确匹配F1值,表明标注一致性较强。我们在PrionNER上对监督BERT基线、W2NER和零样本提取器进行了基准测试。W2NER是最强的监督模型,Gemma-4-31B是最强的零样本模型,但基准测试仍具有挑战性,尤其是对于结构复杂的提及和细粒度的临床邻近标签区分。PrionNER为朊病毒病信息提取提供了临床基础的基准,并支持低资源、细粒度及非平面提取条件下的罕见病生物医学NLP研究。数据集、标注指南和评估脚本可在https://github.com/daotuanan/PrionNER/获取。

英文摘要

Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.

2605.28372 2026-05-28 cs.LG cs.RO

Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning

教师-学生表征对齐用于强化学习驱动的模仿学习

Meraj Mammadov, Pedro Zuidberg Dos Martires, Johannes Andreas Stork

AI总结 提出一种通过自监督对比学习构建共享嵌入空间的方法,以减小教师和学生策略之间的不可模仿差距,从而提升学生策略性能。

详情
Comments
6 pages, 5 figures. Accepted as an oral presentation at the RL4IL Workshop at ICRA 2026
AI中文摘要

从基于状态的强化学习策略进行模仿学习是克服机器人学中复杂高维观测空间维度灾难的常用方法。本文解决了当教师和学生策略孤立学习时出现的不可模仿差距,即教师策略可以依赖学生无法从其观测中推断的特权状态信息。我们提出了一种新算法,不是通过在模仿学习后进行强化学习微调(通常需要全新的训练设置)来改善学生性能,而是学习一个共享嵌入空间,该空间隐藏了特定于智能体的观测,从而通过构造训练出可模仿的教师策略。我们通过自监督对比学习与教师策略并行训练共享嵌入空间,并通过限制其梯度更新编码器网络来防止其提取私有信息。我们在多个示例领域进行了评估,并与最先进的基线方法比较,结果表明我们的算法能够实现更高的学生性能,并显著减小模仿差距。

英文摘要

Imitation learning (IL) from a state-based reinforcement learning (RL) policy is a common approach to overcome the curse of dimensionality in complex and high-dimensional observation spaces prevalent in robotics. This paper addresses the irreducible imitation gap that emerges when teacher and student are learned in isolation, and the teacher policy has the liberty to rely on privileged state information that the student cannot infer from its observations. Instead of improving poor student performance with RL finetuning after IL, which often requires a whole new training setup, we propose a novel algorithm which learns a shared embedding space that hides agent-specific observations and thus trains imitable teacher policies by construction. We train the shared embedding space with self-supervised contrastive learning in parallel to the teacher policy and prevent it from extracting private information by limiting its gradients from updating the encoder networks. We perform evaluations on several example domains and compare to state-of-the-art baselines showing that our algorithm enables higher student performance with substantially reduced imitation gap.

2605.28371 2026-05-28 cs.AI cs.LG cs.SE

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

从论文到基准测试:基于智能体和框架的机器健康智能中欠规范方法复现

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt, Lev Telyatnikov, Olga Fink

AI总结 提出一种基于智能体和共享框架的方法,通过槽绑定接口将论文转化为可执行、可比较的基准测试实现,解决工业预测与健康管理中方法复现的困难。

详情
AI中文摘要

工业预测与健康管理(PHM)为应用机器学习中的更广泛挑战提供了一个代表性案例研究:将已发表的论文转化为可执行、可基准测试的实现。由于工业数据集的访问受限、预处理和评估协议的报告不完整以及隐含的设计选择(例如,窗口化、目标构建、数据分割)对性能有重要影响,复现PHM中的欠规范方法尤为困难。现有的论文到代码系统为单篇论文生成实现,但由于假设和评估设置的不一致性,这些产物通常无法直接比较。我们引入了基于智能体和框架的PHM论文复现方法,其中智能体通过槽绑定接口将论文转化为共享的PHM基准测试框架。该接口将方程和协议描述映射为结构化组件(任务定义、数据集适配器、窗口化、目标、模型和评估器),同时明确记录未解决的假设。最终实现通过标准化任务契约和评估钩子进行验证,从而实现一致且可比较的基准测试。我们在16篇PHM论文上评估了该方法,比较了框架增强型、基于技能和基于提示的智能体复现与最近的无框架论文复现智能体。我们评估了复现成功率、基于模型的代码评估、论文假设的框架绑定以及标准化协议下的跨论文基准可比性。结果表明,将智能体生成与共享框架相结合,将论文复现从孤立的代码合成转变为可执行、假设感知且系统可比较的基准测试实现。

英文摘要

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly difficult due to restricted access to industrial datasets, incomplete reporting of preprocessing and evaluation protocols, and implicit design choices (e.g., windowing, target construction, data splits) that critically affect performance. Existing paper-to-code systems generate implementations for individual papers, but these artifacts are often not directly comparable due to inconsistencies in assumptions and evaluation settings. We introduce \emph{agentic, framework-based PHM paper reproduction}, where an agent translates a paper into a shared PHM benchmark framework via a \emph{slot-binding interface}. This interface maps equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, and evaluators), while explicitly recording unresolved assumptions. The resulting implementations are validated against standardized task contracts and evaluation hooks, enabling consistent and comparable benchmarking. We evaluate this approach on 16 PHM papers, comparing framework-enhanced, skill-based and prompt-based agentic reproduction against a recent framework-free paper-reproduction agent. We assess reproduction success, model-based code evaluation, framework binding of paper assumptions, and cross-paper benchmark comparability under standardized protocols. Our results show that coupling agentic generation with a shared framework transforms paper reproduction from isolated code synthesis into executable, assumption-aware, and systematically comparable benchmark implementations.

2605.28369 2026-05-28 cs.AI cs.SI

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

CyberJurors:电商纠纷裁决的多智能体模拟任务

Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang, Hantao Yao, Yongdong Zhang

AI总结 针对电商纠纷裁决需要从冗余多轮多模态证据中提取关键线索并依据平台特定惯例决策的问题,提出多智能体框架CyberJurors,通过个体裁决链式思维和集体陪审共识裁决提升裁决质量,在包含6000真实案例的基准上超越现有方法。

详情
Comments
ICML 2026
AI中文摘要

电商平台开始招募众包陪审员来裁决大量交易纠纷。与正式法律判决不同,电商纠纷裁决需要从冗余、多轮、多模态证据中提取关键线索,并在平台特定的灵活惯例下做出决策。这些特点使得现有方法不足以应对该场景。为弥补这一差距,我们引入了一项开创性任务——电商纠纷裁决(EDV),并提出了VerdictBench,一个包含6000个真实案例的多模态基准,旨在反映众包陪审团决策。在此基础上,我们提出了CyberJurors,一个多智能体框架,用于澄清纠纷逻辑并规范裁决过程。在个体层面,个体裁决链式思维将EDV任务分解为四个结构化的推理阶段,实现细粒度线索感知并澄清关键线索与纠纷焦点之间的因果逻辑。在集体层面,陪审共识裁决模拟陪审员之间的多轮讨论和投票,同时纳入裁决先例以减轻对任一争议方的认知偏差。在VerdictBench上的实验表明,CyberJurors优于最先进的LLM、MLLM和法庭模拟器,同时与真实陪审团投票模式实现了更强的一致性。代码和数据集可在https://github.com/YanhuiS/CyberJurors 和 https://huggingface.co/datasets/piggi/VerdictBench 获取。

英文摘要

E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E-commerce dispute verdicts require grounding pivotal clues from redundant, multi-round, multimodal evidence and making decisions under flexible platform-specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E-commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real-world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi-agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain-of-Thought decomposes the EDV task into four structured reasoning stages, enabling fine-grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi-round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state-of-the-art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real-world jury voting patterns. Code and dataset are available at https://github.com/YanhuiS/CyberJurors and https://huggingface.co/datasets/piggi/VerdictBench.

2605.28365 2026-05-28 cs.AI cs.CL cs.LO

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

风险控制的 Lean 作为自然语言数学推理的评判者

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

AI总结 针对 Lean 评判自然语言数学答案时信号稀疏且不忠实的问题,提出 COVCAL 选择器,通过有限样本选择性风险控制,在自动形式化覆盖率足够高时保证接受答案的准确率。

详情
AI中文摘要

Lean 越来越多地被用于评判自然语言数学答案,但其信号是不完全的:许多答案从未被形式化,而一个失败的证明可能反映类型错误或缺少库事实,而非答案错误。在 MATH-500 上,我们表明该信号 (i) 严重依赖于覆盖率,即在证明覆盖率高的答案中正确率为 96%,但在覆盖率低时为 20%,以及 (ii) 稀疏且常常不忠实:一个 7B 自动形式化器仅对 28% 的问题证明了某个类别,而人工审计发现其中只有约 43% 的证明是忠实的。我们提出 COVCAL,一个基于 Lean 跟踪诊断的选择器,它在两种机制(保守的 Bonferroni 界和更紧的 dev-then-cal 规则)下,对接受的答案认证有限样本选择性风险界,否则弃权。可行性取决于自动形式化覆盖率:对于 7B 形式化器,信号过于稀疏,Bonferroni 在所有 20 个自助法分区上弃权,而一个专用于证明器的形式化器达到 79% 的覆盖率,并在 20 个分区中的 17 个上使其可行,以 0.98 的接受准确率接受约 48% 的问题。由于自一致性本身已达到 91% 的准确率,我们的贡献是精确描述了何时以及使用哪个形式化器,部分形式化信号可以在风险控制下被信任。

英文摘要

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

2605.28364 2026-05-28 stat.ML cs.LG

Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

基于多项逻辑函数逼近的强化学习的方差自适应最优算法

Wonyoung Kim, Min-Hwan Oh, Garud Iyengar, Assaf Zeevi

AI总结 针对多项逻辑函数逼近的强化学习,提出一种计算高效的方差自适应算法,实现了实例级最优遗憾界,并通过实验验证其优于传统方法。

详情
AI中文摘要

基于多项逻辑(MNL)函数逼近的强化学习因其灵活性和广泛适用性已成为一个重要框架。虽然现有研究在最坏情况分析下建立了遗憾保证,但它们未能捕捉性能如何依赖于学习者和环境之间交互的变异性。在本文中,我们为基于MNL的马尔可夫决策过程开发了一种新的理论分析,得到了显式的方差自适应遗憾界。我们的算法计算高效,并实现了实例级最优遗憾率,缩小了上下界之间的差距。我们的数值实验验证了我们的方法比传统方法更有效地学习最优策略。

英文摘要

Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance depends on the variability of the interaction between the learner and the environment. In this paper, we develop a new theoretical analysis for MNL-based Markov decision processes that yields explicit variance-adaptive regret bounds. Our algorithm is computationally efficient and achieves the instance-wise optimal rate of regret, narrowing the gap between upper and lower bounds. Our numerical experiments validate that our method learns optimal policies more efficiently than conventional approaches.

2605.28363 2026-05-28 cs.CL

PubMedCausal: A Span-Level Annotated Corpus for Causal Relation Extraction in Biomedical Text

PubMedCausal: 用于生物医学文本中因果关抽取的跨度级标注语料库

Ifeoluwa Kunle-John, Josiah Paul, Oluwatosin Agbaakin, Peter Aina, Ikenna Odezuligbo, Sydney Anuyah

AI总结 为解决现有资源将因果关系与广义关联混淆、限制句子级标注或仅关注显式因果线索的问题,构建了基于PubMed摘要的跨度级因果关抽取语料库PubMedCausal,包含30,000段落级行、3,945因果行和6,491个裁决的因果对,并基准测试了判别式编码器和开源生成模型。

详情
Comments
Submitted to EMNLP 2026, 8 Pages, 23 page appendix
AI中文摘要

因果关抽取(CRE)是生物医学文本挖掘的核心,但当前资源常将因果关系与更广泛的关联混淆,将标注限制在句子级别,或主要关注显式因果线索。这限制了它们在评估模型是否能恢复生物医学文本中实际表达的因果主张方面的实用性。我们引入了PubMedCausal,一个基于PubMed摘要构建的生物医学CRE跨度级标注语料库。该语料库包含30,000个段落级行,包括3,945个因果行和6,491个经裁决的因果对。每个因果关系都标注了全文的原因和结果跨度、因果类型以及句子性,从而支持因果检测和全跨度因果抽取的评估。我们在检测和抽取设置下对判别式编码器和开源生成模型进行了基准测试。对于因果检测,生物医学编码器表现最强,PubMedBERT达到F$_1$分数0.7391。对于跨度级抽取,最佳生成基线是DeepSeek-R1-32B配合少样本提示,达到余弦对F$_1$分数0.6765。我们进一步通过评估在PubMedCausal上训练的编码器在外部因果关数据集上的表现来测试迁移学习,表明该资源支持跨数据集评估。我们的结果表明,在类别不平衡、长因果跨度、隐式因果关系、跨句关系以及提示敏感性下,生物医学CRE仍然困难。代码和数据可在此处找到:https://github.com/josiahpaul07/PubMedCausal_Exp

英文摘要

Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence-level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span-level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph-level rows, including 3,945 causal rows and 6,491 adjudicated cause--effect pairs. Each causal relation is annotated with full-text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full-span causal extraction. We benchmark discriminative encoders and open-source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F$_1$ score of 0.7391. For span-level extraction, the best generative baseline is DeepSeek-R1-32B with few-shot prompting, reaching a Cosine Pair F$_1$ of 0.6765. We further test transfer learning by evaluating PubMedCausal-trained encoders on external causal relation datasets, showing that the resource supports cross-dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter-sentential relations, and prompt sensitivity. Code and Data can be found here: https://github.com/josiahpaul07/PubMedCausal_Exp

2605.28362 2026-05-28 cs.RO

Accelerating Robot Path Planning via Connectivity-Preserving Region Proposal Network

加速机器人路径规划的连通性保持区域提议网络

Zhanzheng Ma, Cancan Zhao, Shuai Zhang, Bo Ouyang

AI总结 提出连通性保持区域提议网络(CP-RPN),通过分割模型预测紧凑且拓扑连通的候选区域,压缩搜索空间,结合Voronoi图与局部A*回退机制实现低延迟高成功率路径规划。

详情
AI中文摘要

移动机器人路径规划方法常受限于巨大的搜索空间,导致基于采样的算法存在延迟。基于学习的方法经常遭受局部区域碎片化和全局拓扑不一致性的困扰。为解决这一问题,我们提出了连通性保持区域提议网络(CP-RPN),一种分割引导模型,旨在预测紧凑且拓扑连通的候选区域,显著压缩搜索空间。具体来说,我们设计了一个分割模型,利用可变形注意力变换器(DAT)捕获长距离依赖以实现全局连通性,并采用反卷积解码器保留细粒度空间细节。为保证预测掩膜的连通性,我们设计了一个复合损失函数,结合交叉熵损失进行逐像素监督、连通性感知损失增强局部一致性,以及基于持续同调的拓扑连续性损失强制全局连通性。在这些高连通性走廊状区域的基础上,使用Voronoi图规划路径,并辅以局部A*回退机制确保鲁棒性。实验结果表明,与MPT基线相比,CP-RPN将候选区域大小减少了超过60.13%,实现了确定性低延迟规划(平均0.11秒),成功率达99.60%,在稳定性上优于传统的基于采样的算法。

英文摘要

Mobile robot path planning methods are often constrained by vast search spaces, resulting in latency in samplingbased algorithms. Learning-based approaches frequently suffer from local region fragmentation and global topological inconsistency. To tackle the problem, we present the Connectivity- Preserving Region Proposal Network (CP-RPN), a segmentationguided model designed to predict compact and topologically connected candidate regions, significantly compressing the search space. Specifically, we design a segmentation model that leverages a Deformable Attention Transformer (DAT) to capture long-range dependencies for global connectivity, with a Deconvolutional decoder to preserve fine-grained spatial details. To guarantee the connectivity of the predicted mask, we design a composite loss function that combines Cross-Entropy loss for pixelwise supervision, a Connectivity-Aware loss to enhance local coherence, and a Topological Continuity loss based on persistent homology to enforce global connectivity. Building on these highconnectivity corridor-like regions, the Voronoi diagram is used to plan the path, backed by a local A* fallback mechanism to ensure robustness. Experimental results demonstrate that CPRPN reduces the candidate region size by over 60.13% compared to the MPT baseline and achieves deterministic low-latency planning (avg. 0.11s) with a 99.60% success rate, outperforming traditional sampling-based algorithms in stability.

2605.28360 2026-05-28 cs.AI

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

提示码本:面向语言模型指令精炼的离散组合优化

Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall

AI总结 提出Prompt Codebooks (PCO)框架,将自动提示优化重构为离散组合学习,通过可重用的自然语言本能单元实现实例级路由和结构化反馈,在多个基准上提升性能并压缩提示长度。

详情
AI中文摘要

自动提示优化(APO)显著提升了基于LLM的智能体工作流。然而,现有方法将每个任务的提示视为一个整体、实例无关的字符串,通过全局编辑进行优化,导致更新脆弱且无法复用学到的子行为。我们提出提示码本(PCO),一种新颖的组合式提示优化框架,将APO重构为在有限自然语言本能(原子、可重用的指令单元)词汇表上的离散学习。PCO将提示构建知识组织在离散码本中,通过基于LLM的编码器将每个输入路由到少量条目;生成器将它们组合成冻结目标模型的提示;评论器输出结构化判决,通过归因分解为每个变量的文本梯度,在语言值极小极大目标下联合训练编码器、生成器和码本。得到的路由是实例级的:同一任务的不同输入接收不同的本能组合,这种机制在实例无关方法下结构上无法表达。在Qwen3-8B和LLaMA-3.1-8B上的六个基准测试中,PCO相比零样本提升高达+30.36分,在HotpotQA上超越最强先前基线(GEPA)达+3.34分,总体平均提升+1.11分,并且仅使用K=16个本能即可将部署提示长度相比MIPROv2压缩最多14.1倍,相比GEPA压缩3.0倍。

英文摘要

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

2605.28359 2026-05-28 cs.AI q-fin.TR

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

从知道到做到:面向LLM股票市场交易智能体的记忆控制基准

Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan, Jiacheng Lu, Sinuo Wang, Jing Li, Daxin Jiang, Yonghong He, Zuo Bai

AI总结 针对LLM交易智能体评估中的知识泄露和收益归因问题,提出KTD-Fin基准,通过数据掩码和Barra风格归因框架,分离市场记忆与投资决策,并揭示收益主要来自被动市场暴露而非选股能力。

详情
AI中文摘要

评估大语言模型(LLM)智能体能否在资本市场盈利,越来越被框架化为端到端交易:将智能体置于历史市场中,让其交易,并衡量投资组合收益。这种设置容易导致两种评估失败。首先,长时间的回测往往与前沿LLM的知识截止日期重叠,使得记忆的股票代码、日期、价格和市场叙事替代了投资推理。其次,原始收益是选股能力的一个嘈杂代理,因为正收益可能来自市场贝塔、风格暴露或有利的市场环境,而非真正的阿尔法。我们引入了KTD-Fin(知道-做到金融基准),一个端到端的股票市场交易基准,解决了这两个问题。KTD-Fin使用数据侧掩码协议,在提示和工具中一致地匿名化关键标识符和日历信息,将历史市场记忆与投资决策分离。它还整合了Barra风格的表现归因框架,将投资组合收益分解为市场、风格和选股阿尔法成分。在2024-2026年窗口内对中国沪深300指数评估的十个前沿LLM智能体中,掩码显著改变了智能体的推理过程,推动其转向匿名化的因子推理。归因分析进一步表明,在泄露控制评估下,LLM智能体的累积收益主要由被动的市场和风格暴露解释,而持续选股阿尔法的证据有限。这些发现表明,金融LLM基准不仅应评估智能体是否赚钱,还应评估收益来源是否反映了可转移的投资技能。我们发布KTD-Fin作为LLM交易智能体泄露控制和归因感知评估的可复现模板。

英文摘要

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents.

2605.28358 2026-05-28 cs.LG cs.AI cs.IT math.IT

Score Based Error Correcting Code Decoder

基于分数的纠错码译码器

Alon Helvits, Eliya Nachmani

AI总结 提出SB-ECC,一种将译码视为连续时间去噪的基于分数的译码器,通过神经去噪器定义概率流常微分方程,在奇偶校验约束下迭代更新噪声信道观测值,无需SNR估计即可推理,并在42个码/SNR设置中39/42达到最佳误码率。

详情
Comments
Accepted to ICML 2026
AI中文摘要

纠错码能够实现可靠通信,然而在实际软译码中,跨码族和码长仍然具有挑战性。我们提出SB-ECC,一种基于分数的译码器,将译码视为连续时间去噪。神经去噪器定义了一个概率流常微分方程(ODE),该方程在奇偶校验约束的引导下,迭代地将噪声信道观测值更新为有效的码字。该模型在不同噪声水平下训练,无需时间/SNR条件,从而无需SNR估计即可进行推理,并支持由ODE求解器预算控制的直接延迟-精度权衡。我们使用原始带符号的信道观测值作为输入来学习连续去噪场。在42个码/SNR设置中,SB-ECC在39/42个条目中实现了最佳误码率,平均SNR增益为0.17dB,最大增益为0.46dB,优于最强竞争基线。我们表明,将求解器从Euler切换为DPM可保持-ln(BER),同时将端到端译码时间平均减少8.86%(最高达12.82%)。

英文摘要

Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block lengths. We propose SB-ECC, a score-based decoder that casts decoding as continuous-time denoising. A neural denoiser defines a probability-flow ordinary differential equation (ODE) that iteratively updates the noisy channel observation toward a valid codeword, guided by parity constraints. The model is trained across noise levels without time/SNR conditioning, enabling inference without SNR estimation and supporting a direct latency accuracy trade off controlled by the ODE solver budget. We use the raw signed channel observation as input for learning a continuous denoising field. Across 42 code/SNR settings, SB-ECC achieves the best BER in 39/42 entries, with an average SNR gain of 0.17dB and a maximum gain of 0.46dB over the strongest competing baseline, we showed that swapping the solver from Euler to DPM preserves -ln(BER) while reducing end-to-end decoding time by 8.86% on average (up to 12.82%).

2605.28355 2026-05-28 cs.LG

Detecting Diffusion-Generated Time Series Under Generator Shift

检测生成器偏移下的扩散生成时间序列

Zhi Wen Soi, Aditya Shankar, Gert Lek, Abele Mălan, Daniel Neider, Jian-Jia Chen, Lydia Chen

AI总结 针对生成器未知的扩散生成时间序列检测问题,比较了白盒与黑盒方法,发现简单分类器作为黑盒检测器显著优于白盒方法,并指出该问题不能直接迁移图像领域经验。

详情
AI中文摘要

真实时间序列与扩散生成时间序列之间的界限变得越来越难以划分,然而该领域的检测仍未被充分探索,尤其是在生成器未知的情况下。我们比较了需要访问生成器的白盒检测与仅基于原始信号的黑盒检测。白盒方法是一种从图像领域改编的基于重构的检测器,在分布内表现良好,但在生成器偏移下失效:图像中基于重构的检测之所以成功,是因为大型通用生成器提供了近乎通用的重构先验,而时间序列不存在类似的生成器。相比之下,一个简单的现成分类器作为黑盒检测器表现非常出色,平均F1达到79.2,相对白盒方法提升22.1%,在1%假阳性率下的真正例率为57.2。因此,扩散生成时间序列的检测并非图像领域问题的直接迁移。本工作首次系统探索了扩散生成时间序列的白盒和黑盒检测。最后,我们指出了几个开放且有前景的方向。

英文摘要

The boundary between real and diffusion-generated time series is becoming increasingly difficult to draw, yet detection in this domain remains underexplored, especially when the generator is unknown. We compare white-box detection, which requires access to the generator, against black-box detection, which operates on the raw signal alone. The white-box approach, a reconstruction-based detector adapted from the image domain, works well in in-distribution but breaks down under generator shift: reconstruction-based detection in images succeeds because large generic generators provide a near-universal reconstruction prior, and no analogous generator exists for time series. In contrast, a simple off-the-shelf classifier used as a black-box detector performs remarkably well, achieving an average F1 of 79.2, a 22.1% relative improvement over the white-box approach, and a TPR@1%FPR of 57.2. Diffusion-generated time series detection is therefore not a direct transfer of the image domain problem. This work provides the first systematic exploration of white-box and black-box detection for diffusion-generated time series. We close by identifying several open and promising directions.

2605.28354 2026-05-28 cs.AI

Plan Before Search: Search Agents Need Plan

搜索前先规划:搜索智能体需要规划

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Jiayi Ji, Chenyi Lei, Wenwu Ou, Xiaoshuai Sun, Qibin Hou

AI总结 提出Plan方法,通过将问题分解为有序子问题再进行检索,并引入自举训练范式,无需外部强模型蒸馏即可在多跳QA中激活规划能力。

详情
AI中文摘要

将大型语言模型训练为检索增强推理智能体通常将强化学习与从更强模型蒸馏的SFT冷启动相结合。然而,这种范式忽略了两个基本因素:子技能之间的依赖结构,以及蒸馏并非获取能力的唯一途径。我们通过Plan来研究这一点,这是一种结构化的智能体行为,用于多跳检索,它在任何检索执行之前将问题分解为有序的子问题,从而使每个搜索步骤可以锚定到预先设计的子问题,而不是在先前检索的部分相关文档的影响下漂移。然而,在涵盖3B到14B参数的三个模型家族中,我们发现相同的奖励信号会引发定性不同的RL失败模式。这一现象表明,成功的训练不仅取决于奖励设计,还取决于模型特定的可行性条件:足够的初始熵、训练稳定性和先决子技能。受此启发,我们提出了一种自举训练范式,其中小规模种子模型生成过滤后的轨迹,从而在任何目标模型中激活Plan,消除了从外部强模型蒸馏的需要。我们的流程在每个测试模型中都激活了Plan,并在多跳QA基准上持续优于竞争基线。

英文摘要

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

2605.28353 2026-05-28 cs.NE cs.AI cs.SC

Improving Evaluation of Recombination-based Cartesian Genetic Programming

改进基于重组的笛卡尔遗传编程的评估

Duy Long Tran, Anja Jankovic, Marie Anastacio, Holger Hoos, Roman Kalkreuth

AI总结 本研究通过超参数优化,在SRBench基准平台上评估了子图交叉和离散表型重组两种重组算子,证明了超参数优化可提升基于重组的笛卡尔遗传编程的性能。

详情
Journal ref
GECCO'26 Companion: Genetic and Evolutionary Computation Conference Companion, July 13-17, 2026, San Jose, Costa Rica
Comments
Accepted for presentation as workshop paper in the graph-based genetic programming workshop (GGP) at the Genetic and Evolutionary Computation Conference (GECCO). To appear in the GECCO'26 conference companion. GECCO'26 will be held July 13-17, 2026 in San Jose, Costa Rica
AI中文摘要

笛卡尔遗传编程传统上使用变异作为其主要且通常是唯一的遗传算子来驱动进化搜索。尽管近年来取得了进展,但由于明显的性能提升不足,基于重组的方法长期以来一直被避免。本研究在符号回归基准平台SRBench上检验了最近提出的两种重组算子:子图交叉和离散表型重组。利用TinyverseGP框架中提供的实现,我们对这两种算子的相应表示进行了超参数优化。我们的工作表明,超参数优化可以导致基于重组的笛卡尔遗传编程的性能提升。

英文摘要

Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary search. Despite advancements in recent years, recombinationbased approaches have long been avoided, due to apparent lack of performance gains. This study examines two recently suggested recombination-based operators, subgraph crossover and discrete phenotypic recombination on SRBench, a benchmarking platform for symbolic regression. Using the implementations provided in the TinyverseGP framework, we perform hyperparameter optimisation of the respective representations with these two operators. Our work demonstrates that hyperparameter optimisation can lead to improvements in performance for recombination-based Cartesian Genetic Programming.

2605.28352 2026-05-28 cs.RO

Magnet-Based Soft Robotic Skin Using a 3D-Printed Multi-Lattice Structure and CNN-Based Tactile Super-Resolution

基于磁体的软体机器人皮肤:使用3D打印多格点结构和CNN触觉超分辨率

Yunseong Bang, Joowon Park, Suan Sim, Youngjun Ryu, Sukho Park, Kyungseo Park

AI总结 提出一种集成多层软格点、霍尔效应传感器阵列和CNN触觉超分辨率模型的磁基机器人皮肤,通过格点参数调节实现机械柔顺性与传感特性的联合优化,并利用3D打印快速制造,实现接触位置和法向力的实时估计。

详情
Comments
6 pages, 9 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. Y. Bang and J. Park contributed equally
AI中文摘要

本文提出一种基于磁体的机器人皮肤,它集成了多层软格点、分布式霍尔效应传感器阵列和触觉超分辨率模型。外部接触力通过嵌入的永磁体转换为磁场变化,而格点将这些变化扩散到整个传感域。这使得每个传感器具有大且重叠的感受野,从而在最小盲区的情况下实现大面积的传感。格点参数可调,能够联合调整机械柔顺性和传感特性。隐式建模工作流和选择性激光烧结(SLS)3D打印支持快速制造共形、高复杂度的结构。基于实验测量训练的卷积神经网络实时估计接触位置和法向力。实验验证了定位精度,并表明可扩展到更大表面,适用于全身机器人皮肤和安全的人机交互。

英文摘要

This paper presents a magnet-based robotic skin that integrates a multilayer soft lattice with distributed Hall-effect sensor arrays and a tactile super-resolution model. External contact forces are converted to magnetic field changes by embedded permanent magnets, and the lattice spreads these changes across the sensing domain. This gives each sensor a large, overlapping receptive field and enables a large sensing area with minimal blind spots. Lattice parameters are tunable, enabling joint adjustment of mechanical compliance and transduction characteristics. An implicit modeling workflow and selective laser sintering (SLS) 3D printing support rapid fabrication of conformal, high-complexity structures. A convolutional neural network trained on experimental measurements estimates contact location and normal force in real time. Experiments validate localization accuracy and indicate scalability to larger surfaces, suggesting applicability to whole-body robotic skin and safe human-robot interaction.

2605.28348 2026-05-28 cs.CV

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

面向语义无关和形状感知的视觉-语言分割模型

Corentin Seutin, Mohamed Amine Ettaki, Michaël Clément, Pierrick Coupé, Rémi Giraud

AI总结 提出语义无关且形状感知(SANSA)分割范式,通过非语义文本描述微调模型,在保持语义提示性能的同时,在新任务上提升高达20% mIoU。

详情
Comments
Accepted at the 2026 IEEE International Conference on Image Processing (ICIP 2026)
AI中文摘要

视觉-语言分割模型最近通过利用自然语言表达的高层语义对象类别取得了强大性能。然而,这种语义依赖性限制了它们对形状、几何或纹理等内在视觉属性的推理能力,而这些属性在许多实际应用中至关重要。在这项工作中,我们引入了语义无关且形状感知(SANSA)分割,这是一种新的范式,要求分割模型仅从非语义文本描述中运行。为此,我们提出了两种基于字典约束或示例指导生成SANSA分割提示的策略,两者都生成语义无关的文本描述。然后使用这些提示在语义无关监督下微调分割模型。实验表明,与预训练的最先进模型相比,在此新分割任务上对SANSA提示进行微调可带来高达20%的mIoU改进,同时在标准语义提示上保持强劲性能。这些结果强调了低层和中层视觉推理对于提高视觉-语言分割模型的泛化性和可控性的重要性。

英文摘要

Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

2605.28347 2026-05-28 cs.AI

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

FedMPT: 视觉语言模型的多标签联邦提示调优

Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang

AI总结 针对联邦学习中多标签识别任务,提出FedMPT方法,利用因果模型的前门调整和大语言模型驱动的条件解耦,通过最优传输和门控机制抑制虚假标签关联,提升模型鲁棒性。

详情
Comments
16 pages, including 11 pages of main text and 5 pages of appendix; Accepted by CVPR2026
AI中文摘要

基于视觉语言模型的多标签识别旨在利用其预训练知识更好地适应复杂识别场景,从而增强模型鲁棒性。然而,对于需要联邦学习的现实去中心化应用,将视觉语言模型适应到每个拥有私有和异构数据的客户端会导致模型过拟合虚假标签关联,从而在遇到新样本时触发不相关类别。为解决此问题,我们使用因果模型重新考虑多标签识别的联邦学习,其中采用前门调整并通过中间变量(放大真实标签共现)解耦多标签识别建模过程。在分析指导下,我们提出FedMPT,这是首个专门为联邦多标签识别设计的方法。FedMPT的核心思想是利用可泛化条件引导联邦多标签识别以减轻错误标签激活。为此,FedMPT引入了一个由大语言模型驱动的流程来解读控制标签依赖的潜在条件。此外,我们引入了条件增强提示与图像块之间的最优传输以揭示多个区域级语义。最后,我们通过精心设计的门控机制从不同条件生成协同预测。在多个基准数据集上的实验表明,我们提出的方法在不同设置下取得了有竞争力的结果,并优于现有最先进方法。

英文摘要

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

2605.28346 2026-05-28 cs.CL

When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

当话语压力冲突时:视觉-语言模型输出中的信息结构

Marcell Fekete, Johannes Bjerva, Tamás Káldi

AI总结 研究视觉-语言模型在视觉问答中是否区分话语旧主题和新焦点,发现模型虽产生信息结构相关结构但过度正则化,倾向于窄响应模板,类似模式崩溃。

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地被评估是否能识别正确的视觉内容,但关于它们是否以话语适当的形式表达这些内容却知之甚少。我们利用信息结构(IS)来填补这一研究空白,测试VLM在视觉基础问答中是否能区分话语旧主题(Topic)和话语新焦点(Focus)。我们利用匈牙利语,其中主题和焦点映射到专门的句法位置,使得IS选择在文本中可观察。通过比较六种VLM与人类参与者,我们发现模型产生了与IS相关的结构,但过度正则化了这种敏感性。在话语状态、语法角色(主语主题偏好)和限定性(不定焦点偏好)的相互作用压力下,人类选择多种IS实现策略。相比之下,VLM坍缩为狭窄的响应模板,类似于模式崩溃(Kirk等人,2024)。我们的发现表明,VLM评估应超越内容准确性,关注内容如何为话语打包。

英文摘要

Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.

2605.28345 2026-05-28 cs.AI cs.LG eess.SP

Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Picid: 一种跨任务和领域的可复现PHM模块化评估基础设施

Lev Telyatnikov, Raffael Theiler, Leandro Von Krannichfeldt, Olga Fink

AI总结 提出模块化评估基础设施Picid,通过标准化数据契约和评估边界,实现跨任务、跨数据集的故障检测、诊断和预测的可复现与公平比较。

详情
AI中文摘要

预测与健康管理(PHM)领域的进展受到跨任务、数据集和应用领域缺乏标准化和可复用评估实践的阻碍。报告的结果往往难以复现和比较,因为关键协议选择(如数据划分、预处理、标签对齐、时间窗口和指标)通常是隐式的或临时实现的。我们引入了\picid,一个模块化评估基础设施,将PHM评估流程形式化为显式、可执行和可复现的协议。通过定义良好的抽象,\picid在保持对不同PHM设置的灵活性的同时,强制执行确定性、无泄漏的数据集构建。该框架通过统一接口支持故障检测、诊断和预测,并且可以扩展到新的数据集和模型类别,而不违反协议不变性。通过标准化数据契约和评估边界,\picid还实现了跨诊断(分类)和预测(回归)的公平任务比较,允许相同的模型系列在不同设置中一致地进行评估。我们通过对跨越电池、轴承、涡轮风扇发动机、液压系统、过滤系统和建筑的十二个数据集上的十三个模型进行实证评估来展示\picid。这项工作为PHM中标准化、公平和可复现的评估建立了可复用的基础。

英文摘要

Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks, datasets, and application domains. Reported results are often difficult to reproduce and compare, as key protocol choices, such as data splits, preprocessing, label alignment, temporal windowing, and metrics, are often implicit or implemented ad hoc. We introduce \picid, a modular evaluation infrastructure that formalizes the PHM evaluation pipeline as an explicit, executable, and reproducible protocol. Through well-defined abstractions, \picid enforces deterministic, leakage-safe dataset construction while remaining flexible across diverse PHM settings. The framework supports fault detection, diagnostics, and prognostics through a unified interface and can be extended to new datasets and model classes without violating protocol invariants. By standardizing data contracts and evaluation boundaries, \picid also enables fair cross-task comparisons across diagnostics (classification) and prognostics (regression), allowing identical model families to be evaluated consistently across heterogeneous settings. We demonstrate \picid through an empirical evaluation of thirteen models on twelve datasets spanning batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings. This work establishes a reusable foundation for standardized, fair and reproducible evaluation in PHM.

2605.28340 2026-05-28 stat.ML cs.LG

Decision-focused learning for optimal PV-Battery scheduling

面向决策的光伏-电池调度优化学习

Joris Depoortere, Hussain Kazmi, Johan Driesen

AI总结 提出一种决策聚焦学习框架,通过训练LSTM光伏发电预测器以最小化电池调度成本,相比传统两阶段方法降低平均电费3.6%,验证了预测与优化目标对齐的重要性。

详情
Journal ref
Journal of Energy Storage Volume 154, Part A, 10 April 2026, 121152
AI中文摘要

近年来住宅光伏的使用急剧增加。随着电池系统变得更加经济实惠,光伏-电池系统的最优运行可以为家庭带来显著节省。最优控制需要正确预测底层参数(如光伏发电量)以调度电池。尽管由于算法进步和数据可用性,预测模型变得越来越准确,但准确性通常以通用指标衡量,这些指标可能与下游应用不一致。本研究提出了一种决策聚焦学习框架,通过在下游电池系统最优调度上训练长短期记忆光伏能量预测器,将优化和预测集成在一起。将所提出的方法与标准两阶段方法进行比较。在14个月的评估期内,决策聚焦方法在根据完美预测和无优化基线定义的性能界限归一化后,将20栋建筑的平均电费降低了3.6%。关键的是,尽管该模型的均方根误差为19.9%,显著高于解耦模型的8.2%,但仍实现了这一财务改善。对决策聚焦模型进行热启动进一步改善了结果,平均成本降低约8%,同时减轻了对统计准确性的负面影响(均方根误差为13.7%)。这些发现在20个家庭以及每个家庭单独在0.001水平上具有统计显著性。这些结果表明,将预测模型与优化目标对齐对于在光伏-电池系统中实现成本优势至关重要。未来的研究应在其他数据集、替代预测模型和替代优化算法上重复这些发现。

英文摘要

The use of residential photovoltaics has increased dramatically in recent years. With battery systems becoming more affordable, the optimal operation of a photovoltaic-battery system can bring significant savings to households. Optimal control requires correct forecasts of underlying parameters, such as photovoltaic power generation, to schedule the battery. While forecasting models have become increasingly accurate due to algorithmic advances and data availability, accuracy is typically measured in generic metrics which might not align with the downstream application. This study proposes a decision-focused learning framework that integrates optimization and prediction by training a Long Short-Term Memory photovoltaic energy forecaster on the downstream optimal scheduling of a battery system. The proposed methodology is compared against a standard two-phase approach. Across a 14-month evaluation period, the decision-focused method reduced average electricity costs across twenty buildings by 3.6% when normalized against performance bounds defined by a perfect forecast and a baseline of no optimization. Critically, this financial improvement was achieved despite the model exhibiting a root mean squared error of 19.9%, significantly higher than the decoupled model's 8.2%. Warm-starting the decision-focused model further improves results, lowering average cost by approximately 8%, while also mitigating the negative impact on statistical accuracy (root mean squared error of 13.7%). The findings are statistically significant at the 0.001 level across the twenty households and for each household individually. These results demonstrate that aligning forecast models with optimization goals is key for achieving cost advantages in PV-battery systems. Future research should replicate these findings on other datasets, alternate forecasting models and alternate optimization algorithms.

2605.28338 2026-05-28 cs.AI

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

SafeMed-R1: 临床医生审计的安全与伦理对齐用于医疗大语言模型

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan, Yidong Jiang, Yankai Jiang, Jinru Ding, Jiayuan Chen, Zhuangzhi Gao, Pengcheng Chen, Zhao He, Rongzhao Zhang, Meiling Liu, Luyi Jiang, Jie Xu

AI总结 提出SafeMed-R1模型,通过可追溯的临床信任信号管道和红队压力测试实现安全与伦理对齐,在临床基准上达到79.6%的宏平均准确率,并将不安全输出减少约3-5%。

详情
AI中文摘要

大语言模型在执业考试中日益匹配专家表现,但常规临床使用仍受限,因为治理需要可审计的推理、安全与伦理对齐以及对对抗性滥用的韧性。本文提出SafeMed-R1,通过可追溯的临床信任信号管道进行训练,该管道将每个推理实例与临床医生评分标准和编辑历史关联,并通过安全与伦理监督和红队压力测试进行对齐。SafeMed-R1在临床基准上达到79.6%的宏平均准确率。在对抗性安全测试下,它显示出最低的聚合风险,并将不安全输出相对于基线减少约3%至5%。在一项包含30个用药安全场景的配对专家研究中,SafeMed-R1在医学正确性上与PGY1和PGY2住院医师相当,并在用药安全、指南一致性和临床实用性上得分更高。总体而言,这些结果表明,临床医生审计的监督溯源,结合领域定制的安全与伦理对齐,可以在不依赖推理时检索或引用依据的情况下,加强治理相关的证据。

英文摘要

Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to adversarial misuse. Here we present SafeMed-R1, trained with a traceable Clinical Trust Signals(CTS) pipeline that links each reasoning instance to clinician rubric scores and edit histories, and aligned through safety and ethics supervision and red team stress testing. SafeMed-R1 attains a macro-averaged accuracy of 79.6% across clinical benchmarks. Under adversarial safety testing, it shows the lowest aggregated risk and reduces unsafe outputs by about 3 to 5% relative to its baseline. In a paired expert study of 30 medication safety vignettes, SafeMed-R1 matches PGY1 and PGY2 residents on medical correctness and scores higher for medication safety, guideline consistency, and clinical usefulness. Collectively, these results suggest that clinician-audited supervision provenance, together with domain-tailored safety and ethics alignment, can strengthen governance-relevant evidence without relying on inference-time retrieval or citation grounding.

2605.28337 2026-05-28 cs.AI

An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

一种增强的大邻域搜索方法用于解决具有不兼容客户的容量设施选址问题

Ida Gjergji, Lucas Kletzander, Nysret Musliu, Andrea Schaerf

AI总结 针对具有客户不兼容约束的容量设施选址问题,提出一种结合混合破坏算子和精确修复的大邻域搜索方法,在所有基准实例上取得了新的最优解。

详情
AI中文摘要

文献中最近引入了一种经典容量设施选址问题的新变体,该变体考虑了客户之间的不兼容性。该问题捕捉了给定客户对不能由同一设施服务的情况。这一特征对于许多实际选址问题至关重要,例如存在危险或污染材料以及竞争客户之间的冲突。在本文中,我们提出了一种大邻域搜索(LNS)方法来解决该问题。在LNS框架内,我们引入了三种不同的破坏算子,并以混合方式组合它们,同时在修复阶段使用精确求解器。针对LNS的设计研究了不同的算法组件。实验分析表明,我们的新方法优于现有的最先进元启发式算法,为所有可用的基准实例提供了新的最佳解。

英文摘要

A new variant of the classic capacitated facility location problem, which considers incompatibilities between customers, has recently been introduced in the literature. This problem captures the situation where given pairs of customers cannot be served by the same facility. Such a feature is crucial for many practical cases of location problems, such as the presence of hazardous or polluting materials and contention between competing costumers. In this paper, we propose a Large Neighborhood Search (LNS) method to solve this problem. Within the framework of LNS, we introduce three different destroy operators, which are combined in a hybrid manner, and we use an exact solver in the repair phase. Different algorithmic components are investigated for the design of LNS. The experimental analysis shows that our new method outperforms existing state-of-the-art metaheuristics, providing new best solutions for all available benchmark instances.

2605.28331 2026-05-28 cs.CV

Transfer learning RGB models to hyperspectral images with trainable tensor decompositions

使用可训练张量分解将RGB模型迁移到高光谱图像

Mariette Schönfeld, Laurens Devos, Wannes Meert, Hendrik Blockeel

AI总结 提出一种通过可训练张量分解将预训练RGB模型的卷积滤波器分解为空间和光谱成分,并替换光谱成分以适应高光谱图像通道数的方法,实现高光谱图像迁移学习,实验表明该方法比其他方法更准确和鲁棒。

详情
AI中文摘要

迁移学习使得大型视觉网络能够通过将模型的通用滤波器专门化到新任务,从而应用于各种领域。然而,这些网络假设输入图像具有3个输入通道,使其与多光谱或高光谱图像不兼容。当前缓解这种不兼容性的方法要么牺牲图像信息,要么牺牲模型信息。本文提出了一种新颖的方法,通过使用部分可训练的张量分解来保留图像和模型中的空间信息。我们创建预训练卷积滤波器的这种分解,将滤波器分离为空间和光谱成分。然后,将光谱成分替换为具有更高通道维度的可训练成分。这创建了能够专门化到新数据集的高光谱滤波器,同时保留原始滤波器的空间模式。在各种高光谱数据集上的实验表明,我们的方法比其他高光谱迁移学习方法更准确和鲁棒。

英文摘要

Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models' general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.

2605.28330 2026-05-28 cs.RO

Chance-Constrained MPPI under State and Dynamic Object Prediction Uncertainty and the Evaluation of Collision Risk Calibration

状态与动态物体预测不确定性下的机会约束MPPI及碰撞风险校准评估

Benjamin Serfling, Konrad Doll, Kati Radkhah-Lens

AI总结 针对机会约束MPPI控制中上游不确定性校准不足导致的过自信或过保守问题,提出DUCCT-MPPI架构,通过无迹变换和蒙特卡洛聚合联合集成定位与动态障碍预测不确定性,在仿真中实现鲁棒导航,成功率提升28%。

详情
Comments
Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
AI中文摘要

机会约束模型预测路径积分(MPPI)控制越来越多地用于动态环境中的导航,以明确限制碰撞风险。然而,这些概率保证隐含地假设来自定位和感知的上游不确定性是良好校准的。在实践中,估计器常常校准不良,导致特征性的闭环故障模式:过度自信导致系统性安全违规,而信心不足引发过度保守的冻结或概率稀释。为填补这一关键空白,我们的主要贡献是提出一种严格的评估方法,应用适当的评分规则来评估闭环执行期间预测碰撞风险的统计有效性。同时,提出了双不确定性机会约束管MPPI(DUCCT-MPPI)作为一种实时的、风险感知的规划架构。DUCCT-MPPI通过单管无迹变换(UT)近似联合集成定位不确定性,并通过蒙特卡洛聚合集成动态障碍预测不确定性。通过广泛的基于物理的仿真,该框架展示了鲁棒的故障缓解能力,在高度杂乱的环境中无缝过渡到安全、保守的机动,而不陷入功能死锁。在高度杂乱的环境中,DUCCT-MPPI实现了卓越的鲁棒性,导航成功率比已建立的蒙特卡洛MPPI基线高出近28%,同时记录了最低的行驶时间并最小化了诱导的社会力。最终,这些发现表明,自主导航中可靠的概率安全性不仅要求表达性的风险模型,还要求整个自主栈中统计上有效的不确定性估计。

英文摘要

Chance-constrained Model Predictive Path Integral (MPPI) control is increasingly adopted for navigation in dynamic environments to explicitly bound collision risk. However, these probabilistic guarantees implicitly assume that upstream uncertainties from localization and perception are well-calibrated. In practice, estimators are often miscalibrated, inducing characteristic closed-loop failure modes: overconfidence leads to systematic safety violations, while underconfidence triggers overly conservative freezing or probability dilution. To address this critical gap, our primary contribution is a rigorous evaluation methodology applying proper scoring rules to assess the statistical validity of predicted collision risks during closed-loop execution. Concurrently, Dual-Uncertainty Chance-Constrained Tube MPPI (DUCCT-MPPI) is proposed as a real-time, risk-aware planning architecture. DUCCT-MPPI jointly integrates localization uncertainty via a one-tube Unscented Transform (UT) approximation and dynamic obstacle prediction uncertainty via Monte Carlo aggregation. Through extensive physics-based simulations, the framework demonstrates robust failure-mitigation, seamlessly transitioning to safe, conservative maneuvering without succumbing to functional deadlocks in highly cluttered environments. In highly cluttered environments, DUCCT-MPPI achieves superior robustness, outperforming established Monte Carlo MPPI baselines by nearly 28\% in navigation success rate, while simultaneously recording the lowest travel times and minimizing induced social forces. Ultimately, these findings establish that reliable probabilistic safety in autonomous navigation dictates not only expressive risk models but statistically valid uncertainty estimates throughout the entire autonomy stack.

2605.28328 2026-05-28 cs.LG cs.AI

Learning the Error Patterns of Language Models

学习语言模型的错误模式

Jinwoo Kim, Taylor Berg-KirkPatrick, Loris D'Antoni

AI总结 提出前缀过滤器(prefix filters)来捕捉LLM在特定领域中的错误模式,并通过Palla算法高效学习这些过滤器,从而提升输出有效性,例如在TypeScript生成中将编译率提升60%以上。

详情
AI中文摘要

当为具有特定有效性约束的领域(例如,程序应能编译)生成输出时,LLM通常会在少数集中的方面失败:例如,在生成TypeScript时使用Python函数名。我们观察到这些错误模式可以用少量约束来表示,并且这些约束可以在实践中学习。我们提出\emph{前缀过滤器},即针对领域和LLM的符号函数,作为捕捉错误模式的对象,以及Palla算法,用于在实践中高效学习前缀过滤器,并实现了Palla。由Palla学习的前缀过滤器i)帮助我们定量分析LLM的错误模式,ii)可用于通过约束采样算法约束模型的输出。例如,Palla将Qwen2.5-1.5B在TypeScript生成上的编译率提升了超过60%,使得Qwen2.5-1.5B达到与未约束的Llama3.1-8B相似的性能。

英文摘要

When generating outputs for domains with specific validity constraints (e.g., a program should compile), LLMs often fail in a small number of focused ways: for example, by using Python function names when generating TypeScript. We observe that these error patterns can be represented using a small number of constraints that can be learned in practice. We propose \emph{prefix filters}, which are per-domain-and-LLM symbolic functions, as objects to capture the error patterns, Palla as an algorithm to learn prefix filters efficiently in practice, and implement Palla. Prefix filters learned by Palla i) help us quantitatively analyze the error patterns of LLMs, and ii) can be used to constrain the outputs of a model via constrained sampling algorithms. For example, Palla boosts compile rates for Qwen2.5-1.5B on TypeScript generation, by over 60%, allowing Qwen2.5-1.5B to achieve similar performance to Llama3.1-8B unconstrained.

2605.28324 2026-05-28 cs.CV

Inpainting-Style Conditional Diffusion for Multivariable Time Series Forecasting

基于修复风格条件扩散的多变量时间序列预测

Kourosh Kiani, S. M. Muyeen

AI总结 提出一种将多变量时间序列预测重构为图像修复问题的条件扩散框架,通过掩码条件扩散机制和零填充策略实现高精度短期太阳能功率预测。

详情
AI中文摘要

本文提出了一种新颖的基于条件扩散的框架,用于多变量时间序列太阳能功率预测。该方法通过滑动窗口补丁构建将时间序列光伏数据重新表述为结构化二维表示(图像),从而在统一的时空学习范式内应用去噪扩散概率模型(DDPM)。本文的一个关键贡献是将太阳能预测表述为修复问题,其中未来时间步被视为待重建的缺失区域。这是通过基于掩码的条件扩散机制实现的,其中历史观测作为条件上下文保留,而目标(未来)区域则逐步被破坏并通过反向扩散恢复。该模型学习基于观测数据生成连贯的未来序列,有效执行时间序列修复。为了充分利用所有可用特征并确保与U-Net架构约束兼容,引入了一种零填充策略来构建固定大小的输入。模型使用监督去噪目标进行训练,以预测注入的噪声,从而在反向过程中实现准确的迭代重建。在包括GEFCom2014在内的基准光伏数据集上进行的大量实验表明,所提出的方法实现了高预测精度,特别是在短期预测中。结果凸显了将基于扩散的生成建模与修复公式相结合用于稳健、灵活和高保真太阳能功率预测的有效性。

英文摘要

In this paper, we propose a novel conditional diffusion-based framework for multivariable time-series solar power forecasting. The proposed method reformulates temporal PV data as structured two-dimensional representations (images) using a sliding-window patch construction, enabling the application of Denoising Diffusion Probabilistic Models (DDPM) within a unified spatiotemporal learning paradigm. A key contribution of this work is the formulation of solar forecasting as an inpainting problem, where future time steps are treated as missing regions to be reconstructed. This is achieved through a mask-based conditional diffusion mechanism, in which historical observations are preserved as conditioning context while the target (future) region is progressively corrupted and subsequently recovered via reverse diffusion. The model learns to generate coherent future sequences conditioned on observed data, effectively performing time-series inpainting. To fully utilize all available features and ensure compatibility with U-Net architectural constraints, a zero-padding strategy is introduced to construct fixed-size inputs. The model is trained using a supervised denoising objective to predict injected noise, enabling accurate iterative reconstruction during the reverse process. Extensive experiments conducted on benchmark PV dataset, including GEFCom2014, demonstrate that the proposed approach achieves high forecasting accuracy, particularly for short-term horizons. The results highlight the effectiveness of integrating diffusion-based generative modeling with an inpainting formulation for robust, flexible, and high-fidelity solar power forecasting.