arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.08938 2026-05-28 cs.AI cs.LG

Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators

我们能否形式化验证神经PDE代理模型?小傅里叶神经操作符的SMT编译

Ali Baheri, Ignacio Laguna Peralta

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本文通过将傅里叶神经操作符(FNO)的谱卷积编译为线性映射,在Z3中实现精确或近似的SMT编码,从而对小型FNO代理模型进行形式化验证,并揭示了验证的可靠性与可扩展性之间的权衡。

详情
AI中文摘要

傅里叶神经操作符(FNO)可以极大地加速PDE模拟,但它们通常在没有形式化保证其保留基本物理结构的情况下使用。我们表明,一旦训练权重和网格固定,FNO中的谱卷积是一个线性映射。因此,完整的前向传播是分段线性的,并且可以在Z3的线性实数算术中精确表示。我们研究了两种编码。精确编码将谱卷积编译为稠密矩阵乘法,对于证明和反例都是可靠的。更轻量的冻结编码用常数替换谱路径,使其更快但近似。在10个用于一维对流-扩散-反应的小型FNO代理模型(85到117个参数,网格8到32)上,精确编码在线性(无ReLU)模型上给出了2个可靠的正性证明,5个可靠的正性反例,以及10个可靠的质量违反反例;其余3个在ReLU模型上的正性查询超时。对于质量不增加,Z3在10个模型中的7个上找到了比基于梯度的伪造和蒙特卡洛更差的反例。冻结编码可扩展到网格大小64,且正性检查亚秒级,但它不再为原始FNO提供证书。总体而言,结果明确了可靠性与可扩展性之间的权衡,并指出了对生产规模神经操作符进行形式化验证所需的条件。

英文摘要

Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve basic physical structure. We show that, once the trained weights and grid are fixed, the spectral convolution in an FNO is a linear map. As a result, the full forward pass is piecewise-linear and can be represented exactly in Z3's linear real arithmetic. We study two encodings. The exact encoding compiles the spectral convolution into a dense matrix multiplication, which is sound for both proofs and counterexamples. The lighter frozen encoding replaces the spectral path with a constant, making it faster but approximate. On 10 small FNO surrogates for 1D advection-diffusion-reaction (85 to 117 parameters, grids 8 to 32), the exact encoding gives 2 sound positivity proofs on linear (ReLU-free) models, 5 sound positivity counterexamples, and 10 sound mass-violation counterexamples; the remaining 3 positivity queries on ReLU models time out. For mass non-increase, Z3 finds worse counterexamples than both gradient-based falsification and Monte Carlo on 7 of 10 models. The frozen encoding scales to grid size 64 with sub-second positivity checks, but it no longer provides certificates for the original FNO. Overall, the results make the soundness--scalability tradeoff explicit and point to what is needed for formal verification of production-scale neural operators.

2605.08758 2026-05-28 cs.RO cs.AI math.OC

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架

Jiaxin Liu, Peng Yang, Yuping Li, Xinyue Xie

发表机构 * Institution of Data and Information, Shenzhen International Graduate School, Tsinghua University, Nanshan District, Shenzhen 518055, China(数据与信息研究所,深圳国际研究生院,清华大学,南山区,深圳518055,中国)

AI总结 针对料箱搬运机器人系统的订单履行决策,提出一种结合结构化组合优化与多智能体强化学习的通用可扩展序贯决策框架OLSF-TRS,在小规模系统上平均最优性差距低于3.5%,在大规模场景中相比启发式基线减少8-12%的料箱移动,并保持实时响应。

Comments 35 pages, 5 figures

详情
AI中文摘要

受电子商务和小批量生产的快速扩张推动,成品、半成品和原材料的内部物流负载单元规模正在稳步缩小。料箱正逐渐取代托盘成为主要的搬运和存储容器。这一转变将料箱搬运机器人系统推向了自动化订单履行中心的前沿。料箱搬运机器人系统的订单履行决策具有共同的订单-料箱-机器人序贯决策性质。现有研究主要针对特定系统的决策机制,难以泛化或迁移到其他场景。我们提出了一种基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架(OLSF-TRS),这是一个通用且可扩展的序贯决策框架,结合了结构化组合优化与多智能体强化学习,以协调订单、料箱和机器人决策。在小规模料箱搬运机器人系统上,OLSF-TRS在两种不同的系统配置下实现了接近最优的性能,平均最优性差距低于3.5%。在大规模场景中,OLSF-TRS在两种不同类型的系统上始终优于启发式基线,与基于规则的最先进方法相比,总料箱移动量减少了8-12%和超过30%,同时保持实时响应。这些改进转化为切实的运营效益,包括成本降低、能耗降低和吞吐量稳定性增强。所提出的框架为广泛部署的料箱搬运机器人系统提供了一种高效且统一的订单履行决策框架,支持电子商务和工业物流领域的高质量订单履行。

英文摘要

Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.

2605.08678 2026-05-28 cs.LG

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

MLS-Bench:对构建更好AI的AI系统的全面且严格评估

Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, Chi Jin

发表机构 * UC Berkeley(伯克利大学) Princeton University(普林斯顿大学) Tsinghua University(清华大学) University of Washington(华盛顿大学) Purdue University(Purdue 大学) Harvard University(哈佛大学) University of Pennsylvania(宾夕法尼亚大学) Shanghai Jiao Tong University(上海交通大学) UC San Diego(圣地亚哥大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出MLS-Bench基准,包含12个领域140个任务,评估AI系统能否发明通用且可扩展的机器学习方法,发现当前智能体在方法发明上仍远逊于人类,瓶颈在于科学洞察而非单纯搜索或计算。

详情
AI中文摘要

现代AI的进步由跨设置泛化且可扩展到更大规模的机器学习方法驱动。随着大型语言模型在推理、编码和工程任务中展现出高级能力,理解它们是否能发现此类方法(而不仅仅是应用现有方法)变得越来越重要。我们引入了MLS-Bench,一个用于评估AI系统能否发明通用且可扩展的机器学习方法的基准。MLS-Bench包含12个领域的140个任务,每个任务要求智能体改进机器学习系统或算法的一个目标组件,并证明该改进在受控设置和规模下具有泛化性。我们发现,当前智能体在可靠地超越人类设计的方法方面仍相差甚远,并且工程风格的调优对它们来说比真正的方法发明更容易。我们进一步研究了测试时缩放、自适应计算分配和上下文提供对智能体发现性能的影响,以及它们行为的案例研究。我们的分析表明,瓶颈不仅在于提出新方法,还在于规划、验证和扩展关于这些方法的声明所需的科学洞察。仅靠更多的搜索、计算或上下文并不能消除这一瓶颈。我们构建并维护一个社区平台以进行累积和可比较的迭代,并在https://mls-bench.com发布数据和代码。

英文摘要

Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.

2604.24938 2026-05-28 cs.LG cs.AI cs.CL

Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning

重新思考层冗余:校准比搜索在LLM深度剪枝中更重要

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim, Suin Cho, Woosang Lim, Sunwoo Lee

发表机构 * Neural Superintelligence Lab, MODULABS(神经超智能实验室,MODULABS) University of Southern California(南加州大学) Boston University(波士顿大学) Seoul National University(首尔国立大学) Inha University(inha大学)

AI总结 本文通过实验发现,在大型语言模型深度剪枝中,校准配置对剪枝模式和性能的影响远大于搜索算法的选择。

Comments Preprint

详情
AI中文摘要

深度剪枝通过移除Transformer块来提高大型语言模型的推理效率。先前的工作通常将层冗余视为预训练网络固有的结构属性,强调重要性标准和搜索算法来识别可移除的层。在本研究中,我们从功能角度实证研究深度剪枝。通过评估不同校准配置和多种搜索算法下的代表性LLM系列,我们展示了不同配置会产生不同的剪枝模式。此外,在固定校准配置下,复杂的搜索算法相比简单的一次性方法仅带来边际性能提升,并收敛到相似的剪枝子集。总体而言,我们的结果表明,校准配置在塑造剪枝模式和校准困惑度方面比搜索算法的选择起着更大的作用,同时对下游推理准确性的方差贡献相当。这表明未来的剪枝工作可能受益于优先考虑校准配置而非搜索复杂性。

英文摘要

Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer redundancy as an inherent structural property of pretrained networks, emphasizing importance criteria and search algorithms to identify removable layers. In this study, we empirically investigate depth pruning from a functional perspective. Evaluating representative LLM families across diverse calibration configurations and multiple search algorithms, we show that different configurations produce different pruning patterns. Furthermore, under a fixed calibration configuration, complex search algorithms yield marginal performance improvements over simple one-shot methods, converging to similar pruned subsets. Overall, our results suggest that the calibration configuration plays a substantially larger role than the choice of search algorithm in shaping pruning patterns and calibration perplexity, while contributing comparably to variance in downstream reasoning accuracy. This indicates that future pruning efforts may benefit from prioritizing the calibration configuration over search complexity.

2605.06915 2026-05-28 cs.LG

LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

LLMs 并非(一致地)贝叶斯:量化 LLMs 概率信念的内部(不)一致性

Chacha Chen, Matthew Jörke, Adam Goliński, Masha Fedzechkina, Guillermo Sapiro, Sinead Williamson, Nicholas Foti

发表机构 * Apple(苹果公司) Stanford University(斯坦福大学) Princeton University(普林斯顿大学)

AI总结 本文引入信息处理差距来研究 LLMs 在更新概率信念时的内部不一致性,发现非贝叶斯启发式更新在下游任务中常优于精确贝叶斯计算,表明 LLMs 的世界概率模型存在错误设定。

详情
AI中文摘要

现代人工智能系统正被部署在医学、科学和法律等复杂领域,在这些领域中,它们不仅需要产生正确的答案,还需要在新证据出现时表示和更新关于世界的不确定性信念。我们引入了一种新颖的技术,将 LLMs 作为信息处理规则进行研究,并利用信息处理差距来研究 LLMs 如何从证据中更新其概率信念的内部(不)一致性。我们的广泛实验评估了 LLMs 将证据纳入其信念的多种方法。其中一些方法产生(近乎)贝叶斯更新;其他方法似乎使用学习到的启发式。令人惊讶的是,非贝叶斯启发式更新在下游任务性能上通常优于精确贝叶斯计算——这表明 LLMs 的世界概率模型存在错误设定。最后,我们展示了我们的度量如何提供诊断,以识别 LLM 驱动的推理系统中的问题。

英文摘要

Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.

2510.25781 2026-05-28 cs.LG cs.AI cs.NA cs.NE math.NA

A Practitioner's Guide to Kolmogorov-Arnold Networks

Kolmogorov-Arnold网络实践指南

Amir Noorizadegan, Sifan Wang, Leevan Ling, Juan P. Dominguez-Morales

发表机构 * Department of Mathematics, Hong Kong Baptist University(香港 Baptist大学数学系) Institution for Foundations of Data Science, Yale University(数据科学基础研究所,耶鲁大学) Robotics and Technology of Computers Lab., Universidad de Sevilla(机器人与计算机技术实验室,塞维利亚大学)

AI总结 本文系统综述了受Kolmogorov叠加定理启发的KAN网络,从理论基础、设计轴心(基函数)到最新进展,并提供了实用选择指南和未来方向。

详情
AI中文摘要

Kolmogorov-Arnold网络(KAN)的设计灵感来源于Kolmogorov叠加定理(而非由其决定),已成为MLP的结构化替代方案。本综述对快速扩展的KAN文献进行了系统全面的概述。综述围绕三个核心主题组织:(i)阐明KAN与Kolmogorov叠加理论(KST)、MLP和经典核方法之间的关系;(ii)将基函数作为中心设计轴进行分析;(iii)总结在准确性、效率、正则化和收敛性方面的最新进展。最后,我们提供了实用的“选择你的KAN”指南,并概述了开放的研究挑战和未来方向。随附的GitHub仓库为正在进行的KAN研究提供了结构化参考。

英文摘要

Kolmogorov-Arnold Networks (KANs), whose design is inspired-rather than dictated-by the Kolmogorov superposition theorem, have emerged as a structured alternative to MLPs. This review provides a systematic and comprehensive overview of the rapidly expanding KAN literature. The review is organized around three core themes: (i) clarifying the relationships between KANs and Kolmogorov superposition theory (KST), MLPs, and classical kernel methods; (ii) analyzing basis functions as a central design axis; and (iii) summarizing recent advances in accuracy, efficiency, regularization, and convergence. Finally, we provide a practical "Choose-Your-KAN" guide and outline open research challenges and future directions. The accompanying GitHub repository serves as a structured reference for ongoing KAN research.

2605.00435 2026-05-28 cs.CL cond-mat.dis-nn cs.AI nlin.CD

Escaping Mode Collapse in LLM Generation via Geometric Regulation

通过几何调控逃离大语言模型生成中的模式崩溃

Xin Du, Kumiko Tanaka-Ishii

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan(通信与计算机工程系,早稻田大学,东京,日本) Department of Computer Science and Engineering, Waseda University, Tokyo, Japan(计算机科学与工程系,早稻田大学,东京,日本) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China(智能自主系统上海研究院,同济大学,上海,中国)

AI总结 本文从动力系统视角将模式崩溃解释为几何崩溃,并提出轻量级在线状态空间干预方法RMR(通过低秩阻尼调控Transformer值缓存中的自强化方向),显著降低模式崩溃并实现极低熵率下的稳定生成。

Comments Accepted to ICML 2026

详情
AI中文摘要

模式崩溃是生成建模中的一个持续挑战,在自回归文本生成中表现为从显式循环到逐渐失去多样性和轨迹过早收敛等行为。我们采用动力系统视角,将模式崩溃重新解释为由*几何崩溃*引起的状态空间可访问性降低:在生成过程中,模型的内部轨迹被限制在其表示空间的低维区域。这意味着模式崩溃并非纯粹的token级现象,无法通过符号约束或仅概率解码启发式可靠解决。基于这一视角,我们提出*强化模式调控*(RMR),一种轻量级的在线状态空间干预方法,用于调控Transformer值缓存中占主导地位的自强化方向(实现为低秩阻尼)。在多个大型语言模型上,RMR显著减少了模式崩溃,并能够在极低熵率(低至0.8 nats/步)下实现稳定生成,而标准解码通常在2.0 nats/步附近崩溃。

英文摘要

Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by *geometric collapse*: during generation, the model's internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose *Reinforced Mode Regulation* (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.

2508.05417 2026-05-28 cs.CV

Smoothing Slot Attention Iterations and Recurrences

平滑槽注意力迭代与循环

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation, Aalto University, Espoo, Finland(电气工程与自动化系,阿alto大学,埃斯波,芬兰) Department of Computer Science, Aalto University, Espoo, Finland(计算机科学系,阿alto大学,埃斯波,芬兰) Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland(机器视觉与信号分析中心,奥卢大学,奥卢,芬兰)

AI总结 针对槽注意力在图像首帧冷启动查询缺乏样本特异性及视频帧间聚合变换同质化的问题,提出SmoothSA方法,通过预热冷启动查询和差异化迭代次数来平滑迭代与循环,提升目标发现、识别与推理性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

槽注意力(SA)是主流面向对象学习(OCL)的核心。图像特征可以通过SA迭代地细化冷启动查询槽来聚合成对象级表示。对于视频,这种聚合通过SA在帧间共享的循环进行,查询在第一帧冷启动,之后从上一帧的槽过渡。然而,冷启动查询缺乏样本特异性,从而阻碍了图像或视频第一帧的精确聚合;非第一帧的查询已经具有样本特异性,因此需要与第一帧不同的聚合变换。我们通过SmoothSA解决这些问题:(1)为了平滑图像或视频第一帧上的SA迭代,我们通过OCL内部自蒸馏的微型模块预热冷启动查询,使其具有丰富的输入特征信息;(2)为了平滑视频第一帧和非第一帧之间的SA循环,我们分别使用完整迭代和单次迭代来区分同质的聚合变换。在目标发现、识别和视觉推理上的综合实验验证了我们方法的有效性。进一步的视觉分析阐明了其潜在机制。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/SmoothSA获取。

英文摘要

Slot Attention (SA) lies at the heart of mainstream Object-Centric Learning (OCL). Image features can be aggregated into object-level representations by SA \textit{iteratively} refining cold-start query slots. For video, such aggregation proceeds by SA \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with our \textit{SmoothSA}: (1) To smooth SA iterations on image or video's first frame, we \textit{preheat} cold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, we \textit{differentiate} the homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Our \textit{source code}, \textit{model checkpoints} and \textit{training logs} are provided on https://github.com/Genera1Z/SmoothSA.

2604.27251 2026-05-28 cs.CL cs.AI

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

服从与感知:大型语言模型中的推理可控性研究

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras

发表机构 * School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院) School of EECS, Queen Mary University of London(伦敦女王学院电子工程与计算机科学学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过推理冲突视角,系统研究大型语言模型在诱导逻辑模式与任务预期模式冲突时,是否优先服从指令还是遵循感知合理性,并探索内部检测与激活级干预方法。

详情
AI中文摘要

大型语言模型(LLMs)已知通过预训练数据中的共享推理模式获得推理能力,并通过思维链(CoT)实践进一步激发。然而,基本推理模式(如归纳、演绎和溯因)能否与具体问题实例解耦,仍然是模型可控性的关键挑战,并有助于阐明推理可控性。在本文中,我们首次通过推理冲突的视角系统研究这一问题:推理冲突是指通过强制使用偏离目标任务预期逻辑模式而引发的参数信息与上下文信息之间的显性张力。我们的评估表明,LLMs 始终优先考虑感知合理性而非服从性,尽管存在冲突指令,仍倾向于采用任务合适的推理模式。我们进一步证明推理冲突在内部是可检测的,因为在冲突期间置信度分数显著下降。探测实验确认推理类型从中间层到后期层线性编码,表明存在激活级可控性的潜力。利用这些见解,我们引导模型朝向服从性,将指令遵循度提高多达 29%。总体而言,我们的发现表明,虽然 LLM 推理锚定于具体实例,但主动的机制性干预可以有效地将逻辑模式与数据解耦,为改进可控性、忠实性和泛化性提供了一条路径。

英文摘要

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

2603.09117 2026-05-28 cs.LG cs.AI cs.CL

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

解耦推理与置信度:在可验证奖励的强化学习中恢复校准

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院) National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China(中国国家计算机网络应急技术配合中心)

AI总结 针对RLVR中模型校准退化问题,提出DCPO框架通过解耦推理与校准目标,在保持准确率的同时显著改善校准性能并缓解过度自信。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著增强了大语言模型(LLMs)的推理能力,但严重遭受校准退化,即模型对错误答案变得过度自信。以往研究致力于将校准目标直接纳入现有优化目标。然而,我们的理论分析表明,最大化策略准确率与最小化校准误差之间存在根本性的梯度冲突。基于这一见解,我们提出了DCPO,一个简单而有效的框架,系统地解耦了推理和校准目标。大量实验表明,我们的DCPO不仅保持了与GRPO相当的准确率,还实现了最佳的校准性能,并显著缓解了过度自信问题。我们的研究为更可靠的LLM部署提供了宝贵的见解和实用的解决方案。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

2410.04096 2026-05-28 cs.LG cs.AI cs.NA cs.NE math.NA physics.comp-ph

Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities

Sinc Kolmogorov-Arnold 网络及其在求解含奇异性偏微分方程中的应用

Tianchi Yu, Jingwei Qiu, Jiang Yang, Ivan Oseledets

发表机构 * Skolkovo Institute of Science and Technology(斯克洛夫科学与技术研究所) Southern University of Science and Technology(南方科技大学) International Center for Mathematics(国际数学中心) National Center for Applied Mathematics Shenzhen (NCAMS)(深圳应用数学中心)

AI总结 本文提出在 Kolmogorov-Arnold 网络中使用 Sinc 插值作为可学习激活函数,以有效逼近光滑函数和含奇异性的函数,并在物理信息神经网络求解偏微分方程中取得更好效果。

Journal ref Neural Networks 2026

详情
AI中文摘要

在本文中,我们提出在 Kolmogorov-Arnold 网络(一种具有可学习激活函数的神经网络,最近作为多层感知机的替代方案受到关注)中使用 Sinc 插值。已有许多不同的函数表示被尝试,但我们表明 Sinc 插值提供了一种可行的替代方案,因为它在数值分析中已知能有效表示光滑函数和含奇异性的函数。这不仅对函数逼近重要,也对使用物理信息神经网络求解偏微分方程重要。通过一系列实验,我们表明 SincKANs 在我们考虑的大多数示例中提供了更好的结果。

英文摘要

In this paper, we propose to use Sinc interpolation in the context of Kolmogorov-Arnold Networks, neural networks with learnable activation functions, which recently gained attention as alternatives to Multilayer Perceptron. Many different function representations have already been tried, but we show that Sinc interpolation proposes a viable alternative, since it is known in numerical analysis to effectively represent both smooth functions and functions with singularities. This is important not only for function approximation but also for solving the partial differential equations with physics-informed neural networks. Through a series of experiments, we show that SincKANs provide better results in almost all of the examples we have considered.

2604.25491 2026-05-28 cs.CV cs.AI

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

水印移除的法医成本:从专用攻击到图像编辑

Gautier Evennou, Ewa Kijak

发表机构 * IMATAG(IMATAG机构) IRISA, Univ. Rennes, INRIA, CNRS(IRISA大学、INRIA和CNRS)

AI总结 本文提出水印移除检测(WRD)作为新评估维度,通过训练分类器检测移除痕迹,在10^{-3}假阳性率下实现最优检测,证明法医隐蔽性是水印移除的必要条件。

Comments v1:The Forensic Cost of Watermark Removal, accepted at IH&MMSEC 2026, Special Session "Watermarking Across the Lifecycle of Generative Models". v2: extended version, under review

详情
AI中文摘要

当前水印移除方法在两个轴上进行评估:攻击成功率和感知质量。我们证明这是不够的。虽然最先进的攻击成功地在没有可见失真的情况下降低了水印信号,但它们留下了明显的统计伪影,暴露了移除尝试。我们将这个被忽视的轴命名为水印移除检测(WRD),并证明基于这些伪影训练的现代分类器在10^{-3}假阳性率下,对每种测试的移除方法都达到了最先进的检测率。没有现有的攻击考虑到这种法医泄漏。我们在扩展的评估三元组(攻击成功率、感知质量和法医可检测性)下,对领先的水印方案与标准移除流水线进行了基准测试,发现当前没有方法能平衡所有三个。我们的结果确立了法医隐蔽性作为水印移除的必要要求。

英文摘要

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

2510.24941 2026-05-28 cs.LG

Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

顿悟时刻可以是假的吗?——量化思维链中的装饰性思考与真实思考

Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song

发表机构 * Northeastern University(东北大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出真实思考得分(TTS)量化思维链中每一步对最终答案的因果贡献,发现模型常混合真实思考与装饰性思考,并利用TTS实现有效剪枝与自训练,揭示前沿模型常表述未因果使用的推理步骤。

详情
AI中文摘要

大型语言模型可以生成长的思维链(CoT)推理,但先前的研究表明,在明确设计的设置下,CoT可能是事后合理化,而非计算过程的忠实反映。在这项工作中,我们更进一步,提出了真实思考得分(TTS),用于量化在现实推理问题中CoT每一步对模型最终预测的因果贡献。在从1.5B到1.1T参数的11个模型上,针对常见推理基准,我们发现CoT经常交织真实思考步骤(对最终答案有因果影响)和装饰性思考步骤(看似有用但因果影响很小);即使对于前沿模型,这种装饰性步骤仍然普遍存在:在MATH上,Kimi-K2.6中超过30%的步骤是装饰性的(TTS <= 0.005)。此外,TTS使得有效的CoT剪枝成为可能:移除TTS最低的50%的CoT步骤可以基本保持性能。在这些剪枝后的CoT上进行自训练,可以将Nemotron3-Nano-30B的推理长度减少66%,同时保持性能。最后,我们提供了机制分析,表明LLM可以在潜在空间中被引导以参与或脱离推理步骤。总体而言,我们的结果揭示了前沿LLM经常表述未被因果使用的推理步骤,这对CoT的效率和可信度提出了挑战。

英文摘要

Large language models can generate long chain-of-thought (CoT) reasoning, yet prior work suggests that CoT can be post-hoc rationalization rather than a faithful reflection of the computation through explicitly designed settings. In this work, we go further and propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT to the model's final prediction in realistic reasoning problems. Across eleven models ranging from 1.5B to 1.1T parameters on common reasoning benchmarks, we find that CoTs often interleave true-thinking steps, which causally affect the final answer, with decorative-thinking steps, which appear useful but have little causal influence; Such decorative steps remain prevalent even for frontier models: Over 30% of steps in Kimi-K2.6 are decorative on MATH with TTS <= 0.005. Furthermore, TTS enables effective CoT pruning: removing 50% of CoT steps with the lowest TTS can largely maintain the performance. Self-training on these pruned CoTs reduces reasoning length by 66% while preserving performance on Nemotron3-Nano-30B. Finally, we provide a mechanistic analysis showing that LLMs can be steered in the latent space to engage or disengage with reasoning steps. Overall, our results reveal that frontier LLMs often verbalize reasoning steps that are not causally used, challenging both the efficiency and the trustworthiness of CoT.

2604.23472 2026-05-28 cs.AI

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

Escher-Loop:通过闭环自我指涉优化的共同进化

Ziyang Liu, Xinyan Guo, Xuchen Wei, Han Hao, Liu Yang

发表机构 * Shenzhen X-Institute(深圳X研究所) Soochow University(苏州大学) Shenzhen Loop Area Institute(深圳Loop区研究所) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学)

AI总结 提出Escher-Loop框架,通过任务代理和优化代理的闭环共同进化及动态基准机制,实现超越静态基线的持续性能提升。

Comments The first three authors contributed equally. Corresponding Authors: Han Hao, Liu Yang

详情
AI中文摘要

尽管最近自主代理展示了令人印象深刻的能力,但它们主要依赖于手动脚本化工作流和手工制作的启发式方法,本质上限制了其开放式改进的潜力。为了解决这个问题,我们提出了Escher-Loop,一个完全闭环的框架,实现了两个不同群体的共同进化:解决具体问题的任务代理,以及递归优化任务代理和自身的优化代理。为了维持这种自我指涉的进化,我们提出了一种动态基准测试机制,该机制无缝地将新生成任务代理的经验分数作为相对胜负信号,用于更新优化代理的分数。该机制利用任务代理的进化作为内在信号,驱动优化代理的评估和优化,而无需额外开销。在数学优化问题上的实证评估表明,Escher-Loop有效突破了静态基线的性能上限,在所有评估任务中,在匹配计算量下实现了最高的绝对峰值性能。值得注意的是,我们观察到优化代理动态调整其策略以适应高性能任务代理不断变化的需求,这解释了系统的持续改进和优越的后期性能。

英文摘要

While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, and Optimizer Agents that recursively refine both the task agents and themselves. To sustain this self-referential evolution, we propose a dynamic benchmarking mechanism that seamlessly reuses the empirical scores of newly generated task agents as relative win-loss signals to update optimizers' scores. This mechanism leverages the evolution of task agents as an inherent signal to drive the evaluation and refinement of optimizers without additional overhead. Empirical evaluations on mathematical optimization problems demonstrate that Escher-Loop effectively pushes past the performance ceilings of static baselines, achieving the highest absolute peak performance across all evaluated tasks under matched compute. Remarkably, we observe that the optimizer agents dynamically adapt their strategies to match the shifting demands of high-performing task agents, which explains the system's continuous improvement and superior late-stage performance.

2604.23282 2026-05-28 cs.CV cs.MM

Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

弥合姿态-语义鸿沟:基于文本的人物异常搜索的级联框架

Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin, Zhou Zhao, Yixuan Tang

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出结构-语义解耦级联(SSDC)框架,通过两阶段检索(结构感知粗检索和多智能体语义验证)平衡效率与语义推理,在PAB基准上达到最优性能。

Comments Accepted to ACL 2026.10 pages, 5 figures

详情
AI中文摘要

基于文本的人物异常搜索利用自然语言查询从监控档案中检索特定行为事件。尽管最近的姿态感知方法能够很好地对齐几何结构,但它们面临一个根本性的姿态-语义鸿沟:语义不同的动作可能共享相似的骨骼几何结构。虽然多模态大语言模型(MLLMs)可以减少这种歧义,但将其用于大规模检索在计算上代价高昂。我们提出了结构-语义解耦级联(SSDC)框架,将检索解耦为两个阶段:(1)结构感知粗检索,其中轻量级模型通过骨骼相似性快速筛选候选;(2)侦探小组交互,一个多智能体语义验证模块。该小组包括一个用于快速二元过滤的侦探、一个用于证据提取的分析师和一个用于语义合成的写手。最后,通过将合成描述与结构先验融合,对候选进行重新排序。在PAB基准上的实验表明,SSDC通过平衡效率和语义推理实现了最先进的性能。

英文摘要

Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.

2604.23061 2026-05-28 cs.LG cs.AI

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

C-MORAL: 基于强化对齐的可控多目标分子优化用于大语言模型

Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Amazon(亚马逊)

AI总结 提出C-MORAL框架,通过强化学习后训练结合分组相对优化、属性分数对齐和瓶颈敏感非线性奖励聚合,实现可控多目标分子优化,在C-MuMOInstruct和S$^2$-Bench MolOpt基准上取得最优性能。

Comments 26 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLMs)在分子优化方面展现出潜力,但使其与选择性且相互竞争的药物设计约束对齐仍然具有挑战性。我们提出了C-Moral,一个用于可控多目标分子优化的强化学习后训练框架。C-Moral结合了基于分组的相对优化、针对异构目标的属性分数对齐以及瓶颈敏感的非线性奖励聚合,以提高跨竞争分子属性的稳定性。在C-MuMOInstruct和S$^2$-Bench MolOpt上的实验表明,C-Moral在两个基准上均取得了比较方法中最佳的性能。在C-MuMOInstruct上,C-Moral在域内任务中实现了最佳的成功优化率(SOR)48.9%,在域外任务中为39.5%,同时保持了骨架相似性。在S$^2$-Bench MolOpt上,它在LogP、MR和QED优化任务中也取得了最强结果。这些结果表明,C-Moral是将分子LLMs与连续且受约束的分子设计目标对齐的有效方法。我们的代码和模型公开在https://github.com/Rwigie/C-MORAL。

英文摘要

Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and bottleneck-sensitive non-linear reward aggregation to improve stability across competing molecular properties. Experiments on C-MuMOInstruct and S$^2$-Bench MolOpt show that C-Moral achieves the best performance among compared methods on both benchmarks. On C-MuMOInstruct, C-Moral achieves the best Success Optimized Rate (SOR) of 48.9\% on in-domain tasks and 39.5\% on out-of-domain tasks while preserving scaffold similarity. On S$^2$-Bench MolOpt, it also achieves the strongest results across LogP, MR, and QED optimization tasks. These results suggest that C-Moral is an effective way to align molecular LLMs with continuous and constrained molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.

2604.19072 2026-05-28 cs.LG cs.AI stat.ML

S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

S2MAM: 半监督元加性模型用于稳健估计和变量选择

Xuelin Zhang, Hong Chen, Yingjie Wang, Tieliang Gong, Bin Gu

发表机构 * Huazhong Agricultural University(华中农业大学) China University of Petroleum (East China)(中国石油大学(华东)) Xi'an Jiaotong University(西安交通大学) Jilin University(吉林大学)

AI总结 提出基于双层优化的半监督元加性模型,自动识别信息变量、更新相似矩阵并实现可解释预测,理论保证收敛性和泛化界,实验验证了鲁棒性和可解释性。

Comments Accepted by ICML'2026 as Accept (regular)

详情
AI中文摘要

基于流形正则化的半监督学习是一种经典的联合利用有标签和无标签数据进行学习的框架,其关键要求是未知边际分布的支持集具有黎曼流形的几何结构。通常,基于拉普拉斯-贝尔特拉米算子的流形正则化可以通过与整个训练数据及其对应的图拉普拉斯矩阵相关联的拉普拉斯正则化进行经验近似。然而,图拉普拉斯矩阵严重依赖于预先指定的相似度度量,并且在处理冗余或噪声输入变量时可能导致不适当的惩罚。为了解决上述问题,本文提出了一种新的半监督元加性模型(S$^2$MAM),该模型基于双层优化方案,能够自动识别信息变量、更新相似矩阵,并同时实现可解释的预测。为S$^2$MAM提供了理论保证,包括计算收敛性和统计泛化界。在4个合成数据集和12个真实世界数据集上进行的实验评估,涵盖了不同级别和类型的污染,验证了所提方法的鲁棒性和可解释性。

英文摘要

Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.

2604.21534 2026-05-28 cs.CL

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

UKP_Psycontrol 在 SemEval-2026 任务 2:从文本建模效价和唤醒动态

Darya Hryhoryeva, Amaia Zurinaga, Hamidreza Jamalabadi, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)(无所不在的知识处理实验室) Technical University of Darmstadt(达姆斯塔特技术大学) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE) Psychiatric Control Systems Lab(精神病控制系统实验室) Marburg University(马尔堡大学)

AI总结 针对 SemEval-2026 任务 2,提出三种互补方法(LLM 提示、成对最大熵模型、轻量级神经回归模型)建模文本中的即时情感和短期情感变化,发现 LLM 擅长捕捉静态情感信号,而短期变化更依赖于数值轨迹,系统在子任务 1 和 2A 中排名第一。

Comments Accepted to SemEval 2026 (co-located with ACL 2026)

详情
AI中文摘要

本文介绍了我们为 SemEval-2026 任务 2 开发的系统。该任务要求对按时间顺序排列的用户生成文本中的当前情感和短期情感变化进行建模。我们探索了三种互补的方法:(1)在用户感知和用户无关设置下的 LLM 提示,(2)具有 Ising 式交互的成对最大熵(MaxEnt)模型用于结构化转换建模,以及(3)结合近期情感轨迹和可训练用户嵌入的轻量级神经回归模型。我们的发现表明,LLM 能有效捕捉文本中的静态情感信号,而该数据集中短期情感变化更多地由近期数值状态轨迹解释,而非文本语义。根据官方评估指标,我们的系统在子任务 1 和子任务 2A 中均排名第一。

英文摘要

This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

2604.20996 2026-05-28 cs.CL

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

AFRILANGTUTOR:利用大语言模型推进低资源语言的语言辅导与文化教育

Tadesse Destaw Belay, Shahriar Kabir Nahin, Israel Abebe Azime, Ocean Monjur, Marek Rei, Chris Biemann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam, Anshuman Chhabra

发表机构 * Instituto Politécnico Nacional(墨西哥政治技术学院) University of South Florida(佛罗里达州立大学) Saarland University(萨尔兰大学) Imperial College London(伦敦帝国理工学院) University of Hamburg(汉堡大学)

AI总结 针对低资源语言缺乏训练数据的问题,提出AFRILANGDICT词典资源并构建AFRILANGEDU数据集,通过监督微调和直接偏好优化训练AFRILANGTUTOR模型,在10种非洲语言上显著提升辅导性能。

详情
AI中文摘要

如何为缺乏足够训练资源的语言开发语言学习系统?这一挑战日益被非洲大陆的开发者所面临,他们旨在构建能够理解并用当地语言回应的AI系统。为弥补这一差距,我们引入AFRILANGDICT,一个包含19.47万条非洲语言-英语词典条目的集合,作为生成语言学习材料的种子资源,使我们能够自动构建大规模、多样且可验证的学生-导师问答交互,适用于训练AI辅助语言导师。利用AFRILANGDICT,我们构建了AFRILANGEDU,一个包含7.89万个多轮训练示例的数据集,用于监督微调(SFT)和直接偏好优化(DPO)。使用AFRILANGEDU,我们训练了统称为AFRILANGTUTOR的语言辅导模型。我们在AFRILANGEDU上对两个多语言LLM:Llama-3-8B-IT和Gemma-3-12B-IT进行了微调,覆盖10种非洲语言,并评估了它们的性能。结果表明,在AFRILANGEDU上训练的模型始终优于其基础版本,且结合SFT和DPO带来了显著改进,在LLM作为评判者的评估中,四项指标的提升范围从1.8%到15.5%。为促进低资源语言的进一步研究,所有资源均可在https://huggingface.co/afrilang-edu获取。

英文摘要

How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages, all resources are available at https://huggingface.co/afrilang-edu.

2604.05673 2026-05-28 cs.RO cs.AI

Rectified Schrödinger Bridge Matching for Few-Step Visual Navigation

整流薛定谔桥匹配用于少步视觉导航

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science, Chongqing University(重庆大学计算机学院) Department of Computer Science, University of Liverpool(利物浦大学计算机科学系) Changchun GenY Technology Co., Ltd.(长春GenY科技有限公司)

AI总结 提出整流薛定谔桥匹配(RSBM)框架,利用速度场结构不变性和线性方差减少,在仅3步积分中实现高保真生成策略,满足具身AI低延迟需求。

Comments 18 pages, 7 figures, 10 tables. Code available at https://github.com/WuyangLuan/RSBM

详情
AI中文摘要

视觉导航是具身AI中的核心挑战,要求自主智能体将高维感官观测转化为连续的、长视界动作轨迹。基于扩散模型和薛定谔桥(SB)的生成策略能有效捕捉多模态动作分布,但由于高方差随机传输,需要数十个积分步骤,这对实时机器人控制构成了关键障碍。我们提出整流薛定谔桥匹配(RSBM),该框架利用标准薛定谔桥(ε=1,最大熵传输)与确定性最优传输(ε→0,如条件流匹配)之间共享的速度场结构,由单一熵正则化参数ε控制。我们证明两个关键结果:(1)条件速度场的函数形式在整个ε谱上保持不变(速度结构不变性),使单一网络能够服务于所有正则化强度;(2)减小ε线性降低条件速度方差,实现更稳定的粗步ODE积分。基于缩短传输距离的学习条件先验,RSBM在中间ε下运行,平衡多模态覆盖和路径直线性。实验表明,标准桥需要≥10步才能收敛,而RSBM在仅3个积分步骤中实现了超过94%的余弦相似度和92%的成功率——无需蒸馏或多阶段训练——显著缩小了高保真生成策略与具身AI低延迟需求之间的差距。

英文摘要

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges ($\varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($\varepsilon\to 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $\varepsilon$. We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire $\varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $\varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $\varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $\geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps -- without distillation or multi-stage training -- substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

2604.13583 2026-05-28 cs.CL cs.AI

BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

BenGER平台:面向德国法律任务端到端基准测试的协作式Web平台

Sebastian Nagl, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出BenGER开源Web平台,集成任务创建、协作标注、可配置LLM运行及多维度评估,支持多组织项目与租户隔离,实现法律推理基准测试的端到端透明与可复现。

Comments Preprint - Accepted at ICAIL 2026

详情
AI中文摘要

评估大语言模型(LLM)的法律推理能力需要涵盖任务设计、专家标注、模型执行和基于指标的评估的工作流。在实践中,这些步骤分散在不同的平台和脚本中,限制了透明度、可复现性以及非技术法律专家的参与。我们提出了BenGER(德国法律基准测试)框架,这是一个开源Web平台,集成了任务创建、协作标注、可配置的LLM运行以及基于词汇、语义、事实和法官指标的评估。BenGER支持具有租户隔离和基于角色的访问控制的多组织项目,并可选择性地为标注者提供形成性的、基于参考的反馈。我们将展示一个实时部署,演示端到端的基准测试创建和分析。

英文摘要

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

2604.19669 2026-05-28 cs.LG

HardNet++: Nonlinear Constraint Enforcement in Neural Networks

HardNet++: 神经网络中的非线性约束强制执行

Andrea Goertzen, Kaveh Alim, Youngjae Min, Navid Azizan

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一种通过阻尼局部线性化迭代调整网络输出来强制执行线性和非线性等式与不等式约束的方法,并证明在正则条件下可达到任意精度,应用于非线性模型预测控制问题中实现紧约束满足且不损失最优性。

详情
AI中文摘要

在许多控制和决策应用中,强制执行神经网络输出的约束满足对于安全性、可靠性和物理保真度至关重要。软约束方法在训练期间惩罚违反约束的行为,但不能保证推理期间的约束遵守。其他方法通过投影层保证约束满足,但通常依赖于可行集上存在可处理的投影,限制了它们在更一般问题设置中的实用性。许多感兴趣的现实世界问题是非线性的,缺乏允许可处理投影的特殊结构,这促使开发能够强制执行一般非线性约束的方法。为此,我们引入了HardNet++,一种强制执行线性和非线性等式与不等式约束的约束满足方法。我们的方法通过约束的阻尼局部线性化迭代调整网络输出。每次迭代都是可微的,允许端到端训练框架,其中约束满足层在训练期间处于活动状态。我们证明,在一定的正则条件下,该过程可以强制执行非线性约束满足到任意容差。最后,我们在学习优化背景下展示了紧约束满足而不损失最优性,并将该方法应用于非线性模型预测控制问题。

英文摘要

Enforcing constraint satisfaction in neural network outputs is critical for safety, reliability, and physical fidelity in many control and decision-making applications. While soft-constrained methods penalize constraint violations during training, they do not guarantee constraint adherence during inference. Other approaches guarantee constraint satisfaction via a projection layer, but often rely on the existence of a tractable projection onto the feasible set, limiting their utility in more general problem settings. Many real-world problems of interest are nonlinear and lack the special structure admitting a tractable projection, motivating the development of methods that can enforce general nonlinear constraints. To this end, we introduce HardNet++, a constraint-satisfaction method that enforces linear and nonlinear equality and inequality constraints. Our approach iteratively adjusts the network output via damped local linearizations of the constraints. Each iteration is differentiable, admitting an end-to-end training framework, where the constraint satisfaction layer is active during training. We show that under certain regularity conditions, this procedure enforces nonlinear constraint satisfaction to arbitrary tolerance. Finally, we demonstrate tight constraint adherence without loss of optimality in a learning-for-optimization context, where we apply this method to a nonlinear model predictive control problem.

2604.19355 2026-05-28 cs.LG cs.AI cs.CE

LASER: Learning Active Sensing for Continuum Field Reconstruction

LASER: 用于连续场重建的学习主动感知

Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang, Xiaokang Yang

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University(人工智能MOE重点实验室、人工智能研究院、计算机科学学院、上海交通大学)

AI总结 提出LASER框架,将主动感知建模为部分可观测马尔可夫决策过程,利用连续场潜在世界模型和强化学习策略在潜在想象空间中模拟感知场景,实现稀疏约束下的高保真重建。

Comments Accepted by ICML 2026 (Oral)

详情
AI中文摘要

连续物理场的高保真测量对于科学发现和工程设计至关重要,但在稀疏和受限感知条件下仍然具有挑战性。传统的重建方法通常依赖于固定的传感器布局,无法适应演变的物理状态。我们提出LASER,一个统一的闭环框架,将主动感知建模为部分可观测马尔可夫决策过程(POMDP)。其核心是采用连续场潜在世界模型,捕捉底层物理动力学并提供内在奖励反馈。这使得强化学习策略能够在潜在想象空间中模拟“假设”感知场景。通过根据预测的潜在状态调整传感器移动,LASER能够导航到当前观测之外可能的高信息区域。我们的实验表明,LASER在多种连续场中始终优于静态和离线优化策略,在稀疏条件下实现高保真重建。

英文摘要

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

2604.18758 2026-05-28 cs.CL

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

句法作为罗塞塔石碑:用于上下文科普特语翻译的通用依存关系

Abhishek Purushothama, Emma Thronson, Alexia Guo, Amir Zeldes

发表机构 * Corpling Lab(科林实验室) Georgetown University(乔治城大学)

AI总结 提出一种结合通用依存句法分析和双语词典的上下文学习方法,用于低资源科普特语到英语的机器翻译,取得了新的最佳结果。

Comments ACL 2026 Findings camera-ready, with fixes

详情
AI中文摘要

低资源机器翻译需要不同于高资源语言的方法。本文提出了一种新颖的上下文学习方法,通过输入句子的通用依存句法分析来增强句法信息,以支持科普特语到英语的低资源机器翻译。在已有使用双语词典支持词汇项推理的工作基础上,我们在输入中添加了多种句法分析表示,具体探索了包含原始解析器输出、用简单英语表达的解析结果,以及针对子树中识别出的困难结构的定向指令及其翻译方法。结果表明,虽然单独的句法信息不如基于词典的注释有用,但将检索到的词典项与句法信息相结合,在不同模型规模上均取得了显著提升,为科普特语翻译实现了新的最佳结果。

英文摘要

Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.

2601.11632 2026-05-28 cs.CV

KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

KG-ViP:在多模态大语言模型中桥接知识基础与视觉感知以进行视觉问答

Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Data Darkness Lab, MIRACLE Center, USTC(数据黑暗实验室,MIRACLE中心,中国科学技术大学) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院)

AI总结 提出KG-ViP框架,通过检索与融合场景图和常识图,统一外部知识与细粒度视觉细节,缓解多模态大语言模型在视觉问答中的知识幻觉和视觉感知不足问题。

详情
AI中文摘要

用于视觉问答(VQA)的多模态大语言模型(MLLMs)通常面临双重限制:知识幻觉和细粒度视觉感知不足。关键的是,我们发现常识图和场景图通过提供丰富的外部知识和捕捉细粒度视觉细节,恰好为这些缺陷提供了互补的解决方案。然而,先前的工作通常孤立地处理它们,忽视了它们的协同潜力。为了弥合这一差距,我们提出了KG-ViP,一个统一的框架,通过融合场景图和常识图来增强MLLMs。KG-ViP框架的核心是一个新颖的检索与融合流程,利用查询作为语义桥逐步整合两种图,合成统一的结构化上下文,促进可靠的多模态推理。在FVQA 2.0+和MVQA基准上的大量实验表明,KG-ViP显著优于现有的VQA方法。

英文摘要

Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.

2604.18530 2026-05-28 cs.AI

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

OGER:一种用于混合强化学习的鲁棒离线引导探索奖励

Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Hithink RoyalFlush Information Network, Hangzhou, China(杭州Hithink RoyalFlush信息网络) Computer and Information Science, University of Macau, Macau, China(澳门大学计算机与信息科学学院)

AI总结 提出OGER框架,通过多教师协作训练和基于熵的辅助探索奖励,统一离线教师引导与在线强化学习,提升大语言模型在数学推理和泛化任务中的探索能力。

详情
AI中文摘要

近年来,具有可验证奖励的强化学习(RLVR)的进展显著提升了大型语言模型(LLM)的推理能力,但模型在探索超出其初始策略分布的新轨迹方面仍存在困难。尽管已提出离线教师引导和基于熵的策略来解决这一问题,但它们往往缺乏深度融合或受限于模型自身能力。在本文中,我们提出OGER(离线引导探索奖励),一种新颖的框架,通过专门的奖励建模视角统一离线教师引导和在线强化学习。OGER采用多教师协作训练,并构建一个辅助探索奖励,利用离线轨迹和模型自身的熵来激励自主探索。在数学和通用推理基准上的大量实验表明,OGER持续优于竞争基线,在数学推理上取得显著提升,同时保持对域外任务的鲁棒泛化。我们提供了训练动态的全面分析,并进行了详细的消融研究,以验证我们基于熵的奖励调制的有效性。我们的代码可在 https://github.com/ecoli-hit/OGER.git 获取。

英文摘要

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy distribution. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER (Offline-Guided Exploration Reward), a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER consistently outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.

2604.18235 2026-05-28 cs.CL cs.AI

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

负优势是一把双刃剑:为搜索智能体校准GRPO中的优势

Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(东华师范大学数据科学与工程学院) Tencent(腾讯) Tsinghua University(清华大学)

AI总结 针对GRPO算法在多跳搜索中因粗粒度优势分配和正负优势不平衡导致的训练不稳定问题,提出CalibAdv方法,通过细粒度降低过度负优势并重新平衡正负优势,提升模型性能和训练稳定性。

详情
AI中文摘要

搜索智能体通过与搜索引擎的多轮交互实现强大的问答性能,其中组相对策略优化(GRPO)是一种广泛使用的训练算法。然而,GRPO风格的算法在多跳搜索场景中仍面临若干挑战。首先,当最终答案错误时,正确的中间步骤常常受到惩罚。其次,训练高度不稳定,经常导致自然语言能力退化甚至灾难性训练崩溃。我们的分析将这些问题归因于粗粒度的优势分配以及正负优势之间的不平衡。为了解决这些问题,我们提出了CalibAdv,一种专门为搜索智能体设计的优势校准方法,能够更准确、更稳定地对惩罚和奖励进行建模。具体来说,CalibAdv利用中间步骤的正确性在细粒度上降低过度的负优势,然后进一步重新平衡正负优势以提高训练稳定性。重要的是,CalibAdv采用轻量级设计,从标准 rollout 信号中校准优势,使其简单且易于部署。在三个模型和七个基准上的大量实验表明,CalibAdv同时提升了模型性能和训练稳定性。我们的代码可在 https://github.com/wujwyi/CalibAdv 获取。

英文摘要

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO-style algorithms still face several challenges in multi-hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

2604.18227 2026-05-28 cs.LG

FSEVAL: Feature Selection Evaluation Toolbox and Dashboard

FSEVAL:特征选择评估工具箱与仪表盘

Muhammad Rajabinasab, Arthur Zimek

发表机构 * Department of Mathematics and Computer Science University of Southern Denmark(数学与计算机科学系索恩大学)

AI总结 提出FSEVAL工具箱与可视化仪表盘,用于标准化、统一地评估和可视化特征选择算法。

详情
AI中文摘要

特征选择是一项基本的机器学习和数据挖掘任务,涉及从信息特征中区分冗余特征。它试图通过去除冗余特征来解决维数灾难,同时与降维方法不同,保持可解释性。特征选择在有监督和无监督设置下进行,采用不同的评估指标来确定哪个特征选择算法最佳。在本文中,我们提出了FSEVAL,一个带有可视化仪表盘的特征选择评估工具箱,旨在轻松全面地评估特征选择算法。FSEVAL旨在提供一个标准化、统一的评估和可视化工具箱,帮助该领域的研究人员轻松地对特征选择算法进行广泛而全面的评估。

英文摘要

Feature selection is a fundamental machine learning and data mining task, involved with discriminating redundant features from informative ones. It is an attempt to address the curse of dimensionality by removing the redundant features, while unlike dimensionality reduction methods, preserving explainability. Feature selection is conducted in both supervised and unsupervised settings, with different evaluation metrics employed to determine which feature selection algorithm is the best. In this paper, we propose FSEVAL, a feature selection evaluation toolbox accompanied with a visualization dashboard, with the goal to make it easy to comprehensively evaluate feature selection algorithms. FSEVAL aims to provide a standardized, unified, evaluation and visualization toolbox to help the researchers working in the field, conduct extensive and comprehensive evaluation of feature selection algorithms with ease.

2604.17943 2026-05-28 cs.CL

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

专业领域基准构建与评估框架:以国防相关文档为例

Bao Gia Doan, Aditya Joshi, Pantelis Elinas, Aarya Bodhankar, Oscar Leslie, Tom Marchant, Flora Salim

发表机构 * UNSW Sydney(新南威尔士大学悉尼分校) Cyndr AI

AI总结 提出DoRA框架,通过合成数据生成和双LLM流水线解决专业领域RAG问答的冷启动问题,在国防文档上显著减少幻觉并提升覆盖率和忠实度。

详情
AI中文摘要

基于RAG的专业领域问答面临冷启动问题:缺乏评估基准和用于后训练的标注数据。我们提出DoRA(面向领域的RAG评估),一个仅使用少量专业领域文档的新型基准构建与评估框架。DoRA系统地生成合成QA训练和评估数据集,并跨五个领域特定意图提供可审计的证据。为缓解同流水线循环,DoRA的训练和测试拆分使用不同的LLM家族(训练用Claude Sonnet;测试用GPT-4o),这些数据来自不相交的种子文档语料库。在40份国防相关文档(英文)上实例化后,DoRA产生约6600个精心整理的实例。与8个LLM基线在1259个样本的基准上比较,基于合成训练集微调的LoRA适配Llama3.1-8B在6个覆盖率和忠实度指标上持续提升性能,尤其在默认GTE检索设置下将幻觉减少一半以上,且增益在替代检索器和基于提示的基线下依然保持。国防领域专业知识在评估的三个阶段被纳入:(a) 判断DoRA生成的合成QA质量,(b) 确定LLM作为评判者的分数可靠性,(c) 评估QA流水线在完全人工编写的QA示例上的泛化能力。我们将DoRA定位为领域迁移下专业领域RAG的实用框架,并以国防作为高风险的案例研究。

英文摘要

RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.1-8B trained on the synthetic training set consistently improves performance over 6 coverage and faithfulness metrics, especially reducing hallucination by more than half under the default GTE retrieval setting, with gains persisting across alternative retrievers and prompting-based baselines. Defense-domain expertise is incorporated in three stages of our evaluation: (a) determining the quality of the synthetic QA generated by DoRA, (b) ascertaining the reliability of LLM-as-judge scores, and (c) evaluating the generalization of the QA pipeline on completely human-written QA examples. We position DoRA as a practical framework for specialist-domain RAG under domain shift, with defense as a high-stakes case study.

2604.16565 2026-05-28 cs.LG cs.AI

Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models

流形上的推理:扩散语言模型中用于自我验证的双向一致性

Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu

发表机构 * Institute of Science and Technology for Brain-Inspired Intelligence(脑启发智能科学与技术研究院) Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) IEG, Tencent Inc.(腾讯IEG)

AI总结 提出双向流形一致性(BMC),一种无训练、无监督的度量方法,通过前向掩码和后向重建循环量化生成序列的稳定性,用于扩散语言模型的诊断、推理和对齐。

Comments 31 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

Journal ref Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

详情
AI中文摘要

尽管扩散大语言模型(dLLMs)在全局规划方面具有结构优势,但高效验证它们是否通过有效的推理轨迹得出正确答案仍然是一个关键挑战。在这项工作中,我们提出了一种几何视角:流形上的推理。我们假设有效的生成轨迹作为学习分布的高密度流形上的稳定吸引子存在,而无效路径则表现出流形外漂移。为了实现这一点,我们引入了双向流形一致性(BMC),这是一种无训练、无监督的度量,通过前向掩码和后向重建循环量化生成序列的稳定性。实验上,我们展示了BMC在整个推理生命周期中的多功能性:(1)在诊断中,它作为无需真实答案的解决方案有效性的鲁棒判别器;(2)在推理中,它能够通过拒绝重采样有效集中计算资源于复杂推理任务;(3)在对齐中,它作为密集的几何奖励,将稀疏的结果监督转化为细粒度的指导,使模型能够超越标准基线自我进化。我们的结果确立了内在几何稳定性作为dLLMs正确性的鲁棒指标。

英文摘要

While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.