arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2069
2604.16278 2026-06-01 cs.AI cs.CL cs.LG

Learning to Reason with Insight for Informal Theorem Proving

学习在非形式定理证明中进行洞察推理

Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, Linqi Song

AI总结 针对非形式定理证明中缺乏洞察(识别核心技巧)的瓶颈,提出统一训练框架DeepInsight,通过分层数据集、渐进式多阶段SFT和基于洞察的策略优化方法,显著提升大语言模型的数学推理能力。

详情
AI中文摘要

尽管大多数自动定理证明方法依赖于形式证明系统,但非形式定理证明能更好地发挥大语言模型(LLMs)在自然语言处理方面的优势。在这项工作中,我们识别出非形式定理证明的一个主要瓶颈是缺乏洞察,即难以识别解决复杂问题所需的核心技巧。为了解决这个问题,我们提出了$ exttt{DeepInsight}$,一个统一的训练框架,旨在培养这种基本的推理技能,并使LLMs能够进行洞察推理。我们的框架由三个部分组成:(1)$ exttt{DeepInsightTheorem}$,一个分层数据集,通过显式提取核心技巧和证明草图以及最终证明来结构化非形式证明;(2)渐进式多阶段SFT策略,模拟人类学习过程,教授模型证明写作、规划和洞察识别;(3)$ exttt{InsightPO}$,一种策略优化方法,在此洞察层次结构上分配结构化奖励。我们在具有挑战性的数学基准上的实验表明,这种洞察感知的生成策略显著优于基线。这些结果表明,教模型识别和应用核心技巧可以大幅提高其数学推理能力。

英文摘要

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose $\texttt{DeepInsight}$, a unified training framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. Our framework consists of three components: (1) $\texttt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof; (2) a Progressive Multi-Stage SFT strategy that mimics the human learning process, teaching the model proof writing, planning, and insight identification; and (3) $\texttt{InsightPO}$, a policy optimization method that assigns structured rewards over this insight hierarchy. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.

2604.15959 2026-06-01 cs.LG

Multi-Objective Bayesian Optimization via Adaptive \varepsilon-Constraints Decomposition

基于自适应 ε-约束分解的多目标贝叶斯优化

Yaohong Yang, Sammie Katt, Samuel Kaski

AI总结 提出STAGE-BO方法,通过自适应ε-约束分解将多目标优化转化为序列约束子问题,实现均匀帕累托覆盖并处理约束与偏好。

Comments 24 pages, 22 figures, 4 tables. Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多目标贝叶斯优化(MOBO)为优化多个昂贵的黑箱函数提供了一个原则性框架。然而,现有的MOBO方法通常在覆盖性、可扩展性以及处理约束和偏好方面存在困难。在这项工作中,我们提出了STAGE-BO,即顺序目标自适应间隙填充ε-约束贝叶斯优化:通过分析代理帕累托前沿的覆盖性,我们的方法识别出具有最大未覆盖间隙的帕累托前沿点,并使用其坐标在ε-约束方法中定义自适应约束,从而将问题转化为一系列不等式约束子问题,并通过约束期望改进采集函数高效求解。我们的方法无需超体积计算即可实现均匀的帕累托覆盖,并自然地处理约束和偏好。在合成和真实世界基准上的实验表明,与最先进的基线相比,我们的方法具有优越的覆盖性和具有竞争力的超体积性能。我们的代码实现可在https://github.com/YangYaohong1/STAGE-BO找到。

英文摘要

Multi-objective Bayesian optimization (MOBO) provides a principled framework for optimizing multiple expensive black-box functions. However, existing MOBO methods often struggle with coverage, scalability, and handling constraints and preferences. In this work we propose STAGE-BO, Sequential Targeting Adaptive Gap-Filling $\varepsilon$-Constraint Bayesian Optimization: by analyzing the coverage of the surrogate Pareto front, our method identifies the Pareto front point with the largest uncovered gap, and uses its coordinates to define adaptive constraints in $\varepsilon$-constraint method, which transforms the problem into a sequence of inequality-constrained subproblems, efficiently solved via constrained expected improvement acquisition. Our approach provides uniform Pareto coverage without hypervolume computation and naturally handles constraints and preferences. Experiments on synthetic and real-world benchmarks demonstrate superior coverage and competitive hypervolume performance against state-of-the-art baselines. Our code implementation can be found at https://github.com/YangYaohong1/STAGE-BO.

2604.11613 2026-06-01 cs.LG cs.AI

Symmetry Reveals Layerwise Dynamics: How Transformers Perform In-Context Classification

对称性揭示逐层动力学:Transformer如何执行上下文分类

Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama

AI总结 通过强制特征和标签排列等变性,从Transformer中提取出显式的深度索引递归更新规则,揭示了上下文分类的几何驱动算法。

Comments appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML '26)

详情
AI中文摘要

Transformer可以从少量标记示例中执行上下文分类,但推理时的算法仍然不透明。我们研究了硬无间隔机制下的多类线性分类,并通过在每一层强制特征和标签排列等变性使计算可识别。这实现了可解释性,同时保持了功能等价性,并产生了高度结构化的权重。从这些模型中,我们提取出一个显式的深度索引递归:一个端到端可识别的、在softmax Transformer内部涌现的更新规则,据我们所知这是首个此类规则。由混合特征-标签Gram结构形成的注意力矩阵驱动训练点、标签和测试探针的耦合更新。由此产生的动力学实现了一个几何驱动的算法主题,该主题可以证明放大类别分离并产生鲁棒的期望类别对齐。

英文摘要

Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.

2603.12277 2026-06-01 cs.CL cs.AI cs.CR

Prompt Injection as Role Confusion

提示注入作为角色混淆

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

AI总结 本文通过角色探测和CoT伪造攻击,揭示提示注入源于LLM对文本来源的角色感知混淆,并提出角色混淆程度可预测攻击成功率。

Comments ICML 2026

详情
AI中文摘要

LLM将世界视为单一的文本流,并划分为<user>或<tool>等角色。我们将提示注入追溯到角色混淆:模型根据文本听起来的方式而非其标记的角色来感知文本来源。隐藏在网页中的命令劫持了代理,仅仅因为它听起来像<user>文本,尽管其标签是<tool>。我们设计了角色探测器来测量LLM内部如何感知“谁在说话”,并发现注入的文本占据了与它所模仿的可信角色相同的表示空间。我们通过CoT伪造(一种零样本攻击)证明了这一点,该攻击将捏造的推理注入用户提示和工具输出中。模型将伪造内容误认为是自己的思维,导致对前沿模型的攻击成功率达到60%,而基线接近零。引人注目的是,角色混淆的程度可以在生成单个token之前预测攻击成功。这一机制超越了CoT伪造,适用于标准的代理提示注入,揭示了提示注入是角色感知的可测量后果。对模型而言,听起来像某个角色与成为该角色是无法区分的。

英文摘要

LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.

2604.06484 2026-06-01 cs.CL

ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

ValueGround: 评估多模态大语言模型中文化条件化的视觉价值基础

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

AI总结 提出ValueGround基准,通过最小对比图像对评估多模态大语言模型在文化条件化视觉价值判断中的表现,发现模型在可视化选项下准确率显著低于文本选项。

Comments Updated preprint

详情
AI中文摘要

文化价值观不仅通过语言表达,还通过视觉场景和日常社会实践体现。然而,现有对语言模型中文化价值观的评估几乎完全是文本形式的,尚不清楚当响应选项可视化时,文化条件化的判断是否保持稳定。我们引入了ValueGround,一个用于评估多模态大语言模型(MLLMs)中文化条件化视觉价值基础的基准。ValueGround基于世界价值观调查问题,使用最小对比图像对来表示对立的响应选项,同时控制无关变量。给定一个国家、一个问题和一个图像对,模型必须选择最符合该国价值倾向的图像,而无法访问原始响应选项文本。在六个MLLM和13个国家上的实验表明,模型在可视化响应选项下的表现显著差于原始文本选项,平均准确率从72.8%下降到62.6%。我们的基准为研究文化条件化价值判断的跨模态迁移提供了一个受控测试平台。

英文摘要

Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, leaving it unclear whether culture-conditioned judgments remain stable when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Experiments across six MLLMs and 13 countries show that models perform substantially worse with visualized response options than with the original textual options, with average accuracy dropping from 72.8% to 62.6%. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

2604.10432 2026-06-01 cs.RO

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

AnySlot: 用于零样本槽级放置的目标条件视觉-语言-动作策略

Zhaofeng Hu, Sifan Zhou, Qinbo Zhang, Rongtao Xu, Qi Su, Jorge Mendez-Mendz, Ci-Jyun Liang

AI总结 提出AnySlot框架,通过将语言指令转化为空间视觉目标,解耦高层槽选择与低层执行,实现零样本槽级精确放置。

详情
AI中文摘要

视觉-语言-动作(VLA)策略已成为通用机器人操作的多功能范式。然而,在组合语言下的精确物体放置对端到端VLA策略仍然具有挑战性。槽级放置需要可靠的槽接地和厘米级几何精度。为此,我们提出AnySlot,一个通过引入语言接地与控制之间的显式空间视觉目标来降低组合复杂性的框架。AnySlot通过在目标槽处渲染空间标记将语言转化为视觉目标,然后使用目标条件VLA策略执行该目标。这种层次化设计将高层槽选择与低层执行解耦,提高了语义准确性和空间鲁棒性。此外,认识到此类精度要求高的任务缺乏基准,我们引入了SlotBench,一个包含九个任务类别的结构化模拟基准,用于评估槽级放置中的空间推理。大量实验表明,AnySlot在零样本槽级放置中显著优于平面VLA基线和模块化接地方法。

英文摘要

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language remains challenging for end-to-end VLA policies. Slot-level placement requires reliable slot grounding and centimeter-level geometric precision. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal between language grounding and control. AnySlot converts language into a visual goal by rendering a spatial marker at the intended slot, then executes this goal with a goal-conditioned VLA policy. This hierarchical design decouples high-level slot selection from low-level execution, improving semantic accuracy and spatial robustness. Furthermore, recognizing the lack of benchmarks for such precision-demanding tasks, we introduce SlotBench, a structured simulation benchmark with nine task categories for evaluating spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and modular grounding methods in zero-shot slot-level placement.

2604.10805 2026-06-01 cs.CV

Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

基于单应性的地面映射中距离误差的解析建模与校正

Mateusz Szulc, Marcin Iwanowski

AI总结 本文推导了单应性扰动与距离误差的解析关系,提出基于回归和梯度下降的两种校正策略,并通过大规模仿真验证了其有效性。

Comments 7 pages, 4 figures

详情
AI中文摘要

从单目相机准确估计距离对于智能监控系统至关重要。在许多部署中,通过手动选择对应区域初始化的平面单应性将图像坐标映射到地面位置。这种初始化中的微小不准确性会传播为系统性的距离失真。本文推导了单应性扰动与由此产生的距离误差之间的显式关系,表明误差大致随距相机的真实距离呈二次增长。基于该模型,评估了两种简单的校正策略:基于回归的二次误差函数估计和通过基于坐标的梯度下降直接优化单应性。一项包含超过1900万个测试样本的大规模仿真研究表明,当模型可靠拟合时,回归可实现更高的峰值精度,而梯度下降在初始校准较差时具有更强的鲁棒性。这表明,在许多实际系统中,改进几何校准可能比增加模型复杂度带来更大的性能提升。

英文摘要

Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

2604.10495 2026-06-01 cs.CL

Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

为什么你不知道?评估不确定性来源对大型语言模型中不确定性量化的影响

Maiya Goloburda, Roman Vashurin, Fedor Chernogorskii, Nurkhan Laiyk, Daniil Orel, Preslav Nakov, Maxim Panov

AI总结 本文通过引入一个明确分类不确定性来源的新数据集,系统评估了现有不确定性量化方法在不同不确定性来源下的表现,发现多数方法在模型知识局限下表现良好,但在其他来源下性能下降或产生误导。

详情
AI中文摘要

随着大型语言模型(LLM)在现实世界应用中的日益普及,可靠的不确定性量化(UQ)对于安全有效使用变得至关重要。现有的大多数语言模型UQ方法旨在产生单一的置信度分数——例如,估计模型答案正确的概率。然而,自然语言任务中的不确定性源于多个不同的来源,包括模型知识差距、输出可变性和输入歧义,这些对系统行为和用户交互有不同的影响。在这项工作中,我们研究了不确定性来源如何影响现有UQ方法的行为和有效性。为了进行受控分析,我们引入了一个新数据集,该数据集明确分类了不确定性来源,允许系统评估每种条件下的UQ性能。我们的实验表明,虽然许多UQ方法在不确定性仅源于模型知识限制时表现良好,但当引入其他来源时,它们的性能会下降或变得具有误导性。这些发现强调了需要明确考虑大型语言模型中不确定性来源的不确定性感知方法。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

2604.10273 2026-06-01 cs.CV

Dual-Exposure Imaging with Events

基于事件的双曝光成像

Mingyuan Lin, Hongyi Liu, Chu He, Wen Yang, Gui-Song Xia, Lei Yu

AI总结 提出事件辅助的双曝光成像算法E-DEI,利用事件相机的高时间分辨率对齐和融合双曝光图像特征,以消除运动伪影和曝光差异,提升低光图像质量。

详情
AI中文摘要

通过结合短曝光和长曝光图像的互补优势,双曝光成像(DEI)在低光场景下增强了图像质量。然而,现有的DEI方法由于场景运动导致的空间位移和不同曝光时间引起的图像特征差异,不可避免地会产生伪影。为了解决这个问题,我们提出了一种新颖的基于事件的双曝光成像(E-DEI)算法,该算法从双曝光图像对和事件中重建高质量图像,利用事件相机的高时间分辨率提供准确的帧间/帧内动态信息。具体来说,我们将这个复杂任务分解为两个子任务的集成,即基于事件的运动去模糊和低光图像增强任务,这指导我们将E-DEI网络设计为双路径并行特征传播架构。我们提出了一个双路径特征对齐与融合(DFAF)模块,以在事件的辅助下有效地对齐和融合从双曝光图像中提取的特征。此外,我们构建了一个包含配对低/正常光图像和事件的真实世界数据集(PIED)。在多个数据集上的实验表明了我们方法的优越性。代码和数据集可在GitHub上获取。

英文摘要

By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

2503.09315 2026-06-01 cs.LG

ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning

ShuffleGate: 通过批量敏感性学习实现推荐系统的可扩展特征优化

Yihong Huang, Chen Chu, Fan Zhang, Liping Wang Fei Chen, Yu Lin, Ruiduan Li, Zhihao Li

AI总结 提出ShuffleGate机制,通过批量洗牌策略以可微方式估计特征重要性,统一特征选择和维度选择任务,实现极化重要性分布,避免复杂阈值调优,在四个基准上优于现有方法,并在工业部署中实现10倍维度压缩和91%训练吞吐量提升。

详情
AI中文摘要

特征优化——特别是特征选择(FS)和维度选择(DS)——对于大规模推荐系统的效率和泛化能力至关重要。虽然概念上相关,但这些任务通常采用孤立的解决方案,往往存在重要性分数模糊或计算成本过高的问题。在本文中,我们提出ShuffleGate,一种统一且可解释的机制,通过衡量模型对信息损失的敏感性来估计组件重要性。与学习相对权重的传统门控不同,ShuffleGate引入了一种批量洗牌策略,以端到端可微的方式有效“擦除”信息。这种范式转变产生了自然极化的重要性分布,弥合了长期存在的“搜索-重训练差距”,并在无需复杂阈值调优的情况下区分关键信号与噪声。在四个基准上的大量实验验证了ShuffleGate在特征选择和维度选择任务中均持续优于最先进的方法。它比排列基线实现了15倍的加速,并展示了极端的可扩展性,在仅700秒内处理了2.7亿个参数。最后,在一项顶级工业部署中,它将输入维度压缩了10倍,使得训练吞吐量提高了91%,同时每天服务数十亿次请求且性能无下降。

英文摘要

Feature optimization -- specifically Feature Selection (FS) and Dimension Selection (DS) -- is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs. In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model's sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively "erase" information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing "search-retrain gap" and distinguishing essential signals from noise without complex threshold tuning. Extensive experiments across four benchmarks validate that ShuffleGate consistently outperforms state-of-the-art methods in both Feature and Dimension Selection tasks. It achieves a 15\times speedup over permutation baselines and demonstrates extreme scalability by processing 270M parameters in just 700 seconds. Finally, in a top-tier industrial deployment, it compressed input dimensions by 10\times, yielding a 91% increase in training throughput while serving billions of daily requests without performance degradation.

2604.06881 2026-06-01 cs.LG physics.flu-dyn

MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems

MENO: 用于动力系统的MeanFlow增强神经算子

Tianyue Yang, Xiao Xue

AI总结 提出MENO框架,通过改进的MeanFlow方法恢复多尺度特征,在低分辨率训练数据下实现高分辨率准确预测,且推理速度比扩散增强方法快14倍。

Comments 27 pages, 13 figures

详情
AI中文摘要

神经算子因其网格不变性和计算效率而成为动力系统的强大替代模型。然而,基于傅里叶的变体在谱空间中固有地截断高频分量,导致小尺度结构丢失,并在低分辨率数据训练时降低高分辨率下的预测质量。虽然基于扩散的增强方法可以恢复多尺度特征,但它们引入了大量推理开销,削弱了神经算子的效率优势。在这项工作中,我们引入了MeanFlow增强神经算子(MENO),一种新颖的框架,以最小的推理成本实现准确的全尺度预测。通过利用改进的MeanFlow方法,MENO恢复了小尺度细节和大尺度动力学,具有优越的物理保真度和统计准确性。我们在三个具有挑战性的动力系统上评估了MENO,包括相场动力学、二维Kolmogorov流和活性物质动力学,分辨率高达256×256。在所有基准测试中,与基线神经算子相比,MENO将功率谱密度精度提高了最多2倍,同时与最先进的去噪扩散隐式模型(DDIM)增强对应方法相比,实现了高达14倍的推理加速,有效弥合了准确性和效率之间的差距。MENO的灵活性和效率使其成为科学机器学习应用中高效的替代模型,其中统计完整性和计算效率至关重要。

英文摘要

Neural operators have emerged as powerful surrogates for dynamical systems due to their grid-invariant properties and computational efficiency. However, Fourier-based variants inherently truncate high-frequency components in spectral space, resulting in the loss of small-scale structures and degraded prediction quality at high resolutions when trained on low-resolution data. While diffusion-based enhancement methods can recover multi-scale features, they introduce substantial inference overhead that undermines the efficiency advantage of neural operators. In this work, we introduce MeanFlow-Enhanced Neural Operators (MENO), a novel framework that achieves accurate all-scale predictions with minimal inference cost. By leveraging the improved MeanFlow method, MENO restores both small-scale details and large-scale dynamics with superior physical fidelity and statistical accuracy. We evaluate MENO on three challenging dynamical systems, including phase-field dynamics, 2D Kolmogorov flow, and active matter dynamics, at resolutions up to 256$\times$256. Across all benchmarks, MENO improves the power spectrum density accuracy by up to a factor of 2 compared to baseline neural operators while achieving up to $14\times$ faster inference than the state-of-the-art Denoising Diffusion Implicit Model (DDIM)-enhanced counterparts, effectively bridging the gap between accuracy and efficiency. The flexibility and efficiency of MENO position it as an efficient surrogate model for scientific machine learning applications where both statistical integrity and computational efficiency are paramount.

2509.10078 2026-06-01 cs.CL cs.AI

Human Psychometric Questionnaires Mischaracterize LLM Behavior

人类心理测量问卷误判LLM行为

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

AI总结 通过比较LLM在Likert问卷和生成概率上的价值与人格特征,发现问卷存在系统性偏差,提出基于生成概率的评估方法更准确。

Comments 38 pages, 6 figures

详情
AI中文摘要

我们检验了人类心理测量问卷是否可以作为可靠工具来表征和预测LLM在日常用户交互中的行为。我们分析了八个开源LLM,比较了从两种不同方法得出的价值和人格特征:基于既定问卷(PVQ-40/21和BFI-44/10)的Likert自我报告,以及对日常用户查询的价值负载响应的生成概率。两种特征显著不同。在生成概率中,常被引为LLM稳定倾向证据的构念内项目一致性消失了。我们将这一差距归因于既定问卷项目中的显式词汇线索使模型能够识别目标构念并以一致、社会期望的方式响应,而现实用户查询不提供此类线索。此外,人口统计角色提示以与真实人类模式一致的方式改变了模型对人类问卷的响应,但在对现实用户查询的响应生成概率中没有出现此类变化,表明它们在模拟目标人口统计在真实世界用户交互中的行为方面能力有限。总体而言,我们的研究表明,人类心理测量问卷不足以预测LLM行为,并建议基于生成的评估作为更准确的度量。

英文摘要

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

2604.01985 2026-06-01 cs.LG cs.AI cs.RO

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

World Action Verifier: 通过前向-反向不对称性自我改进世界模型

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

AI总结 提出World Action Verifier (WAV)框架,利用状态合理性和动作可达性的独立验证以及前向-反向不对称性,通过视频语料库的多样子目标生成器和稀疏逆模型实现循环一致性,从而在欠探索区域自我改进世界模型,在多个任务中样本效率提升2倍且下游策略性能提升22%以上。

Comments Project Website: https://world-action-verifier.github.io

详情
AI中文摘要

通用世界模型有望实现可扩展的策略评估、优化和规划,但达到所需的鲁棒性仍然具有挑战性。与主要关注最优动作的策略学习不同,世界模型需要在大量次优动作的空间中保持可靠,而这些动作在带有动作标签的机器人交互中往往代表性不足。为了解决这一挑战,我们提出了World Action Verifier (WAV)框架,该框架使世界模型能够识别自身的预测错误并进行自我改进。关键思想是将动作条件的状态预测分解为两个独立可验证的因素:状态合理性和动作可达性。我们证明,由于两个潜在的不对称性——更广泛的无动作数据的可用性和动作相关特征的更低维度——验证这些因素比直接前向预测更容易处理。利用这些不对称性,我们通过(i)从视频语料库中获得的多样子目标生成器和(ii)从状态特征子集推断动作的稀疏逆模型来增强世界模型。通过强制提议的子目标、推断的动作和前向展开之间的循环一致性,WAV在现有方法常常失败的欠探索区域提供了一种有效的验证机制。在涵盖MiniGrid、RoboMimic和ManiSkill的九个任务中,我们的方法实现了2倍的样本效率提升,同时将下游策略性能提高了22%以上。

英文摘要

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.

2603.28579 2026-06-01 cs.RO

EBuddy: a workflow orchestrator for industrial human-machine collaboration

EBuddy:面向工业人机协作的工作流编排器

Michele Banfi, Rocco Felici, Stefano Baraldo, Oliver Avram, Anna Valente

AI总结 提出EBuddy,一种基于语音引导的工作流编排器,通过将专家实践形式化为有限状态机驱动的应用,实现工业环境中自然的人机协作,显著缩短端到端流程时间并保持可重复性。

详情
AI中文摘要

本文介绍了EBuddy,一种用于工业环境中自然人机协作的语音引导工作流编排器。EBuddy针对工具密集型工作流中一个反复出现的瓶颈:专家知识有效但难以规模化,当操作员和会话之间临时重建程序时,执行质量会下降。EBuddy将专家实践操作化为有限状态机(FSM)驱动的应用程序,在运行时提供可解释的决策框架(当前状态和允许的动作),使得口头请求在状态约束下被解释,同时系统执行并监控相应的工具交互。通过模块化工作流工件,EBuddy协调异构资源,包括GUI驱动的软件和协作机器人,利用自动语音识别和意图理解实现完全基于语音的交互。在定向能量沉积(DED)的叶轮叶片检查和修复准备中,通过人机协作实现的工业试点显示,在入职、3D扫描和处理以及修复程序生成过程中,端到端流程时间显著减少,同时保持了可重复性和低操作员负担。

英文摘要

This paper presents EBuddy, a voice-guided workflow orchestrator for natural human-machine collaboration in industrial environments. EBuddy targets a recurrent bottleneck in tool-intensive workflows: expert know-how is effective but difficult to scale, and execution quality degrades when procedures are reconstructed ad hoc across operators and sessions. EBuddy operationalizes expert practice as a finite state machine (FSM) driven application that provides an interpretable decision frame at runtime (current state and admissible actions), so that spoken requests are interpreted within state-grounded constraints, while the system executes and monitors the corresponding tool interactions. Through modular workflow artifacts, EBuddy coordinates heterogeneous resources, including GUI-driven software and a collaborative robot, leveraging fully voice-based interaction through automatic speech recognition and intent understanding. An industrial pilot on impeller blade inspection and repair preparation for directed energy deposition (DED), realized by human-robot collaboration, shows substantial reductions in end-to-end process duration across onboarding, 3D scanning and processing, and repair program generation, while preserving repeatability and low operator burden.

2603.28201 2026-06-01 cs.LG stat.ML

A Perturbation Approach to Unconstrained Linear Bandits

无约束线性赌博机的一种扰动方法

Andrew Jacobsen, Dorian Baudry, Shinji Ito, Nicolò Cesa-Bianchi

AI总结 本文提出一种基于扰动的框架,将无约束线性赌博机问题简化为标准在线线性优化问题,并实现了静态和动态遗憾的最优高概率保证。

Comments 50 pages; v2: ICML 2026

详情
AI中文摘要

我们重新审视了Abernethy等人(2008)在无约束赌博机线性优化(uBLO)背景下的标准基于扰动的方法。我们展示了一个令人惊讶的结果:在无约束设置中,这种方法有效地将赌博机线性优化(BLO)简化为一个标准的在线线性优化(OLO)问题。我们的框架在几个方面改进了先前的工作。首先,当我们的扰动方案与比较器自适应的OLO算法结合时,我们推导出了期望遗憾保证,从而对不同的对抗模型如何影响最终的比较器自适应率提供了新的见解。我们还将分析扩展到动态遗憾,在没有$P_T$先验知识的情况下,首次获得了具有最优$\sqrt{P_T}$路径长度依赖的保证。然后,我们为uBLO中的静态和动态遗憾开发了第一个高概率保证。最后,我们讨论了静态遗憾的下界,并证明了欧几里得球上对抗性线性赌博机的传说$Ω(\sqrt{dT})$率,这具有独立的意义。

英文摘要

We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the first guarantees with optimal $\sqrt{P_T}$ path-length dependencies without prior knowledge of $P_T$. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore $Ω(\sqrt{dT})$ rate for adversarial linear bandits on the Euclidean ball, which is of independent interest.

2603.26885 2026-06-01 cs.CV

TTE-CAM: Self-Explainable Class Activation Maps for Pretrained Black-Box CNNs

TTE-CAM:用于预训练黑盒CNN的自解释类激活图

Kerol Djoumessi, Philipp Berens

AI总结 提出TTE-CAM框架,通过卷积替换分类头将预训练黑盒CNN转化为自解释模型,在保持预测性能的同时提供忠实解释。

Comments Accepted at MIDL 2026 in the short paper track

详情
AI中文摘要

卷积神经网络在医学图像分析中取得了最先进的性能,但仍然不透明,限制了在高风险临床环境中的采用。现有方法面临一个基本权衡:事后方法提供不忠实的近似解释,而固有可解释架构是忠实的,但往往牺牲预测性能。我们引入TTE-CAM,一个测试时框架,通过基于原始权重初始化的卷积替换其分类头,将预训练的黑盒CNN转换为自解释模型,从而弥合这一差距。所得模型保留了黑盒预测性能,同时提供了与事后方法在定性和定量上都具有竞争力的内置忠实解释。代码可在 https://github.com/kdjoumessi/Test-Time-Explainability 获取。

英文摘要

Convolutional neural networks (CNNs) achieve state-of-the-art performance in medical image analysis yet remain opaque, limiting adoption in high-stakes clinical settings. Existing approaches face a fundamental trade-off: post-hoc methods provide unfaithful approximate explanations, while inherently interpretable architectures are faithful but often sacrifice predictive performance. We introduce TTE-CAM, a test-time framework that bridges this gap by converting pretrained black-box CNNs into self-explainable models via a convolution-based replacement of their classification head, initialized from the original weights. The resulting model preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively. The code is available at https://github.com/kdjoumessi/Test-Time-Explainability

2603.26612 2026-06-01 cs.RO

Meta-Adaptive Beam Search Planning for Transformer-Based Reinforcement Learning Control of UAVs with Overhead Manipulators under Flight Disturbances

基于Transformer强化学习的无人机搭载顶置机械臂在飞行扰动下的元自适应波束搜索规划

Hazim Alzorgan, Sayed Pedram Haeri Boroujeni, Abolfazl Razi

AI总结 针对无人机与顶置机械臂耦合导致的末端执行器跟踪误差问题,提出基于Transformer双深度Q网络(DDQN)的强化学习框架,通过自适应波束搜索规划器利用学习到的评论家进行前向估计,实现软件在环的短视域波束搜索,显著降低跟踪误差并提升奖励。

Comments The paper will be reworked significantly

详情
AI中文摘要

配备顶置机械臂的无人机为检查、维护和基于接触的交互提供了独特的能力。然而,无人机及其机械臂的运动紧密耦合,由风或控制不完善引起的微小姿态变化会使末端执行器偏离预定路径。这种耦合使得可靠跟踪变得困难,也限制了最初为固定基座机器人设计的学习型臂控制器的直接使用。在我们的测试中,每当无人机机体经历漂移或快速姿态修正时,这些效应都会一致出现。为了解决这一问题,我们开发了一个基于Transformer双深度Q网络(DDQN)的强化学习框架,其核心思想是使用自适应波束搜索规划器,该规划器利用学习到的评论家作为前向估计器,对候选控制序列进行短视域波束搜索。这使得控制器能够通过模拟推演来预测末端执行器的运动,而不是直接在实际模型上执行这些动作,实现了软件在环(SITL)方法。前瞻依赖于处理短状态序列的Transformer评论家提供的价值估计,而DDQN骨干网络则提供保持学习过程稳定所需的单步目标。在相同训练条件下对3自由度空中机械臂进行评估,所提出的元自适应规划器表现出最强的整体性能,奖励增加10.2%,平均跟踪误差大幅降低(从约6%降至3%),并且相对于DDQN基线,组合奖励-误差指标改善29.6%。当无人机基座因外部扰动出现漂移时,与固定波束和仅Transformer变体相比,我们的方法在跟踪目标尖端轨迹方面表现出更高的稳定性(保持5厘米跟踪误差)。

英文摘要

Drones equipped with overhead manipulators offer unique capabilities for inspection, maintenance, and contact-based interaction. However, the motion of the drone and its manipulator is tightly linked, and even small attitude changes caused by wind or control imperfections shift the end-effector away from its intended path. This coupling makes reliable tracking difficult and also limits the direct use of learning-based arm controllers that were originally designed for fixed-base robots. These effects appear consistently in our tests whenever the UAV body experiences drift or rapid attitude corrections. To address this behavior, we develop a reinforcement-learning (RL) framework with a transformer-based double deep Q learning (DDQN), with the core idea of using an adaptive beam-search planner that applies a short-horizon beam search over candidate control sequences using the learned critic as the forward estimator. This allows the controller to anticipate the end-effector's motion through simulated rollouts rather than executing those actions directly on the actual model, realizing a software-in-the-loop (SITL) approach. The lookahead relies on value estimates from a Transformer critic that processes short sequences of states, while a DDQN backbone provides the one-step targets needed to keep the learning process stable. Evaluated on a 3-DoF aerial manipulator under identical training conditions, the proposed meta-adaptive planner shows the strongest overall performance with a 10.2% reward increase, a substantial reduction in mean tracking error (from about 6% to 3%), and a 29.6% improvement in the combined reward-error metric relative to the DDQN baseline. Our method exhibits elevated stability in tracking target tip trajectory (by maintaining 5 cm tracking error) when the drone base exhibits drifts due to external disturbances, as opposed to the fixed-beam and Transformer-only variants.

2603.23977 2026-06-01 cs.LG cs.AI

Circuit-Inspired High-Order Neural Networks with Unified Neural Dynamics Modeling for PDE Solving and Visual Perception

电路启发的具有统一神经动力学建模的高阶神经网络用于PDE求解与视觉感知

Tongfei Chen, Jingying Yang, Linlin Yang, Juan Zhang, Jinhu Lü, David Doermann, Chunyu Xie, Long He, Tian Wang, Guodong Guo, Baochang Zhang

AI总结 提出电路启发的高阶神经网络(CHONN),通过基尔霍夫级联组合实现高阶动力学算子,在PDE求解、长期物理预测和ImageNet-1K识别中提升结构保真度和稳定性。

详情
AI中文摘要

深度网络通常依赖架构启发式方法来塑造表示演化,限制了其对由内在动力学支配的数据的建模能力。我们提出了电路启发的高阶神经网络(CHONN),这是一个模块化框架,将表示演化视为一个潜在势过程,并通过基尔霍夫启发的级联组合增加其有效阶数。单个基尔霍夫神经单元实现稳定的一阶更新,而串行组合的单元在一个块内形成高阶动力学算子。这种构造是可解释的、数值稳定的,并且与常见的神经骨干网络兼容。理论分析表明,级联单元诱导出端到端的高阶算子,控制实验证明块内高阶构造不同于通用深度堆叠,特别是在导数敏感度量上。在稳态算子学习、长期物理预测和ImageNet-1K识别中,CHONN提高了结构保真度、滚动稳定性和视觉表示学习。这些结果将高阶电路组合确定为神经动力学建模的一般原则。

英文摘要

Deep networks often rely on architectural heuristics to shape representation evolution, limiting their ability to model data governed by intrinsic dynamics. We present the Circuit-inspired High-Order Neural Network (CHONN), a modular framework that treats representation evolution as a latent potential process and increases its effective order through Kirchhoff-inspired cascade composition. A single Kirchhoff Neural Cell implements a stable first-order update, while serially composed cells form higher-order dynamical operators within one block. This construction is interpretable, numerically stable and compatible with common neural backbones. Theoretical analysis shows that cascaded cells induce end-to-end high-order operators, and controlled experiments demonstrate that intra-block high-order construction differs from generic depth stacking, especially on derivative-sensitive measures. Across steady-state operator learning, long-horizon physical forecasting and ImageNet-1K recognition, CHONN improves structural fidelity, rollout stability and visual representation learning. These results identify high-order circuit composition as a general principle for neural dynamics modeling.

2603.23160 2026-06-01 cs.CL

UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

UniDial-EvalKit:多面对话能力评估的统一工具包

Qi Jia, Haodong Zhao, Dun Pei, Xiujie Song, Ye Shen, Shibo Wang, Zijian Chen, Zicheng Zhang, Xiangyang Zhu, Guangtao Zhai

AI总结 提出UniDial-EvalKit统一评估工具包,通过标准化数据格式、模块化流水线和层次化评分聚合,解决多轮交互场景下评估协议异构问题,并基于大规模实验揭示当前系统无一致最优、记忆智能体常不及全上下文基线的现象。

详情
AI中文摘要

在多轮交互场景中对大型语言模型(LLM)和智能体进行基准测试对于理解其实际能力至关重要。然而,现有的评估协议高度异构,在数据集格式、模型接口和评估流水线上差异显著,严重阻碍了系统比较。在这项工作中,我们提出了UniDial-EvalKit(UDE),一个用于评估交互式AI系统的统一评估工具包。UDE的核心贡献在于其整体统一性:它将异构数据格式标准化为通用模式,通过模块化架构简化复杂的评估流水线,并在层次化评分聚合下对齐指标计算。它还通过并行生成和评分以及检查点恢复来支持高效的大规模评估,消除冗余计算。利用UDE,我们在多个多维基准上进行了广泛评估。我们的实证分析表明,没有单一系统在所有基准上持续优于其他系统,而当前的记忆智能体通常无法超越全上下文基线。进一步的分析指出了几个未来方向,包括基准去重和更自适应的记忆架构。

英文摘要

Benchmarking large language models (LLMs) and agents in multi-turn interactive scenarios is essential for understanding their practical capabilities. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a hierarchical scoring aggregation. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint resume to eliminate redundant computation. Leveraging UDE, we conduct an extensive evaluation across diverse multi-dimensional benchmarks. Our empirical analysis shows that no single system consistently outperforms others across all benchmarks, while current memory agents often fail to surpass full-context baselines. Further analyses highlight several future directions, including benchmark deduplication and more adaptive memory architectures.

2603.22744 2026-06-01 cs.AI

LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

LH-Bench:面向主观企业任务的长期智能体技能基础评估

Abhishek Chandwani, Ishan Gupta

AI总结 提出LH-Bench,通过专家基础评分标准、真实标注工件和成对偏好评估三支柱,解决主观企业任务中长期自主执行的评估问题。

详情
AI中文摘要

大型语言模型在数学和编程等客观可验证任务上表现出色,这些任务的评估简化为单元测试或单一正确答案。相比之下,现实世界中的企业工作通常是主观且依赖上下文的:成功取决于组织目标、用户意图以及在长期多工具工作流中产生的中间工件的质量。我们引入LH-Bench,一种三支柱评估设计,超越二元正确性,对主观企业任务中的自主长期执行进行评分。这些支柱包括:(i) 专家基础评分标准,为LLM评判者提供评估主观工作所需的领域背景;(ii) 策划的真实工件,提供逐步奖励信号(例如,内容任务的章节级注释);以及(iii) 成对人类偏好评估,用于收敛验证。我们表明,领域作者编写的评分标准比LLM作者编写的评分标准提供更可靠的评估信号(kappa = 0.60 vs. 0.46),并且人类偏好判断确认了相同的顶级分离(p < 0.05),这证明专家基础评估可以在不牺牲可靠性的情况下扩展。我们发布公共数据集,并报告两个环境的结果:Figma到代码(通过MCP针对Figma API的33个真实.fig任务)和程序化内容(41门课程,包含183个单独评估的章节,服务于一个拥有30+日常用户的课程平台)。

英文摘要

Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p < 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users).

2603.21558 2026-06-01 cs.AI

Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers

可靠的自改进训练:验证推理过程,而不仅仅是答案

Xinyu Zhang

AI总结 针对自改进训练中因依赖最终答案正确性导致推理错误累积的问题,提出VSI框架,通过步骤级结构验证(如符号计算检查算术步骤)筛选训练数据,在GSM8K上实现持续准确率提升(80.5%→91.0%)。

Comments Accepted at ICLR 2026 Workshop LLM Reasoning. 10 pages, 3 figures, 5 tables

详情
AI中文摘要

自改进训练中,模型从自身生成的解决方案中学习,有望带来持续的能力提升,但存在一个普遍失败模式:经过多轮训练后,累积的推理错误导致准确率停滞或下降。我们将这种漂移归因于标准过滤标准——仅根据最终答案的正确性保留解决方案,这使得幸运猜测(答案正确但推理有缺陷)污染训练数据。我们提出已验证自改进(VSI)框架,该框架基于步骤级结构完整性而非仅最终输出决定数据保留。VSI通过计算机代数库(sympy)重新计算算术步骤、检查中间一致性并强制执行领域约束来验证解决方案。在GSM8K上使用Qwen3-4B-Thinking进行5轮自改进评估,与四个基线(无验证、结果验证、多数投票和VSI+DPO)相比,VSI拒绝了约34%的答案正确的解决方案,成功隔离了幸运猜测。这种更清洁的训练信号驱动了所有轮次的持续准确率提升(从80.5%到91.0%),而结果验证趋于平稳,未验证的训练则崩溃。最后,将VSI检查转化为DPO偏好对,训练模型区分合理推理与幸运答案,将奖励准确率从46%提升至63%。VSI提供了一种简单、可复现的配方,用于在自动化推理检查可用时实现稳健的自改进。

英文摘要

Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasive failure mode: across multiple rounds, compounding reasoning errors cause accuracy to stall or degrade. We trace this drift to standard filtering criteria that retain solutions based solely on final answer correctness, which lets lucky guesses (correct answers with flawed reasoning) contaminate the training data. We propose Verified Self-Improvement (VSI), a framework that conditions data retention on step-level structural integrity rather than just the final output. VSI validates solutions by recomputing arithmetic steps via a computer-algebra library (sympy), checking intermediate consistency, and enforcing domain constraints. Evaluating VSI on GSM8K with Qwen3-4B-Thinking across 5 rounds of self-improvement against four baselines (no verification, outcome verification, majority voting, and VSI with DPO) shows that VSI rejects approximately 34% of correct-answer solutions, successfully isolating lucky guesses. This cleaner training signal drives sustained accuracy gains across all rounds (80.5% to 91.0%), whereas outcome verification plateaus and unverified training collapses. Finally, converting VSI checks into DPO preference pairs trains the model to distinguish sound reasoning from lucky answers, boosting reward accuracy from 46% to 63%. VSI offers a simple, reproducible recipe for robust self-improvement whenever automated reasoning checks are available.

2511.11440 2026-06-01 cs.CV cs.CL

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

合成刺激,真实收益:通过完全受控的数据生成重新思考VLM微调

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

AI总结 本文提出一种完全受控的数据生成与标注流程,用于微调视觉语言模型(VLM),通过平衡分布和干净标注消除偏差,在空间推理任务上仅用130个样本即可实现均匀性能,并在真实世界数据上提升13%的性能。

详情
AI中文摘要

通过微调获得的视觉语言模型(VLM)的性能提升通常基于对真实世界场景的临时数据收集和标注。尽管有所改进,但这一过程往往容易受到偏差、错误和分布不平衡的影响,导致过拟合和性能不平衡。虽然少数研究探索了合成数据生成,但它们通常缺乏对数据分布和标注质量的控制。在这项工作中,我们通过探索完全受控的数据生成和标注流程,重新评估了模型微调的潜力,获得了具有平衡分布和干净标注的无偏差数据。以识别物体绝对位置的空间推理任务作为用例,我们微调了最先进的VLM,并在合成和真实世界基准上进行了详尽的评估,包括对真实世界场景的可迁移性。我们的实验揭示了两个关键发现:1)在平衡数据上微调可以在视觉场景中产生均匀的性能,并且仅用130个样本就能缓解常见偏差;2)在合成刺激上微调使真实世界数据(COCO)的性能提升了13%,优于在完整COCO训练集上微调的模型。

英文摘要

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

2510.25110 2026-06-01 cs.CL

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

DEBATE:用于评估角色扮演LLM代理中观点动态的大规模基准

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, You Li, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

AI总结 提出DEBATE基准,通过多轮公共消息和Likert量表信念数据,评估多代理角色扮演LLM模拟中观点动态的真实性,发现零样本设置下代理组过度收敛,而监督微调可改善立场对齐并减少组级收敛误差。

详情
AI中文摘要

准确建模通过社交互动产生的观点变化对于理解和缓解极化、错误信息及社会冲突至关重要。近期工作使用角色扮演LLM代理(RPLA)模拟观点动态,但多代理模拟常表现出不自然的群体行为(如过早收敛),且缺乏评估与真实人类群体互动一致性的经验基准。我们提出DEBATE,一个大规模基准,用于评估多代理RPLA模拟中观点动态的真实性。DEBATE包含来自美国参与者在107个话题上的多轮公共消息和私有Likert量表信念;实验中使用的清理基准包含697个组的2,788名参与者,支持在话语和组级别进行评估,并为未来个体级别分析提供支持。我们使用七个LLM实例化“数字孪生”RPLA,并在两种设置(下一消息预测和完整动态模拟)下,使用基于立场的观点动态指标进行评估。在零样本设置中,RPLA组相对于人类组表现出强烈的观点收敛。在保留组划分上,对Llama-3.1-8B-Instruct进行监督微调(SFT)改善了辅助立场对齐并减少了组级收敛误差,尽管在观点变化和信念更新方面仍存在差异。DEBATE能够对模拟观点动态进行严格基准测试,并支持未来关于使多代理RPLA与真实人类互动对齐的研究。

英文摘要

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains multi-round public messages and private Likert-scale beliefs from U.S.-based participants across 107 topics; the cleaned benchmark used in our experiments contains 2,788 participants in 697 groups, enabling evaluation at the utterance and group levels and supporting future individual-level analyses. We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full dynamics simulation, using stance-based opinion-dynamics metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. On the held-out group split, supervised fine-tuning (SFT) for Llama-3.1-8B-Instruct improves auxiliary stance alignment and reduces group-level convergence error, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

2603.19862 2026-06-01 cs.CV cs.LG

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

IsoCLIP: 分解CLIP投影器以实现高效的模态内对齐

Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

AI总结 本文通过分析CLIP投影器的谱特性,发现模态间对齐子空间和各向异性方向,提出无训练方法IsoCLIP去除各向异性方向以改善模态内对齐,在模态内检索和分类任务上降低延迟并超越现有方法。

Comments Accepted at CVPR2026

详情
AI中文摘要

视觉-语言模型如CLIP被广泛用于涉及视觉和文本模态的跨模态任务。然而,当个体模态编码器应用于固有的模态内任务(如图像到图像检索)时,其性能因模态内错位而受损。本文研究CLIP中的模态内错位,重点关注将投影前图像和文本嵌入映射到共享嵌入空间的投影器的作用。通过分析应用于投影特征的余弦相似度形式及其与对比CLIP损失的交互,我们发现在训练期间存在一个负责对齐两种模态的跨模态算子,以及第二个仅强制执行模态内归一化但不促进模态内对齐的模态内算子。通过对跨模态算子的谱分析,我们识别出一个近似各向同性的子空间,其中两种模态良好对齐,以及每个模态特有的各向异性方向。我们证明该对齐子空间可以直接从投影器权重中获得,并且去除各向异性方向可改善模态内对齐。我们在模态内检索和分类基准上的实验表明,我们的无训练方法减少了模态内错位,大大降低了延迟,并在多个预训练的类CLIP模型上优于现有方法。代码公开于:https://github.com/simomagi/IsoCLIP。

英文摘要

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

2603.19262 2026-06-01 cs.CL cs.AI

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

大型语言模型中推理时引发的概率变换的经验表征

Mike Farmer, Abhinav Kochar, Yugyung Lee

AI总结 本研究通过经验观察发现,在多种推理时流程(如思维链、自我细化、检索增强和验证器引导修订)下,候选答案的概率变换遵循近似的对数比率关系,并分析了其系数变化和鲁棒性。

Comments 22 pages, 11 figures, 5 tables

详情
AI中文摘要

大型语言模型越来越依赖推理时程序,如思维链推理、自我细化、检索增强和验证器引导修订,但这些程序下引发的概率变换结构仍不清楚。我们研究外部引发的候选答案概率分配,并观察到重复出现的近似对数比率关系:\[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] 其中 $q_t$ 和 $\tilde q_t$ 分别是引发前和引发后的概率,$b_t$ 是外部构建的证据信号,$α_t$ 是提示配置的经验描述符。在来自 GPQA Diamond、TheoremQA、MMLU-Pro 和 ARC-Challenge 的 4,975 个推理问题上,对多个指令微调模型系列进行评估,我们在约 $1.3 \times 10^5$ 个候选级观测上观察到近似对数比率关系,平均 $R^2 \approx 0.76$。系数在不同引发设置下变化,但定性相似的关系在评估条件下持续存在。使用替代统计表示、提示配置、保留评估和 token 级对数概率的鲁棒性分析表明,观察到的结构不依赖于特定的提示程序或概率估计方法。主要贡献不是代数形式本身(它与广义贝叶斯更新和概率变换框架相关),而是经验观察:在受控条件下,多样化的推理时提示流程反复表现出可复现的对数比率结构。该框架为分析推理时 LLM 流程中的校准、证据放大、不确定性传播和交互敏感性提供了协议敏感的视角。

英文摘要

Large language models increasingly rely on inference-time procedures such as chain-of-thought reasoning, self-refinement, retrieval augmentation, and verifier-guided revision, yet the structure of elicited probability transformations under these procedures remains poorly understood. We study externally elicited probability assignments over candidate answers and observe recurring approximate log-ratio relationships: \[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] where $q_t$ and $\tilde q_t$ are pre- and post-elicitation probabilities, $b_t$ is an externally constructed evidence signal, and $α_t$ is an empirical descriptor of the prompting configuration. Across 4,975 reasoning problems from GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge, evaluated on multiple instruction-tuned model families, we observe approximate log-ratio relationships with mean $R^2 \approx 0.76$ over about $1.3 \times 10^5$ candidate-level observations. Coefficients vary across elicitation settings, but qualitatively similar relationships persist across evaluated conditions. Robustness analyses using alternative statistical representations, prompting configurations, held-out evaluation, and token-level log-probabilities suggest that the observed structure is not tied to one prompting procedure or probability estimation method. The main contribution is not the algebraic form itself, which is related to generalized Bayesian updating and probability-transformation frameworks, but the empirical observation that diverse inference-time prompting pipelines repeatedly exhibit reproducible log-ratio structure under controlled conditions. The framework provides a protocol-sensitive perspective for analyzing calibration, evidence amplification, uncertainty propagation, and interaction sensitivity in inference-time LLM pipelines.

2603.17306 2026-06-01 cs.CL q-bio.NC

Evidence for systematic semantic structure in individual phonemes

单个音素中系统性语义结构的证据

Gexin Zhao

AI总结 本研究通过大型语言模型、跨语言听者实验和发音特征分析,证明英语单个音素携带结构化的多维语义轮廓,挑战了音义关系任意性的传统假设。

Comments 31 pages, 4 figures

详情
AI中文摘要

语言学的一个基本假设认为,声音与意义之间的关系在很大程度上是任意的。这里我们表明,这一假设在单个音素层面上不成立:每个英语音素都携带一个结构化的、多维的语义轮廓,该轮廓可从文本中恢复、跨语言感知,并以发音为基础。三个大型语言模型独立检测到220对字母对比中九个感知维度上的一致语义结构。以英语为母语者(N=93)在一项预先注册的强制选择任务中确认了这些关联(与模型预测的一致性为85.3%),而五种类型学上不同语言的听者(N=155)在音频呈现下复制了该效应(准确率73.2%-81.9%)。发音特征以交叉验证的R²为0.56-0.98预测了该结构,表明发出声音的身体行为系统地塑造了其所传达的意义。这些发现将音素层面的象似性重新定义为音系系统中一种普遍的、具身的属性。

英文摘要

A foundational assumption in linguistics holds that sound-meaning relations are largely arbitrary. Here we show that this assumption fails at the level of individual phonemes: each English phoneme carries a structured, multidimensional semantic profile that is recoverable from text, perceived across languages, and grounded in articulation. Three large language models independently detected consistent semantic structure across nine perceptual dimensions in 220 pairwise letter contrasts. Native English speakers (N = 93) confirmed these associations in a preregistered forced-choice task (85.3% agreement with model predictions), and listeners of five typologically diverse languages (N = 155) replicated the effect under audio presentation (73.2%-81.9% accuracy). Articulatory features predicted the structure with cross-validated R^2 of 0.56-0.98, indicating that the bodily act of producing a sound systematically shapes the meaning it conveys. These findings reframe phoneme-level iconicity as a pervasive, embodied property of the phonological system.

2601.05770 2026-06-01 cs.LG cs.CL

Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

权重到代码:从离散Transformer中提取可解释算法

Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

AI总结 提出离散Transformer架构,通过温度退火采样注入离散性,结合假设检验和符号回归从模型权重中提取可解释算法,在离散任务上性能与RNN基线相当,并扩展到连续中间计算任务。

详情
AI中文摘要

算法提取旨在直接从算法任务训练的模型中合成可执行程序,从而无需依赖人工编写的目标程序即可从权重中重新发现可执行机制。然而,将此范式应用于Transformer时,由于表示纠缠(例如叠加),其中重叠方向编码的特征严重阻碍了符号表达式的恢复。我们提出了离散Transformer,这是一种专门设计用于弥合连续表示与离散符号逻辑之间差距的架构。通过温度退火采样注入离散性,我们的框架有效利用假设检验和符号回归来提取人类可读的程序。实验表明,离散Transformer在共享离散任务上实现了与基于RNN的MIPS基线相当的性能,同时将提取扩展到具有连续值中间计算的任务。最后,我们展示了架构归纳偏置对合成程序提供了细粒度控制,使离散Transformer成为算法提取和Transformer可解释性的可控测试平台。

英文摘要

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo recovery of executable mechanisms from weights without relying on human-written target programs. However, applying this paradigm to Transformer is complicated by representation entanglement (e.g., superposition), where features encoded in overlapping directions substantially hinder the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to the RNN-based MIPS baseline on shared discrete tasks, while broadening extraction to tasks with continuous-valued intermediate computations. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a controllable testbed for algorithm extraction and Transformer interpretability.

2603.18382 2026-06-01 cs.AI

From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

从弱线索到真实身份:评估LLM代理中推理驱动的去匿名化

Myeongseob Ko, Jihyun Jeong, Sumiran Singh Thakur, Gyuhak Kim, Ruoxi Jia

AI总结 研究通过LLM代理结合分散的非识别线索与公开证据重建真实身份的能力,揭示了即使在没有明确标识符的情况下,代理也能以高成功率实现去匿名化,并提出了新的隐私评估维度。

Comments Accepted at ICML 2026

详情
Journal ref
ICML 2026
AI中文摘要

匿名化通常被认为一旦移除显式标识符就能保护隐私,因为重新识别历来需要专业知识、定制算法和手动验证。我们证明基于LLM的代理削弱了这一屏障:通过将分散的、单独非识别的线索与公开证据相结合,它们重建真实世界的身份,有时甚至在良性任务中也是如此。我们在三种场景中评估了这一风险——经典的链接事件、一个控制基准(\emph{InferLink}),该基准变化指纹类型、任务框架和攻击者知识,以及开放的人机交互痕迹。在Netflix奖去匿名化设置的最稀疏情况下,代理重建了79.2%的身份,而经典匹配基线为56.0%;在\emph{InferLink}上,即使没有明确的重新识别请求,代理也能链接个体,并且在给出请求时更频繁。在编辑过的人机交互痕迹中,代理通过将上下文线索与公开证据相互印证,进一步将匿名化档案解析为特定个体。这些发现表明,对代理系统的隐私评估不仅应衡量访问或披露了哪些信息,还应衡量可以推断出哪些身份。

英文摘要

Anonymization is often assumed to protect privacy once explicit identifiers are removed, because re-identification has historically required specialized expertise, tailored algorithms, and manual corroboration. We show that LLM-based agents weaken this barrier: by combining scattered, individually non-identifying cues with public evidence, they reconstruct real-world identities, sometimes even during benign tasks. We evaluate this risk across three settings -- classical linkage incidents, a controlled benchmark (\emph{InferLink}) that varies fingerprint type, task framing, and attacker knowledge, and open-ended human--AI interaction traces. In the sparsest regime of the Netflix Prize deanonymization setting, agents reconstruct 79.2\% of identities, against 56.0\% for a classical matching baseline; on \emph{InferLink}, they link individuals even without an explicit re-identification request, and more often once one is given. In redacted human--AI interaction traces, agents further resolve anonymized profiles to specific individuals by corroborating contextual cues with public evidence. These findings suggest that privacy evaluations for agentic systems should measure not only what information is accessed or disclosed, but also what identities can be inferred.

2603.17145 2026-06-01 cs.LG cs.AI

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

REAL: 面向LLM评判的回归感知强化学习

Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik

AI总结 提出REAL框架,通过广义策略梯度将回归目标融入强化学习,优化LLM作为评分器的数值评估,在多个规模模型上超越SFT和标准RL方法。

Comments Accepted to ICML 2026. The first two authors contributed equally

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自动评估器,为模型输出分配数值分数,这种范式称为LLM-as-a-Judge。然而,标准的强化学习(RL)方法通常依赖二元奖励(例如0-1准确率),从而忽略了回归任务中固有的序结构;例如,当真实值为5时,它们未能识别出预测4显著优于预测1。相反,现有的回归感知方法通常局限于监督微调(SFT),限制了其探索最优推理路径的能力。为弥合这一差距,我们提出\textbf{REAL}(\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning),这是一个原则性的RL框架,旨在优化回归奖励,并且也被证明对相关性指标是最优的。一个关键的技术挑战是回归目标显式地依赖于策略,从而使标准策略梯度方法失效。为解决此问题,我们采用广义策略梯度估计器,该估计器自然地将优化分解为两个互补组件:(1)对思维链(CoT)轨迹的探索,以及(2)最终分数的回归感知预测细化。跨模型规模(8B到32B)的大量实验表明,REAL在域外基准上始终优于回归感知SFT基线和标准RL方法,展现出显著更好的泛化能力。具体在Qwen3-32B上,我们相比SFT基线获得了+8.40 Pearson和+7.20 Spearman相关性的提升,相比基础模型提升了+18.30/+11.20。这些发现凸显了将回归目标整合到RL探索中对准确LLM评估的关键价值。

英文摘要

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

2603.16123 2026-06-01 cs.LG cs.AI math.AT math.CT

Functorial Neural Architectures from Higher Inductive Types

基于高阶归纳类型的函子神经架构

Karen Sargsyan

AI总结 提出通过高阶归纳类型规范编译为神经架构,强制解码器满足严格幺半函子性质,从而在组合泛化任务上比非函子方法提升2-10倍。

Comments 26 pages, 10 tables. Code and Cubical Agda formalization: https://github.com/karsar/hott_neuro

详情
AI中文摘要

神经网络通常能学习任务的各个部分,但在这些部分的新组合上失败。我们认为这种失败是架构性的:只有当解码器尊重任务的代数法则,即从自由生成的序列下降到由这些法则确定的商时,它才能组合泛化。我们通过将高阶归纳类型(HIT)规范编译为神经架构,使这一原则具有建设性。基点、路径构造子和2-胞腔分别映射为基约束、生成器网络、结构拼接和学习到的同伦。由此产生的传输解码器在构造上是严格幺半函子:解码一个拼接的词是独立生成的环段的拼接。相反,我们证明softmax自注意力无法同时满足严格幺半组合和下降到任何非平凡组合商。在环面、圆楔和克莱因瓶上的实验验证了预期的层次结构:函子解码器比非函子替代方案性能提升2-10倍,而学习到的2-胞腔恰好在使用克莱因瓶关系的词上缩小了46%的误差差距。这些结果表明,组合泛化应作为架构中的函子结构强制执行,而非仅从示例中学习。

英文摘要

Neural networks often learn the parts of a task but fail on novel combinations of those parts. We argue that this failure is architectural: a decoder generalizes compositionally only when it respects the algebraic laws of the task, i.e. when it descends from freely generated sequences to the quotient determined by those laws. We make this principle constructive by compiling Higher Inductive Type (HIT) specifications into neural architectures. Basepoints, path constructors, and 2-cells are mapped to base constraints, generator networks, structural concatenation, and learned homotopies. The resulting transport decoders are strict monoidal functors by construction: decoding a concatenated word is concatenation of independently generated loop segments. In contrast, we prove that softmax self-attention cannot simultaneously satisfy strict monoidal composition and descent to any non-trivial compositional quotient. Experiments on the torus, wedge of circles, and Klein bottle validate the predicted hierarchy: functorial decoders outperform non-functorial alternatives by $2$--$10\times$, and a learned 2-cell closes a $46\%$ error gap precisely on words exercising the Klein-bottle relation. These results suggest that compositional generalization should be enforced as functorial structure in the architecture, rather than learned from examples alone.