arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2605.15514 2026-05-18 cs.CL cs.AI cs.LG

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

RoPE在长上下文中无法区分位置或令牌,证明性分析

Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng

AI总结 本文证明RoPE在长上下文中因失去局部偏倚和令牌相关性一致性而失效,无法区分位置或令牌,且增加RoPE基值只能牺牲位置区分能力。

Comments 35 pages, 11 figures, submitted to NeurIPS 2026

详情
AI中文摘要

我们识别了旋转位置嵌入(RoPE)在基于Transformer的长上下文语言模型中的内在限制。我们的理论分析脱离了上下文的具体内容,仅依赖其长度。我们证明,随着上下文长度增加,基于RoPE的注意力变得不可预测,并失去两个对有效性至关重要的属性。首先,它失去局部偏倚:RoPE不再更倾向于 favor 近的位置而非远的位置。其次,它失去令牌相关性的一致性:一个关键向量在某一位置获得更高的注意力分数,可能在另一位置获得更低的分数。在两种情况下,失败的概率接近0.5,不优于随机猜测。我们进一步证明,当关键令牌被移动到不同位置或被不同令牌替换时,注意力分数可以保持不变,表明无法区分位置或令牌。调整RoPE基值在区分位置和令牌之间进行权衡,但无法同时保持两者。增加RoPE基值超参数,这是当前长上下文模型中的常见做法,有助于区分不同令牌,但不可避免地牺牲区分位置的能力。我们的实证分析显示,多头、多层架构不足以克服这些限制。我们的发现表明,未来基于Transformer的长上下文语言模型可能需要从根本上新的机制来编码位置和令牌顺序。

英文摘要

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

2605.15513 2026-05-18 cs.AI

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

CAPS:级联自适应成对选择用于高效的并行推理

Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu

AI总结 CAPS通过级联自适应成对选择方法,在保持高效并行推理的同时,减少验证器的计算成本,优于现有成对验证方法。

Comments 31 pages, 2 figures, 18 tables

详情
AI中文摘要

并行推理,即生成器生成多个候选解,聚合器选择最佳解,是大型语言模型中最具效果的测试时扩展形式,而成对自我验证已成为其最强的聚合原始构件。然而,成对验证成本高昂:每次判断需读取两个完整的解,现有方法无论比较是否信息丰富,每问题都进行数十次判断。我们引入CAPS(级联自适应成对选择),一种仅在推理阶段使用的框架,沿两个正交轴非均匀分配验证器计算:证据轴适应每个候选解中验证者看到的部分,分布轴适应比较在池中的分布。CAPS将其实例化为四阶段级联,可选救援子程序,并允许闭式验证器令牌成本,其中每个候选的边际成本大致减半,相对于均匀全证据计划。在四个自我验证模型(Qwen3-14B,GPT-OSS-20B,Qwen3-4B-Instruct/Thinking)和五个涵盖代码(LiveCodeBench-v5/v6,CodeContests)和数学(AIME 2025,HMMT 2025)的推理基准上,CAPS在14个20个套件中优于领先的成对验证器,使用25.4%的验证器令牌预算在代码上,并在所有20个套件中优于点状自我验证。权衡套件允许以验证器在部分与完整证据上的准确性为术语的可解释诊断,提供具体的预部署检查以确定级联适用性。

英文摘要

Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.

2605.15510 2026-05-18 cs.RO

A QUBO Formulation Framework for Kinematic Structure-Based Robot Design Optimization: A Robotic Hand Case Study

基于运动学结构的机器人设计优化的QUBO公式框架:以机械手为例

HyoJae Kang, Yeong Jae Park, Jeongdo Ahn, Dongil Park

AI总结 本文提出基于二次无约束二元优化的框架,用于机器人设计优化,通过运动学结构级评估指标进行经典计算和量子退火优化,以机械手为例验证了该方法的有效性。

Comments This manuscript has been submitted for possible publication. 14 pages, 5 figures

详情
AI中文摘要

本文提出了一种基于二次无约束二元优化的公式框架,用于机器人设计优化,利用运动学结构级评估指标进行经典计算,将结果转换为与基于量子退火的优化兼容的组合选择问题。以机械手为例,其性能由每个手指的运动学特性及交互项决定。所提公式将个体设计奖励、重叠工作空间交互、一位热约束和结构依赖惩罚整合到统一的二次模型中。构建了一个27变量的机械手设计问题,并用模拟退火作为经典基线验证该公式的可行性。进一步用量子退火检验该公式在退火硬件执行中的适用性。结果表明,可以得到满足一位热选择和成对约束的可行设计组合,随着读取次数增加,目标值范围变窄。此外,讨论了该公式的应用扩展至其他机器人系统。所提框架提供了一种将基于运动学结构的机器人设计问题转换为组合优化问题的一般方法。

英文摘要

This paper presents a quadratic unconstrained binary optimization-based formulation framework for robot design optimization using kinematic structure-level evaluation metrics. In the proposed framework, classical computation is used to evaluate design-dependent metrics while the resulting combinatorial selection problem is formulated in a structure compatible with quantum annealing-based optimization. A robotic hand is adopted as a representative case study, as its performance is determined by both the individual kinematic characteristics of each finger and interaction terms. The proposed formulation incorporates individual design rewards, overlap workspace interactions, one-hot constraint, and structural dependency penalties into a unified quadratic model. A 27-variable robotic hand design problem is constructed, and simulated annealing is used as a classical baseline to verify the feasibility of the formulation. Quantum annealing is further performed to examine the applicability of the proposed formulation to annealing-based hardware execution. The results show that feasible design combinations satisfying both one-hot selection and pairwise constraints can be obtained, with the observed objective-value range becoming narrower as the number of reads increases. In addition, the formulation process is discussed for other robotic systems. The proposed framework provides a generalized approach for transforming kinematic structure-based robot design problems into combinatorial optimization problems.

2605.15509 2026-05-18 cs.LG cs.RO

parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning

parallelcbf:一种用于张量并行强化学习的可组合安全性过滤和可追溯性框架

Yijun Lu, Zilei Yang, Yuyin Ma

AI总结 ParallelCBF首次整合了张量并行无人机环境、硬门CBF安全过滤器、分片BC到RL流水线和第一类操作可追溯性,提供可组合的API以实现端到端安全约束训练。

详情
AI中文摘要

ParallelCBF首次整合了张量并行无人机环境、硬门CBF安全过滤器、分片BC到RL流水线和第一类操作可追溯性,提供可组合的API以实现端到端安全约束训练。

英文摘要

While Isaac Lab provides massive parallel UAV simulation, OmniSafe and safe-control-gym provide constrained-RL benchmarks, and CBFKit provides control-barrier-function synthesis tooling, no existing framework unifies these capabilities for end-to-end safety-constrained training. ParallelCBF is the first framework to unify (i)~tensor-parallel UAV environments, (ii)~hard-gate CBF safety filters, (iii)~sharded BC-to-RL pipelines, and (iv)~first-class operational auditability -- pre-registration, watchdog registries, failure forensics, and dataset audits as composable APIs rather than user-implemented scripts. We release ParallelCBF v0.1.0 under Apache~2.0 with a four-layer composable API, a CPU PyTorch reference implementation of a dual-barrier (squared / linear-predictive) CBF, property-based safety invariance tests across vectorized batch sizes that complete in 1.67~s for the full 39-test suite, and a 31{,}415-episode behavior-cloning collection campaign whose curriculum mix, per-bucket yields, and dataset SHA-256 are auditable through the framework's own \texttt{ops} primitives. We report a representative end-to-end pipeline execution in which the framework's auditability layer halted a downstream training stage that did not meet pre-registered convergence criteria, preventing silent propagation of a degraded checkpoint -- an architectural property we argue is necessary, not merely useful, for reproducible empirical robotics research. The framework is installable via \texttt{pip install parallelcbf}; source and release artifacts are available at https://github.com/xiaoyang-123-cell/ParallelCBF.

2605.15504 2026-05-18 cs.LG cs.AI

Learning with Conflicts of Interest

利益冲突中的学习

Nischal Aryal, Arash Termehchy, Ali Vakilian, Marianne Winslett

AI总结 本文提出一种博弈论框架,用于解决ML系统与用户之间的利益冲突,通过可扩展的算法在保护用户的同时最大化有益信息。

详情
AI中文摘要

金融、社会和政治因素经常导致ML系统所有者和服务使用者的利益无法完全一致。ML系统往往产生有偏见的信息,可能影响用户做出不利于自身利益的决定。当前解决方案要求ML系统实施协议以缓解偏见,但所有者通常没有实施这些协议的激励,并常认为这限制了他们的表达自由或商业。我们认为,解决此问题的成功方案必须认识到ML系统与其用户之间的利益冲突,并利用此信息保护用户免受不利影响,同时允许用户安全地受益于这些系统。为此,我们提出了一种博弈论框架,用于建模存在利益冲突的ML系统与用户之间的互动。我们提出了具有理论保证的可扩展算法,以最大化与所需信息和行动相关的内容,并最小化与偏见和操纵行为相关的交互内容。

英文摘要

Financial, social, and political factors often prevent the interests of the owners of ML systems and services and their users from being perfectly aligned. ML systems often produce biased information that can influence users to make decisions that are not in their best interest. Current solution approaches require ML systems to implement protocols to mitigate their biases. However, ML system owners usually do not have any incentive to implement these protocols and often argue that it limits their freedom of expression or business. We believe that a successful solution to this problem must recognize the conflict of interest between the ML systems and their users, and use this information to protect users against information that adversely influences their decisions while allowing users to safely benefit from these systems. To this end, we propose a game-theoretic framework that models the interaction between ML systems and users with conflicts of interest. We present scalable algorithms with theoretical guarantees that maximize the amount of desired information and actions and minimize the amount of biased and manipulative actions in interaction with ML systems.

2605.15496 2026-05-18 cs.RO cs.CV

LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields

LAPS:利用主动池化和采样改进增量激光雷达映射

Dongjae Lee, Wooseong Yang, Yifu Tao, Maurice Fallon, Ayoung Kim

AI总结 LAPS通过主动池化和采样提升增量神经映射的回放管理,提高回放保留和分配,增强重建完整性与几何精度。

Comments accepted at RA-L 2026

详情
AI中文摘要

神经距离场提供紧凑连续的3D几何表示,适合增量激光雷达映射。然而,其在线优化易受灾难性遗忘影响,新观测可能退化已重建几何。基于回放的训练常用于解决此问题,但现有方法依赖被动回放缓冲区和均匀采样,导致内存浪费和欠约束区域训练不足。我们提出LAPS,一种增量神经映射的回放管理框架,改进在线更新中的回放保留和分配。LAPS结合基于可靠性的主动池化保留有限内存下的可靠历史样本,以及基于不确定性的主动采样聚焦欠约束区域。实验表明,LAPS在合成和真实世界基准上一致提升重建完整性,同时保持竞争性的几何精度。在牛津尖塔数据集中,其在Blenheim Palace 05序列上比PIN-SLAM的召回率提高4.66个百分点,F1分数提高3.79个百分点。我们开源实现见:https://github.com/dongjae0107/LAPS。

英文摘要

Neural distance fields offer a compact and continuous representation of 3D geometry, making them attractive for incremental LiDAR mapping. However, their online optimization is vulnerable to catastrophic forgetting, where new observations can degrade previously reconstructed geometry. Replay-based training is commonly used to address this issue, but existing methods typically rely on passive replay buffers and uniform sampling, which can waste memory on redundant observations and under-train poorly constrained regions. We propose LAPS, a replay management framework for incremental neural mapping that improves both replay retention and replay allocation during online updates. LAPS combines reliability-based active pooling to retain reliable historical samples under limited memory with uncertainty-guided active sampling to focus optimization on under-constrained regions. Experiments on synthetic and real-world benchmarks show that LAPS consistently improves reconstruction completeness while maintaining competitive geometric accuracy. On Oxford Spires, it improves recall by 4.66 pp and F1-score by 3.79 pp over PIN-SLAM on the Blenheim Palace 05 sequence. We release our open source implementation at: https://github.com/dongjae0107/LAPS.

2605.15492 2026-05-18 cs.RO cs.CV

FLASH: Efficient Visuomotor Policy via Sparse Sampling

FLASH:通过稀疏采样实现高效的视觉-运动策略

Jiaqi Bai, Jindou Jia, Yuxuan Hu, Gen Li, Xiangyu Chen, Tuo An, Kuangji Zuo, Jianfei Yang

AI总结 FLASH通过稀疏采样和Legendre多项式轨迹表示,提升视觉-运动策略学习效率,实现更长的动作时间跨度和更快的推理速度,实验表明其在多个任务中达到最先进的性能。

Comments 19 pages, 10 figures

详情
AI中文摘要

生成模型如扩散模型和流匹配在视觉-运动策略学习中占据主导地位,但其依赖迭代去噪导致高推理延迟,无法满足实时机器人控制需求。本文提出Fast Legendre-polynomial Action policy via Sparse History-anchored flow(FLASH Policy),通过连续Legendre多项式轨迹表示替代离散动作块生成。具体而言,通过稀疏时间采样拟合专家示范,使单次推理覆盖显著延长的动作时间跨度。为进一步加速生成,FLASH从历史多项式系数启动流匹配过程而非无信息的高斯噪声,缩短传输距离并实现准确单步推理。此外,解析多项式微分直接提供所需的速度前馈信号给扭矩控制器,无需数值近似。在五个模拟和两个真实世界操作任务上的大量实验表明,FLASH在所有任务中达到92%以上的成功率,每episode推理时间仅为31.40ms(比扩散策略快175倍,比先前流匹配策略快18倍),训练收敛速度比ACT快4倍,控制器跟踪误差比离散动作基线减少5至7倍。

英文摘要

Generative models such as diffusion and flow matching have become dominant paradigms for visuomotor policy learning, yet their reliance on iterative denoising incurs high inference latency incompatible with real-time robotic control. We present Fast Legendre-polynomial Action policy via Sparse History-anchored flow (FLASH Policy), which replaces discrete action-chunk generation with continuous Legendre polynomial trajectory representation. Specifically, by fitting expert demonstrations under sparse temporal sampling, FLASH enables a single inference to cover a significantly extended action horizon. To further accelerate generation, FLASH initiates the flow matching process from history polynomial coefficients rather than uninformative Gaussian noise, shortening the transport distance and enabling accurate single-step inference. Moreover, analytic polynomial differentiation directly provides desired velocity feed-forward signals to the torque controller without numerical approximation. Extensive experiments on five simulated and two real-world manipulation tasks demonstrate that FLASH achieves state-of-the-art success rates ($\ge 92\%$ across all tasks), a per-episode inference time of $31.40\,ms$ (up to $175\times$ faster than diffusion policies and $18\times$ faster than prior flow matching policies), up to $4\times$ faster training convergence than ACT, and $5\times$ to $7\times$ reduction in controller tracking error compared to discrete-action baselines.

2605.15488 2026-05-18 cs.LG stat.ML

SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference

SurvivalPFN: 通过上下文贝叶斯推断实现生存预测的 amortization

Shi-ang Qi, Vahid Balazadeh, Michael Cooper, Russell Greiner, Rahul G. Krishnan

AI总结 SurvivalPFN 通过上下文学习实现生存预测的 amortization,利用预训练的网络在单次前向传递中处理右删失数据,避免了参数假设,产生校准的生存分布,在61个数据集上表现优异。

详情
AI中文摘要

生存分析提供了一个强大的统计框架,用于在删失存在的情况下建模时间到事件的结果。然而,从众多专门的生存方法中选择合适的估计器通常需要大量方法论和领域专业知识。我们引入了SurvivalPFN,这是一种先验-数据拟合网络,通过上下文学习实现对删失观测的贝叶斯推断的amortization。SurvivalPFN 在多样化的合成、可识别和右删失数据生成过程中进行预训练,使其能够在推理过程中单次前向传递中实现生存分析的amortization。结果,模型适应每个数据集的有效复杂性,而无需任务特定的训练或超参数调整,避免了限制性的参数假设,并产生校准的生存分布。在涵盖61个数据集、21种方法和5种评估指标的大型基准测试中,SurvivalPFN实现了强大的预测性能,并经常优于已建立的生存模型。这些结果表明,SurvivalPFN为生存分析提供了一个原理上和实用的基础模型,潜在应用领域包括医疗、金融和工程(https://github.com/rgklab/SurvivalPFN)

英文摘要

Survival analysis provides a powerful statistical framework for modeling time-to-event outcomes in the presence of censoring. However, selecting an appropriate estimator from the many specialized survival approaches often requires substantial methodological and domain expertise. We introduce SurvivalPFN, a prior-data fitted network that amortizes Bayesian inference for censored observations through in-context learning. SurvivalPFN is pretrained on a diverse family of synthetic, identifiable, and right-censored data-generating processes, enabling it to amortize survival analysis in a single forward pass during inference. As a result, the model adapts to the effective complexity of each dataset without task-specific training or hyperparameter tuning, avoids restrictive parametric assumptions, and produces calibrated survival distributions. In a large-scale benchmark spanning 61 datasets, 21 methods, and 5 evaluation metrics, SurvivalPFN achieves strong predictive performance and often improves upon established survival models. These results suggest that SurvivalPFN offers a principled and practical foundation model for survival analysis, with potential applications in high-impact domains such as healthcare, finance, and engineering (https://github.com/rgklab/SurvivalPFN).

2605.15486 2026-05-18 cs.RO cs.AI

Hybrid LLM-based Intelligent Framework for Robot Task Scheduling

基于混合大语言模型的智能机器人任务调度框架

Swayamjit Saha, Subhabrata Das, Haonan Duan, Xiao-Yang Liu

AI总结 本文提出利用大语言模型提升建筑机器人任务调度效率,通过平衡时间效率与资源利用,结合自然语言处理接口实现与专业人员的实时沟通,并采用两个LLM代理生成更精确的任务计划。

Comments 9 pages, 5 figures

详情
AI中文摘要

本研究介绍了一种利用大语言模型(LLMs)改进建筑机器人任务调度的智能框架。LLM通过接收关键任务数据,如代理行动能力及目标终点来优化任务分配策略。系统利用自然语言处理接口与建筑专业人员沟通,并实时适应突发工地条件。我们同时使用两个LLM代理,即生成器(GPT-4)和监督器(Gemma 3/Llama 4/Mistral 7b)LLM代理,以提供更精确的任务计划。我们通过简单场景评估所提出的方法,并提供指标分数证明框架的有效性。我们的结果表明,在包括机器人在内的建筑操作任务中,LLM的实施至关重要。

英文摘要

This study introduces intelligent frameworks that use Large Language Models (LLMs) to improve task scheduling for construction robots. The LLM is fed with key data about the desired task, such as agent action abilities, and the desired end goal to be achieved. A well-balanced allocation strategy is developed, optimizing both time efficiency and resource utilization. Our system utilizes a Natural Language Processing interface to streamline communication with construction professionals and adapt in real-time to unexpected site conditions. We concurrently use two LLM agents, specifically generator (GPT-4) and supervisor (Gemma 3/Llama 4/Mistral 7b) LLM agents to provide a more precise task schedule. We evaluate the proposed methodology using a straightforward scenario and provide metric scores to prove the efficacy of the frameworks. Our results highlight that the implementation of LLMs is crucial in construction operational tasks including robots.

2605.15484 2026-05-18 cs.CV cs.LG

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

何时稀疏MoE在视觉中起作用?背骨计算利用在稀疏路由中的作用

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

AI总结 研究稀疏top-k路由在视觉分类中的有效性,发现计算利用模式,指出背骨架构和多专家路由对性能的影响,通过实验验证关键因素。

Comments 24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho

详情
AI中文摘要

混合专家(MoE)网络提供良好的准确率-计算量折衷,但实际视觉部署受专家崩溃和端到端效率提升有限的阻碍。本文研究稀疏top-k路由在视觉分类中的帮助条件,评估多种子协议下的四个基准(CIFAR-10/100、Tiny-ImageNet、ImageNet-1K)。观察到计算利用模式:正准确率差距需要总FLOPs的显著分数ρ进行路由;在ImageNet规模上,这虽必要但不够,还需多专家路由(k≥2)。通过两个受控实验隔离这些因素。在CIFAR-10上对隐藏大小的扫描显示标准和深度wise背骨的预测符号反转,排除背骨家族作为活跃变量。ImageNet-1K的消融实验仅改变top-k,保持架构、初始化和ρ固定,使差距从正变负。一种针对样本的Soft MoE变体,对专家进行softmax而非批次,使CIFAR-100超越密集基线,识别批次轴调度为样本CNN设置的主要失败模式。代码和汇总结果:https://github.com/libophd/sparse-moe-vision-rho。

英文摘要

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $ρ$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $ρ$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.

2605.15480 2026-05-18 cs.RO cs.AI

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays

残差强化学习用于具有随机延迟的机器人遥控

Kaize Deng, Zewen Yang

AI总结 针对随机延迟导致的信号不连续问题,本文提出一种混合控制框架,通过LSTM状态估计器与残差强化学习策略相结合,提升遥控稳定性与性能。

Comments Accepted at 23rd IFAC World Congress 2026

详情
AI中文摘要

遥控中的随机通信延迟引入了信号不连续性,破坏了控制稳定性并降低了控制性能。因此,传统强化学习方法在面对延迟观测时表现不佳,导致高频震荡。为此,我们提出了一种混合控制框架,即延迟鲁棒强化学习,结合使用长短期记忆网络(LSTM)的状态估计器与残差强化学习策略。LSTM从延迟观测中重建出平滑的连续状态估计,使强化学习代理学习残差扭矩补偿策略,平衡跟踪精度与速度平滑性。在Franka Panda机器人上的实验验证表明,本文方法显著优于现有最先进基线,确保在高方差随机延迟下仍能实现稳健稳定的遥控。

英文摘要

Stochastic communication delays in teleoperation introduce signal discontinuities that undermine control stability and degrade control performance. Consequently, the conventional reinforcement learning (RL) methods struggle with the delayed observations due to the delay-induced observations, leading to high-frequency chattering. To address this, we propose a hybrid control framework, delay-resilient RL, integrating a state estimator utilizing Long Short-Term Memory (LSTM) with a residual RL policy, which is resilient to stochastic delays. The LSTM reconstructs smooth, continuous state estimates from delayed observations, enabling the RL agent to learn a residual torque compensation policy that balances tracking accuracy with velocity smoothness. Experimental validation on Franka Panda robots demonstrates that our approach significantly outperforms the state-of-the-art baselines, ensuring robust and stable teleoperation even under high-variance stochastic delays.

2605.15475 2026-05-18 cs.CV cs.MM

A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation

通过t-FCW图表示实现统一的非参数化且可解释的点云分析

Haijian Lai, Bowen Liu, Man Xu, Chan-Tong Lam, João Macedo, Benjamin Ng, Sio-Kei Im

AI总结 本文提出增强型t-FCW图表示用于点云嵌入,分析其有效性来源并设计网络,实现高效可解释的点云处理,适用于分类和分割任务。

Comments Accepted for publication in IEEE Transactions on Multimedia

详情
AI中文摘要

我们引入增强型转置全连接加权(t-FCW)图表示,将点云嵌入度量空间。尽管原始t-FCW在点云分类中表现良好,但其有效性原因和更广泛适用性尚不明确。本文分析了使增强型和原始t-FCW有效的属性,并设计网络仅使用增强型t-FCW作为特征提取器。从可解释性角度看,我们构建了用于分类、部分分割和语义分割的记忆银行。我们的分析表明,增强型t-FCW继承了表面描述符的鲁棒性,并通过维度关系提供可解释性。这些属性使网络高效且可解释,能够在NVIDIA RTX A5000 GPU上以约7秒处理ModelNet40分类问题。重要的是,增强型t-FCW既可以作为轻量级独立基线,也可以作为现有深度模型的补充插件。

英文摘要

We introduce an empowered transposed Fully Connected Weighted (t-FCW) graph representation to embed point clouds into a metric space. While original t-FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t-FCW effective and design a network that uses the empowered t-FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t-FCW. Our analysis reveals that the empowered t-FCW inherits robustness from surface descriptors, provides interpretability through dimension-wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t-FCW can function both as a lightweight standalone baseline and as a complementary plug-in to existing deep models.

2605.15467 2026-05-18 cs.CL cs.AI

Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction

基于检索增强的大型语言模型用于受模式约束的临床信息提取

A H M Rezaul Karim, Ozlem Uzuner

AI总结 本文提出一种模块化检索增强生成框架,通过schema约束提示、确定性后处理和二次审核,提升护士-患者对话中观察提取的F1分数达80.36%。

详情
AI中文摘要

对话护士-患者记录包含可操作的观察,但将这些记录转化为结构化表示仍具挑战性。MEDIQA-SYNUR专注于从对话记录中提取观察,要求系统将这些叙述规范化为预定义模式,并满足值-类型约束。我们提出了一种模块化检索增强生成(RAG)流程,利用训练集作为示例语料库,结合模式约束提示(完整模式与剪枝候选模式)、确定性模式后处理和二次审核,并采用两个LLM骨干:Llama-4-Scout-17B-16E-Instruct和GPT-5.2,配以相应的嵌入模型。我们的最佳配置使用GPT-5.2、完整模式、RAG和二次审核,达到80.36%的F1分数。整体结果表明,RAG consistently improves performance,而最佳模式约束程度取决于模型,二次审核通过纠正残余模式一致性错误带来小幅增益。

英文摘要

Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.

2605.15465 2026-05-18 cs.LG eess.SP

Toward World Modeling of Physiological Signals with Chaos-Theoretic Balancing and Latent Dynamics

向生理信号的世界建模迈进:基于混沌理论的平衡与潜在动态

Yunfei Luo, Xi Chen, Yuliang Chen, Lanshuang Zhang, Md Mofijul Islam, Siwei Zhao, Peter Kotanko, Subhasis Dasgupta, Andrew Campbell, Rakesh Malhotra, Tauhidur Rahman

AI总结 本文提出NormWear-2模型,通过将多变量生理信号与临床干预变量编码到共享潜在空间,结合先验知识推理与非参数潜在状态转移适应,实现多时间尺度的预测。混沌理论平衡动态制度多样性提升了表示鲁棒性,且在不同临床场景下表现优异。

Comments NormWear Collection: https://huggingface.co/collections/mosaic-laboratory/normwear

详情
AI中文摘要

生理时间序列信号反映了人体复杂的多尺度动态过程。现有建模研究集中在静态任务如分类、事件预测或短期下一步预测,而长期信号级预测和生理信号的预测性质仍未被充分探索。我们引入NormWear-2,一种将多变量生理信号和临床干预变量编码到共享潜在空间的世界模型,并将它们的联合时间演变建模为动态系统。我们的方法结合了从先验预训练知识(直觉)推断和即时非参数潜在状态转移适应(洞察),实现了在多种时间尺度上的连贯预测,基于异质临床干预。在预训练阶段,我们发现混沌理论平衡动态制度多样性会产生更稳健的表示,较小的平衡数据集在性能上优于两倍大小的数据集,并捕捉到分岔制度。我们在多样化的现实世界生理数据集上评估了世界模型的性能,涵盖异质的时间分辨率和干预制度,包括日常生活、点即护理和临床场景,包括健身规划、血液透析、糖尿病管理和手术监测。这些评估数据集包含8,026名受试者的记录,时间跨度从3.2小时的高分辨率信号数据到2.3年的纵向临床生物标志物追踪。NormWear-2在时间、频率和潜在表示领域实现了最佳的预测性能,显著优于最先进的时间序列基础模型,同时保持了具有竞争力的下游表示质量,为生理信号的一般用途世界模型迈出了重要一步。

英文摘要

Physiological time series signals reflect complex, multi-scale dynamical processes of the human body. Existing modeling studies focus on static tasks such as classification, event forecasting, or short-horizon next step prediction, while long-horizon signal-level forecasting and predictive nature of physiological signals remain underexplored. We introduce NormWear-2, a world model that encodes both multivariate physiological signals and clinical intervention variables into a shared latent space and models their joint temporal evolution as a dynamical system. Our approach combines inference from prior pre-trained knowledge (intuition) with instant non-parametric latent state transition adaptation (insight), enabling coherent forecasting across multiple temporal scales, conditioned on heterogeneous clinical interventions. During the pretraining phase, we find that chaos-theoretic balancing of dynamical regime diversity yields more robust representations, with a smaller balanced corpus outperforming one twice its size and capturing bifurcation regimes. We evaluate the world model performance across diverse real-world physiological datasets spanning heterogeneous temporal resolutions and intervention regimes, covering daily life, point-of-care, and clinical settings, including fitness planning, hemodialysis, diabetes management, and surgical monitoring. These evaluation datasets comprise records from 8,026 subjects, spanning study durations from 3.2 hours for high-resolution signal data to 2.3 years for longitudinal clinical biomarker tracking. NormWear-2 achieves the best overall forecasting performance across time, frequency, and latent representation domains, with significant improvements over state-of-the-art time series foundation models, while maintaining competitive downstream representation quality, providing a step toward general-purpose world models for physiological signals.

2605.15464 2026-05-18 cs.LG cs.AI cs.CL

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

GRLO:从零开始在开放环境中的通用强化学习

Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi

AI总结 GRLO研究从少量交互数据中训练的RLHF在开放环境中的泛化能力,探索其对话能力是否能迁移至数学推理和代码生成等下游任务,展示出高效且低成本的训练方法。

详情
AI中文摘要

事后训练已成为解锁大型语言模型能力的关键步骤,强化学习(RL)逐渐成为关键范式。近期基于RL的后训练方法日益分化为两种范式:基于人类反馈的强化学习(RLHF),其通过目标领域的偏好信号优化模型,以及基于可验证奖励的强化学习(RLVR),其在由验证器支持的环境中运行。后者在近期以推理为导向的后训练中占据主导地位,因为它在领域特定任务(如推理)上提供了更强的增益和更高的效率。然而,尽管领域内RL训练取得了令人满意的性能,但仍需要大量的GPU计算资源,这仍然是广泛应用的主要障碍。本文研究了从开放环境中的少量交互数据中从零开始训练的RLHF的泛化能力,并探讨其显式获得的对话能力是否能隐式地迁移到数学推理和代码生成等下游任务,即GRLO。具体而言,在Qwen3-4B-Base基础上,GRLO仅使用5K提示和22.7 GPU小时,将所有领域的平均性能从24.1提升到63.1,所需数据和计算资源分别比强大的领域内RLVR基线少约46倍和68倍。所得到的模型甚至与Qwen发布的后训练模型相媲美,后者需要更大的训练成本。值得注意的是,后续的领域内RLVR阶段仅带来选择性的增益,主要体现在更难的竞赛数学基准上。我们希望GRLO能为构建广泛具备能力的后训练模型提供一个简单且高效的配方。我们的代码和数据将在:https://github.com/SJY8460/GRLO上提供。

英文摘要

Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.

2605.15463 2026-05-18 cs.LG

Layer-wise Derivative Controlled Networks

分层导数控制网络

Rowan Martnishn, Sean Anderson

AI总结 本文提出ChainzRule网络,通过DREG正则化技术平衡模型精度、硬件效率与功能稳定性,实验证明其在参数使用和梯度波动控制方面优于传统模型。

Comments Under Review at Neural Network Elsevier

详情
AI中文摘要

随着机器学习模型复杂度的增加,它们越来越难以满足高精度、硬件效率和功能稳定性的三重需求。传统架构往往以牺牲尖锐或不可预测的行为为代价获得性能,微小的输入变化会导致输出剧烈波动,这对敏感环境中的实际部署至关重要。本文引入ChainzRule(CR),一种新型神经架构,旨在协调这些竞争目标。ChainzRule用受微分正则化(DREG)控制的多项式引擎取代标准分段线性激活函数。与传统方法不同,DREG对中间导数进行针对性正则化,从而在不削弱多项式引擎固有表示能力的情况下抑制极端敏感性。在

英文摘要

As machine learning models grow in complexity, they increasingly struggle with three conflicting demands: the need for high accuracy, the requirement for hardware efficiency, and the necessity of functional stability. Traditional architectures often achieve performance at the expense of spiky or unpredictable behavior, where small changes in input lead to massive swings in output -- a critical flaw for real-world deployment in sensitive environments. This paper introduces ChainzRule (CR), a novel neural architecture designed to harmonize these competing goals. ChainzRule replaces standard piecewise-linear activations with a Polynomial Engine governed by Differential Regularization (DREG). Unlike traditional methods that impose global, coarse-grained constraints on a model's Lipschitz constant, DREG acts as a targeted regularization on intermediate derivatives. This approach suppresses extreme sensitivity without attenuating the representational power inherent in the Polynomial Engine. In head-to-head "Fair Fight" benchmarks, ChainzRule outperformed standard models while using 15.5x fewer parameters. On the MNIST dataset, it reduced peak gradient volatility by an average of 23.1%, ensuring a smoother and more predictable manifold. On Yelp Full ordinal regression under explicit DREG regularization, ChainzRule achieves 70.17% accuracy, validating that derivative-aware regularization is compatible with competitive performance on realistic tasks. By embedding gradient awareness into the architecture via DREG, ChainzRule demonstrates that stability and accuracy need not be competing objectives.

2605.15461 2026-05-18 cs.LG cs.AI

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

DrugSAGE: 自演化代理经验用于高效前沿药物发现

Yikun Zhang, Xiwei Cheng, Tianyu Liu, Yuanqi Du, Wengong Jin

AI总结 DrugSAGE通过自演化代理经验框架,高效构建前沿药物发现模型,跨任务记忆提升模型性能,实现零次搜索下的显著优势。

详情
AI中文摘要

构建前沿药物发现预测模型需要昂贵的工具、架构和训练策略搜索。当前基于LLM的代理通过大量试错找到前沿解决方案,但不保留积累的经验,因此每次新任务都要支付完整搜索成本。我们提出\method(自演化代理经验)框架,通过跨任务积累和重用经验高效构建前沿药物发现模型。\method维护跨任务记忆中的验证技能、有效策略的统计证据以及重复错误及其修复记录。在某些情况下,\method可直接转移有效解决方案而无需测试时搜索。在33个分子性质预测任务中,\method在单任务设置中排名第一。在16个较小任务积累的记忆下,\method在跨任务评估设置中达到17个保留任务的平均归一化分数为0.935,并在零次测试时搜索模式中优于所有基线代理10-30%。总之,我们的工作展示了跨任务记忆在药物发现前沿模型开发中的优势。

英文摘要

Building state-of-the-art (SOTA) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies. Current LLM-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task. We propose \method (Self-evolving Agent Experience), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently. \method maintains a cross-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes. In some cases, \method transfers a working solution directly without test-time search. In 33 molecular property prediction tasks, \method ranks first among nine SOTA agents in a single-task setting. With memory accumulated from 16 smaller tasks, \method achieves an averaged normalized score of 0.935 on 17 held-out tasks in a cross-task evaluation setting and outperforms all baseline agents by 10-30\% in a zero-test-time search regime. In summary, our work shows the advantage of cross-task memory for efficient SOTA model development in drug discovery.

2605.15459 2026-05-18 cs.LG stat.ML

Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics

别停止我:通过耗散黎曼流形力学采样损失极小值

Albert Kjøller Jacobsen, Leo Uhre Jakobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis

AI总结 本文提出DiMS方法,通过耗散黎曼流形力学精确采样损失极小值,解决传统方法无法准确采样重参数化不变解的问题,并在贝叶斯推断中验证其有效性。

详情
AI中文摘要

现代神经网络损失函数的极小值通常不是孤立的,而是形成在训练数据上重参数化不变解的连通组件。分析这些解是一个难题,但采样方法是可行的。现有方法要么在低损失区域扩散,无法精确采样重参数化不变解,要么本质上是局部的,限制了对其他极小值盆地的探索。本文提出基于动能的动力系统,受重力和摩擦项驱动,以精确采样极小水平集。DiMS方法依赖物理动机的超参数,允许控制采样器的探索能力。我们以不确定性量化作为动机问题,在贝叶斯推断中观察到比之前方法更好的性能。

英文摘要

The minima of modern neural network loss functions are typically not isolated, rather they form connected components of reparameterization invariant solutions on the training data. Analytically characterizing these solutions is a hard problem, but sampling approaches are feasible. By construction, existing methods either spread over low-loss regions, and thus do not sample reparameterization invariant solutions exactly, or are inherently local, which limits exploration of other minima valleys. We propose sampling such reparameterization invariant models using a dynamical system based on kinetic energy, subject to a gravitational pull and a friction term that dissipates energy from the system. Our proposed sampler, DiMS, is guaranteed to sample exactly from the minimum level sets and depends on physically motivated hyperparameters which allows control over the exploration capabilities of the sampler. We consider uncertainty quantification in Bayesian inference as the motivating problem and observe improved performance compared to previously proposed approaches.

2605.15458 2026-05-18 cs.CV

Video Models Can Reason with Verifiable Rewards

视频模型可以借助可验证的奖励进行推理

Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

AI总结 本文提出VideoRLVR方法,通过规则反馈优化视频扩散模型,提升可验证推理能力,在Maze、FlowFree和Sokoban任务中优于监督微调基线,证明可验证RL能推动视频模型超越感知模仿。

Comments Website: https://darthzhu.github.io/VideoRLVR-page/

详情
AI中文摘要

视频扩散模型在感知真实感和时间一致性方面取得了快速进展,但主要优化于合理生成而非可验证推理。在需要生成视频满足显式空间、时间或逻辑约束的任务中,这一限制尤为突出。受强化学习可验证奖励(RLVR)在推理导向语言模型中的作用启发,我们引入VideoRLVR,一种通过基于规则的反馈优化视频扩散模型的实用方法。VideoRLVR将视频推理视为可验证视觉轨迹的生成,包含SDE-GRPO优化核心、密集分解奖励和Early-Step Focus策略以提高训练效率。Early-Step Focus策略限制策略优化到早期去噪阶段,使训练延迟降低约40%的同时保持性能。我们在Maze、FlowFree和Sokoban三个程序生成领域进行评估,这些领域有客观成功标准。在这些任务中,VideoRLVR在监督微调基线上持续改进,密集分解奖励在低成功率设置中尤为重要。我们的RL优化模型在这些可验证推理基准和跨领域基准中优于评估的专有和开源视频生成模型。这些结果表明,可验证RL能推动视频模型超越感知模仿,向更可靠的规则一致视觉推理迈进。

英文摘要

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

2605.15450 2026-05-18 cs.CV cs.AI cs.LG

RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects

RIDE: 基于Retinex的解耦方法用于揭示隐藏物体

Chunming He, Rihan Zhang, Dingming Zhang, Chengyu Fang, Longxiang Tang, Jingjia Feng, Fengyang Xiao, Sina Farsiu

AI总结 RIDE通过Retinex理论提出同域图像分解方法,解决隐藏物体分割问题,利用判别性差距定理提升前景与背景的区分度。

详情
AI中文摘要

隐藏物体分割(COS)涵盖一系列密集预测任务,包括伪装物体检测、多形体分割、透明物体检测和工业缺陷检测,其中目标通过不同物理机制与周围环境视觉融合。现有方法要么直接操作RGB图像,要么采用异构分解(如傅里叶、小波)将空间证据分散到尺度/频率系数,使像素对齐线索不直接。我们引入一种根本不同的视角:通过Retinex理论进行同域图像分解,将图像分解为光照和反射成分。我们的核心发现是视觉融合迫使复合空间中的外观匹配,但并不需要同时在两个成分空间中匹配,这一现象我们正式称为判别性差距定理。关键的是,我们证明在多样化的COS子任务中,底层物理过程系统性地反相关光照和反射差异,从而理论保证Retinex分解在完整物理范围内保持或严格提升总前景-背景判别性,反相关最大化增益。基于此,我们提出RIDE,包括:(i)任务驱动的Retinex分解模块,学习端到端的分割最优分解;(ii)判别性差距注意力机制,适应性利用分解帮助的区域;(iii)伪装打破对比损失,操作在反射特征空间中。

英文摘要

Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

2605.15445 2026-05-18 cs.AI

From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates

从LLM生成的猜想到Lean形式化:通过求和平方证书实现自动多项式不等式证明

Ruobing Zuo, Hanrui Zhao, Gaolei He, Zhengfeng Yang, Jianlin Wang

AI总结 本文提出NSPI框架,结合LLM和符号计算,通过求和平方证书实现多项式不等式证明,展示其在10变量多项式上的有效性与可扩展性。

Comments Accepted to ICML 2026. Preprint version

详情
AI中文摘要

自动证明多项式不等式是自动化数学推理中的基本挑战,其中丰富的代数结构和快速增长的证书搜索空间阻碍了可扩展性。纯粹的符号方法提供强保证,但随着变量数或次数的增加,其扩展性较差,因为代数操作昂贵且中间表达式迅速增长。同时,LLM引导的方法在竞赛风格的不等式上取得了显著进展,特别是在变量数较少的情况下。为了解决剩余的可扩展性挑战,我们提出NSPI,一种结合LLM和符号计算优势的神经符号框架。具体而言,LLM提出一个近似多项式求和平方(SOS)分解的猜想;我们通过符号计算对其进行细化,得到精确的多项式SOS表示,这直接证明目标不等式,并进一步在Lean中验证证明,从而实现从启发式发现到机器检查证明的端到端流程。在涉及最多10个变量的多项式挑战基准上的实验展示了所提方法的有效性和可扩展性。

英文摘要

Automated proving of polynomial inequalities is a fundamental challenge in automated mathematical reasoning, where rich algebraic structure and a rapidly growing certificate search space hinder scalability. Purely symbolic approaches provide strong guarantees but often scale poorly as the number of variables or the degree increases, due to expensive algebraic manipulations and rapidly growing intermediate expressions. In parallel, LLM-guided methods have made notable progress, particularly on competition-style inequalities with a small number of variables. To address the remaining scalability challenges, we propose NSPI, a neuro-symbolic framework that combines the complementary strengths of LLMs and symbolic computation for polynomial-inequality proving. Concretely, an LLM proposes a conjecture in the form of an approximate polynomial Sum-Of-Squares (SOS) decomposition; we refine it via symbolic computation to obtain an exact polynomial SOS representation, which directly proves the target inequality, and we further certify the proof in Lean, yielding an end-to-end pipeline from heuristic discovery to machine-checked proof. Experiments on challenging benchmarks involving polynomials with up to 10 variables demonstrate the effectiveness and scalability of the proposed method.

2605.15440 2026-05-18 cs.CL

Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis

为何语言模型比人类更不惊讶?测试解析多重性不匹配假说

William Timkey, Brian Dillon, Tal Linzen

AI总结 研究探讨语言模型在处理句子时的 surprisal 与人类处理的差异,通过调整解析数量来测试解析多重性不匹配假说对句法歧义的影响。

详情
AI中文摘要

surprisal理论认为,词语的处理难度由其在上下文中的可预测性决定,为人类句子处理与语言模型的next-word预测提供了潜在联系。尽管语言模型(LM)的surprisal能够成功预测自然文本中的阅读时间,但它们系统性地低估了受控研究中句法歧义观察到的难度幅度,尤其是在歧义句中。这种不匹配可能源于人类和LM之间计算约束的差异。在此,我们测试了一个假设,即LM可能能够同时考虑更多的不同句子解释,而人类则无法做到。使用具有词同步束搜索的循环神经网络语法(RNNGs),我们系统地变化用于计算词surprisal的同时解析数量,然后使用这些surprisal来预测人类的阅读时间。减少同时活跃解析的数量确实增加了预测的歧义效应幅度,但不足以捕捉人类中效应的全部幅度。这表明,LM和人类可用的同时解析数量的差异无法将基于LM的surprisal与人类句子处理联系起来。

英文摘要

Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.

2605.15436 2026-05-18 cs.CL cs.LG

Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance

神经激活模式在语言模型架构中的跨分析:对认知任务性能的全面研究

Mahdi Naser-Moghadasi, Faezeh Ghaderi

AI总结 本文分析了六种大型语言模型架构在十二种认知任务上的神经激活模式,揭示了编码器和解码器架构在处理不同任务时的差异,发现数学推理产生最高注意力熵,解码器模型在稀疏性上更高。

Comments 8 pages, accepted at IEEE BigData 2025

详情
AI中文摘要

本文对六种不同大型语言模型(LLM)架构的神经激活模式进行了全面分析,研究了它们在十二种认知任务类别上的性能。通过系统测量最终激活值、注意力熵和稀疏性模式,揭示了编码器和解码器架构在处理多样化认知任务时的根本差异。对144个任务-模型组合的分析表明,数学推理在所有架构中均产生最高注意力熵,而解码器模型在稀疏性模式上显著高于编码器模型。研究结果为现代语言模型的计算特性及其任务特定神经行为提供了关键见解,对大数据应用中的模型选择和优化具有启示作用。

英文摘要

This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.

2605.15430 2026-05-18 cs.RO cs.CV

Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones

在树上何处栖息:用于树抓取无人机的视觉引导

Alex Dunnett, Leonie Bottomley, Mirko Kovac, Basaran Bahadir Kocer

AI总结 本文提出一种视觉引导方法,用于确定树上理想的栖息点,通过图像处理算法评估树的形状和结构,基于枝条宽度、坡度和曲率选择适宜栖息的枝条。

Comments Work in progress version accepted to the Recent Advances in Robotic Perception for Forestry

详情
AI中文摘要

本研究展示了一种方法,用于确定树上理想的栖息点,该方法利用视觉引导的自主树栖无人机。各种图像处理算法,包括用于机器学习、图像分割和二值图像形态学的算法,被用来评估树的形状和结构。与仅寻找最近可用的枝条不同,本研究通过评估每条枝条的潜力,根据枝条宽度、坡度(与水平面的角度)和曲率等因素来确定其适合栖息的程度。对于给定的树栖无人机和超过10,000张从2月到10月在亚热带和温润气候下的城市树木图像数据集,所提出的方法成功地为76%的可行目标生成了结果。可行目标定义为枝条直径足够厚且可用栖息空间至少等于腱驱动抓取夹具的宽度。这些初步成功的结果为开发一系列改进和额外功能奠定了基础,以创建通用方法;这将涉及整合深度感知和姿态传感器的补充数据,以增强枝条评估。

英文摘要

This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.

2605.15424 2026-05-18 cs.CV

Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

Social-Mamba:基于状态空间模型的社会感知轨迹预测

Po-Chien Luan, Wuyang Li, Yang Gao, Alexandre Alahi

AI总结 本文提出Social-Mamba,通过将社会互动视为结构化序列过程,结合循环Mamba模块和社交三元组分解,实现高效准确的轨迹预测,实验表明其在多个基准上表现优异。

详情
AI中文摘要

人类轨迹预测对于拥挤环境中安全导航至关重要,需要在准确性和计算效率之间取得平衡。高效建模社会互动是密集人群中的关键。然而,大多数最新方法依赖于注意力机制,虽然能捕捉复杂依赖关系,但会带来二次计算成本,随着邻居数量的增加而表现不佳。最近,选择性状态空间模型提供了线性时间的替代方案;然而,其本质上是顺序的,与社会互动的无结构和动态性质不匹配。为此,我们提出了Social-Mamba,一种预测架构,将社会互动重新表述为结构化序列过程。其核心是循环Mamba模块,一个新型模块,能够实现连续的双向信息流。Social-Mamba在以自我为中心的网格上组织代理,并引入社交三元组分解,将互动分解为时间、以自我为中心和目标为中心的扫描。这些通过可学习的社会门和全局扫描动态整合,以生成准确且高效的轨迹预测。在五个轨迹预测基准上的广泛实验表明,Social-Mamba在准确率方面达到最先进的水平,同时提供优越的参数效率和计算可扩展性。此外,将Social-Mamba嵌入到流匹配框架中进一步增强了准确性和效率,使其成为未来轨迹预测研究的灵活且稳健的基础。代码已公开:https://github.com/vita-epfl/Social-Mamba

英文摘要

Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: https://github.com/vita-epfl/Social-Mamba

2605.15423 2026-05-18 cs.CV cs.AI eess.IV

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

MR2-ByteTrack:基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点

Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti

AI总结 本文提出MR2-ByteTrack,一种针对嵌入式视觉节点的视频目标检测方法,通过交替使用全分辨率和低分辨率推理,结合ByteTrack和Rescore算法提升效率,实现在嵌入式设备上的高精度实时检测。

详情
AI中文摘要

现代智能视觉传感器需要设备端智能来处理视频流,因为云计算在带宽、延迟和隐私限制下往往不可行。然而,这些传感系统通常依赖超低功耗微控制器(MCUs),其内存和计算能力有限,使得需要特征存储或多帧缓冲的传统视频目标检测方法不可行。为了解决这一挑战,我们引入了多分辨率重评分ByteTrack(MR2-ByteTrack),一种专为基于MCU的嵌入式视觉节点设计的视频目标检测(VOD)方法。MR2-ByteTrack通过交替使用全分辨率和低分辨率推理来降低计算成本,同时通过ByteTrack在帧间链接检测,并通过Rescore算法通过概率联合规则聚合跨帧的检测置信度分数以纠正误分类。我们将其应用于基于CNN的检测器和基于Transformer的模型,证明了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明,MR2-ByteTrack保持了准确性,实现了CNN模型的mAP最高达49.0,Transformer模型的mAP为48.7,同时将CNN的乘加操作减少了高达53%,Transformer的减少了32%。当部署在GAP9上,一个超低功耗RISC-V多核MCU上时,我们的方法相比仅处理全分辨率图像,实现了高达55%的能耗节省,实现了在MCU类嵌入式视觉节点上的首个实时Transformer-based VOD。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。

英文摘要

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

2605.15421 2026-05-18 cs.CV

U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration

U-SEG:不确定性在分割中的探索——系统多变量研究

Michael Smith, Frank P. Ferrie

AI总结 本文系统探讨了不确定性估计与分割交集中的关键问题,分析了不同变量对分割性能的影响,发现挑战性任务和样本多样性在分割中具有重要作用。

Comments Accepted to CVPR Findings Track 2026

详情
AI中文摘要

本文深入探讨了不确定性估计与分割交叉领域中的一些未被充分研究的课题。先前研究表明,不确定性估计的质量对多种变量非常敏感。作为不确定性估计的主要应用之一,帮助识别和解决实际场景中的预测错误,任何影响这一应用的因素都必须明确识别。例如,更具挑战性的领域或不同的数据集和架构是否会导致使用不确定性估计时性能下降?视频序列中的先前帧是否能提供与其它方法相当的不确定性估计?能否利用样本多样性结合不确定性估计方法以获得更好的估计?最后,何时使用基于集成的不确定性估计比确定性网络更合理?我们通过创建框架并执行大规模研究,跨多个变量(如数据集、主干网络和下游任务)对语义和全景分割进行研究。我们发现,a) 具有挑战性的全景分割任务通常导致性能下降,而数据集和主干网络之间的高性能方差表明泛化并不保证;b) 时间序列样本对特定配置有用,但在许多情况下不值得付出代价;c) 样本多样性在校准下游任务中最具潜力,但其他情况下无法超越更简单的替代方案;d) 确定性方法在某些下游任务中足够,但若在部署中能实现正确条件,集成方法可带来显著改进。

英文摘要

In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.

2605.15417 2026-05-18 cs.LG cs.AI

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

$f$-轨迹平衡:一种用于调整GFlowNets、生成模型和LLMs的损失家族,结合on-policy和off-policy数据

Jake Fawkes, Jason Hartford

AI总结 本文提出一种基于$f$-散度的损失家族,通过on-policy和off-policy数据调整生成模型,提升模型覆盖性和泛化能力。

Comments Published at ICML 2026

详情
AI中文摘要

在GFlowNets和变分推断中,目标与模型对数概率之间的均方误差被证明是训练生成模型的有效低方差替代损失。该损失具有在on-policy情况下其梯度对应KL散度的梯度,而在off-policy情况下仍保持有效损失且具有相同全局最小值的性质。本文证明该构造可扩展到整个$f$-散度家族,从而得到一系列损失函数,其on-policy梯度对应相应的$f$-散度,但保留相同的全局最小值。具体而言,我们展示了on-policy梯度导致目标与模型对数概率上的翻译不变损失函数与$f$-散度之间的一一对应关系。这种等价性使我们能够设计新的替代损失函数,用于调整广泛类别的生成模型,继承相应$f$-散度的性质,如更广泛的模式覆盖,同时适用于off-policy数据。我们将其应用于各种任务,包括经典合成示例、SynFlowNets分子发现和异步大语言模型(LLM)调整,证明我们的模型在广泛类别的生成模型中保留其预测属性,无论是on-policy还是off-policy数据。

英文摘要

In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.

2605.15413 2026-05-18 cs.LG

Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

Transformer可扩展性危机:现代语言模型中性能瓶颈的首次全面实证分析

Mahdi Naser Moghadasi, Faezeh Ghaderi

AI总结 本文通过评估118种transformer模型,揭示了模型在序列长度扩展时的性能瓶颈,发现随着序列长度增加,模型处理能力显著下降,挑战了传统扩展假设。

Comments 8 pages, accepted at IEEE BigData 2025

详情
AI中文摘要

尽管transformer架构在自然语言处理中取得了显著成功,但其可扩展性限制仍缺乏系统性的实证分析。本文首次对118种transformer模型进行了大规模评估,涵盖七种不同的架构类别,揭示了根本性的性能瓶颈,表现为硬性部署限制。我们的系统性基准测试方法揭示了关键的可扩展性危机:虽然88.1%的模型能够处理最多512个token的序列,但在1024个token时降至44.9%,在2048个token时完全失败。通过严格分析加载时间、内存消耗和计算效率,我们证明压缩模型在参数效率(649.2 tokens/sec/M参数)上优于大型生成模型(12.5 tokens/sec/M)。我们的发现挑战了主流的扩展假设,并提供了首次定量证据,表明理论上的O(n²)注意力复杂度转化为可测量的性能瓶颈。本工作建立了新的transformer评估基准方法,并为生产环境中的实际部署决策提供了关键见解。

英文摘要

Despite the remarkable success of transformer architectures in natural language processing, their scalability limitations remain poorly understood through systematic empirical analysis. This paper presents the first comprehensive large-scale evaluation of 118 transformer models across seven distinct architectural categories, revealing fundamental performance walls that manifest as hard deployment constraints. Our systematic benchmarking methodology uncovers a critical scalability crisis: while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens. Through rigorous analysis of loading times, memory consumption, and computational efficiency across sequence lengths from 128 to 2048 tokens, we demonstrate that compressed models achieve superior parameter efficiency (649.2 tokens/sec/M parameters) compared to large generative models (12.5 tokens/sec/M). Our findings challenge prevailing scaling assumptions and provide the first quantitative evidence that the theoretical O(n2) attention complexity translates into measurable performance walls. This work establishes new benchmarking methodologies for transformer evaluation and provides critical insights for practical deployment decisions in production environments.

2605.15404 2026-05-18 cs.CL

Capability Conditioned Scaffolding for Professional Human LLM Collaboration

专业领域能力条件下的支架框架

Sen Yang, Yinglei Ma

AI总结 本文提出能力条件支架框架,通过划分强、混合和弱领域,基于结构化能力档案调节干预行为,提升专业人类与AI协作的可靠性。

详情
AI中文摘要

大型语言模型的个性化通常适应用户偏好和风格,但未考虑用户在不同专业领域中的评估能力差异。这种局限可能导致专业领域漂移,即用户依赖AI生成的推理内容,而无法可靠评估。我们引入能力条件支架框架,该框架将专业知识划分为强、混合和弱领域,并根据结构化能力档案调节干预行为。在多个MMLU子集和四个LLM子集上的初步评估显示,基于档案的干预行为一致,包括在档案交换下的类别反转和混合领域风险区的选择性激活。这些发现表明,具备能力意识的支架能够支持超越风格个性化的更可靠的专业人类AI协作。

英文摘要

Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.