arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2605.15475 2026-05-18 cs.CV cs.MM

A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation

通过t-FCW图表示实现统一的非参数化且可解释的点云分析

Haijian Lai, Bowen Liu, Man Xu, Chan-Tong Lam, João Macedo, Benjamin Ng, Sio-Kei Im

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University(澳门理工大学应用科学学院) University of Coimbra, CISUC/LASI, DEI(科英布拉大学) Macao Polytechnic University(澳门理工大学)

AI总结 本文提出增强型t-FCW图表示用于点云嵌入,分析其有效性来源并设计网络,实现高效可解释的点云处理,适用于分类和分割任务。

Comments Accepted for publication in IEEE Transactions on Multimedia

详情
AI中文摘要

我们引入增强型转置全连接加权(t-FCW)图表示,将点云嵌入度量空间。尽管原始t-FCW在点云分类中表现良好,但其有效性原因和更广泛适用性尚不明确。本文分析了使增强型和原始t-FCW有效的属性,并设计网络仅使用增强型t-FCW作为特征提取器。从可解释性角度看,我们构建了用于分类、部分分割和语义分割的记忆银行。我们的分析表明,增强型t-FCW继承了表面描述符的鲁棒性,并通过维度关系提供可解释性。这些属性使网络高效且可解释,能够在NVIDIA RTX A5000 GPU上以约7秒处理ModelNet40分类问题。重要的是,增强型t-FCW既可以作为轻量级独立基线,也可以作为现有深度模型的补充插件。

英文摘要

We introduce an empowered transposed Fully Connected Weighted (t-FCW) graph representation to embed point clouds into a metric space. While original t-FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t-FCW effective and design a network that uses the empowered t-FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t-FCW. Our analysis reveals that the empowered t-FCW inherits robustness from surface descriptors, provides interpretability through dimension-wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t-FCW can function both as a lightweight standalone baseline and as a complementary plug-in to existing deep models.

2605.15467 2026-05-18 cs.CL cs.AI

Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction

基于检索增强的大型语言模型用于受模式约束的临床信息提取

A H M Rezaul Karim, Ozlem Uzuner

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文提出一种模块化检索增强生成框架,通过schema约束提示、确定性后处理和二次审核,提升护士-患者对话中观察提取的F1分数达80.36%。

详情
AI中文摘要

对话护士-患者记录包含可操作的观察,但将这些记录转化为结构化表示仍具挑战性。MEDIQA-SYNUR专注于从对话记录中提取观察,要求系统将这些叙述规范化为预定义模式,并满足值-类型约束。我们提出了一种模块化检索增强生成(RAG)流程,利用训练集作为示例语料库,结合模式约束提示(完整模式与剪枝候选模式)、确定性模式后处理和二次审核,并采用两个LLM骨干:Llama-4-Scout-17B-16E-Instruct和GPT-5.2,配以相应的嵌入模型。我们的最佳配置使用GPT-5.2、完整模式、RAG和二次审核,达到80.36%的F1分数。整体结果表明,RAG consistently improves performance,而最佳模式约束程度取决于模型,二次审核通过纠正残余模式一致性错误带来小幅增益。

英文摘要

Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.

2605.15465 2026-05-18 cs.LG eess.SP

Toward World Modeling of Physiological Signals with Chaos-Theoretic Balancing and Latent Dynamics

向生理信号的世界建模迈进:基于混沌理论的平衡与潜在动态

Yunfei Luo, Xi Chen, Yuliang Chen, Lanshuang Zhang, Md Mofijul Islam, Siwei Zhao, Peter Kotanko, Subhasis Dasgupta, Andrew Campbell, Rakesh Malhotra, Tauhidur Rahman

发表机构 * University of California San Diego(加州大学圣地亚哥分校) Dartmouth College(达特茅斯学院) Amazon Web Services(亚马逊网络服务) Sanderling Renal Services(Sanderling 肾脏服务) Renal Research Institute(肾脏研究研究所) Icahn School of Medicine at Mount Sinai(辛克尔医学院(Mount Sinai))

AI总结 本文提出NormWear-2模型,通过将多变量生理信号与临床干预变量编码到共享潜在空间,结合先验知识推理与非参数潜在状态转移适应,实现多时间尺度的预测。混沌理论平衡动态制度多样性提升了表示鲁棒性,且在不同临床场景下表现优异。

Comments NormWear Collection: https://huggingface.co/collections/mosaic-laboratory/normwear

详情
AI中文摘要

生理时间序列信号反映了人体复杂的多尺度动态过程。现有建模研究集中在静态任务如分类、事件预测或短期下一步预测,而长期信号级预测和生理信号的预测性质仍未被充分探索。我们引入NormWear-2,一种将多变量生理信号和临床干预变量编码到共享潜在空间的世界模型,并将它们的联合时间演变建模为动态系统。我们的方法结合了从先验预训练知识(直觉)推断和即时非参数潜在状态转移适应(洞察),实现了在多种时间尺度上的连贯预测,基于异质临床干预。在预训练阶段,我们发现混沌理论平衡动态制度多样性会产生更稳健的表示,较小的平衡数据集在性能上优于两倍大小的数据集,并捕捉到分岔制度。我们在多样化的现实世界生理数据集上评估了世界模型的性能,涵盖异质的时间分辨率和干预制度,包括日常生活、点即护理和临床场景,包括健身规划、血液透析、糖尿病管理和手术监测。这些评估数据集包含8,026名受试者的记录,时间跨度从3.2小时的高分辨率信号数据到2.3年的纵向临床生物标志物追踪。NormWear-2在时间、频率和潜在表示领域实现了最佳的预测性能,显著优于最先进的时间序列基础模型,同时保持了具有竞争力的下游表示质量,为生理信号的一般用途世界模型迈出了重要一步。

英文摘要

Physiological time series signals reflect complex, multi-scale dynamical processes of the human body. Existing modeling studies focus on static tasks such as classification, event forecasting, or short-horizon next step prediction, while long-horizon signal-level forecasting and predictive nature of physiological signals remain underexplored. We introduce NormWear-2, a world model that encodes both multivariate physiological signals and clinical intervention variables into a shared latent space and models their joint temporal evolution as a dynamical system. Our approach combines inference from prior pre-trained knowledge (intuition) with instant non-parametric latent state transition adaptation (insight), enabling coherent forecasting across multiple temporal scales, conditioned on heterogeneous clinical interventions. During the pretraining phase, we find that chaos-theoretic balancing of dynamical regime diversity yields more robust representations, with a smaller balanced corpus outperforming one twice its size and capturing bifurcation regimes. We evaluate the world model performance across diverse real-world physiological datasets spanning heterogeneous temporal resolutions and intervention regimes, covering daily life, point-of-care, and clinical settings, including fitness planning, hemodialysis, diabetes management, and surgical monitoring. These evaluation datasets comprise records from 8,026 subjects, spanning study durations from 3.2 hours for high-resolution signal data to 2.3 years for longitudinal clinical biomarker tracking. NormWear-2 achieves the best overall forecasting performance across time, frequency, and latent representation domains, with significant improvements over state-of-the-art time series foundation models, while maintaining competitive downstream representation quality, providing a step toward general-purpose world models for physiological signals.

2605.15464 2026-05-18 cs.LG cs.AI cs.CL

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

GRLO:从零开始在开放环境中的通用强化学习

Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 GRLO研究从少量交互数据中训练的RLHF在开放环境中的泛化能力,探索其对话能力是否能迁移至数学推理和代码生成等下游任务,展示出高效且低成本的训练方法。

详情
AI中文摘要

事后训练已成为解锁大型语言模型能力的关键步骤,强化学习(RL)逐渐成为关键范式。近期基于RL的后训练方法日益分化为两种范式:基于人类反馈的强化学习(RLHF),其通过目标领域的偏好信号优化模型,以及基于可验证奖励的强化学习(RLVR),其在由验证器支持的环境中运行。后者在近期以推理为导向的后训练中占据主导地位,因为它在领域特定任务(如推理)上提供了更强的增益和更高的效率。然而,尽管领域内RL训练取得了令人满意的性能,但仍需要大量的GPU计算资源,这仍然是广泛应用的主要障碍。本文研究了从开放环境中的少量交互数据中从零开始训练的RLHF的泛化能力,并探讨其显式获得的对话能力是否能隐式地迁移到数学推理和代码生成等下游任务,即GRLO。具体而言,在Qwen3-4B-Base基础上,GRLO仅使用5K提示和22.7 GPU小时,将所有领域的平均性能从24.1提升到63.1,所需数据和计算资源分别比强大的领域内RLVR基线少约46倍和68倍。所得到的模型甚至与Qwen发布的后训练模型相媲美,后者需要更大的训练成本。值得注意的是,后续的领域内RLVR阶段仅带来选择性的增益,主要体现在更难的竞赛数学基准上。我们希望GRLO能为构建广泛具备能力的后训练模型提供一个简单且高效的配方。我们的代码和数据将在:https://github.com/SJY8460/GRLO上提供。

英文摘要

Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.

2605.15463 2026-05-18 cs.LG

Layer-wise Derivative Controlled Networks

分层导数控制网络

Rowan Martnishn, Sean Anderson

发表机构 * Sentivity AI Virginia Tech(弗吉尼亚理工大学)

AI总结 本文提出ChainzRule网络,通过DREG正则化技术平衡模型精度、硬件效率与功能稳定性,实验证明其在参数使用和梯度波动控制方面优于传统模型。

Comments Under Review at Neural Network Elsevier

详情
AI中文摘要

随着机器学习模型复杂度的增加,它们越来越难以满足高精度、硬件效率和功能稳定性的三重需求。传统架构往往以牺牲尖锐或不可预测的行为为代价获得性能,微小的输入变化会导致输出剧烈波动,这对敏感环境中的实际部署至关重要。本文引入ChainzRule(CR),一种新型神经架构,旨在协调这些竞争目标。ChainzRule用受微分正则化(DREG)控制的多项式引擎取代标准分段线性激活函数。与传统方法不同,DREG对中间导数进行针对性正则化,从而在不削弱多项式引擎固有表示能力的情况下抑制极端敏感性。在

英文摘要

As machine learning models grow in complexity, they increasingly struggle with three conflicting demands: the need for high accuracy, the requirement for hardware efficiency, and the necessity of functional stability. Traditional architectures often achieve performance at the expense of spiky or unpredictable behavior, where small changes in input lead to massive swings in output -- a critical flaw for real-world deployment in sensitive environments. This paper introduces ChainzRule (CR), a novel neural architecture designed to harmonize these competing goals. ChainzRule replaces standard piecewise-linear activations with a Polynomial Engine governed by Differential Regularization (DREG). Unlike traditional methods that impose global, coarse-grained constraints on a model's Lipschitz constant, DREG acts as a targeted regularization on intermediate derivatives. This approach suppresses extreme sensitivity without attenuating the representational power inherent in the Polynomial Engine. In head-to-head "Fair Fight" benchmarks, ChainzRule outperformed standard models while using 15.5x fewer parameters. On the MNIST dataset, it reduced peak gradient volatility by an average of 23.1%, ensuring a smoother and more predictable manifold. On Yelp Full ordinal regression under explicit DREG regularization, ChainzRule achieves 70.17% accuracy, validating that derivative-aware regularization is compatible with competitive performance on realistic tasks. By embedding gradient awareness into the architecture via DREG, ChainzRule demonstrates that stability and accuracy need not be competing objectives.

2605.15461 2026-05-18 cs.LG cs.AI

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

DrugSAGE: 自演化代理经验用于高效前沿药物发现

Yikun Zhang, Xiwei Cheng, Tianyu Liu, Yuanqi Du, Wengong Jin

发表机构 * Northeastern University(东北大学) Broad Institute of MIT and Harvard(MIT和哈佛大学Broad研究所) Yale University(耶鲁大学) Microsoft Research New England(微软研究院新英格兰分部)

AI总结 DrugSAGE通过自演化代理经验框架,高效构建前沿药物发现模型,跨任务记忆提升模型性能,实现零次搜索下的显著优势。

详情
AI中文摘要

构建前沿药物发现预测模型需要昂贵的工具、架构和训练策略搜索。当前基于LLM的代理通过大量试错找到前沿解决方案,但不保留积累的经验,因此每次新任务都要支付完整搜索成本。我们提出\method(自演化代理经验)框架,通过跨任务积累和重用经验高效构建前沿药物发现模型。\method维护跨任务记忆中的验证技能、有效策略的统计证据以及重复错误及其修复记录。在某些情况下,\method可直接转移有效解决方案而无需测试时搜索。在33个分子性质预测任务中,\method在单任务设置中排名第一。在16个较小任务积累的记忆下,\method在跨任务评估设置中达到17个保留任务的平均归一化分数为0.935,并在零次测试时搜索模式中优于所有基线代理10-30%。总之,我们的工作展示了跨任务记忆在药物发现前沿模型开发中的优势。

英文摘要

Building state-of-the-art (SOTA) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies. Current LLM-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task. We propose \method (Self-evolving Agent Experience), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently. \method maintains a cross-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes. In some cases, \method transfers a working solution directly without test-time search. In 33 molecular property prediction tasks, \method ranks first among nine SOTA agents in a single-task setting. With memory accumulated from 16 smaller tasks, \method achieves an averaged normalized score of 0.935 on 17 held-out tasks in a cross-task evaluation setting and outperforms all baseline agents by 10-30\% in a zero-test-time search regime. In summary, our work shows the advantage of cross-task memory for efficient SOTA model development in drug discovery.

2605.15459 2026-05-18 cs.LG stat.ML

Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics

别停止我:通过耗散黎曼流形力学采样损失极小值

Albert Kjøller Jacobsen, Leo Uhre Jakobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis

发表机构 * Section for Cognitive Systems, Technical University of Denmark(丹麦技术大学认知系统系) Center for Quantum Information Physics, New York University(纽约大学量子信息物理中心)

AI总结 本文提出DiMS方法,通过耗散黎曼流形力学精确采样损失极小值,解决传统方法无法准确采样重参数化不变解的问题,并在贝叶斯推断中验证其有效性。

详情
AI中文摘要

现代神经网络损失函数的极小值通常不是孤立的,而是形成在训练数据上重参数化不变解的连通组件。分析这些解是一个难题,但采样方法是可行的。现有方法要么在低损失区域扩散,无法精确采样重参数化不变解,要么本质上是局部的,限制了对其他极小值盆地的探索。本文提出基于动能的动力系统,受重力和摩擦项驱动,以精确采样极小水平集。DiMS方法依赖物理动机的超参数,允许控制采样器的探索能力。我们以不确定性量化作为动机问题,在贝叶斯推断中观察到比之前方法更好的性能。

英文摘要

The minima of modern neural network loss functions are typically not isolated, rather they form connected components of reparameterization invariant solutions on the training data. Analytically characterizing these solutions is a hard problem, but sampling approaches are feasible. By construction, existing methods either spread over low-loss regions, and thus do not sample reparameterization invariant solutions exactly, or are inherently local, which limits exploration of other minima valleys. We propose sampling such reparameterization invariant models using a dynamical system based on kinetic energy, subject to a gravitational pull and a friction term that dissipates energy from the system. Our proposed sampler, DiMS, is guaranteed to sample exactly from the minimum level sets and depends on physically motivated hyperparameters which allows control over the exploration capabilities of the sampler. We consider uncertainty quantification in Bayesian inference as the motivating problem and observe improved performance compared to previously proposed approaches.

2605.15458 2026-05-18 cs.CV

Video Models Can Reason with Verifiable Rewards

视频模型可以借助可验证的奖励进行推理

Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校) Microsoft Research(微软研究院) University of Southern California(南加州大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 本文提出VideoRLVR方法,通过规则反馈优化视频扩散模型,提升可验证推理能力,在Maze、FlowFree和Sokoban任务中优于监督微调基线,证明可验证RL能推动视频模型超越感知模仿。

Comments Website: https://darthzhu.github.io/VideoRLVR-page/

详情
AI中文摘要

视频扩散模型在感知真实感和时间一致性方面取得了快速进展,但主要优化于合理生成而非可验证推理。在需要生成视频满足显式空间、时间或逻辑约束的任务中,这一限制尤为突出。受强化学习可验证奖励(RLVR)在推理导向语言模型中的作用启发,我们引入VideoRLVR,一种通过基于规则的反馈优化视频扩散模型的实用方法。VideoRLVR将视频推理视为可验证视觉轨迹的生成,包含SDE-GRPO优化核心、密集分解奖励和Early-Step Focus策略以提高训练效率。Early-Step Focus策略限制策略优化到早期去噪阶段,使训练延迟降低约40%的同时保持性能。我们在Maze、FlowFree和Sokoban三个程序生成领域进行评估,这些领域有客观成功标准。在这些任务中,VideoRLVR在监督微调基线上持续改进,密集分解奖励在低成功率设置中尤为重要。我们的RL优化模型在这些可验证推理基准和跨领域基准中优于评估的专有和开源视频生成模型。这些结果表明,可验证RL能推动视频模型超越感知模仿,向更可靠的规则一致视觉推理迈进。

英文摘要

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

2605.15450 2026-05-18 cs.CV cs.AI cs.LG

RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects

RIDE: 基于Retinex的解耦方法用于揭示隐藏物体

Chunming He, Rihan Zhang, Dingming Zhang, Chengyu Fang, Longxiang Tang, Jingjia Feng, Fengyang Xiao, Sina Farsiu

发表机构 * Duke University(杜克大学) Tsinghua University(清华大学) Harvard University(哈佛大学)

AI总结 RIDE通过Retinex理论提出同域图像分解方法,解决隐藏物体分割问题,利用判别性差距定理提升前景与背景的区分度。

详情
AI中文摘要

隐藏物体分割(COS)涵盖一系列密集预测任务,包括伪装物体检测、多形体分割、透明物体检测和工业缺陷检测,其中目标通过不同物理机制与周围环境视觉融合。现有方法要么直接操作RGB图像,要么采用异构分解(如傅里叶、小波)将空间证据分散到尺度/频率系数,使像素对齐线索不直接。我们引入一种根本不同的视角:通过Retinex理论进行同域图像分解,将图像分解为光照和反射成分。我们的核心发现是视觉融合迫使复合空间中的外观匹配,但并不需要同时在两个成分空间中匹配,这一现象我们正式称为判别性差距定理。关键的是,我们证明在多样化的COS子任务中,底层物理过程系统性地反相关光照和反射差异,从而理论保证Retinex分解在完整物理范围内保持或严格提升总前景-背景判别性,反相关最大化增益。基于此,我们提出RIDE,包括:(i)任务驱动的Retinex分解模块,学习端到端的分割最优分解;(ii)判别性差距注意力机制,适应性利用分解帮助的区域;(iii)伪装打破对比损失,操作在反射特征空间中。

英文摘要

Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

2605.15445 2026-05-18 cs.AI

From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates

从LLM生成的猜想到Lean形式化:通过求和平方证书实现自动多项式不等式证明

Ruobing Zuo, Hanrui Zhao, Gaolei He, Zhengfeng Yang, Jianlin Wang

发表机构 * School of Software Engineering, East China Normal University, Shanghai, China(东华大学软件工程学院) College of Computer Science and Technology, National University of Defense Technology, Changsha, China(国防科技大学计算机科学与技术学院) School of Computer and Information Engineering, Henan University, Kaifeng, China(河南大学计算机与信息工程学院)

AI总结 本文提出NSPI框架,结合LLM和符号计算,通过求和平方证书实现多项式不等式证明,展示其在10变量多项式上的有效性与可扩展性。

Comments Accepted to ICML 2026. Preprint version

详情
AI中文摘要

自动证明多项式不等式是自动化数学推理中的基本挑战,其中丰富的代数结构和快速增长的证书搜索空间阻碍了可扩展性。纯粹的符号方法提供强保证,但随着变量数或次数的增加,其扩展性较差,因为代数操作昂贵且中间表达式迅速增长。同时,LLM引导的方法在竞赛风格的不等式上取得了显著进展,特别是在变量数较少的情况下。为了解决剩余的可扩展性挑战,我们提出NSPI,一种结合LLM和符号计算优势的神经符号框架。具体而言,LLM提出一个近似多项式求和平方(SOS)分解的猜想;我们通过符号计算对其进行细化,得到精确的多项式SOS表示,这直接证明目标不等式,并进一步在Lean中验证证明,从而实现从启发式发现到机器检查证明的端到端流程。在涉及最多10个变量的多项式挑战基准上的实验展示了所提方法的有效性和可扩展性。

英文摘要

Automated proving of polynomial inequalities is a fundamental challenge in automated mathematical reasoning, where rich algebraic structure and a rapidly growing certificate search space hinder scalability. Purely symbolic approaches provide strong guarantees but often scale poorly as the number of variables or the degree increases, due to expensive algebraic manipulations and rapidly growing intermediate expressions. In parallel, LLM-guided methods have made notable progress, particularly on competition-style inequalities with a small number of variables. To address the remaining scalability challenges, we propose NSPI, a neuro-symbolic framework that combines the complementary strengths of LLMs and symbolic computation for polynomial-inequality proving. Concretely, an LLM proposes a conjecture in the form of an approximate polynomial Sum-Of-Squares (SOS) decomposition; we refine it via symbolic computation to obtain an exact polynomial SOS representation, which directly proves the target inequality, and we further certify the proof in Lean, yielding an end-to-end pipeline from heuristic discovery to machine-checked proof. Experiments on challenging benchmarks involving polynomials with up to 10 variables demonstrate the effectiveness and scalability of the proposed method.

2605.15440 2026-05-18 cs.CL

Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis

为何语言模型比人类更不惊讶?测试解析多重性不匹配假说

William Timkey, Brian Dillon, Tal Linzen

发表机构 * Department of Linguistics, New York University(纽约大学语言学系) Center for Data Science, New York University(纽约大学数据科学中心) Department of Linguistics, University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校语言学系)

AI总结 研究探讨语言模型在处理句子时的 surprisal 与人类处理的差异,通过调整解析数量来测试解析多重性不匹配假说对句法歧义的影响。

详情
AI中文摘要

surprisal理论认为,词语的处理难度由其在上下文中的可预测性决定,为人类句子处理与语言模型的next-word预测提供了潜在联系。尽管语言模型(LM)的surprisal能够成功预测自然文本中的阅读时间,但它们系统性地低估了受控研究中句法歧义观察到的难度幅度,尤其是在歧义句中。这种不匹配可能源于人类和LM之间计算约束的差异。在此,我们测试了一个假设,即LM可能能够同时考虑更多的不同句子解释,而人类则无法做到。使用具有词同步束搜索的循环神经网络语法(RNNGs),我们系统地变化用于计算词surprisal的同时解析数量,然后使用这些surprisal来预测人类的阅读时间。减少同时活跃解析的数量确实增加了预测的歧义效应幅度,但不足以捕捉人类中效应的全部幅度。这表明,LM和人类可用的同时解析数量的差异无法将基于LM的surprisal与人类句子处理联系起来。

英文摘要

Surprisal theory posits that the processing difficulty of a word is determined by its predictability in context, offering a potential link between human sentence processing and next-word predictions from language models. While language model (LM) surprisals successfully predict reading times in naturalistic text, they systematically underpredict the magnitude of difficulty observed in controlled studies of syntactic ambiguity, particularly in garden path sentences. This mismatch might arise from differences in the computational constraints between humans and LMs. Here we test one such hypothesis, specifically, that LMs may be able to simultaneously consider a greater number of distinct sentence interpretations at once, compared to humans. Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, we systematically vary the number of simultaneous parses used to compute word surprisal, and then use these surprisals to predict human reading times. Reducing the number of simultaneous active parses indeed increases the magnitude of predicted garden path effects, but not nearly enough to capture the full magnitude of the effects in humans. This suggests that differences in the number of simultaneous parses available to LMs and humans cannot reconcile LM-based surprisal with human sentence processing.

2605.15436 2026-05-18 cs.CL cs.LG

Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance

神经激活模式在语言模型架构中的跨分析:对认知任务性能的全面研究

Mahdi Naser-Moghadasi, Faezeh Ghaderi

发表机构 * Research Division, BrightMind AI(BrightMind AI 研究部) Texas Tech University(德克萨斯理工大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文分析了六种大型语言模型架构在十二种认知任务上的神经激活模式,揭示了编码器和解码器架构在处理不同任务时的差异,发现数学推理产生最高注意力熵,解码器模型在稀疏性上更高。

Comments 8 pages, accepted at IEEE BigData 2025

详情
AI中文摘要

本文对六种不同大型语言模型(LLM)架构的神经激活模式进行了全面分析,研究了它们在十二种认知任务类别上的性能。通过系统测量最终激活值、注意力熵和稀疏性模式,揭示了编码器和解码器架构在处理多样化认知任务时的根本差异。对144个任务-模型组合的分析表明,数学推理在所有架构中均产生最高注意力熵,而解码器模型在稀疏性模式上显著高于编码器模型。研究结果为现代语言模型的计算特性及其任务特定神经行为提供了关键见解,对大数据应用中的模型选择和优化具有启示作用。

英文摘要

This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.

2605.15430 2026-05-18 cs.RO cs.CV

Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones

在树上何处栖息:用于树抓取无人机的视觉引导

Alex Dunnett, Leonie Bottomley, Mirko Kovac, Basaran Bahadir Kocer

发表机构 * Department of Civil, Aerospace and Design Engineering, University of Bristol(布里斯托大学土木、航空航天与设计工程系) Laboratory of Sustainability Robotics at Swiss Federal Laboratories for Materials Science and Technology (EMPA)(瑞士材料科学与技术联邦实验室可持续机器人实验室) Ecole Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院)

AI总结 本文提出一种视觉引导方法,用于确定树上理想的栖息点,通过图像处理算法评估树的形状和结构,基于枝条宽度、坡度和曲率选择适宜栖息的枝条。

Comments Work in progress version accepted to the Recent Advances in Robotic Perception for Forestry

详情
AI中文摘要

本研究展示了一种方法,用于确定树上理想的栖息点,该方法利用视觉引导的自主树栖无人机。各种图像处理算法,包括用于机器学习、图像分割和二值图像形态学的算法,被用来评估树的形状和结构。与仅寻找最近可用的枝条不同,本研究通过评估每条枝条的潜力,根据枝条宽度、坡度(与水平面的角度)和曲率等因素来确定其适合栖息的程度。对于给定的树栖无人机和超过10,000张从2月到10月在亚热带和温润气候下的城市树木图像数据集,所提出的方法成功地为76%的可行目标生成了结果。可行目标定义为枝条直径足够厚且可用栖息空间至少等于腱驱动抓取夹具的宽度。这些初步成功的结果为开发一系列改进和额外功能奠定了基础,以创建通用方法;这将涉及整合深度感知和姿态传感器的补充数据,以增强枝条评估。

英文摘要

This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.

2605.15424 2026-05-18 cs.CV

Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

Social-Mamba:基于状态空间模型的社会感知轨迹预测

Po-Chien Luan, Wuyang Li, Yang Gao, Alexandre Alahi

发表机构 * EPFL, Switzerland(瑞士联邦理工学院)

AI总结 本文提出Social-Mamba,通过将社会互动视为结构化序列过程,结合循环Mamba模块和社交三元组分解,实现高效准确的轨迹预测,实验表明其在多个基准上表现优异。

详情
AI中文摘要

人类轨迹预测对于拥挤环境中安全导航至关重要,需要在准确性和计算效率之间取得平衡。高效建模社会互动是密集人群中的关键。然而,大多数最新方法依赖于注意力机制,虽然能捕捉复杂依赖关系,但会带来二次计算成本,随着邻居数量的增加而表现不佳。最近,选择性状态空间模型提供了线性时间的替代方案;然而,其本质上是顺序的,与社会互动的无结构和动态性质不匹配。为此,我们提出了Social-Mamba,一种预测架构,将社会互动重新表述为结构化序列过程。其核心是循环Mamba模块,一个新型模块,能够实现连续的双向信息流。Social-Mamba在以自我为中心的网格上组织代理,并引入社交三元组分解,将互动分解为时间、以自我为中心和目标为中心的扫描。这些通过可学习的社会门和全局扫描动态整合,以生成准确且高效的轨迹预测。在五个轨迹预测基准上的广泛实验表明,Social-Mamba在准确率方面达到最先进的水平,同时提供优越的参数效率和计算可扩展性。此外,将Social-Mamba嵌入到流匹配框架中进一步增强了准确性和效率,使其成为未来轨迹预测研究的灵活且稳健的基础。代码已公开:https://github.com/vita-epfl/Social-Mamba

英文摘要

Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: https://github.com/vita-epfl/Social-Mamba

2605.15423 2026-05-18 cs.CV cs.AI eess.IV

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

MR2-ByteTrack:基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点

Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti

发表机构 * Electrical, Electronic and Information Engineering (DEI), University of Bologna, Italy.(博洛尼亚大学电气、电子与信息工程学院,意大利) Department of Electrical Engineering (ESAT), KU Leuven, Belgium.(卢旺达大学电气工程系,比利时) Dalle Molle Institute for Artificial Intelligence (IDSIA), USI--SUPSI, Switzerland.(人工智能研究所(IDSIA),瑞士USI--SUPSI)

AI总结 本文提出MR2-ByteTrack,一种针对嵌入式视觉节点的视频目标检测方法,通过交替使用全分辨率和低分辨率推理,结合ByteTrack和Rescore算法提升效率,实现在嵌入式设备上的高精度实时检测。

详情
AI中文摘要

现代智能视觉传感器需要设备端智能来处理视频流,因为云计算在带宽、延迟和隐私限制下往往不可行。然而,这些传感系统通常依赖超低功耗微控制器(MCUs),其内存和计算能力有限,使得需要特征存储或多帧缓冲的传统视频目标检测方法不可行。为了解决这一挑战,我们引入了多分辨率重评分ByteTrack(MR2-ByteTrack),一种专为基于MCU的嵌入式视觉节点设计的视频目标检测(VOD)方法。MR2-ByteTrack通过交替使用全分辨率和低分辨率推理来降低计算成本,同时通过ByteTrack在帧间链接检测,并通过Rescore算法通过概率联合规则聚合跨帧的检测置信度分数以纠正误分类。我们将其应用于基于CNN的检测器和基于Transformer的模型,证明了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明,MR2-ByteTrack保持了准确性,实现了CNN模型的mAP最高达49.0,Transformer模型的mAP为48.7,同时将CNN的乘加操作减少了高达53%,Transformer的减少了32%。当部署在GAP9上,一个超低功耗RISC-V多核MCU上时,我们的方法相比仅处理全分辨率图像,实现了高达55%的能耗节省,实现了在MCU类嵌入式视觉节点上的首个实时Transformer-based VOD。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。

英文摘要

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

2605.15421 2026-05-18 cs.CV

U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration

U-SEG:不确定性在分割中的探索——系统多变量研究

Michael Smith, Frank P. Ferrie

发表机构 * Centre for Intelligent Machines, McGill University(智能机器中心,麦吉尔大学)

AI总结 本文系统探讨了不确定性估计与分割交集中的关键问题,分析了不同变量对分割性能的影响,发现挑战性任务和样本多样性在分割中具有重要作用。

Comments Accepted to CVPR Findings Track 2026

详情
AI中文摘要

本文深入探讨了不确定性估计与分割交叉领域中的一些未被充分研究的课题。先前研究表明,不确定性估计的质量对多种变量非常敏感。作为不确定性估计的主要应用之一,帮助识别和解决实际场景中的预测错误,任何影响这一应用的因素都必须明确识别。例如,更具挑战性的领域或不同的数据集和架构是否会导致使用不确定性估计时性能下降?视频序列中的先前帧是否能提供与其它方法相当的不确定性估计?能否利用样本多样性结合不确定性估计方法以获得更好的估计?最后,何时使用基于集成的不确定性估计比确定性网络更合理?我们通过创建框架并执行大规模研究,跨多个变量(如数据集、主干网络和下游任务)对语义和全景分割进行研究。我们发现,a) 具有挑战性的全景分割任务通常导致性能下降,而数据集和主干网络之间的高性能方差表明泛化并不保证;b) 时间序列样本对特定配置有用,但在许多情况下不值得付出代价;c) 样本多样性在校准下游任务中最具潜力,但其他情况下无法超越更简单的替代方案;d) 确定性方法在某些下游任务中足够,但若在部署中能实现正确条件,集成方法可带来显著改进。

英文摘要

In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.

2605.15417 2026-05-18 cs.LG cs.AI

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

$f$-轨迹平衡:一种用于调整GFlowNets、生成模型和LLMs的损失家族,结合on-policy和off-policy数据

Jake Fawkes, Jason Hartford

发表机构 * Department of Statistics, University College London, UK(伦敦大学学院统计学系) Valence Labs, London, UK(伦敦Valence实验室)

AI总结 本文提出一种基于$f$-散度的损失家族,通过on-policy和off-policy数据调整生成模型,提升模型覆盖性和泛化能力。

Comments Published at ICML 2026

详情
AI中文摘要

在GFlowNets和变分推断中,目标与模型对数概率之间的均方误差被证明是训练生成模型的有效低方差替代损失。该损失具有在on-policy情况下其梯度对应KL散度的梯度,而在off-policy情况下仍保持有效损失且具有相同全局最小值的性质。本文证明该构造可扩展到整个$f$-散度家族,从而得到一系列损失函数,其on-policy梯度对应相应的$f$-散度,但保留相同的全局最小值。具体而言,我们展示了on-policy梯度导致目标与模型对数概率上的翻译不变损失函数与$f$-散度之间的一一对应关系。这种等价性使我们能够设计新的替代损失函数,用于调整广泛类别的生成模型,继承相应$f$-散度的性质,如更广泛的模式覆盖,同时适用于off-policy数据。我们将其应用于各种任务,包括经典合成示例、SynFlowNets分子发现和异步大语言模型(LLM)调整,证明我们的模型在广泛类别的生成模型中保留其预测属性,无论是on-policy还是off-policy数据。

英文摘要

In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.

2605.15413 2026-05-18 cs.LG

Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

Transformer可扩展性危机:现代语言模型中性能瓶颈的首次全面实证分析

Mahdi Naser Moghadasi, Faezeh Ghaderi

发表机构 * Research Division, BrightMind AI(BrightMind AI研究部) Texas Tech University(德克萨斯科技大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文通过评估118种transformer模型,揭示了模型在序列长度扩展时的性能瓶颈,发现随着序列长度增加,模型处理能力显著下降,挑战了传统扩展假设。

Comments 8 pages, accepted at IEEE BigData 2025

详情
AI中文摘要

尽管transformer架构在自然语言处理中取得了显著成功,但其可扩展性限制仍缺乏系统性的实证分析。本文首次对118种transformer模型进行了大规模评估,涵盖七种不同的架构类别,揭示了根本性的性能瓶颈,表现为硬性部署限制。我们的系统性基准测试方法揭示了关键的可扩展性危机:虽然88.1%的模型能够处理最多512个token的序列,但在1024个token时降至44.9%,在2048个token时完全失败。通过严格分析加载时间、内存消耗和计算效率,我们证明压缩模型在参数效率(649.2 tokens/sec/M参数)上优于大型生成模型(12.5 tokens/sec/M)。我们的发现挑战了主流的扩展假设,并提供了首次定量证据,表明理论上的O(n²)注意力复杂度转化为可测量的性能瓶颈。本工作建立了新的transformer评估基准方法,并为生产环境中的实际部署决策提供了关键见解。

英文摘要

Despite the remarkable success of transformer architectures in natural language processing, their scalability limitations remain poorly understood through systematic empirical analysis. This paper presents the first comprehensive large-scale evaluation of 118 transformer models across seven distinct architectural categories, revealing fundamental performance walls that manifest as hard deployment constraints. Our systematic benchmarking methodology uncovers a critical scalability crisis: while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens. Through rigorous analysis of loading times, memory consumption, and computational efficiency across sequence lengths from 128 to 2048 tokens, we demonstrate that compressed models achieve superior parameter efficiency (649.2 tokens/sec/M parameters) compared to large generative models (12.5 tokens/sec/M). Our findings challenge prevailing scaling assumptions and provide the first quantitative evidence that the theoretical O(n2) attention complexity translates into measurable performance walls. This work establishes new benchmarking methodologies for transformer evaluation and provides critical insights for practical deployment decisions in production environments.

2605.15404 2026-05-18 cs.CL

Capability Conditioned Scaffolding for Professional Human LLM Collaboration

专业领域能力条件下的支架框架

Sen Yang, Yinglei Ma

发表机构 * University College London(伦敦大学学院) Fudan University(复旦大学)

AI总结 本文提出能力条件支架框架,通过划分强、混合和弱领域,基于结构化能力档案调节干预行为,提升专业人类与AI协作的可靠性。

详情
AI中文摘要

大型语言模型的个性化通常适应用户偏好和风格,但未考虑用户在不同专业领域中的评估能力差异。这种局限可能导致专业领域漂移,即用户依赖AI生成的推理内容,而无法可靠评估。我们引入能力条件支架框架,该框架将专业知识划分为强、混合和弱领域,并根据结构化能力档案调节干预行为。在多个MMLU子集和四个LLM子集上的初步评估显示,基于档案的干预行为一致,包括在档案交换下的类别反转和混合领域风险区的选择性激活。这些发现表明,具备能力意识的支架能够支持超越风格个性化的更可靠的专业人类AI协作。

英文摘要

Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.

2605.15403 2026-05-18 cs.LG math.OC stat.ML

$ϕ$-Balancing for Mixture-of-Experts Training

$ϕ$-平衡用于专家混合训练

Lizhang Chen, Jonathan Li, Qi Wang, Runlong Liao, Shuozhe Li, Chen Liang, Ni Lao, Qiang Liu

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Northwestern University(西北大学)

AI总结 本文提出$ϕ$-平衡框架,通过最小化严格凸、对称且可导的潜在函数,实现专家资源的群体平衡,优于现有启发式方法,在大规模预训练和下游微调中表现更稳定有效。

详情
AI中文摘要

混合专家(MoE)模型依赖于平衡的专家利用以充分发挥其可扩展性。然而,现有负载平衡方法大多是启发式的,并基于嘈杂的小批量分配统计,引入了相对于群体目标的偏差。我们提出$ϕ$-平衡,一个原则性的框架,通过最小化预期路由分布的严格凸、对称且可导的潜在函数,直接针对群体层面的专家平衡。利用凸对偶性,我们推导出等价的min-max形式,并通过镜像下降法获得一个简单的在线算法,得到一个高效的EMA基于路由调整,具有可忽略的开销。在大规模预训练和下游微调中,$ϕ$-平衡一致优于先前的Switch式和无损失基线,展示了更稳定和有效的专家利用。

英文摘要

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $ϕ$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $ϕ$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

2605.15400 2026-05-18 cs.AI

Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

超越伙伴多样性:一种基于影响的团队引导框架用于零样本人机协同

Wei Sheng, Rohan Paleja

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出基于影响的团队引导框架IBTS,通过影响塑造激励智能体发现多样化的高绩效团队交互模式,提升团队表现,强调需结合稀疏奖励协调机制与伙伴多样性覆盖。

详情
AI中文摘要

尽管AI代理正从孤立工具发展为交互合作者,数据驱动的人机协同(HMT)方法仍依赖跨领域的大量人类交互数据,导致成本高。零样本协调(ZSC)通过模拟多样化的伙伴群体来近似未见伙伴的行为。然而,随着团队规模扩大和通信退化,伙伴覆盖本身不足。为此,本文提出影响基于的团队引导(IBTS)框架,利用影响塑造激励智能体发现多样化的高绩效团队交互模式,并引导持续轨迹向更强的协调模式发展。在Overcooked-AI的双智能体和三智能体设置中评估IBTS,测试学习协调结构是否超越二元交互。评估包括模拟伙伴、合成伙伴风格变化以及首次涉及两名真实人类队友和一名机器队友的30人Overcooked-AI HMT研究。在这些评估中,IBTS在对比基线中提升了团队表现,突显了需要扩展ZSC来结合稀疏奖励协调机制与伙伴多样性覆盖,而非仅依赖多样性。

英文摘要

While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.

2605.15399 2026-05-18 cs.LG cs.AI cs.NA math.NA physics.comp-ph

Breakeven complexity: A new perspective on neural partial differential equation solvers

突破性复杂度:神经偏微分方程求解器的新视角

Yijing Zhang, Nicholas Roberts, Tanya Marwah, Mikhail Khodak

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Google DeepMind(谷歌DeepMind)

AI总结 本文提出突破性复杂度评估框架,考虑神经求解器的前期成本与传统求解器的低保真度成本,分析不同PDE求解器在复杂问题中的有效性。

详情
AI中文摘要

偏微分方程的神经替代求解器相比数值方法能带来显著加速,尤其在需要多次求解的场景中。然而,现有基于精度的评估方法未充分考虑两个核心问题:(1) 神经求解器在数据生成、训练和调优上存在显著前期成本;(2) 经典求解器也能在足够低的模拟成本下生成低保真度解。为明确考虑这些现实并全面纳入端到端成本,我们提出以突破性复杂度为核心的评估框架,该指标衡量在学习求解器成本有效于等误差的传统求解器之前所需的前向求解次数。为了评估此指标,我们应用扩展定律确定应分配多少训练预算给数据生成,并讨论如何在不同设置中实现平滑的误差匹配。我们评估了多个神经PDE求解器在三个2D周期域上的PDEs以及由GPU原生PyFR代码生成的新型流动基准测试中的突破性复杂度。其他发现包括,神经PDE求解器在成本、维度、滚动、物理领域(如更高雷诺数)等更复杂的问题中变得更具有效性。

英文摘要

Neural surrogate solvers of partial differential equations (PDEs) promise dramatic speedups over numerical methods, especially in scenarios requiring many solves. However, current accuracy-based evaluations do not fully consider two central issues: (1) neural solvers incur substantial up-front costs for data generation, training, and tuning; and (2) classical solvers can also generate low-fidelity solutions at a sufficiently low simulation cost. To explicitly account for these realities and fully incorporate end-to-end costs, we propose an evaluation framework centered on breakeven complexity, a metric that counts the forward solves before a learned solver is cost-effective relative to an error-equivalent traditional solver. To evaluate this measure, we apply scaling laws to determine how much training budget to allocate to data generation and discuss how to achieve smooth error-matching in diverse settings. We evaluate the breakeven complexity of multiple neural PDE solvers on three PDEs on 2D periodic domains from APEBench and a novel benchmark of flows past multiple obstacles generated by the GPU-native PyFR code. Among other findings, our results suggest that neural PDE solvers become more effective as problems get harder in terms of cost, dimension, rollout, physics regime (e.g. higher Reynolds number), etc.

2605.15397 2026-05-18 cs.CV

ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest

ELDOR:亚马逊雨林非法金矿开采的数据集和基准

Kangning Cui, Surendra Bohara, Suraj Prasai, Zishan Shao, Wei Tang, Martin Pillaca, Edwin Flores, Zhen Yang, Gregory Larsen, Evan Dethier, David Lutz, Jean-Michel Morel, Miles Silman, Victor Pauca, Fan Yang

发表机构 * Wake Forest University(威克森林大学) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) Duke University(杜克大学) City University of Hong Kong(香港城市大学) Yale University(耶鲁大学) Alaska Spatial Science(阿拉斯加空间科学) Colby College(科林斯学院) Colby-Sawyer College(科林斯-萨威尔学院) Lingnan University(岭南大学) Centro de Innovación Cientifica Amazónica(亚马逊科学创新中心)

AI总结 ELDOR通过大规模无人机基准监测亚马逊雨林非法金矿开采对环境和景观的影响,包含2500多公顷的手动标注正射影像,涵盖采矿活动和生态结构的像素级语义标签,评估多种模型在细粒度和小规模结构识别上的性能。

Comments 70 pages, 35 figures, 28 tables

详情
AI中文摘要

亚马逊雨林非法金矿开采导致森林砍伐、水污染和长期生态系统破坏,但难以在精细空间尺度上监测。卫星影像支持大范围观测,但常遗漏小型采矿相关结构和微妙的土地覆盖转变,尤其是频繁的云层覆盖。我们引入ELDOR,一个大规模无人机基准,用于监测非法金矿开采对雨林环境和景观的破坏。ELDOR包含覆盖超过2500公顷的手动标注正射影像,具有像素级语义标签,涵盖采矿相关活动和周围生态结构。借助这一统一的标注源,我们建立了四个基准任务:语义分割、分割衍生识别、直接多标签分类以及基于视觉-语言模型的类别存在识别。在这些任务中,我们比较了通用和遥感专用的分割模型、视觉基础模型相关的分割方法、直接多标签分类方法以及视觉-语言模型,在受控的闭集协议下。结果表明,当前方法在罕见的小规模采矿结构和细粒度恢复类别上仍存在困难,表明需要上下文感知和多模态建模。为了支持领域分析和实际应用,我们进一步构建了一个交互式探索器,为领域专家提供统一的数据探索和模型推理界面。

英文摘要

Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.

2605.15394 2026-05-18 cs.LG cs.AI stat.ML

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

无奖励的表示:用于LLM微调的JEPA审计

Biswa Sengupta

发表机构 * LLM Suite group of JP Morgan Chase and its affiliates(JP摩根士丹利 LLC 集团及其附属机构)

AI总结 本文探讨了在无奖励设定下,通过JEPA架构学习更有效的表示方法,测试了多种辅助项在自然语言到正则表达式生成任务中的表现,发现某些辅助项在特定统计检验下显著,但整体效果不显著。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)提出,当模型被训练以预测潜在表示而非观测输出时,应学习更有用的抽象。对于自回归语言模型微调,这一原则意味着诱导的隐藏状态几何必须达到语言模型头部并且提高解码任务指标。我们在此基础上,在固定Llama-3.2-1B-Instruct LoRA基础上,对自然语言到正则表达式生成任务进行了测试,比较了22种训练时的辅助项,包括轨迹形状正则化、分布约束、预测器/目标不对称性、Fisher度量Jacobi残差以及一个解码器可见的JEPA目标,该目标位于交叉熵的正锥内。经验结果是一个结构化的零假设:几种辅助项在单细胞配对α=0.10下显著(T3-Local在Δ=+2.53 pp,p=0.003最强),但无一通过Bonferroni或Holm-Bonferroni检验。解码器可见的JEPA产生了研究中的第一个正辅助-交叉熵梯度余弦值,但精确匹配仍处于种子噪声内;在五个种子的完整微调复制中,相同的辅助项在两个基准测试中均重现了零假设(TURK:Δ=+0.04 pp,p_配对=0.96;SYNTH:Δ=+0.52 pp,p_配对=0.28),因此零假设在LoRA和完整微调中对解码器可见的构造是稳健的。隐藏状态表示和解码任务准确性在这一领域因此弱相关;我们相应地将LLM领域JEPA评估重新定义为耦合问题,其中核心问题是哪些指标下有用的隐藏几何成为解码器可见的任务信号。

英文摘要

Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emph{and} improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired $α= 0.10$ without correction (T3-Local at $Δ= +2.53$~pp, $p = 0.003$ being the strongest), but none survives Bonferroni or Holm--Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary--cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at $n = 5$ seeds reproduces the null on both benchmarks (TURK: $Δ= +0.04$~pp, $p_{\text{paired}} = 0.96$; SYNTH: $Δ= +0.52$~pp, $p_{\text{paired}} = 0.28$), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.

2605.15393 2026-05-18 cs.LG

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

LPDS:通过逻辑保持难度扩展评估LLM鲁棒性

Philipp Mondorf, Samuel J. Bell, Jesse Dodge, Dieuwke Hupkes

发表机构 * FAIR at Meta(Meta 的 FAIR 部门) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文提出LPDS框架,通过系统搜索逻辑保持变化来评估LLM鲁棒性,发现难度增加导致性能下降,且微调困难变体能获得更一致的鲁棒性提升。

Comments 41 pages, 31 figures

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地用于在最小人为监督下执行任务,确保这些模型运行稳健至关重要。特别是,一个能够解决给定问题的模型不应因某些实体(如名称、数字或其他上下文细节)的变化而失败,而问题逻辑保持不变。先前研究表明,当前LLM仍难以处理这种鲁棒性:它们在某些问题变体上成功,但在其他变体上失败。然而,现有评估缺乏系统方法来识别最可能引发失败的逻辑保持变体。相反,它们通常测试允许变体的随机子集,这可能高估鲁棒性。为解决这一差距,我们引入了逻辑保持难度扩展(LPDS),一个框架,它(i)量化问题变体的难度,并(ii)系统搜索允许变体的空间以找到那些最大化难度并暴露失败的变体。我们显示,随着难度增加,性能下降,且模型推理链中的错误变得更加明显。我们进一步证明,LPDS高效地找到困难问题变体,导致性能下降幅度比随机采样大5倍。最后,我们证明在更多困难变体上微调比在更容易的变体上微调能获得更一致的鲁棒性提升。

英文摘要

As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities$\unicode{x2013}$such as names, numbers, or other contextual details$\unicode{x2013}$have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations to find those that maximize difficulty and expose failures. We show that as difficulty increases, performance declines and errors in the models' reasoning chains become more pronounced. We further demonstrate that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling. Finally, we show that fine-tuning on more difficult variations leads to more consistent robustness gains than training on easier ones.

2605.15391 2026-05-18 cs.CV cs.AI

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

PanoWorld:几何一致的全景视频世界建模

Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas

发表机构 * Northeastern University(东北大学)

AI总结 PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

详情
AI中文摘要

PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

英文摘要

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

2605.15388 2026-05-18 cs.LG

Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation

随机方差缩减估计的统一高概率分析

Zhankun Luo, Antesh Upadhyay, M. Berk Sahin, Sang Bin Moon, Anuran Makur, Abolfazl Hashemi

发表机构 * School of Electrical and Computer Engineering(电气与计算机工程学院) Department of Computer Science(计算机科学系)

AI总结 本文提出一个统一框架,通过递归中的记忆保留、重置概率和迭代移动修正项,分析随机方差缩减估计,推导出高概率界,并改进了随机优化的复杂度。

详情
AI中文摘要

随机估计器是大规模优化的基础,需从噪声 oracle 观察中推断总体量。尽管动量、SPIDER、STORM 和 PAGE 等方法成效显著,但其分析多为特定估计器和期望基,掩盖了可靠性决定的结构权衡。本文开发了一个基于递归的统一框架,包含记忆保留、重置概率和迭代移动修正项,恢复经典估计器,推动新的二阶变体,并推导出估计误差的偏差-方差分解。主要结果是一个统一的高概率界,使用新的无维向量值 Freedman 不等式,适用于包含随机向量鞅和的光滑规范空间。结果适用于欧几里得和非欧几里得设置,包括 Banach 空间中的镜像下降法分析。应用包括无约束优化的高概率 oracle 复杂度,建立对置信度的对数依赖。还推导出首次 $\tilde{\mathcal{O}}(\varepsilon^{-3})$ 的随机优化 oracle 复杂度界,改进了现有 $\tilde{\mathcal{O}}(\varepsilon^{-4})$ 复杂度,通过利用方差缩减估计首次应用于此场景。

英文摘要

Stochastic estimators are fundamental to large-scale optimization, where population quantities must be inferred from noisy oracle observations. Although influential methods such as momentum, SPIDER, STORM, and PAGE have been highly successful, their analyses are largely estimator-specific and expectation-based, obscuring the structural tradeoffs that determine reliability. In this paper, we develop a unified framework for stochastic variance-reduced estimation based on a recursion with three components: memory retention, reset probability, and a correction term for iterate movement. This framework recovers several classical estimators, motivates new second-order variants, and yields a bias-variance decomposition of estimation error. Our main result is a unified high-probability bound proved using a new dimension-free vector-valued Freedman inequality, valid for smooth normed spaces involving random sums of vector martingales. The result applies in both Euclidean and non-Euclidean settings, including the analysis of mirror-descent-based methods in Banach spaces. As applications, we obtain high-probability oracle complexities for unconstrained optimization with mirror descent, establishing the logarithmic dependence on the confidence level. We also derive the first $\tilde{\mathcal{O}}(\varepsilon^{-3})$ oracle-complexity bounds for stochastic optimization with expectation constraints, improving upon the existing $\tilde{\mathcal{O}}(\varepsilon^{-4})$ complexity by leveraging variance-reduced estimation for the first time in this setting.

2605.15384 2026-05-18 cs.LG cs.AI

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

一个评分够吗?重新思考序列演进LLM记忆的评估

Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen

发表机构 * University of Virginia(弗吉尼亚大学) Princeton University(普林斯顿大学)

AI总结 本文提出SeqMem-Eval框架,通过评估记忆状态的演变、泛化、经验巩固和信息保留,揭示传统指标无法捕捉的记忆质量差异。

Comments 29 pages, 13 figures

详情
AI中文摘要

记忆在使大语言模型(LLM)能够处理序列任务中起着核心作用,通过积累和重用经验实现时间连续性。然而,现有LLM记忆评估大多依赖汇总指标如最终验证准确率或累积在线性能,这可能掩盖诸如遗忘和负迁移等关键失败模式。本文引入SeqMem-Eval,一种用于序列演进LLM记忆的诊断评估框架。受持续学习启发,它针对一种测试时间设置,其中记忆是外部的、提示介导的,并且在不修改模型参数的情况下更新。与只关注最终性能不同,SeqMem-Eval评估记忆状态在连续推理中的演变、泛化、经验巩固和信息保留。具体而言,它测量在线效用、验证泛化、反向迁移和遗忘,提供更细致的记忆质量视角。通过在多样任务和记忆方法上的广泛实验,我们显示更高的最终或累积准确性不必然意味着更好的记忆质量:许多方法表现出强劲的性能提升,同时遭受显著的遗忘或负迁移。此外,不同记忆设计在适应性和稳定性之间表现出不同的权衡,这些权衡在标准评估指标下是不可见的。

英文摘要

Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.

2605.15383 2026-05-18 cs.CV

MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays

MorphoHELM:用于评估基于显微镜的形态学检测方法的综合基准

Emre Hayir, Lorin Crawford, Alex X. Lu

发表机构 * Microsoft Research New England(微软研究院新英格兰分部)

AI总结 MorphoHELM提供了一个综合的开放基准,用于评估细胞染色法中的特征提取方法,通过不同批次效应评估任务,揭示方法间的权衡关系,展示经典计算机视觉方法在多种场景下的优势。

详情
AI中文摘要

显微镜图像包含关于细胞对扰动响应的丰富信息,对药物筛选等应用至关重要。研究人员常使用表示提取方法来量化图像,近年来深度学习方法层出不穷。然而,评估这些表示的质量仍存在碎片化问题,各模型在不同任务和数据集上使用定制的流程和指标,难以公平比较。本文介绍MorphoHELM,一个全面的开放基准,用于评估细胞染色法中的特征提取方法。MorphoHELM整合了领域内的评估标准,扩展并修正使其更稳健,并在迄今为止最广泛的方法上进行评估。该基准的一个显著特点是每个任务在不同批次效应(或技术噪声)程度下进行评估,直接量化方法检测生物信号能力随噪声增加而下降的程度。这些特性使MorphoHELM能够检测方法间的权衡关系,我们证明某些类型的生物信号检测能力强的模型在其他方面表现较弱。我们展示现有模型在所有设置中均无法超越经典计算机视觉分析策略,这些策略仍是最强的通用场景表示。所有数据集、代码和评估工具均在https://github.com/microsoft/MorphoHELM公开。

英文摘要

Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.

2605.15380 2026-05-18 cs.CL cs.CY cs.HC

Eskwai for Students: Generative AI Assistant for Legal Education in Ghana

Eskwai for Students:面向加纳法律教育的生成式AI助手

George Boateng, Philemon Badu, Patrick Agyeman-Budu, Samuel Ansah, Evans Atompoya, Evan Igwilo, Lord Baah, Frederick Abu-Bonsrah, Victor Wumbor-Apin Kumbol

发表机构 * ETH Zurich, Switzerland(瑞士苏黎世联邦理工学院) Charité - Universitätsmedizin Berlin, Germany(柏林夏里特医学院) Kwame AI Inc., U.S.(美国瓦卡姆AI公司)

AI总结 本文介绍了Eskwai for Students,一个基于检索增强生成(RAG)的生成式AI助手,用于帮助加纳法律学生解答法律问题,通过超过12000份案例法和1400项立法数据库,评估其对学生查询的帮助性及伦理影响。

Comments 10 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

详情
AI中文摘要

生成式AI的最新进展展示了其在法律教育中的潜力。然而,针对全球南方国家开发和部署此类系统的研究有限。本文开发了Eskwai for Students,一个生成式AI助手,帮助法律学生进行法律教育。Eskwai for Students是一个检索增强生成(RAG)系统,能够回答广泛的法律问题,基于超过12000份案例法和1400项立法的定制数据库。我们部署了Eskwai for Students,进行了为期30个月(2.5年)的纵向研究,由3100名加纳法律学生使用,共提交了32000次查询。我们评估了AI的有用性,并提供了关于法律学生提交的查询类型的见解,这引发了一些伦理问题。本工作有助于理解全球南方法律学生如何利用生成式AI进行学习,以及如何负责任地利用它来促进法律教育。

英文摘要

Recent advances in generative AI have shown their potential to be leveraged for legal education. Yet, work on the development and deployment of such systems for legal education in the Global South is limited. In this work, we developed Eskwai for Students, a generative AI assistant to help law students with their legal education. Eskwai for Students is a retrieval augmented generation (RAG) system that provides answers to a wide range of legal questions for law students grounded in a curated database of over 12K case laws and 1.4K legislation in Ghana. We deployed Eskwai for Students in a longitudinal study of 30 months (2.5 years) used by 3.1K law students in Ghana who made 32K queries. We evaluated the helpfulness of our AI, and provided insight into the kinds of queries law students submit to this generative AI tool, which raises some ethical concerns. This work contributes to an understanding of how law students in the Global South are using generative AI for their studies and the ways it could be leveraged responsibly to advance legal education.