arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2604.26634 2026-06-05 cs.LG econ.GN q-fin.EC stat.AP

Electricity price forecasting across Norway's five bidding zones in the post-crisis era

在危机后时代跨挪威五个竞价区的电力价格预测

My Thi Diem Phan, Trung Tuyen Truong, Hoai Phuong Ha, Dat Thanh Nguyen

发表机构 * Independent researcher(独立研究者) Department of Mathematics, University of Oslo(奥斯陆大学数学系) Department of Computer Science, The Arctic University of Norway(挪威北极大学计算机科学系) Faculty of Medicine, University of Oslo(奥斯陆大学医学院)

AI总结 本文研究了挪威五个竞价区在能源危机后电力价格预测的问题,通过构建多模态数据集并评估了八种预测模型,发现LightGBM在所有区域表现最佳,同时强调了外部特征在不同市场状况下的重要性。

详情
Comments
This version removes variables unavailable at prediction time to eliminate look-ahead leakage, clarifies the forecasting task definition, and updates the results and discussion accordingly. All tables and figures have been recomputed
AI中文摘要

挪威的电力市场长期以来由水电主导,但2021-2022年的能源危机和与欧洲大陆的更强整合已从根本上改变了价格形成机制,降低了基于历史数据校准的预测模型的可靠性。尽管需要更新的模型,但缺乏一个统一的基准来评估所有结构各异的挪威竞价区的特征贡献。本文提出了对Nord Pool市场在所有五个挪威竞价区的一步预测的全面评估。我们构建了一个覆盖2019-2025年的多模态小时数据集,并使用严格因果测试集评估了八种预测模型家族,包括Light Gradient Boosting Machine(LightGBM)、带有外生变量的自回归模型和先进的深度学习架构。我们实现了稳健的滚动起源回测、留一组法特征消融和条件制度分析来分解模型性能和特征效用。我们的结果表明,LightGBM在每个区域都表现最佳,平均绝对误差范围为1.60至5.58欧元每兆瓦时,而一个带有外生变量的岭正则化自回归模型在北部区域仍然是一个高度有竞争力的线性基准。特征消融揭示了仅依赖滞后价格和日历变量的模型能够获得高精度,通常与完整的多模态模型的性能相匹配或接近。然而,条件制度分析显示,外部特征如水库水位和天然气价格在分层预测误差方面至关重要,这些误差在压力市场制度下持续增加。这突显了模型可解释性和制度意识在决策者面对市场动态结构性变化时的实用价值。

英文摘要

Norway's electricity market is heavily dominated by hydropower, but the 2021-2022 energy crisis and stronger integration with Continental Europe have fundamentally altered price formation, reducing the reliability of forecasting models calibrated on historical data. Despite the critical need for updated models, a unified benchmark evaluating feature contributions across all structurally diverse Norwegian bidding zones remains lacking. Here we present a comprehensive evaluation of one-step-ahead forecasting of the Nord Pool market across all five Norwegian bidding zones. We constructed a multimodal hourly dataset spanning 2019-2025 and evaluated eight forecasting model families, including Light Gradient Boosting Machine (LightGBM), autoregressive models with exogenous variables, and advanced deep learning architectures, using a strictly causal test set. We implemented robust rolling-origin backtesting, leave-one-group-out feature ablation, and conditional regime analysis to dissect model performance and feature utility. Our results show that LightGBM achieves the best performance in every zone, with mean absolute error ranging from 1.60 to 5.58 euros per megawatt-hour, while a ridge-regularized autoregressive model with exogenous variables remains a highly competitive linear benchmark in northern zones. Feature ablation reveals that models relying solely on lagged prices and calendar variables achieve high accuracy and often match or closely approach the performance of the full multimodal model. However, conditional regime analysis demonstrates that external features like reservoir levels and gas prices remain crucial to stratify forecast errors, which consistently increase under stressed market regimes. This highlights the practical value of model interpretability and regime awareness for decision makers facing structural changes in market dynamics.

2604.26269 2026-06-05 cs.CL cs.AI cs.LG

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

校准的惊喜:一种信息论视角下的创造性质量

Bo Zou, Chao Xu

发表机构 * arXiv.org

AI总结 本文提出了一种信息论框架,用于评估创造性写作的质量,通过校准的惊喜概念,结合香农互信息理论,量化了高质量文本与降质文本之间的差异。

详情
Comments
28 pages, 3 figures
AI中文摘要

在大型语言模型时代,创造性写作的质量缺乏可计算的理论基础。主流方法是评分标准——将整体审美判断分解为子评分,以及通过RLHF偏好信号——用群体投票代替质量。这两种方法都绕过了文本本身的统计结构。本文提供了一种信息论基础,填补这一空白。我们提出了'校准的惊喜'作为优秀创造性写作的信息论本质。这种判断符合阅读直觉并涵盖了其对立面。这种文学判断可以精确地进行数学公式化。在完全维度约束Y下,可行的写作选择被强制进入极狭窄的空间。稀有的幸存者,从无约束的视角来看,恰好是最不可预测的选择。两者都通过香农互信息I(X;Y) = H(X) - H(X|Y)精确测量——'校准'对应H(X|Y)接近0;'惊喜'对应H(X)升高。公式的减法结构自然地将'有根据的惊喜'与'纯噪声'分开。我们使用Qwen1.5-7B的token级logprobs作为理想读者概率分布的操作代理。在20对(12中文/8英文)的高质量与系统降质文学段落中,20/20对支持核心预测:高质量段落的I(X;Y)系统性地高于其降质版本。

英文摘要

In the era of large language models, creative writing quality lacks a computable theoretical anchor. The dominant approaches are rubric scoring -- decomposing holistic aesthetic judgment into sub-scores -- and RLHF preference signals -- replacing quality with group votes. Both bypass the statistical structure of the text itself. This paper provides an information-theoretic foundation to fill this gap. We propose 'calibrated surprise' as the information-theoretic essence of excellent creative writing. This judgment matches reading intuition and covers its opposite. This literary judgment admits a precise mathematical formulation. Under full-dimensional constraints Y, feasible writing choices are forced into an extremely narrow space. The rare survivors are, from the unconstrained perspective, exactly the least predictable choices. Both are measured precisely by Shannon mutual information I(X;Y) = H(X) - H(X|Y) -- 'calibrated' corresponds to H(X|Y) approaching 0; 'surprising' corresponds to H(X) going high. The subtraction structure of the formula naturally separates 'well-grounded surprise' from 'pure noise'. We use token-level logprobs from Qwen1.5-7B as an operational proxy for the ideal reader's probability distribution. Across 20 pairs (12 Chinese / 8 English) of high-quality vs. systematically degraded literary passages, 20/20 pairs support the core prediction: high-quality passages have systematically higher I(X;Y) than their degraded versions.

2604.21017 2026-06-05 cs.RO cs.AI

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment: 一个大规模数据集,用于在医疗机器人中启用基础模型

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

发表机构 * Open-H-Embodiment Consortium University of California, Berkeley(加州大学伯克利分校) University of California, Los Angeles(加州大学洛杉矶分校) University of Southern California(南加州大学) University of Cambridge(剑桥大学) University of Tokyo(东京大学) University of Tokyo, Graduate School of Information Science and Technology(东京大学信息科学与技术研究生院) University of Tokyo, Institute of Industrial Science(东京大学工业科学研究所)

AI总结 本文提出Open-H-Embodiment数据集,通过两个基础模型展示了其在医疗机器人领域的应用,展示了大规模开放数据在推动机器人学习和世界建模方面的关键作用。

详情
Comments
Project website: https://open-h.github.io/open-h-embodiment/
AI中文摘要

自主医疗机器人有希望提高患者预后、减少从业者的工作量、普及医疗访问并实现超人精度。然而,自主医疗机器人受到根本性数据问题的限制:现有的医疗机器人数据集较小、单一躯体且很少公开共享,限制了该领域所需的基础模型的发展。我们介绍了Open-H-Embodiment,这是迄今为止最大的开放医疗机器人视频数据集,包含同步运动学,涵盖超过50个机构和多种机器人平台,包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit(dVRK)、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro以及多种定制系统,涵盖手术操作、机器人超声和内窥镜程序。我们通过两个基础模型展示了该数据集的研究价值。GR00T-H是首个开放的基础视觉-语言-动作模型,是唯一在结构缝合基准测试中实现完整端到端任务完成的模型(25%的试验 vs. 其他所有模型的0%),并在29步体外缝合序列中实现了64%的平均成功率。我们还训练了Cosmos-H-Surgical-Simulator,这是首个动作条件的世界模型,能够从单个检查点实现多躯体手术模拟,涵盖九种机器人平台,并支持计算机模拟政策评估和医学领域合成数据生成。这些结果表明,开放、大规模的医疗机器人数据收集可以作为研究社区的关键基础设施,推动机器人学习、世界建模以及更广泛的研究进展。

英文摘要

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 50 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

2604.17121 2026-06-05 cs.LG cs.AI

The Topological Trouble With Transformers

Transformer 的拓扑困境

Michael C. Mozer, Shoaib Ahmed Siddiqui, Rosanne Liu

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文探讨了Transformer在处理序列结构时的拓扑问题,指出其纯前馈架构限制了动态状态跟踪,提出应通过递归架构转向隐含激活动态,并介绍了连续思维Transformer架构的分类方法及未来研究方向。

详情
AI中文摘要

Transformers通过扩展的上下文历史在序列中编码结构。然而,其纯前馈架构从根本上限制了动态状态跟踪。状态跟踪——迭代更新反映不断变化环境的潜在变量——涉及本质上序列依赖性,这使得前馈网络难以维持。因此,前馈模型会将演进状态表示推入其层栈更深处,使得信息在浅层不可用,最终耗尽模型的深度。虽然动态深度模型和显式或隐式思维可以绕过这一深度限制,但这些解决方案在计算和内存上效率低下。在本文中,我们主张,时间扩展认知需要从显式思维轨迹转向隐式激活动态,通过递归架构。我们引入了递归和连续思维Transformer架构的分类方法,按其递归轴(深度与步长)和输入标记与递归步长的比例进行分类。最后,我们概述了有前景的研究方向,包括增强的状态空间模型和粗粒度递归,以更好地将状态跟踪整合到现代基础模型中。

英文摘要

Transformers encode structure in sequences via an expanding contextual history. However, their purely feedforward architecture fundamentally limits dynamic state tracking. State tracking -- the iterative updating of latent variables reflecting an evolving environment -- involves inherently sequential dependencies that feedforward networks struggle to maintain. Consequently, feedforward models push evolving state representations deeper into their layer stack with each new input step, rendering information inaccessible in shallow layers and ultimately exhausting the model's depth. While this depth limit can be bypassed by dynamic depth models and by explicit or latent thinking that externalizes state representations, these solutions are computationally and memory inefficient. In this article, we argue that temporally extended cognition requires refocusing from explicit thought traces to implicit activation dynamics via recurrent architectures. We introduce a taxonomy of recurrent and continuous-thought transformer architectures, categorizing them by their recurrence axis (depth versus step) and their ratio of input tokens to recurrence steps. Finally, we outline promising research directions, including enhanced state-space models and coarse-grained recurrence, to better integrate state tracking into modern foundation models.

2603.25158 2026-06-05 cs.AI

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Trace2Skill: 将轨迹局部经验转化为可迁移的代理技能

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

发表机构 * ETH Zürich University of Zurich(苏黎世联邦理工学院) Peking University(北京大学) Zhejiang University(浙江大学) Qwen Large Model Application Team, Alibaba(阿里巴巴文心一言应用团队)

AI总结 本文提出Trace2Skill框架,通过归纳推理将广泛执行轨迹整合为统一的技能目录,有效提升代理技能的可迁移性和实用性,适用于多种领域。

详情
Comments
Work in Progress. May version add more experiments
AI中文摘要

大型语言模型(LLM)代理日益依赖领域特定技能,但手动编写此类技能难以扩展,而纯参数知识生成的技能常忽略关键操作陷阱。我们引入Trace2Skill框架,通过归纳推理将广泛执行轨迹整合为统一的技能目录。Trace2Skill支持深入现有人工编写技能和从弱LLM生成草稿中创建有用技能。实验表明,Trace2Skill在多样化的领域中均表现出色,包括办公流程、数学推理和视觉问答。重要的是,进化出的技能不仅限于所用轨迹的简单记忆:它们在不同模型规模、不同模型家族和非分布设置中均能迁移。例如,从Qwen3.5-35B轨迹进化出的技能使Qwen3.5-122B代理在WikiTableQuestions任务上提升高达57.65个百分点。进一步分析显示,Trace2Skill优于序列技能编辑和ReasoningBank式检索记忆,能将重复失败和 workaround 压缩为标准操作程序(SoPs),并产生可重用的技能,无需参数更新或测试时检索。

英文摘要

Large Language Model (LLM) agents increasingly rely on domain-specific skills, yet manually authoring such skills does not scale, and skills generated purely from parametric knowledge often miss critical operational pitfalls. We introduce Trace2Skill, a framework that consolidates broad execution trajectories in parallel into a unified skill directory through inductive reasoning over agent experience. Trace2Skill supports both deepening existing human-written skills and creating useful skills from weak LLM-generated drafts. Experiments demonstrate the effectiveness of Trace2Skill across diverse domains, including office workflows, math reasoning, and vision QA. Importantly, the evolved skills are not merely memorized artifacts of the trajectories used to create them: they often transfer across model scales, across model families, and to out-of-distribution settings. For example, skills evolved from Qwen3.5-35B trajectories improve a Qwen3.5-122B agent by up to $57.65$ percentage points on WikiTableQuestions. Further analyses show that Trace2Skill outperforms sequential skill editing and ReasoningBank-style retrieval memories, compresses recurring failures and workarounds into standard operating procedures (SoPs), and yields portable skills that can be reused without parameter updates or test-time retrieval.

2604.23600 2026-06-05 cs.CL

Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

性格在英语和印地语中影响人物条件化大语言模型叙事中的性别偏见:一项实证研究

Tanay Kumar, Shreya Gautam, Aman Chadha, Vinija Jain, Francesco Pierri

发表机构 * Politecnico di Milano(米兰理工学院) Apple(苹果公司) Meta

AI总结 本研究探讨了在英语和印地语中,人物条件化大语言模型叙事中的性别偏见如何受到性格特征的影响,发现性格特质与性别偏见的幅度和方向显著相关,特别是黑暗三联体性格特质与性别刻板印象的表示更相关,但这些关联在不同模型和语言中有所变化。

详情
AI中文摘要

大型语言模型(LLMs)正越来越多地应用于以人物为导向的应用程序,如教育、客户服务和社会平台,在这些应用中,模型在与用户交互时被提示采用特定的人物。虽然人物条件可以提高用户体验和参与度,但也引发了关于性格线索如何与性别偏见和刻板印象相互作用的担忧。在本工作中,我们对英语和印地语中的人物条件化故事生成进行了受控研究,每个故事描绘了一名印度职场人士在系统性变化的人物性别、职业角色和性格特征(来自HEXACO和黑暗三联体框架)下生成特定情境的物品(例如教案、报告、信件)。在来自六种最先进的LLM生成的23,400个故事中,我们发现性格特征与性别偏见的幅度和方向显著相关。特别是,黑暗三联体性格特征与比社会可取的HEXACO特征更高的性别刻板印象表示相关,尽管这些关联在不同模型和语言中有所变化。我们的发现表明,LLM中的性别偏见并非静态,而是依赖于情境的。这表明在现实应用中使用的人物条件化系统可能会引入不均等的表示伤害,强化生成的教育、职业或社交内容中的性别刻板印象。

英文摘要

Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditioning can improve user experience and engagement, it also raises concerns about how personality cues may interact with gender biases and stereotypes. In this work, we present a controlled study of persona-conditioned story generation in English and Hindi, where each story portrays a working professional in India producing context-specific artifacts (e.g., lesson plans, reports, letters) under systematically varied persona gender, occupational role, and personality traits from the HEXACO and Dark Triad frameworks. Across 23,400 generated stories from six state-of-the-art LLMs, we find that personality traits are significantly associated with both the magnitude and direction of gender bias. In particular, Dark Triad personality traits are consistently associated with higher gender-stereotypical representations compared to socially desirable HEXACO traits, though these associations vary across models and languages. Our findings demonstrate that gender bias in LLMs is not static but context-dependent. This suggests that persona-conditioned systems used in real-world applications may introduce uneven representational harms, reinforcing gender stereotypes in generated educational, professional, or social content.

2604.23466 2026-06-05 cs.LG cs.AI cs.AR

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

评估Hopper和Blackwell GPU上的CUDA Tile用于AI工作负载

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文评估了CUDA Tile在Hopper和Blackwell GPU上的AI工作负载性能,比较了CuTile与cuBLAS、Triton等方法的效率和可移植性,发现CuTile在特定工作负载上表现优异,但在跨架构优化上仍有不足。

详情
AI中文摘要

NVIDIA的CUDA Tile(CuTile)引入了一种基于Python的、以tile为中心的抽象,用于GPU内核开发,旨在简化编程同时保持Tensor Core和Tensor Memory Accelerator(TMA)在现代GPU上的效率。我们对三种NVIDIA GPU(Hopper和Blackwell架构下的H100 NVL、B200和RTX PRO 6000 Blackwell Server Edition)上的CuTile进行了首次独立、跨架构评估,对比了cuBLAS、Triton、WMMA和原始SIMT等现有方法。我们通过基准测试代表性AI工作负载,包括GEMM、融合多头注意力和端到端LLM推理(BF16/FP16精度),以评估性能和可移植性。我们的结果表明,CuTile的效果强烈依赖于工作负载和架构。在数据中心级Blackwell(B200)上,CuTile在融合注意力任务中达到最高1007 TFLOP/s,比FlashAttention-2快2.5倍,仅需60行Python内核代码。对于GEMM,CuTile在22行代码中达到cuBLAS性能的52-79%,比WMMA的123行代码更高效,使其成为手写CUDA内核的实用替代品,但尚未成为供应商优化库的替代品。然而,相同的CuTile注意力内核在RTX PRO 6000(sm_120)上仅达到FlashAttention-2的53%吞吐量,暴露了显著的跨架构优化差距。相比之下,Triton在所有测试平台上的cuBLAS性能保持在62-101%,无需架构特定调整,显示出更强的可移植性。

英文摘要

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

2604.22583 2026-06-05 cs.LG

Adaptive Head Budgeting for Efficient Multi-Head Attention

自适应头预算用于高效多头注意力

Bilal Faye, Abdoulaye Mbaye, Hanane Azzag, Mustapha Lebbah

发表机构 * LIPN, Université Paris 13(巴黎第十三大学LIPN实验室) Université Paris 13(巴黎第十三大学) Université de Versailles Saint-Quentin-en-Yvelines(巴黎- Versailles 巴黎-圣昆丁-埃夫里大学)

AI总结 提出BudgetFormer,通过动态分配注意力头预算和相关性分布,在文本分类任务中减少计算和内存开销,同时保持或提升性能。

详情
AI中文摘要

多头注意力使Transformer能够捕获多样化的表示,但无论任务复杂度如何,通常每个输入都会激活所有注意力头。对于粗粒度任务(如文本分类),相关信息通常是全局性的,这种固定分配会引入不必要的计算。我们提出BudgetFormer,一种基于每个输入动态分配注意力头的Transformer架构。该模型学习头预算和相关性分布,以选择信息量最大的头。为了支持有效的头选择,我们引入了一种平衡探索与利用的训练策略。在文本分类任务上的实验表明,BudgetFormer减少了FLOPs和内存使用,同时匹配或超越了标准多头注意力的性能。这些结果突显了自适应头分配作为提高Transformer效率和性能的有效方法。

英文摘要

Multi-head attention enables Transformers to capture diverse representations, but all attention heads are typically activated for every input, regardless of task complexity. For coarse-grained tasks such as text classification, where relevant information is often global, this fixed allocation can introduce unnecessary computation. We propose BudgetFormer, a Transformer architecture that dynamically allocates attention heads on a per-input basis. The model learns both a head budget and a relevance distribution to select the most informative heads. To support effective head selection, we introduce a training strategy that balances exploration and exploitation. Experiments on text classification tasks show that BudgetFormer reduces FLOPs and memory usage while matching or surpassing the performance of standard multi-head attention. These results highlight adaptive head allocation as an effective approach to improving Transformer efficiency and performance.

2604.20572 2026-06-05 cs.CL

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

在需要时提问:从记忆和技能中主动检索以实现经验驱动的终身学习代理

Yuxuan Cai, Wei Li, Jie Zhou, Qin Chen, Xin Li, Bo Zhang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University, Shanghai(东华大学计算机科学与技术学院,上海) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出了一种经验驱动的终身学习框架ProactAgent,通过主动检索结构化的经验库来改进长期任务。该框架通过ExpOnEvo联合更新策略和优化记忆,并引入ProactRL将检索视为显式的策略动作,从而在交互过程中主动检索以提高任务表现和效率。

详情
AI中文摘要

在线终身学习代理必须决定不仅如何行动,还要何时咨询先前经验以持续改进长期任务。现有方法通常被动地检索记忆,如在任务初始化或每次步骤后,因此错过了交互过程中出现的知识缺口。我们提出了ProactAgent,一种经验驱动的终身学习框架,用于在结构化的经验库上进行主动检索。ProactAgent通过ExpOnEvo持续改进,联合更新策略并优化记忆,将过去交互组织成事实、事件和技能存储库。它进一步引入了ProactRL,将检索视为显式的策略动作,并学习何时以及检索什么。通过比较相同交互前缀下有无检索的配对延续,ProactRL提供步骤级过程奖励,鼓励仅在改进任务结果或效率时检索。在SciWorld、AlfWorld和StuLife上的实验表明,ProactAgent在所有基线中表现一致,成功率达到32%的相对提升,交互轮次减少超过33%。我们的代码将在GitHub上公开。

英文摘要

Online lifelong learning agents must decide not only how to act but also when to consult prior experience to continually improve on long-horizon tasks. Existing methods typically retrieve memories passively, such as at task initialization or after each step, and therefore miss knowledge gaps that arise during interaction. We propose ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured Experience Base. ProactAgent continually improves through ExpOnEvo, which jointly updates policies and refines memory, organizing past interactions into factual, episodic, and skill repositories. It further introduces ProactRL, which treats retrieval as an explicit policy action and learns when and what to retrieve. By comparing paired continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level process rewards that encourage retrieval only when it improves task outcomes or efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently outperforms all baselines, achieving up to 32% relative improvement in success rate and over 33% reduction in interaction rounds. Our code will be publicly available at GitHub.

2604.19741 2026-06-05 cs.CV

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

CityRAG: 通过空间感知的视频生成进入城市

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

发表机构 * Google(谷歌) Cornell University(康奈尔大学) Stanford University(斯坦福大学)

AI总结 CityRAG通过利用地理注册数据的大型语料库,生成空间一致且可导航的真实环境视频,其核心方法是结合学习的先验知识和时空不一致训练数据,以实现复杂的运动和外观变化。

详情
Comments
Project page: cityrag.github.io
AI中文摘要

我们解决了生成一个空间一致且可导航的环境的问题,该环境是真实位置的模拟。现有的视频生成模型可以产生一个与文本(T2V)或图像(I2V)提示一致的合理序列。然而,能够重建在任意天气条件和动态物体配置下的真实世界对于下游应用如自动驾驶和机器人模拟至关重要。为此,我们提出了CityRAG,一个视频生成模型,利用大规模地理注册数据作为上下文,将生成过程与物理场景结合,同时保持对复杂运动和外观变化的学习先验。CityRAG依赖于时间不一致的训练数据,教会模型将场景的底层属性与瞬时属性语义解耦。我们的实验表明,CityRAG能够生成连贯的分钟级、物理一致的视频序列,保持数千帧的天气和光照条件,实现回环闭合,并导航复杂的轨迹以重建真实世界地理。

英文摘要

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

2510.22048 2026-06-05 cs.LG

PF$Δ$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations

PF$Δ$: 一个用于负载、发电和拓扑变化的功率流基准数据集

Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, Priya Donti

发表机构 * Department of Electrical Engineering & Computer Science(电气工程与计算机科学系) Laboratory for Information & Decision Systems(信息与决策系统实验室) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出PF$Δ$基准数据集,用于评估在负载、发电和拓扑变化下的功率流计算,通过包含859,800个解决的实例,涵盖六种不同的变电站系统大小,并包含三种故障场景,以评估传统求解器和基于GNN的方法,识别现有方法的不足和未来研究的开放问题。

详情
Journal ref
NeurIPS 2025
Comments
31 pages, 14 figures. Accepted at NeurIPS 2025
AI中文摘要

功率流(PF)计算是实时电网操作的核心,广泛应用于诸如故障分析(其中重复的PF评估评估在停电情况下的电网安全性)和拓扑优化(涉及基于PF的在组合学上庞大的动作空间中的搜索)。在操作时间尺度上运行这些计算或在大规模评估空间中仍然是主要的计算瓶颈。此外,随着可再生能源的整合和气候引起的极端天气,电力系统操作的不确定性也在增加,这要求工具能够准确且高效地模拟广泛的情景和运行条件。机器学习方法相对于传统求解器提供了潜在的加速,但其性能尚未在能够捕捉真实世界变化的基准上得到系统评估。本文介绍了PF$Δ$,一个用于功率流的基准数据集,能够捕捉负载、发电和拓扑的多样化变化。PF$Δ$包含859,800个解决的功率流实例,涵盖六种不同的变电站系统大小,捕捉三种类型的故障场景(N、N-1和N-2),并包括接近不可行的案例,这些案例接近稳态电压稳定性极限。我们评估了传统求解器和基于GNN的方法,突显了现有方法在关键领域的不足,并识别了未来研究的开放问题。我们的数据集可在https://huggingface.co/datasets/pfdelta/pfdelta/tree/main获取,我们的代码、数据生成脚本和模型实现可在https://github.com/MOSSLab-MIT/pfdelta获取。

英文摘要

Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PF$Δ$, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PF$Δ$ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N -1, and N -2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at https://huggingface.co/datasets/pfdelta/pfdelta/tree/main and our code with data generation scripts and model implementations is at https://github.com/MOSSLab-MIT/pfdelta.

2604.17260 2026-06-05 cs.CL

Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation

重新思考会议有效性:一个用于时间细粒度自动会议有效性评估的基准和框架

Yihang Li, Chenhui Chu

发表机构 * Kyoto University(京都大学)

AI总结 本文提出了一种新的会议有效性评估方法,通过定义有效性为时间内的客观成就率,并引入AMI-ME数据集和自动评估框架,以支持对会议中各个话题段落的有效性评分,从而建立一个全面的基准并评估框架的通用性。

详情
Comments
ACL 2026 Main Conference
AI中文摘要

评估会议有效性对于提高组织生产力至关重要。当前的方法依赖于事后调查,仅能为整个会议提供一个粗粒度的评分。依赖人工评估在可扩展性、成本和可重复性方面存在固有限制。此外,单一评分无法捕捉协作讨论的动态特性。我们提出了一种新的评估会议有效性的范式,围绕新的标准和时间细粒度方法。我们将有效性定义为时间内的客观成就率,并对会议中的各个话题段落进行评估。为了支持这一任务,我们引入了AMI会议有效性(AMI-ME)数据集,这是一个新的元评估数据集,包含来自130个AMI语料库会议的2,459个人工标注的段落。我们还开发了一个自动有效性评估框架,该框架使用大型语言模型(LLM)作为评判者,对每个段落的有效性进行评分,以相对整体会议目标。通过大量的实验,我们建立了这一新任务的全面基准,并评估了框架在不同会议类型中的通用性,从商业场景到非结构化讨论。此外,我们通过从原始语音开始的端到端性能测试来衡量完整系统的功能。我们的结果验证了该框架的有效性,并提供了强有力的基线,以促进未来会议分析和多方对话的研究。我们的数据集和代码将公开发布。AMI-ME数据集和自动评估框架可在:此URL处获取。

英文摘要

Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.

2604.12474 2026-06-05 cs.RO cs.AI

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

从运动学到动力学:学习精炼混合计划以实现物理可行的执行

Lidor Erez, Shahaf S. Shperberg, Ayal Taitler

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 该研究通过连续空间中的强化学习,解决混合计划在物理可行性执行中的问题,通过引入分析二阶约束的马尔可夫决策过程,改进混合规划器生成的一阶轨迹,从而可靠地恢复物理可行性。

详情
AI中文摘要

在许多机器人任务中,智能体必须穿越一系列空间区域以完成任务。此类问题本质上是混合离散-连续的:一个高层动作序列和一个在物理上可行的连续轨迹。生成的轨迹和动作序列还必须满足诸如截止时间、时间窗口和速度或加速度限制等约束条件。尽管混合时间规划器试图解决这一挑战,但它们通常使用线性(一阶)动力学建模运动,这无法保证生成的计划满足机器人的真实物理约束。因此,即使高层动作序列固定,生成动态可行的轨迹也变成了一个双层优化问题。我们通过连续空间中的强化学习来解决这个问题。我们定义了一个明确包含分析二阶约束的马尔可夫决策过程,并用它来改进由混合规划器生成的一阶计划。我们的结果表明,这种方法可以可靠地恢复物理可行性,并有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

英文摘要

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.

2604.16370 2026-06-05 cs.CL cs.AI cs.CV

Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding

Brain-CLIPLM: 用于EEG到文本解码的语义压缩

Xiaoli Yang, Huiyuan Tian, Yurui Li, Jianyu Zhang, Shijian Li, Gang Pan

发表机构 * Beijing Institute of Technology, Beijing, China(北京理工大学,北京,中国)

AI总结 该研究提出Brain-CLIPLM框架,通过语义锚点恢复和锚点引导的句子重建,解决EEG信号低信噪比和信息带宽限制的问题,实现了更高的文本检索准确率。

详情
AI中文摘要

从非侵入性脑电图(EEG)解码自然语言仍受限于低信噪比和有限的信息带宽。这提出了一个核心问题:能否从此类信号中可靠地恢复句子级语言?在现实的信息约束下,直接恢复假设可能过于强烈。我们提出语义压缩假设:非侵入性EEG可能保留可恢复的语义锚点,而非完整的词法-句法形式。从这一视角,直接句子重建相对于EEG可恢复的信息规模过于细粒度。为解决这种不匹配,我们提出了Brain-CLIPLM,一个两阶段框架,将EEG到文本解码分解为语义锚点恢复和锚点引导的句子重建。第一阶段使用对比学习将词级EEG证据对齐固定关键词词汇并恢复有序的语义锚点。第二阶段使用基于检索的大型语言模型和链式推理提示从这些锚点中重建句子意义,遵循粒度匹配原则,使解码复杂度与可恢复的神经信息规模相匹配。在结合了苏黎世认知语言处理(ZuCo)基准测试中,Brain-CLIPLM实现了67.6%的Top-5和85.0%的Top-25句子检索准确率,其中在中间锚点粒度下表现最强。控制分析,包括排列检验,显示EEG衍生的锚点携带超出语言模型先验的信息。这些发现表明,EEG到文本解码应更好地视为在锚点引导句子重建之前恢复压缩的语义内容。

英文摘要

Decoding natural language from non-invasive electroencephalography (EEG) remains constrained by low signal-to-noise ratio and limited information bandwidth. This raises a central question: can sentence-level language be reliably recovered from such signals? Under realistic information constraints, this direct-recovery assumption may be too strong. We introduce a semantic compression hypothesis: non-invasive EEG may preserve recoverable semantic anchors rather than the full lexical--syntactic form of a sentence. From this perspective, direct sentence reconstruction is overly fine-grained relative to the recoverable information scale of EEG. To address this mismatch, we propose Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic-anchor recovery and anchor-guided sentence reconstruction. Stage 1 uses contrastive learning to align word-level EEG evidence with a fixed keyword vocabulary and recover ordered semantic anchors. Stage 2 uses a retrieval-grounded large language model with chain-of-thought reasoning prompts to reconstruct sentence meaning from these anchors, following a granularity matching principle that aligns decoding complexity with the recoverable neural information scale. On the combined Zurich Cognitive Language Processing (ZuCo) benchmark, Brain-CLIPLM achieves 67.6\% Top-5 and 85.0\% Top-25 sentence retrieval accuracy, with the strongest performance at intermediate anchor granularity. Control analyses, including a permutation test, show that EEG-derived anchors carry sentence-specific information beyond language-model priors. These findings suggest that EEG-to-text decoding is better framed as recovering compressed semantic content before anchor-guided sentence reconstruction.

2604.09361 2026-06-05 cs.LG

Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains

用于无界域上高维格罗斯-皮塔耶夫斯基方程的随机维度冻结采样神经网络

Zhangyong Liang, Tingfeng Wang, Xiaofei Zhao

发表机构 * National Center for Applied Mathematics, Tianjin University(天津大学应用数学中心) School of Mathematics and Statistics, Wuhan University(武汉大学数学与统计学院) School of Mathematics and Statistics & Computational Sciences Hubei Key Laboratory, Wuhan University(武汉大学数学与统计学院及计算科学湖北省重点实验室)

AI总结 本文提出了一种名为SD-FSNN的新型计算框架,用于求解高维无界域上的格罗斯-皮塔耶夫斯基方程。该方法通过结合多种技术,克服了传统离散化方法中的维度诅咒和梯度基神经网络求解器的计算瓶颈。首先,预设的高斯包络编码了波函数的远场衰减,使得空间-时间分离得以实现,其中空间近似通过冻结的单隐层神经网络和数据驱动的采样特征进行处理。这导致了一个无梯度的形式化,其中空间导数被解析地预先计算,时间依赖性则通过减少的常微分方程演化。其次,随机维度采样器通过在每个时间步只评估少量空间维度,提供了空间算子的条件无偏估计,从而降低了计算和内存成本。离散守恒定律也被强制执行,确保了长期稳定性。大量的数值实验表明,SD-FSNN在高达1000维的GPE上实现了显著更高的准确性和效率,优于当前最先进的方法,包括PINNs、随机特征方法和张量网络方法。结果证实SD-FSNN有效缓解了冻结基模型在结构解流形上的Kolmogorov n-宽度障碍。

详情
AI中文摘要

本文介绍了一种名为随机维度冻结采样神经网络(SD-FSNN)的新计算框架,用于求解无界域上的高维格罗斯-皮塔耶夫斯基方程(GPE)。所提出的方法通过技术的协同作用,克服了传统离散化方法中的维度诅咒和梯度基神经网络求解器的计算瓶颈。首先,预设的高斯包络编码了波函数的远场衰减,使得空间-时间分离得以实现,其中空间近似通过冻结的单隐层神经网络和数据驱动的采样特征进行处理。这导致了一个无梯度的形式化,其中空间导数被解析地预先计算,时间依赖性则通过减少的常微分方程演化。其次,随机维度采样器通过在每个时间步只评估少量空间维度,提供了空间算子的条件无偏估计,从而降低了计算和内存成本。离散守恒定律也被强制执行,确保了长期稳定性。大量的数值实验表明,SD-FSNN在高达1000维的GPE上实现了显著更高的准确性和效率,优于当前最先进的方法,包括PINNs、随机特征方法和张量网络方法。结果证实SD-FSNN有效缓解了冻结基模型在结构解流形上的Kolmogorov n-宽度障碍。

英文摘要

This paper introduces the Stochastic-Dimension Frozen Sampled Neural Network (SD-FSNN), a novel computational framework for solving high-dimensional Gross-Pitaevskii equation (GPE) on unbounded domain. The proposed method circumvents the curse-of-dimensionality that plagues traditional discretizations and the computational bottlenecks of gradient-based neural network solvers through a synergistic combination of techniques. First, a prescribed Gaussian envelope encodes the far-field decay of the wavefunction, enabling a space-time separation where the spatial approximation is handled by a frozen, single-hidden-layer neural network with data-driven sampled features. This yields a gradient-free formalism where spatial derivatives are analytically precomputed and time-dependence is evolved via reduced ODEs. Second, a stochastic-dimension sampler provides a conditionally unbiased estimate of the spatial operator by evaluating only a small subset of spatial dimensions at each time step, essentially reducing computational and memory costs. Discrete conservation laws are also enforced, ensuring long-term stability. Extensive numerical experiments on GPE in up to 1000 dimensions demonstrate that SD-FSNN achieves significantly higher accuracy and efficiency compared to state-of-the-art methods, including PINNs, randomized feature methods, and tensor-network approaches. The results confirm that SD-FSNN effectively mitigates the Kolmogorov $n$-width barrier for frozen-basis models on structured solution manifolds.

2604.03634 2026-06-05 cs.LG cs.IT eess.SP math.IT

Algebraic Diversity: Group-Theoretic Spectral Estimation from Single Observations

代数多样性:从单次观测进行群论谱估计

Mitchell A. Thornton

发表机构 * Richardson, TX 75080 USA(美国德克萨斯州里奇蒙德市75080号)

AI总结 本文通过群论方法揭示了单次观测下的谱估计问题,证明了时间平均是退化群作用的特例,并展示了群平均估计与多快门协方差估计的等效性,同时统一了DFT、DCT和KLT等变换。

详情
Comments
41 pages, 14 figures. v3: Retracted six quantitative findings in Section 11, transformer application, due to implementation error in spectral concentration metric. Corrected results deferred to separate publication. Remark added after Conjecture 23 on orbit-structure bias in psi criterion. All other sections unaffected v4: new result on blind group matching; v5: corrected/updated metrics
AI中文摘要

我们证明时间平均多个观测是退化群作用的特例,群G={e}。一个通用替换定理证明了单快门群平均估计与多快门协方差估计具有等效的子空间分解。平凡群嵌入定理证明样本协方差是平凡群估计的累积,其方差由(G,L)连续体支配,随1/(|G|·L)变化。处理增益10log10(M) dB等于经典波束成形增益,证明该增益是群阶的属性而非传感器数量。DFT、DCT和KLT统一为群匹配的特例。我们推测一个通用代数平均定理,将这些结果扩展到任意统计量,方差由有效群阶d_eff支配。蒙特卡洛实验在五种群类型下的前四个样本矩上验证了该猜想,精度达四位。该框架利用信息的结构(数据对象的表示论对称性)而非内容,补充了香农理论。五种应用被展示:单快门MUSIC、大规模MIMO、单脉冲波形分类、图信号处理和变压器LLM分析。描述了盲群匹配技术。

英文摘要

We establish that temporal averaging over multiple observations is the degenerate case of algebraic group action with the trivial group $G=\{e\}$. A General Replacement Theorem proves that a group-averaged estimator from one snapshot achieves equivalent subspace decomposition to multi-snapshot covariance estimation. The Trivial Group Embedding Theorem proves that the sample covariance is the accumulation of trivial-group estimates, with variance governed by a $(G,L)$ continuum as $1/(|G|\cdot L)$. The processing gain $10\log_{10}(M)$ dB equals the classical beamforming gain, establishing that this gain is a property of group order, not sensor count. The DFT, DCT, and KLT are unified as group-matched special cases. We conjecture a General Algebraic Averaging Theorem extending these results to arbitrary statistics, with variance governed by the effective group order $d_{\mathrm{eff}}$. Monte Carlo experiments on the first four sample moments across five group types confirm the conjecture to four-digit precision. The framework exploits the $structure$ of information (representation-theoretic symmetry of the data object) rather than the content, complementing Shannon's theory. Five applications are demonstrated: single-snapshot MUSIC, massive MIMO, single-pulse waveform classification, graph signal processing, and analysis of transformer LLMs. Techniques for blind group matching are described.

2604.07709 2026-06-05 cs.AI cs.CL cs.CY cs.LG

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

IatroBench: AI安全措施中意外伤害的预注册证据

David Gringras

发表机构 * Harvard T.H. Chan School of Public Health(哈佛大学T.H. 洪学校公共卫生学院)

AI总结 该研究通过IatroBench评估了AI安全措施在医疗决策中的意外伤害风险,发现不同模型在身份相关性上的隐瞒行为存在显著差异,尤其在高度安全训练的模型中表现更明显。

详情
Comments
30 pages, 3 figures, 11 tables. Pre-registered on OSF (DOI: 10.17605/OSF.IO/G6VMZ). Code and data: https://github.com/davidgringras/iatrobench. v2: Fix bibliography entries (add arXiv IDs, published venues); correct p-value typo in Limitations section; add AI Assistance Statement v3: Correct Figure 1 (decoupling scatter accidentally reverted to earlier draft in v2)
AI中文摘要

一个经过严格安全训练的模型会将完整的苯二氮䓬类药物减量方案交给医生,而拒绝给需要该方案的患者,尽管临床事实完全相同;知识在两种情况下都存在。IatroBench在六十个预注册的临床场景和六个前沿模型(3,600次响应)中测量这种不对称性,并通过医生编写的结构化评估进行评分,该评估由第二位医生验证(加权Kappa 0.571,内部一致性96%)。在保持临床内容不变的情况下,仅改变提问者是患者还是医生,产生我们称为身份依赖性隐瞒的现象:所有五个可测试的模型都给医生更多(解耦间隙+0.38,p=0.003;在安全冲突行动上的非专业人士命中率下降13.1点,p<0.0001;其余无变化),且在最高度安全训练的模型Opus中,差距最大(+0.65)。触发因素是缺乏任何专业或知识信号,而不是身份证明,因为律师或知情的非专业人士可以恢复被拒绝的患者情况。仅考虑委托的基准会将三种机制评分相同。Opus抑制了医生框架证明其知道的内容;Llama 4在两种框架中都不胜任;GPT-5.2的过滤器剥离了其33.2%的医生响应,但没有剥离非专业人士的响应。评估层继承了训练层的盲目性;标准LLM评分者在我们流程标记为有害的81.5%的响应中对遗漏伤害评分零(Kappa 0.066),因此用于检测失败的工具重现了这种现象。这些场景是为碰撞设计的;其比率描述了这种设计,但说 nothing about ordinary prevalence.

英文摘要

A heavily safety-trained model will hand a physician the full, patient-followable benzodiazepine taper and refuse it to the patient who needs it, over identical clinical facts; the knowledge is present either way. IatroBench measures that asymmetry across sixty pre-registered clinical scenarios and six frontier models (3,600 responses), scoring each on two axes, commission harm (what a response gets wrong) and omission harm (what it withholds), through a physician-authored structured evaluation validated by a second physician (weighted kappa 0.571, within-1 agreement 96%). Holding clinical content fixed and varying only whether the asker presents as patient or physician yields what we call identity-contingent withholding: all five testable models give the physician more (a decoupling gap of +0.38, p = 0.003; a 13.1-point fall in layperson hit rates on safety-colliding actions, p < 0.0001; no change on the rest), and the gap runs widest in the most heavily safety-trained model, Opus (+0.65). The trigger is the absence of any professional or epistemic signal rather than a credential, since a lawyer or an informed layperson recovers what the patient is refused. A commission-only benchmark would score three mechanisms alike. Opus suppresses what physician framing proves it knows; Llama 4 is incompetent in either framing; GPT-5.2's filter strips 33.2% of its physician responses and none of the lay ones. The evaluation layer inherits the blindness of the training layer; a standard LLM judge scores zero omission harm on 81.5% of the responses our pipeline flags harmful (kappa 0.066), so the instrument built to detect the failure reproduces it. The scenarios are engineered for collision; their rates describe that design and say nothing about ordinary prevalence.

2604.12138 2026-06-05 cs.AI cs.CL cs.IR

Retrieval-Augmented Generation Must Move Beyond Factual Grounding to Represent Diverse Opinions

检索增强生成必须超越事实基础以代表多样化观点

Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri

发表机构 * Amazon.com(亚马逊公司)

AI总结 本文指出检索增强生成系统存在系统性事实偏差,并提出需要在检索系统设计上进行范式转变,通过不确定性量化方法提出统一目标,并展示Opinion-Aware RAG架构在两个领域中的实验结果,证明其在多样性、公平性和准确性方面的提升。

详情
Comments
20 pages, Preprint under review
AI中文摘要

本文主张检索增强生成系统存在系统性事实偏差,即在优化知识不确定性的同时忽视意见丰富内容中固有的随机不确定性。这种不一致要求检索系统设计发生范式转变。对35个主要RAG基准的调查表明,只有一个是意见合成的,证实了这种偏差的结构性:嵌入在数据集、检索目标和评估指标中。除了技术限制外,这种偏差还对透明和可问责的AI构成风险:回音室效应放大主导观点,系统性低估少数声音,以及通过偏见信息合成进行意见操控的潜在风险。我们通过不确定性量化的方法正式提出问题,显示事实查询应最小化后验熵,而意见查询必须保持它,并利用Wasserstein距离推导出统一的目标,涵盖覆盖性、忠实性和公平性。作为存在证明,我们提出了Opinion-Aware RAG(O-RAG),一种具有基于LLM的意见提取和实体链接意见元数据的架构,并在两个领域——电子商务卖家论坛和公共酒店评论——中评估了超过10000次讨论和6000次客户评论。实验显示Wasserstein距离到语料库级情感分布减少了18-48%,情感多样性增加了26.8%,实体匹配率增加了42.7%,人类评估者在79.2%的情况下更偏好包含意见的响应。我们提出了一项研究议程,并认为随着RAG系统越来越多地调解信息访问,其代表多样化观点的能力不仅不是可选的,而是必需的。

英文摘要

This position paper argues that Retrieval-Augmented Generation systems exhibit a systematic factual bias-optimizing for epistemic uncertainty reduction while ignoring the aleatoric uncertainty inherent in opinion-rich content - and that this misalignment demands a paradigm shift in retrieval system design. A survey of 35 major RAG benchmarks reveals that only one addresses opinion synthesis, confirming that the bias is structural: embedded in datasets, retrieval objectives, and evaluation metrics alike. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic under-representation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize the problem through the lens of uncertainty quantification, showing that factual queries should minimize posterior entropy while opinion queries must preserve it, and derive a unified objective over coverage, fidelity, and fairness using the Wasserstein distance. As an existence proof, we present Opinion-Aware RAG (O-RAG), an architecture featuring LLM-based opinion extraction and entity-linked opinion metadata, and evaluate it across two domains - e-commerce seller forums and public hotel reviews - spanning 10K+ discussions and 6K+ customer reviews. Experiments demonstrate 18-48% reduction in Wasserstein distance to corpus-level sentiment distributions, +26.8% sentiment diversity, and +42.7% entity match rate, with human evaluators preferring opinion-enriched responses 79.2% of the time. We propose a research agenda and argue that as RAG systems increasingly mediate access to information, their ability to represent diverse perspectives is not optional but essential.

2604.12110 2026-06-05 cs.LG

SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

SOLARIS: 预测性卸载基于潜在表示的推理扩展

Zikun Liu, Liang Luo, Qianru Li, Zhengyu Zhang, Wei Ling, Jingyi Shen, Zeliang Chen, Yaning Huang, Jingxian Huang, Abdallah Aboelela, Chonglin Sun, Feifan Gu, Fenggang Wu, Hang Qu, Huayu Li, Jill Pan, Kaidi Pei, Laming Chen, Longhao Jin, Qin Huang, Tongyi Tang, Varna Puvvada, Wenlin Chen, Xiaohan Wei, Xu Cao, Yantao Yao, Yuan Jin, Yunchen Pu, Yuxin Chen, Zijian Shen, Zhengkai Zhang, Jing Zhu, Dong Liang, Ellie Wen

发表机构 * Meta AI

AI总结 本文提出SOLARIS框架,通过预测未来请求中的用户-项目交互嵌入,将昂贵的基础模型推理与关键服务路径解耦,从而在大规模应用中实现实时知识转移,提升服务效率和收益。

详情
Comments
Accepted to SIGIR 2026 Industry Track
AI中文摘要

近期推荐系统扩展定律的进展导致了前所未有的复杂基础模型。尽管这些模型性能优异,但其计算需求使得实时服务不切实际,通常迫使从业者依赖知识蒸馏,以牺牲服务质量换取效率。为了解决这一挑战,我们提出了SOLARIS(基于潜在表示的推测卸载推理扩展)框架,灵感来源于推测解码。SOLARIS通过预测未来请求中可能出现的用户-项目对,主动预计算用户-项目交互嵌入,并异步生成其基础模型表示。这种方法将昂贵的基础模型推理与延迟敏感的服务路径解耦,使能够从此前被认为过于昂贵而无法用于在线使用的模型中进行实时知识转移。在部署于Meta的广告系统中,该系统每日处理数十亿请求,SOLARIS实现了0.67%的收益驱动的顶级指标提升,证明了其在大规模应用中的有效性。

英文摘要

Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta's advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

2410.04960 2026-06-05 cs.CV

On Efficient Variants of Segment Anything Model: A Survey

关于高效分段任何模型的变体:一项调查

Xiaorui Sun, Jun Liu, Heng Tao Shen, Xiaofeng Zhu, Ping Hu

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) School of Computing and Communications(计算与通信学院) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 本文综述了高效分段任何模型变体的研究,探讨了提升效率的同时保持准确性的核心技术和方法,并评估了不同硬件上的性能。

详情
Comments
IJCV
AI中文摘要

分段任何模型(SAM)是图像分割任务的基础模型,以其在多样化应用中的强大泛化能力而闻名。然而,其出色的性能伴随着显著的计算和资源需求,使其在资源受限的环境中(如边缘设备)部署变得困难。为此,提出了一系列SAM变体以在保持准确性的同时提高效率。本文提供了对这些高效SAM变体的首次全面回顾。我们首先探讨了推动这项研究的动力,然后介绍了SAM中使用的核心技术和模型加速方法。接着,我们详细探讨了SAM加速策略,按方法进行分类,并讨论了几个未来研究方向。最后,我们对这些方法在各种硬件上的进行了统一和广泛的评估,评估了它们在代表性基准上的效率和准确性,并提供了整体性能的清晰比较。

英文摘要

The Segment Anything Model (SAM) is a foundational model for image segmentation tasks, known for its strong generalization across diverse applications. However, its impressive performance comes with significant computational and resource demands, making it challenging to deploy in resource-limited environments such as edge devices. To address this, a variety of SAM variants have been proposed to enhance efficiency while keeping accuracy. This survey provides the first comprehensive review of these efficient SAM variants. We begin by exploring the motivations driving this research. We then present core techniques used in SAM and model acceleration. This is followed by a detailed exploration of SAM acceleration strategies, categorized by approach, and a discussion of several future research directions. Finally, we offer a unified and extensive evaluation of these methods across various hardware, assessing their efficiency and accuracy on representative benchmarks, and providing a clear comparison of their overall performance.

2604.08882 2026-06-05 cs.RO

Simulation of Adaptive Running with Flexible Sports Prosthesis using Reinforcement Learning of Hybrid-link System

使用混合链接系统强化学习模拟适应性跑步与柔性运动假肢

Yuta Shimane, Ko Yamamoto

发表机构 * Department of Biological Sciences, The University of Tokyo(东京大学生物科学系) Institute of Systems and Information Engineering, University of Tsukuba(茨城大学系统与信息工程研究所)

AI总结 本文提出了一种基于强化学习的框架,用于模拟单侧小腿截肢者在不同虚拟假肢刚度条件下的适应性跑步运动,通过混合链接系统整合了叶弹簧型运动假肢的灵活性,分析了假肢刚度对跑步动态和代谢成本的影响。

详情
AI中文摘要

本研究提出了一种基于强化学习的框架,用于模拟单侧小腿截肢者在混合链接系统中的适应性跑步运动,该系统整合了叶弹簧型运动假肢的灵活性。运动假肢的设计和选择通常依赖于试错法。全面的全身动力学分析,考虑人体运动与假肢变形之间的相互作用,可以为用户特定的设计和选择提供有价值的见解。所提出的混合链接系统通过整合分段常应(PCS)模型来代表假肢的灵活性。基于此系统,模拟方法利用强化学习方法生成单侧小腿截肢者的全身动态运动。该框架整合了基于运动捕捉数据的模仿学习与准确的假肢动力学计算。在多种虚拟假肢刚度条件下模拟跑步运动,并分析由此获得的相应的代谢成本(COT)。结果表明,假肢刚度的变化影响跑步动态和性能,且COT与先前研究中的值一致。我们的发现证明了所提出方法在虚拟条件下进行模拟和分析的潜力,这些虚拟条件与现实世界条件不同。

英文摘要

This study proposes a reinforcement learning-based framework for adaptive running motion simulation in a unilateral transtibial amputee using a hybrid-link system that incorporates the flexibility of a leaf-spring-type sports prosthesis. The design and selection of sports prostheses typically rely on trial and error. A comprehensive whole-body dynamics analysis that accounts for interactions between human motion and prosthetic deformation can provide valuable insights for user-specific design and selection. The proposed hybrid-link system enables such analysis by integrating a Piece-wise Constant Strain (PCS) model to represent prosthetic flexibility. Based on this system, the simulation methodology generates whole-body dynamic motions of a unilateral transtibial amputee using a reinforcement learning approach. This framework integrates imitation learning based on motion capture data with accurate computation of prosthetic dynamics. Running motions are simulated under multiple virtual prosthetic stiffness conditions, and the corresponding metabolic cost of transport (COT) obtained from these simulations is analyzed. The results suggest that variations in prosthetic stiffness influence running dynamics and performance, and that COT is consistent with values reported in prior study. Our findings demonstrate the potential of the proposed approach for simulation and analysis under virtual conditions that differ from real-world conditions.

2604.08477 2026-06-05 cs.AI cs.CL cs.LG

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA: 通过自然指令上的强化学习激发大语言模型的通用推理

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出SUPERNOVA框架,通过自然指令数据集构建高质量的强化学习可验证奖励数据集,通过100+次强化学习实验系统研究如何利用这些数据集提升下游推理性能,并在BigBench Extra Hard基准上实现64.4个百分点的相对提升。

详情
Comments
23 Pages; 2-column format; 10 figures
AI中文摘要

强化学习可验证奖励(RLVR)在数学和代码等正式领域显著提升了推理能力,但将其扩展到STEM领域以外仍然具有挑战性。扩展RLVR到STEM领域本质上受到高质量可验证训练数据的缺乏限制。在本文中,我们引入SUPERNOVA,一个从自然指令数据集中整理RLVR数据的框架,这些数据集是专家标注的丰富来源,但尚未被充分利用于RLVR训练。通过100多次受控的强化学习实验,我们系统研究如何利用这些数据集进行RLVR训练以及数据整理决策如何影响下游推理性能。特别是,我们研究了三种数据设计:(a)源任务选择,(b)任务混合,以及(c)合成干预。我们的分析揭示了源任务选择对下游推理性能有显著影响。此外,基于单个目标任务性能选择任务优于基于总体平均性能的策略,合成干预并未提高推理能力。受这些见解的启发,我们构建了SUPERNOVA,一个从自然指令数据集中整理出的25,000个实例的高质量RLVR数据集。我们证明了在SUPERNOVA上训练Qwen3-0.6B比基础Qwen3-0.6B表现更优,在包含23个复杂推理任务的挑战性基准BigBench Extra Hard(BBEH)上实现了64.4个百分点的相对提升。重要的是,我们发现SUPERNOVA的收益可以推广到未见基准、更大模型规模和新模型家族。总体而言,我们的发现为整理人类标注资源以扩展RLVR到通用推理提供了实用见解。模型、数据、代码见https://github.com/asuvarna31/supernova。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved reasoning in formal domains such as mathematics and code, but extending these gains beyond STEM remains challenging. Extending RLVR beyond STEM is fundamentally constrained by the lack of high-quality verifiable training data. In this work, we introduce SUPERNOVA, a framework for curating RLVR data from natural instruction datasets, which are a rich source of expert-annotated data but are underexplored for RLVR training. Through 100+ controlled RL experiments, we systematically study how to utilize these dataset for RLVR and how data curation decisions affect downstream reasoning performance . In particular, we investigate three data designs: (a) source task selection, (b) task mixing, and (c) synthetic interventions. Our analysis reveals that source task selection has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance and synthetic interventions do not improve reasoning. Guided by these insights, we construct SUPERNOVA, a high-quality RLVR dataset of 25K instances curated from natural instruction datasets. We show that training Qwen3-0.6B on SUPERNOVA outperforms the base Qwen3-0.6B, yielding a relative gain of 64.4pp on BigBench Extra Hard (BBEH), a challenging benchmark comprising 23 complex reasoning tasks. Importantly, we find that gains from SUPERNOVA generalize to unseen benchmarks, larger model scales, and newer model families. Overall, our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. Models, Data, Code at https://github.com/asuvarna31/supernova.

2604.06052 2026-06-05 cs.CV

Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

注意,我可以请你决定吗?定位扩散模型中的生成选择

Katarzyna Zaleska, Łukasz Popek, Monika Wysoczańska, Kamil Deja

发表机构 * Warsaw University of Technology(华沙技术大学) valeo.ai IDEAS Research Institute(IDEAS研究所)

AI总结 本文提出基于探测的定位技术,发现自注意力层是解决模糊概念的关键,并设计ICM方法通过干预少量自注意力层实现精确去偏。

详情
Comments
CVPR 2026
AI中文摘要

文本到图像扩散模型展现出卓越的生成能力,但其内部运作仍然不透明,尤其是在处理不完全描述性提示时。在这种情况下,模型必须做出隐式决策以生成文本中未明确指定的细节。本文研究了这一决策过程并非分散而是计算上局部化在模型架构中的假设。虽然现有的定位技术专注于提示相关的干预,但我们注意到这种显式条件可能与隐式决策不同。因此,我们引入了一种基于探测的定位技术,以识别概念属性可分性最高的层。我们的发现表明,模糊概念的分辨主要由自注意力层控制,将其确定为最有效的干预点。基于这一发现,我们提出了ICM(隐式选择修改)——一种精确的引导方法,对少量层进行有针对性的干预。大量实验证实,与现有最先进方法相比,干预这些特定的自注意力层能产生更优的去偏性能,并最小化较不精确方法常见的伪影。代码可在https://github.com/kzaleskaa/icm获取。

英文摘要

Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches. The code is available at https://github.com/kzaleskaa/icm.

2604.03042 2026-06-05 cs.RO

Enhancing Multi-Robot Exploration Using Probabilistic Frontier Prioritization with Dirichlet Process Gaussian Mixtures

利用概率前沿优先级与狄利克雷过程高斯混合模型增强多机器人探索

John Lewis Devassy, Meysam Basiri, Mário A. T. Figueiredo, Pedro U. Lima

发表机构 * Institute for Systems and Robotics / LARSyS and Instituto Superior Técnico, Universidade de Lisboa(系统与机器人研究所 / LARSyS 和里斯本大学理工学院) Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de Lisboa(电信研究所和里斯本大学理工学院)

AI总结 本文提出了一种基于概率前沿优先级和狄利克雷过程高斯混合模型的改进方法,以提升多机器人探索的效率,通过在两种先进的多智能体探索算法中集成该方法,实现了在不同环境复杂度、通信限制和团队规模下的性能提升,实验结果表明平均性能提升了10%至14%。

详情
Comments
Accepted: IEEE Robotics and Automation Letters (RA-L)
AI中文摘要

多智能体自主探索对于环境监测、搜索救援和大规模工业监控等应用至关重要。然而,在通信限制下有效协调仍是一个重大挑战。前沿探索算法分析已知区域与未知区域之间的边界,以确定下一个最佳视图,以最大化探索收益。本文提出了一种改进现有基于前沿的探索算法的方法,通过引入概率前沿优先级方法,利用狄利克雷过程高斯混合模型(DP-GMM)和信息增益的概率公式,提高前沿优先级的质量。该改进方法整合到两种最先进的多智能体探索算法中,在不同环境复杂度、通信限制和团队规模下均实现了性能提升。仿真显示,两种算法在所有组合中平均收益提高了10%和14%。在双无人机真实世界实验中的成功部署进一步证实了这些发现。

英文摘要

Multi-agent autonomous exploration is essential for applications such as environmental monitoring, search and rescue, and industrial-scale surveillance. However, effective coordination under communication constraints remains a significant challenge. Frontier exploration algorithms analyze the boundary between the known and unknown regions to determine the next-best view that maximizes exploratory gain. This article proposes an enhancement to existing frontier-based exploration algorithms by introducing a probabilistic approach to frontier prioritization. By leveraging Dirichlet process Gaussian mixture model (DP-GMM) and a probabilistic formulation of information gain, the method improves the quality of frontier prioritization. The proposed enhancement, integrated into two state-of-the-art multi-agent exploration algorithms, consistently improves performance across environments of varying clutter, communication constraints, and team sizes. Simulations showcase an average gain of $10\%$ and $14\%$ for the two algorithms across all combinations. Successful deployment in real-world experiments with a dual-drone system further corroborates these findings.

2604.01489 2026-06-05 cs.LG cs.AI cs.DC cs.PF cs.SE

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

CuTeGen: 基于LLM的代理框架用于使用CuTe生成和优化高性能GPU内核

Tara Saba, Zhiyang Chen, Jikai Jason Li, Anne Ouyang, Xujie Si, Fan Long

发表机构 * Department of Computer Science, University of Toronto(计算机科学系,多伦多大学)

AI总结 本文提出CuTeGen,一种基于LLM的代理框架,通过CuTe抽象层实现GPU内核的生成和优化,通过结构化生成-测试-优化工作流,在标准基准测试中实现了比PyTorch快1.71倍的速度提升,并在生成成本相近的情况下优于现有代理基线CudaForge。

详情
AI中文摘要

高性能GPU内核对现代机器学习系统至关重要,但开发这些内核仍然是一个手动、专家驱动的过程。最近的研究尝试利用LLM自动生成功能内核,但生成的内核在标准化基准测试中仍无法达到精心调优的参考内核。我们提出了CuTeGen,一种代理GPU内核合成框架,将内核开发视为在CuTe抽象层上的结构化生成-测试-优化工作流。CuTeGen有两个设计选择区别于先前的工作:针对CuTe而不是原始CUDA,这暴露了性能关键结构如分块和数据移动,同时保持足够的稳定性以进行迭代优化;以及延迟的性能调度,将低层次性能反馈推迟到内核的高层结构稳定之后。在209个KernelBench Level-1和Level-2任务上,CuTeGen在PyTorch上实现了平均1.71倍的速度提升,并在生成成本相近的情况下优于先前的代理基线CudaForge(0.89倍)。代码可在https://github.com/taratt/cutegen.git获取。

英文摘要

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

2604.00230 2026-06-05 cs.LG

Neural Collapse Dynamics: Depth, Activation, Regularisation, and Feature Norm Threshold

神经坍缩动力学:深度、激活、正则化和特征范数阈值

Anamika Paul Rupa

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文研究了神经坍缩现象的动力学,发现特征范数达到特定临界值时会发生神经坍缩,并探讨了深度、激活函数、正则化和网络宽度对这一过程的影响。

详情
AI中文摘要

神经坍缩(NC)——即最后层特征收敛到一个等角紧框架——在平衡状态下已被深入理解,但其发生过程的动力学仍不明确。我们发现一个简单且可预测的规律:当特征范数的均值达到模型-数据集特定的临界值fn*时会发生NC,该值对训练条件变化不敏感。该值在每个(模型,数据集)对中高度集中(CV < 8%);训练动态主要影响fn接近fn*的速度,而非其值本身。在标准训练轨迹中,fn低于fn*的交叉始终在NC发生之前,提供了一个具有平均提前时间62个周期(MAE 24个周期)的实用预测器。直接干预实验确认fn*是梯度流的稳定吸引子——特征尺度的扰动在训练过程中会自我校正,无论方向如何都会收敛到相同值(p>0.2)。完成(架构x数据集)网格揭示了本文最强的结果:ResNet-20在MNIST上给出fn* = 5.867——相对于CIFAR-10的+68%,架构效应增加了+458%。该网格强烈非加性;fn*不能分解为独立的架构和数据集贡献。四个结构性规律出现:(1)深度对坍缩速度有非单调影响;(2)激活函数共同决定坍缩速度和fn*;(3)权重衰减定义了一个三区域相图——太小会减慢,最佳范围最快,太大会阻止坍缩;(4)宽度单调加速坍缩,同时将fn*最多移动13%。这些结果确立了特征范数动态作为预测NC时间的可行诊断方法,表明范数阈值行为是深度网络中延迟表示再组织的通用机制。

英文摘要

Neural collapse (NC) -- the convergence of penultimate-layer features to a simplex equiangular tight frame -- is well understood at equilibrium, but the dynamics governing its onset remain poorly characterised. We identify a simple and predictive regularity: NC occurs when the mean feature norm reaches a model-dataset-specific critical value, fn*, that is largely invariant to training conditions. This value concentrates tightly within each (model, dataset) pair (CV < 8%); training dynamics primarily affect the rate at which fn approaches fn*, rather than the value itself. In standard training trajectories, the crossing of fn below fn* consistently precedes NC onset, providing a practical predictor with a mean lead time of 62 epochs (MAE 24 epochs). A direct intervention experiment confirms fn* is a stable attractor of the gradient flow -- perturbations to feature scale are self-corrected during training, with convergence to the same value regardless of direction (p>0.2). Completing the (architecture)x(dataset) grid reveals the paper's strongest result: ResNet-20 on MNIST gives fn* = 5.867 -- a +458% architecture effect versus only +68% on CIFAR-10. The grid is strongly non-additive; fn* cannot be decomposed into independent architecture and dataset contributions. Four structural regularities emerge: (1) depth has a non-monotonic effect on collapse speed; (2) activation jointly determines both collapse speed and fn*; (3) weight decay defines a three-regime phase diagram -- too little slows, an optimal range is fastest, and too much prevents collapse; (4) width monotonically accelerates collapse while shifting fn* by at most 13%. These results establish feature-norm dynamics as an actionable diagnostic for predicting NC timing, suggesting that norm-threshold behaviour is a general mechanism underlying delayed representational reorganisation in deep networks.

2602.19190 2026-06-05 cs.CV cs.AI

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

FUSAR-GPT : 一种嵌入时空特征和两阶段解耦的视觉语言模型,用于合成孔径雷达图像

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

发表机构 * Fudan University(复旦大学) Discipline and Technology Center of Microwave Vision Intelligent Sensing, Fudan University(微波视觉智能感知学科与技术中心,复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出FUSAR-GPT,一种专门针对合成孔径雷达图像的视觉语言模型,通过嵌入时空特征和两阶段解耦方法,在多个遥感视觉语言基准测试中实现了最先进的性能。

详情
AI中文摘要

对所有天气和所有时间的合成孔径雷达(SAR)智能解释的研究对于推进遥感应用至关重要。近年来,尽管视觉语言模型(VLMs)在RGB图像上展示了强大的开放世界理解能力,但直接应用于SAR领域时,由于成像机制的复杂性、对散射特征的敏感性和高质量文本语料的稀缺性,其性能受到严重限制。为系统解决这一问题,我们构建了首个SAR图像-文本-AlphaEarth特征三元组数据集,并开发了FUSAR-GPT,一种专门用于SAR的VLM。FUSAR-GPT创新性地引入了一个地理空间基线模型作为“世界知识”先验,并通过“时空锚点”将多源遥感时间特征嵌入模型的视觉主干中,从而实现对SAR图像中目标稀疏表示的动态补偿。此外,我们设计了一种两阶段SFT策略,以解耦大模型的知识注入和任务执行。时空特征嵌入和两阶段解耦范式使FUSAR-GPT在多个典型遥感视觉语言基准测试中实现了最先进的性能,显著优于主流基线模型,超过10%。

英文摘要

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 10%.

2603.26233 2026-06-05 cs.CL

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

提问还是假设?编码代理中的不确定性意识澄清寻求

Nicholas Edwards, Sebastian Schuster

发表机构 * Faculty of Computer Science, University of Vienna(维也纳大学计算机科学系) UniVie Doctoral School Computer Science, University of Vienna(维也纳大学计算机科学博士学院)

AI总结 本研究评估了LLM代理在未指定任务中的澄清能力,提出了一种不确定性意识的多代理框架,提高了任务解决率,并展示了多代理系统在复杂任务中主动寻求信息的行为。

详情
Comments
18 pages, 7 figures; added experiments evaluating open-weight models (Kimi K2.6), expanded related work, and included dataset validation details
AI中文摘要

随着大型语言模型(LLM)代理在开放领域如软件工程中的广泛应用,它们经常遇到缺乏关键上下文的未指定指令。尽管人类开发者通过提问来解决模糊性,当前的代理大多优化于自主执行。在本工作中,我们系统地评估了LLM代理在未指定的SWE-bench Verified变体上的澄清能力。我们提出了一种不确定性意识的多代理框架,将未指定检测与代码执行解耦。在专有和开源前沿LLM上,我们的框架实现了69.40%的任务解决率,显著优于标准单代理设置,并缩小了与完全指定指令代理的性能差距。此外,我们发现多代理系统表现出良好的信息寻求行为,在简单任务上保守地提出查询,而在更复杂的问题上主动寻求信息。这些发现表明,当前模型可以转变为积极的合作者,其中代理能够独立识别何时提问以获取缺失信息。

英文摘要

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that decouples underspecification detection from code execution. Across both proprietary and open-weight frontier LLMs, our scaffold achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated information-seeking behavior, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

2601.12983 2026-06-05 cs.CL

ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

ChartAttack: 测试大型语言模型在图表生成中对恶意提示的脆弱性

Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta, Iryna Gurevych

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT索菲亚大学"圣克莱门特·欧赫里迪斯基") Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE(无处不在知识处理实验室(UKP实验室)、计算机科学系、图腾达姆斯塔特大学和应用网络安全国家研究中心ATHENE) Arizona State University(亚利桑那州立大学)

AI总结 本文提出ChartAttack框架,用于评估多模态大语言模型在生成误导性图表方面的能力,通过注入误导性元素来诱导错误解释,并引入AttackViz数据集来评估和改进模型的鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地被用于从数据表自动生成图表,提高了分析和报告的效率,但也引入了新的滥用风险。我们提出了ChartAttack,一个用于评估MLLMs如何通过在图表设计中注入误导性元素来大规模生成误导性图表的框架。我们还介绍了AttackViz,一个图表问答(QA)数据集,其中每个(图表规范,QA)对都标记有有效的误导性元素及其诱导的错误答案。ChartAttack显著降低了QA性能,使MLLM的准确性在领域内下降17.2点,在跨领域下降11.9点。一项受控的人类研究显示,由ChartAttack生成的误导性图表会降低人类图表QA性能。最后,我们证明AttackViz可用于微调MLLMs以提高对误导性图表的鲁棒性。我们的发现强调了在MLLM基于图表生成系统的设计、评估和部署中需要加强鲁棒性和安全性的紧迫需求。我们公开了我们的代码和数据。

英文摘要

Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, improving analysis and reporting efficiency while introducing new misuse risks. We present ChartAttack, a framework for evaluating how MLLMs can generate misleading charts at scale by injecting misleaders into chart designs to induce incorrect interpretations. We also introduce AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. ChartAttack significantly degrades QA performance, reducing MLLM accuracy by 17.2 points in-domain and 11.9 cross-domain. A controlled human study shows that misleading charts generated by ChartAttack reduce human chart QA performance. Finally, we demonstrate that AttackViz can be used to fine-tune MLLMs to improve robustness against misleading charts. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.

2603.19312 2026-06-05 cs.LG cs.AI

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

LeWorldModel:从像素稳定端到端联合嵌入预测架构

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal(Mila与蒙特利尔大学) New York University(纽约大学) Samsung SAIL(三星SAIL) Brown University(布朗大学)

AI总结 本文提出LeWorldModel,一种通过仅使用两个损失项从原始像素稳定端到端训练的联合嵌入预测架构,显著减少了可调损失超参数,并在多种2D和3D控制任务中表现出色,同时在物理结构编码和物理不合理的事件检测方面展示了其能力。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)提供了一个有吸引力的框架,用于在紧凑的潜在空间中学习世界模型,但现有方法仍然脆弱,依赖于复杂的多术语损失、指数移动平均、预训练编码器或辅助监督来避免表示崩溃。在本工作中,我们引入了LeWorldModel(LeWM),这是第一个通过仅使用两个损失项从原始像素稳定端到端训练的JEPAs。这将可调损失超参数的数量从六个减少到一个。在单个GPU上几小时内可训练约1500万参数,LeWM的规划速度比基于基础模型的世界模型快48倍,同时在多种2D和3D控制任务中保持竞争力。除了控制之外,我们还展示了LeWM的潜在空间通过探测物理量编码有意义的物理结构。惊奇评估证实,该模型能够可靠地检测出物理上不可能的事件。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.