arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2509.24189 2026-06-09 cs.CL 版本更新

SPECTRA: Revealing the Full Spectrum of User Preferences via Distributional LLM Inference

SPECTRA:通过分布化LLM推理揭示用户偏好的全频谱

Luyang Zhang, Jialu Wang, Shichao Zhu, Beibei Li, Zhongcun Wang, Guangmou Pan, Yang Song

发表机构 * Carnegie Mellon University(卡内基梅隆大学) TikTok Inc.(TikTok公司)

AI总结 提出SPECTRA方法,将微调后的LLM视为隐式概率模型,通过探测softmax层推断用户偏好的概率分布,有效恢复长尾偏好并提升公平性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于建模用户偏好,其典型输出是每个用户直接生成的排序项目列表。然而,这种生成范式继承了自回归解码的偏差和不透明性。它过度强调频繁(头部)偏好,并抑制少数、长尾偏好。为了解决这个问题,我们提出了SPECTRA(Softmax Probing for Extracted Category-level Token Readouts and Analysis),它将微调后的LLM视为隐式概率模型,并探测其softmax以推断语义可解释的偏好类别上的概率分布。我们在MovieLens、Yelp和一个大规模短视频平台上评估了SPECTRA。SPECTRA实现了:(i) 分布对齐,在公共数据集上将与经验偏好分布的Jensen-Shannon散度降低了38%到44%;(ii) 长尾恢复与跨用户公平性,在MovieLens上将top-3类别曝光熵提高了23%,并且对尾部偏好用户的增益大于头部偏好用户;(iii) 下游应用价值,在MovieLens和Yelp上类别NDCG提升了41%到46%,在大规模部署中,与针对头部优化的生产排序器相比,长尾类别排序提升了7倍。

英文摘要

Large Language Models (LLMs) are increasingly used to model user preferences, with the typical output as a directly-generated ranked item list per user. However, this generative paradigm inherits the bias and opacity of autoregressive decoding. It over-emphasizes frequent (head) preferences and suppresses minority, long-tail ones. To address this, we propose SPECTRA (Softmax Probing for Extracted Category-level Token Readouts and Analysis), which treats the finetuned LLM as an implicit probabilistic model and probes its softmax to infer a probability distribution over semantically interpretable preference categories. We evaluate SPECTRA on MovieLens, Yelp, and a large-scale short-video platform. SPECTRA delivers (i) distributional alignment, reducing Jensen-Shannon divergence to the empirical preference distribution by 38 to 44 percent across public datasets; (ii) long-tail recovery with cross-user fairness, raising top-3 category exposure entropy by 23 percent on MovieLens and producing a larger gain on tail-preference users than on head-preference users; and (iii) downstream application value, with a 41 to 46 percent category-NDCG boost on MovieLens and Yelp, and a 7x improvement on long-tail category ranking on a large-scale deployment against a head-optimized production ranker.

2601.04805 2026-06-09 cs.AI 版本更新

Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

基于思考的非思考:通过强化学习解决混合推理模型训练中的奖励黑客问题

Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) Jiutian Research, Beijing, China(九天研究院)

AI总结 针对混合推理模型训练中的奖励黑客问题,提出Thinking-Based Non-Thinking方法,利用思考型回答的解决方案信息为非思考型回答设置差异化最大令牌数,在数学基准上减少约50%令牌使用并提升准确率。

详情
AI中文摘要

大型推理模型(LRMs)因其卓越性能而备受关注。然而,其性能主要源于思考(即长链思维CoT),这显著增加了计算开销。为解决这一过度思考问题,现有工作侧重于使用强化学习(RL)训练混合推理模型,使其根据查询复杂度自动决定是否进行思考。不幸的是,使用RL会遇到奖励黑客问题,例如,模型进行了思考但被判定为未思考,导致奖励错误。为缓解此问题,现有工作要么采用监督微调(SFT),计算成本高昂,要么对非思考型回答强制设置统一令牌限制,缓解效果有限。本文提出基于思考的非思考(TNT)。它不使用SFT,而是通过利用思考型回答的解决方案组件中的信息,为不同查询的非思考型回答设置不同的最大令牌使用量。在五个数学基准上的实验表明,与DeepSeek-R1-Distill-Qwen-1.5B/7B和DeepScaleR-1.5B相比,TNT将令牌使用量减少约50%,同时显著提高准确率。事实上,TNT在所有测试方法中实现了准确率与效率之间的最优权衡。此外,在所有测试数据集中,TNT被分类为未使用思考的回答中出现奖励黑客问题的概率低于10%。

英文摘要

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.

2601.04498 2026-06-09 cs.LG cs.CV 版本更新

IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

IGenBench:文本到信息图生成可靠性基准测试

Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie, Jiale Lao, Yiyao Wang, Haoxuan Li, Tingting Gao, Bo Pan, Luoxuan Weng, Xiuqi Huang, Minfeng Zhu, Yingchaojie Feng, Yuyu Luo, Wei Chen

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) UESTC University of Virginia(弗吉尼亚大学) HKUST(GZ)(香港科技大学(广州)) Cornell University(康奈尔大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出IGENBENCH基准,包含30种信息图类型和600个测试用例,通过多模态大语言模型分解为10类原子问题评估10种T2I模型,发现数据完整性等维度是普遍瓶颈。

详情
AI中文摘要

信息图是结合数据可视化与文本和插图元素的复合视觉制品,用于传达信息。虽然最近的文本到图像(T2I)模型可以生成美观的图像,但它们在生成信息图方面的可靠性仍不清楚。生成的信息图可能乍看正确,但包含容易被忽视的问题,例如扭曲的数据编码或错误的文本内容。我们提出了IGENBENCH,这是第一个评估文本到信息图生成可靠性的基准,包含跨越30种信息图类型的600个精心设计的测试用例。我们设计了一个自动评估框架,将可靠性验证分解为基于10种问题类型的原子是否问题。我们使用多模态大语言模型(MLLM)验证每个问题,得到问题级准确率(Q-ACC)和信息图级准确率(I-ACC)。我们在IGENBENCH上全面评估了10个最先进的T2I模型。我们的系统分析揭示了未来模型开发的关键见解:(i)三级性能层次,顶级模型的Q-ACC为0.90,但I-ACC仅为0.49;(ii)数据相关维度成为普遍瓶颈(例如,数据完整性:0.21);(iii)所有模型实现端到端正确性的挑战。我们在https://this URL发布IGENBENCH。

英文摘要

Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.

2601.03256 2026-06-09 cs.CV 版本更新

Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

Muses: 无需训练的设计、组合与生成不存在的幻想3D生物

Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang

发表机构 * Nanjing University(南京大学) China Agricultural University(中国农业大学)

AI总结 提出Muses,首个无需训练的馈送式幻想3D生物生成方法,利用3D骨架实现结构感知的设计、组合与生成,在视觉保真度和文本对齐方面达到最优。

Comments Project page: https://luhexiao.github.io/Muses.github.io/

详情
AI中文摘要

我们提出Muses,这是首个在馈送式范式中无需训练即可生成奇幻3D生物的方法。以往依赖部件感知优化、手动组装或2D图像生成的方法,由于复杂的部件级操作和有限的域外生成挑战,往往产生不真实或不连贯的3D资产。相比之下,Muses利用3D骨架(生物形态的基本表示)来明确且合理地组合多样元素。这种骨架基础将3D内容创作形式化为一个结构感知的设计、组合与生成流水线。Muses首先通过图约束推理构建一个具有连贯布局和比例的创意组合3D骨架。然后,该骨架在结构化潜在空间内引导基于体素的组装过程,整合来自不同对象的区域。最后,在骨架条件下应用图像引导的外观建模,为组装形状生成风格一致且和谐的纹理。大量实验证明,Muses在视觉保真度和文本描述对齐方面达到了最先进的性能,并在灵活的3D对象编辑方面具有潜力。项目页面:此 https URL。

英文摘要

We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.

2601.02880 2026-06-09 cs.AI cs.CL 版本更新

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

ReTreVal:带有验证和跨问题记忆的推理树

Abhishek HS, Pavan C Shekar, Arpit Jain, Ashwanth Krishnan

发表机构 * QpiAI

AI总结 ReTreVal通过自适应树探索、带工具增强的节点细化、类型化失败回溯和自修改记忆,使大语言模型在无需微调的情况下实现跨问题学习,其在MATH-500上达到85.8%的pass@1准确率,在MMLU-Pro上达到54.4%的准确率。

Comments 15 pages, 1 figure, 12 tables

详情
AI中文摘要

现有推理框架在问题边界丢弃所有失败上下文,导致模型解决问题500时比问题1时更无知。我们提出了ReTreVal(带有验证的推理树),这是一个无需训练的框架,通过自适应树探索、带工具增强的节点细化、类型化失败回溯以及自修改记忆,实现了跨问题学习。ReTreVal在MATH-500上达到85.8%的pass@1准确率(比零样本CoT高8.6个百分点,比最强基线Self-Refine高8.6个百分点),在MMLU-Pro上达到54.4%的准确率(比Self-Refine高15.3个百分点),3.4:1的胜率比噪声比证实了真正的错误恢复。这些能力,以前需要梯度更新,使32B模型能够与更大的单次通过系统竞争。

英文摘要

Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a self-rewriting memory that accumulates and revises strategy entries across problems, enabling inference-time cross-problem learning on any fixed, unmodified LLM without fine-tuning. ReTreVal achieves 85.8% pass@1 on MATH-500 (+8.6 pp over Zero-Shot CoT, +8.6 pp over the strongest baseline Self-Refine) and 54.4% on MMLU-Pro (+15.3 pp over Self-Refine), with a 3.4:1 win-to-regression ratio confirming genuine error recovery rather than noise. These capabilities, previously requiring gradient updates, allow a 32B model to compete with much larger single-pass systems.

2601.01665 2026-06-09 cs.LG cs.AI 版本更新

Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

多目标神经组合优化的对抗实例生成与鲁棒训练

Wei Liu, Yaoxin Wu, Yingqian Zhang, Thomas Bäck, Yingjie Fan

发表机构 * LIACS, Leiden University, Leiden, The Netherlands(莱顿大学LIACS研究所,莱顿,荷兰) Eindhoven University of Technology, Eindhoven, The Netherlands(埃因霍温理工大学,埃因霍温,荷兰)

AI总结 提出面向多目标组合优化问题的偏好条件深度强化学习鲁棒性框架,通过偏好对抗攻击生成困难实例并量化影响,结合硬度感知偏好选择的对抗训练提升泛化性,在MOTSP、MOCVRP、MOKP上验证了攻击与防御的有效性。

详情
AI中文摘要

深度强化学习(DRL)在解决多目标组合优化问题(MOCOPs)方面显示出巨大潜力。然而,这些基于学习的求解器的鲁棒性尚未得到充分探索,尤其是在多样化和复杂的问题分布上。在本文中,我们提出了一个面向偏好条件DRL求解器用于MOCOPs的统一鲁棒性导向框架。在该框架内,我们开发了一种基于偏好的对抗攻击,以生成暴露求解器弱点的困难实例,并通过由此导致的帕累托前沿质量下降来量化攻击影响。我们进一步引入了一种防御策略,将硬度感知偏好选择集成到对抗训练中,以减少对受限偏好区域的过拟合并提高分布外性能。在多目标旅行商问题(MOTSP)、多目标容量车辆路径问题(MOCVRP)和多目标背包问题(MOKP)上的实验结果验证了我们的攻击方法能够成功地为不同求解器学习困难实例。此外,我们的防御方法显著增强了神经求解器的鲁棒性和泛化能力,在困难或分布外实例上提供了优越的性能。

英文摘要

Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.

2410.00713 2026-06-09 cs.CV 版本更新

RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

RAD:面向机器人观测的真实异常检测数据集与基准

Kaichen Zhou, Xinhai Chang, Taewhan Kim, Jiadong Zhang, Yang Cao, Chufei Peng, Fangneng Zhan, Hao Zhao, Hao Dong, Kai Ming Ting, Ye Zhu

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) Great Bay University(大湾大学) Harvard University(哈佛大学) Tsinghua University(清华大学) Nanjing University(南京大学) Deakin University(德克萨斯大学)

AI总结 提出RAD数据集,包含13类日常物体和4种缺陷,从60多个机器人视角在非受控光照下采集,用于评估2D特征、3D重建和视觉语言模型在姿态无关的异常检测中的表现,发现2D方法优于3D和VLM方法。

详情
AI中文摘要

异常检测是机器人感知和工业检测的核心能力,然而现有大多数基准是在固定视角和稳定光照的受控条件下收集的,未能反映实际部署场景。我们提出RAD(真实异常检测),一个由机器人捕获的多视角数据集,旨在强调姿态变化、反射材料和视角依赖的缺陷可见性。RAD涵盖13个日常物体类别和四种真实缺陷类型——划痕、缺失、污渍和挤压——在非受控光照下从每个物体超过60个机器人视角捕获。我们在姿态无关的设置下对多种最先进方法进行基准测试,包括基于2D特征的方法、3D重建流水线和视觉语言模型(VLM)。令人惊讶的是,我们发现成熟的2D特征嵌入方法在图像级别上始终优于最近的3D和基于VLM的方法,而在像素级别定位上性能差距缩小。我们的分析表明,反射表面、几何对称性和稀疏的视角覆盖从根本上限制了当前基于几何和零样本的方法。RAD为机器人异常检测建立了一个具有挑战性和现实性的基准,突出了超出受控实验室环境的关键开放问题。

英文摘要

Anomaly detection is a core capability for robotic perception and industrial inspection, yet most existing benchmarks are collected under controlled conditions with fixed viewpoints and stable illumination, failing to reflect real deployment scenarios. We introduce RAD (Realistic Anomaly Detection), a robot-captured, multi-view dataset designed to stress pose variation, reflective materials, and viewpoint-dependent defect visibility. RAD covers 13 everyday object categories and four realistic defect types--scratched, missing, stained, and squeezed--captured from over 60 robot viewpoints per object under uncontrolled lighting. We benchmark a wide range of state-of-the-art approaches, including 2D feature-based methods, 3D reconstruction pipelines, and vision-language models (VLMs), under a pose-agnostic setting. Surprisingly, we find that mature 2D feature-embedding methods consistently outperform recent 3D and VLM-based approaches at the image level, while the performance gap narrows for pixel-level localization. Our analysis reveals that reflective surfaces, geometric symmetry, and sparse viewpoint coverage fundamentally limit current geometry-based and zero-shot methods. RAD establishes a challenging and realistic benchmark for robotic anomaly detection, highlighting critical open problems beyond controlled laboratory settings.

2512.12225 2026-06-09 cs.AI 版本更新

A Geometric Theory of Cognition for Machine Intelligence

机器智能的认知几何理论

Laha Ale

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China(计算与人工智能学院,西南交通大学,成都,中国)

AI总结 提出黎曼流形上的梯度流框架,统一表征、记忆、适应与预测,在部分可观测强化学习任务中优于前馈基线,鲁棒性堪比循环架构。

详情
AI中文摘要

开发能够统一表征、记忆、适应和预测的人工智能体仍然是人工智能中的一个基本挑战。在这里,我们引入了一个几何框架,其中认知计算源于学习到的潜在流形上的黎曼梯度流。学习到的度量编码了表征约束和计算偏好,而几何中的各向异性自然产生了多个时间尺度的行为,从而在没有显式记忆模块或循环机制的情况下,同时产生快速反应响应和较慢的适应动态。我们通过黎曼表征和动态模型实例化该框架,并在部分可观测的强化学习环境中进行评估。在观测掩蔽、感觉中断、动态扰动和预测性潜在建模任务中,所提出的方法始终优于前馈基线,实现了与循环架构相当的鲁棒性,并产生了高度可预测的潜在轨迹,具有较低的长程展开误差。这些结果表明,学习到的潜在几何可以同时作为表征、记忆、适应和预测的基质。更广泛地说,该框架提供了动力系统、表征学习和基于世界模型的智能之间的原则性联系。

英文摘要

Developing artificial agents that unify representation, memory, adaptation, and prediction remains a fundamental challenge in artificial intelligence. Here we introduce a geometric framework in which cognitive computation emerges from Riemannian gradient flow on a learned latent manifold. The learned metric encodes representational constraints and computational preferences, while anisotropies in the geometry naturally generate multiple timescales of behaviour, yielding both rapid reactive responses and slower adaptive dynamics without explicit memory modules or recurrent mechanisms. We instantiate this framework through Riemannian representation and dynamics models and evaluate them in partially observable reinforcement-learning environments. Across observation masking, sensory blackouts, dynamics perturbations, and predictive latent-modelling tasks, the proposed approach consistently outperforms feedforward baselines, achieves robustness comparable to recurrent architectures, and produces highly predictable latent trajectories with low long-horizon rollout error. These results suggest that learned latent geometry can serve simultaneously as a substrate for representation, memory, adaptation, and prediction. More broadly, the framework provides a principled connection between dynamical systems, representation learning, and world-model-based intelligence.

2410.05662 2026-06-09 cs.LG 版本更新

Communication-Efficient Federated Learning under Dynamic Device Arrival and Departure: Convergence Analysis and Algorithm Design

动态设备加入和离开下的通信高效联邦学习:收敛性分析与算法设计

Zhan-Lun Chang, Dong-Jun Han, Seyyedali Hosseinalipour, Mung Chiang, Christopher G. Brinton

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University(埃洛姆家族电气与计算机工程学院,普渡大学) Department of Computer Science and Engineering, Yonsei University(延世大学计算机科学与工程系) Department of Electrical Engineering, University at Buffalo–SUNY(布法罗大学(SUNY)电气工程系)

AI总结 针对设备动态加入/离开的联邦学习场景,提出基于梯度相似性的模型初始化算法,通过加权平均历史全局模型加速分布偏移恢复,实现收敛速度提升一个数量级以上。

详情
AI中文摘要

大多数联邦学习(FL)方法假设设备集固定。然而,现实场景中设备常因用户移动模式或跨小区切换等动态加入或离开系统。这种动态设置带来了独特挑战:(1)优化目标随活动设备集演变,不同于传统FL的静态目标;(2)当前全局模型可能不再作为后续轮次的有效初始化,可能阻碍适应、延迟收敛并降低资源效率。为应对这些挑战,我们首先对动态设备集下的FL进行收敛性分析,考虑了梯度噪声、本地训练迭代次数以及该实际设置中的数据异质性等因素。受此分析启发,我们提出一种模型初始化算法,使设备加入或离开网络时能够快速适应。我们的关键思想是计算先前全局模型的加权平均,以梯度相似性为指导,优先选择在数据分布与当前设备集紧密对齐上训练的模型,从而在更少的训练轮次中加速从分布偏移中恢复。这种即插即用算法设计为与现有FL方法无缝集成,具有广泛适用性。实验表明,与基线相比,我们的方法通常实现一个数量级或更多的收敛加速,我们证明这大幅降低了达到目标精度的能耗。

英文摘要

Most federated learning (FL) approaches assume a fixed device set. However, real-world scenarios often involve devices dynamically joining or leaving the system, driven by, e.g., user mobility patterns or handovers across cell boundaries. This dynamic setting introduces unique challenges: (1) the optimization objective evolves with the active device set, unlike traditional FL's static objective; and (2) the current global model may no longer serve as an effective initialization for subsequent rounds, potentially hindering adaptation, delaying convergence, and reducing resource efficiency. To address these challenges, we first provide a convergence analysis for FL under a dynamic device set, accounting for factors such as gradient noise, local training iterations, and data heterogeneity in this practical setting. Motivated by this analysis, we propose a model initialization algorithm that enables rapid adaptation whenever devices join or leave the network. Our key idea is to compute a weighted average of previous global models, guided by gradient similarity, to prioritize models trained on data distributions that closely align with the current device set, thereby accelerating recovery from distribution shifts in fewer training rounds. This plug-and-play algorithm is designed to integrate seamlessly with existing FL methods, offering broad applicability. Experiments demonstrate that our approach achieves convergence speedups typically an order of magnitude or more compared to baselines, which we show drastically reduces energy consumption to reach a target accuracy.

2512.20845 2026-06-09 cs.AI cs.MA 版本更新

MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

MAR:多智能体反思提升大语言模型的推理能力

Onat Ozer, Yuchen Wang, Grace Wu, Daniel Dosti, Honghao Zhang, Vivi De La Rue

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出多智能体反思框架,通过多角色辩论生成多样化反思,解决单模型反思中的思维退化问题,在HotPot QA和HumanEval上分别达到47% EM和82.7%准确率。

详情
AI中文摘要

大语言模型已展现出通过反思自身错误并据此行动来提升推理任务性能的能力。然而,同一LLM对自身的持续反思会表现出思维退化,即即使知道错误,LLM仍会反复重复相同错误。为解决此问题,我们引入多智能体与多角色辩论者作为生成反思的方法。通过大量实验,我们发现这能导致LLM智能体生成的反思具有更好的多样性。我们在HotPot QA(问答)上展示了47%的精确匹配准确率,在HumanEval(编程)上展示了82.7%的准确率,这两项性能均超越了单一LLM的反思。

英文摘要

LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.

2509.10534 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

解耦“什么”和“哪里”:极坐标位置嵌入

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国)

AI总结 提出极坐标位置嵌入(PoPE)以解耦Transformer注意力机制中的内容和位置,在诊断任务、序列建模和语言模型中优于RoPE,并展现零样本长度外推能力。

Comments ICML 2026 camera-ready version

详情
AI中文摘要

Transformer架构中的注意力机制根据内容(“什么”)和序列中的位置(“哪里”)将键匹配到查询。我们提出一项分析,表明在流行的RoPE旋转位置嵌入中,“什么”和“哪里”是纠缠的。这种纠缠会损害性能,特别是当决策需要在这两个因素上独立匹配时。我们提出对RoPE的改进,称为极坐标位置嵌入(PoPE),它消除了“什么-哪里”的混淆。PoPE在仅通过位置或内容进行索引的诊断任务上表现远优于基线。在音乐、基因组和自然语言领域的自回归序列建模中,使用PoPE作为位置编码方案的Transformer在评估损失(困惑度)和下游任务性能上优于使用RoPE的基线。在语言建模中,这些优势在模型规模从124M到774M参数时持续存在。关键的是,与RoPE甚至专为外推设计的方法YaRN(需要额外微调和频率插值)相比,PoPE展现出强大的零样本长度外推能力。

英文摘要

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

2506.17231 2026-06-09 cs.CL cs.CR 版本更新

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

通过从大语言模型到小语言模型的对抗性提示蒸馏实现高效且隐蔽的越狱攻击

Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) The Chinese University of Hong Kong(香港中文大学) University of Liverpool(利物浦大学)

AI总结 提出对抗性提示蒸馏(APD)框架,将LLM的越狱能力迁移到SLM,实现高效低资源攻击,在GPT-4上达到96.4%攻击成功率,速度提升3.7倍,参数减少11.3倍。

Comments 24 pages, 3 figures

详情
AI中文摘要

当前针对大语言模型(LLM)的越狱攻击主要依赖LLM自身生成对抗性提示,造成了关键的效率瓶颈:每次攻击需要大量计算资源和API查询,限制了可扩展性和实际部署。为克服这一限制,我们提出对抗性提示蒸馏(APD),一种新颖的框架,将越狱能力从LLM迁移到小语言模型(SLM),以实现高效、低资源的攻击。APD集成了三个关键组件:(1)通过LoRA微调进行掩码对抗知识预训练,(2)动态温度控制的知识蒸馏以弥合架构差距,以及(3)基于强化学习的模板优化以实现自适应改进。在12个模型上的大量实验表明,APD实现了最先进的攻击成功率(例如,在GPT-4上达到96.4%的ASR_k),同时显著提高了效率——生成提示速度提升3.7倍,参数比教师模型减少11.3倍。我们的工作建立了首个轻量级越狱攻击的实用框架,暴露了LLM防御中的新漏洞,并为推进AI安全研究提供了可扩展的测试平台。我们的代码可在以下网址获取:this https URL。

英文摘要

Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4% ASR_k on GPT-4) while dramatically improving efficiency - generating prompts 3.7x faster with 11.3x fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

2512.08499 2026-06-09 cs.LG cs.AI 版本更新

Developing Distance-Aware Physics-Constrained Probabilistic Frameworks for Industrial Prognostics

面向工业预测的具有距离感知的物理约束概率框架开发

Waleed Razzaq, Yun-Bo Zhao

发表机构 * University of Science and Technology China(中国科学技术大学)

AI总结 提出两种无需采样的距离感知物理约束概率框架PC-SNGP和PC-SNER,通过谱归一化和动态加权策略平衡数据保真度与物理一致性,在轴承预测中提升精度和不确定性校准。

详情
AI中文摘要

可靠且物理可解释的工业预测概率框架的发展仍处于初期阶段,现有文献在输入远离训练流形时往往不敏感。本文开发了两种无需采样的、具有距离感知的物理约束概率框架:(i) PC-SNGP 和 (ii) PC-SNER。两者均对隐藏层权重应用谱归一化,强制从输入到潜在空间的bi-Lipschitz距离保持表示。PC-SNGP将密集输出替换为高斯过程,其后验方差随输入与训练流形的距离增加而增大。PC-SNER修改输出层以预测Normal-Inverse-Gamma (NIG)参数,用于距离保持估计。为在训练过程中保持数据保真度与物理一致性之间的平衡,我们引入了物理约束损失的动态加权策略。我们还引入了一个距离感知系数 (DAC) 指标来量化对分布偏移的敏感性。实验上,我们使用PRONOSTIA、XJTU-SY和HUST基准数据集在滚动轴承 (REBs) 预测上验证了两种框架。实验结果表明,与竞争基线相比,预测精度提高,不确定性估计校准良好,同时在交叉验证中保持可审计性能,并在极端对抗扰动下具有鲁棒性。

英文摘要

Development of reliable and physically interpretable probabilistic frameworks for industrial prognostics remain nascent, and existing literature is often insensitive as inputs move away from the training manifold. In this paper, we develop two sampling-free, distance-aware physics-constrained probabilistic frameworks: (i) PC-SNGP and (ii) PC-SNER. Both apply spectral normalization to hidden layer weights, enforcing bi-Lipschitz distance-preserving representation from the input to the latent space. PC-SNGP replaces the dense output with Gaussian process whose posterior variance increases with input distance from the training manifold. PC-SNER modifies the output layer to predict Normal-Inverse-Gamma~(NIG) parameters for distance preserving estimation. To maintain balance between data fidelity and physical consistency during training, we introduce a dynamic weighting strategy for the physics-constrained loss. We also introduce a distance-aware-coefficient~(DAC) metric to quantify sensitivity to distributional shifts. Empirically, we validate both frameworks on rolling-element-bearings (REBs) prognostics using the PRONOSTIA, XJTU-SY, and HUST benchmark datasets. Experimental results demonstrate improved prediction accuracy and well-calibrated uncertainty estimates relative to competing baselines, while maintaining auditable performance in cross-validation and robustness under extreme adversarial perturbations.

2512.16349 2026-06-09 cs.CV cs.AI 版本更新

Collaborative Edge-to-Server Inference for Vision-Language Models

面向视觉-语言模型的协作式边缘到服务器推理

Soochang Song, Yongjune Kim

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)(电气工程系,波扬科学技术大学(POSTECH))

AI总结 提出一种协作式边缘到服务器推理框架,通过两阶段选择性重传策略,在降低通信成本的同时保持视觉-语言模型的推理精度。

Comments 12 pages, 15 figures, 3 tables

详情
AI中文摘要

我们提出了一种面向视觉-语言模型(VLM)的协作式边缘到服务器推理框架,该框架在保持推理精度的同时降低了通信成本。在典型部署中,边缘设备(客户端)捕获的视觉数据被传输到服务器进行VLM推理。然而,传输全分辨率图像会产生高昂的通信成本。相反,为减轻通信开销而进行的激进缩小或过度压缩可能会丢弃细粒度细节,导致精度下降。为克服这一限制,我们设计了一个通信高效的两阶段框架。在第一阶段,服务器对缩小的缩略图(全局图像)进行推理,并量化输出令牌的最小熵。如果最小熵超过预定义阈值,服务器利用VLM的内部注意力识别感兴趣区域(RoI),并请求边缘设备发送该RoI的细节保留局部图像。然后,服务器通过联合利用全局和局部图像来细化其推理。这种选择性重传策略确保仅额外传输必要的视觉内容。实验结果一致证实,所提出的框架在跨多种VQA基准测试中显著降低了通信开销,同时保持了推理精度。

英文摘要

We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, transmitting full-resolution images incurs high communication cost. Conversely, aggressive downsizing or excessive compression to mitigate communication overhead can discard fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a communication-efficient two-stage framework. In the first stage, the server performs inference on the downsized thumbnail (global image) and quantifies the min-entropy of the output tokens. If the min-entropy exceeds a predefined threshold, the server identifies a region of interest (RoI) using the VLM's internal attention and requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is additionally transmitted. Experimental results consistently confirm that the proposed framework substantially reduces communication overhead while maintaining inference accuracy across diverse VQA benchmarks.

2512.15116 2026-06-09 cs.LG cs.AI 版本更新

FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

FADTI: 基于傅里叶和注意力驱动的多变量时间序列插补扩散模型

Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang, Xuemin Lin, Ying Zhang

发表机构 * Anonymous(匿名)

AI总结 提出FADTI扩散框架,通过可学习傅里叶偏置投影模块注入频域归纳偏置,结合自注意力与门控卷积进行时序建模,在多个基准上优于现有方法,尤其在高缺失率下表现突出。

Comments This work has been submitted to the IEEE for possible publication. 10 pages, 7 figures

详情
AI中文摘要

多变量时间序列插补是医疗保健、交通预测和生物建模等应用中的基础问题,其中传感器故障和不规则采样导致普遍存在的缺失值。然而,现有的基于Transformer和扩散的模型缺乏明确的归纳偏置和频率感知,限制了它们在结构化缺失模式和分布偏移下的泛化能力。我们提出FADTI,一个基于扩散的框架,通过可学习的傅里叶偏置投影(FBP)模块注入频率信息特征调制,并将其与通过自注意力和门控卷积进行的时间建模相结合。FBP支持多种谱基,能够自适应编码平稳和非平稳模式。这种设计将频域归纳偏置注入生成式插补过程。在多个基准(包括一个新引入的生物时间序列数据集)上的实验表明,FADTI持续优于最先进的方法,尤其是在高缺失率下。代码可在该https URL获取。

英文摘要

Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF

2512.12320 2026-06-09 cs.RO 版本更新

Programmable Deformation Design of Porous Soft Actuator through Volumetric-Pattern-Induced Anisotropy

通过体积图案诱导各向异性的多孔软体执行器可编程变形设计

Canqi Meng, Weibang Bai

发表机构 * ShanghaiTech Automation and Robotics (STAR) Center, School of Information Science and Technology, ShanghaiTech University(上海科技大学自动化与机器人(STAR)中心,信息科学与技术学院,上海科技大学)

AI总结 提出一种在多孔泡沫中切割图案实现软体执行器可编程变形的方法,通过有限元分析研究机制,实验展示弯曲、倾斜、扭转等变形,并应用于仿生软体手。

Comments Accepted to 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

传统的软体气动执行器通常基于中空弹性体腔室,往往存在结构支撑小的问题,并且需要昂贵的几何特定重新设计才能实现多模态功能。填充到腔室中的多孔材料(如泡沫)可以为执行器提供结构稳定性。然而,通过定制多孔体本身来实现可编程变形的方法仍未得到充分探索。本文提出了一种新颖的设计方法,通过在泡沫体中切割特定图案来实现具有可编程变形的软体多孔执行器。该方法引入了泡沫的局部结构各向异性,从而在全局真空输入下引导材料的变形。此外,讨论了圆柱形泡沫基底上的三种基本图案:横向用于弯曲,纵向用于倾斜,对角线用于扭转。利用有限元分析(FEA)建立了计算模型,以研究切口图案方法的机理。实验表明,通过图案阵列数N的潜在优化设计,执行器可以实现最大80°的弯曲(N=2)、18°的倾斜(N=1)和115°的扭转(N=8)。我们方法的通用性通过图案的可转移性、可扩展性和复杂设计的无模具快速原型制作得到证明。作为综合应用,我们将人类手部折痕图转化为功能性切口图案,创建了一个能够像人类一样自适应抓握的仿生软体机械手。我们的工作为多功能软体多孔机器人的设计提供了一种新的、高效且可扩展的范式。

英文摘要

Conventional soft pneumatic actuators, typically based on hollow elastomeric chambers, often suffer from small structural support and require costly geometry-specific redesigns for multimodal functionality. Porous materials such as foam, filled into chambers, can provide structural stability for the actuators. However, methods to achieve programmable deformation by tailoring the porous body itself remain underexplored. In this paper, a novel design method is presented to realize soft porous actuators with programmable deformation by incising specific patterns into the porous foam body. This approach introduces localized structural anisotropy of the foam guiding the material's deformation under a global vacuum input. Furthermore, three fundamental patterns on a cylindrical foam substrate are discussed: transverse for bending, longitudinal for tilting, and diagonal for twisting. A computational model is built with Finite Element Analysis (FEA), to investigate the mechanism of the incision-patterning method. Experiments demonstrate that with a potential optimal design of the pattern array number N, actuators can achieve bending up to $80^{\circ}$ (N=2), tilting of $18^{\circ}$ (N=1), and twisting of $115^{\circ}$ (N=8). The versatility of our approach is demonstrated via pattern transferability, scalability, and mold-less rapid prototyping of complex designs. As a comprehensive application, we translate the human hand crease map into a functional incision pattern, creating a bio-inspired soft robot hand capable of human-like adaptive grasping. Our work provides a new, efficient, and scalable paradigm for the design of multi-functional soft porous robots.

2512.07998 2026-06-09 cs.RO cs.CV 版本更新

DIJIT: A Robotic Head for an Active Observer

DIJIT: 面向主动观察者的机器人头部

Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Kelly Yuan, Markus D. Solbach, Yiqian Liu, Michael Jenkin, John K. Tsotsos

发表机构 * Department of Electrical Engineering and Computer Science, York University(电气与计算机科学系,约克大学)

AI总结 提出DIJIT双目机器人头部,具有9个机械自由度和4个光学自由度,实现类人眼/头运动,用于主动视觉研究,其扫视精度接近人类。

详情
Journal ref
IEEE Robotics and Automation Letters, Vol. 11, No. 6, pp. 7038-7045, June 2026
AI中文摘要

我们提出DIJIT,一种新颖的双目机器人头部,专为作为主动观察者的移动代理设计。DIJIT独特的功能广度使得主动视觉研究以及类人眼和头颈运动、它们之间的相互关系以及各自对视觉能力的贡献成为可能。DIJIT还被用于探索人类视觉如何利用眼/头运动解决视觉任务与当前计算机视觉方法之间的差异。DIJIT的设计具有九个机械自由度,而相机和镜头提供了额外的四个光学自由度。机械设计的范围和速度与人类性能相当。DIJIT达到了人类峰值扫视速度的85%。我们的设计包括会聚立体视觉所需的运动范围,即聚散、版本和旋转。在这里,我们介绍DIJIT及其性能的某些方面。我们还提出了一种新颖的扫视相机运动方法,利用相机方向与电机值之间的直接关系。由此产生的扫视相机运动在准确性上接近人类运动,左相机和右相机的平均误差分别为1.17°和1.14°。

英文摘要

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. DIJIT attains 85\% of the peak human saccade speed. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. Here, we present DIJIT and some aspects of its performance. We also present a novel method for saccadic camera movements, using a direct relationship between camera orientation and motor values. The resulting saccadic camera movements are close to human movements in terms of their accuracy, with 1.17$^\circ$ and 1.14$^\circ$ mean error for the left and right cameras, respectively.

2512.07355 2026-06-09 cs.AI cs.CV cs.LG 版本更新

A Geometric Unification of Concept Learning with Concept Cones

概念学习与概念锥的几何统一

Alexandre Rocchi, Thomas Fel, Gianni Franchi

发表机构 * AMIAD Kempner Institute, Harvard University(哈佛大学凯普勒研究所)

AI总结 通过共享几何框架(概念锥)统一监督式概念瓶颈模型与无监督稀疏自编码器,提出包含关系度量评估概念对齐,并发现稀疏性与扩展因子的最佳平衡点。

Comments 33 pages

详情
AI中文摘要

两种可解释性传统并行发展但很少相互交流:概念瓶颈模型(CBM)规定概念应该是什么,而稀疏自编码器(SAE)发现哪些概念涌现。CBM使用监督将激活与人类标记的概念对齐,而SAE依赖稀疏编码来揭示涌现概念。我们证明两种范式实例化相同的几何结构:每个范式学习激活空间中的一组线性方向,其非负组合形成概念锥。因此,监督和无监督方法的不同不在于种类,而在于如何选择这个锥。基于这一观点,我们提出了两种范式之间的操作桥梁。CBM提供人类定义的参考几何,而SAE可以通过其学习的锥在多大程度上近似或包含CBM的锥来评估。这种包含框架产生了量化指标,将归纳偏差(如SAE类型、稀疏性或扩展比)与合理概念的涌现联系起来。使用这些指标,我们发现了稀疏性和扩展因子的“最佳点”,该点最大化与CBM概念的几何和语义对齐。总体而言,我们的工作通过共享的几何框架统一了监督和无监督的概念发现,提供了原则性指标来衡量SAE进展,并评估发现的概念与合理的人类概念的对齐程度。

英文摘要

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

2512.03606 2026-06-09 cs.LG 版本更新

Observation-driven correction of numerical weather prediction for marine winds

基于观测驱动的海洋风数值天气预报修正

Matteo Peduto, Qidong Yang, Jonathan Giezendanner, Devis Tuia, Sherrie Wang

发表机构 * arXiv

AI总结 提出ORCA模型,利用Transformer架构融合稀疏、异质的海洋观测数据,实时修正GFS风场预报,在0-48小时预报时效内误差降低13%-45%。

详情
AI中文摘要

准确的海洋风预报对于安全航行、船舶路线规划和能源作业至关重要,但由于海洋观测数据稀疏、异质且时间变化大,预报仍然具有挑战性。我们提出了一种基于观测信息的全球数值天气预报(NWP)海洋风修正方法。该方法不是直接预报风场,而是通过同化最新的现场观测数据来学习局部修正模式,以调整全球预报系统(GFS)的输出。我们提出了ORCA(基于注意力的观测信息实时修正),这是一种基于Transformer的深度学习架构,它(i)通过掩码和基于集合的注意力机制处理不规则且随时间变化的观测集,(ii)通过交叉注意力将预测条件建立在最近的观测-预报对上,以及(iii)采用循环时间嵌入和坐标感知的位置表示,从而在任意空间坐标上实现单次推理。我们使用国际综合海洋-大气数据集(ICOADS)的观测数据,在大西洋上评估了ORCA。ORCA在长达48小时的所有预报时效内降低了GFS 10米风误差,在1小时预报时效内实现了45%的改进,在48小时预报时效内实现了13%的改进。空间分析显示,在观测最丰富的海岸线和航运路线沿线,改进最为持久。这种标记化架构自然地适应了异质的观测平台(船舶、浮标、验潮站和海岸站),并在单次前向传播中产生站点特定的预测和流域尺度的网格化产品。这些结果展示了一种实用的低延迟后处理方法,通过学习修正系统性的预报误差来补充NWP。

英文摘要

Accurate marine wind forecasts are essential for safe navigation, ship routing, and energy operations, yet they remain challenging because observations over the ocean are sparse, heterogeneous, and temporally variable. We present an observation-informed correction approach for global numerical weather prediction (NWP) of marine winds. Rather than forecasting winds directly, we learn local correction patterns by assimilating the latest in-situ observations to adjust the Global Forecast System (GFS) output. We propose ORCA (Observation-informed Real-time Correction with Attention), a transformer-based deep learning architecture that (i) handles irregular and time-varying observation sets through masking and set-based attention mechanisms, (ii) conditions predictions on recent observation--forecast pairs via cross-attention, and (iii) employs cyclical time embeddings and coordinate-aware location representations to enable single-pass inference at arbitrary spatial coordinates. We evaluate ORCA over the Atlantic Ocean using observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) as reference. ORCA reduces GFS 10-meter wind error at all lead times up to 48 hours, achieving 45% improvement at 1-hour lead time and 13% improvement at 48-hour lead time. Spatial analyses reveal the most persistent improvements along coastlines and shipping routes, where observations are most abundant. The tokenized architecture naturally accommodates heterogeneous observing platforms (ships, buoys, tide gauges, and coastal stations) and produces both site-specific predictions and basin-scale gridded products in a single forward pass. These results demonstrate a practical, low-latency post-processing approach that complements NWP by learning to correct systematic forecast errors.

2512.01930 2026-06-09 cs.LG cs.AI 版本更新

SVRG and Beyond via Posterior Correction

SVRG及其后验校正扩展

Nico Daheim, Thomas Möllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文揭示SVRG与后验校正方法的深层联系,证明SVRG是各向同性高斯后验校正的特例,并通过灵活指数族后验自动导出牛顿型和Adam型新变体。

Comments ICML 2026 (oral)

详情
AI中文摘要

随机方差缩减梯度(SVRG)及其变体旨在通过使用梯度校正来加速训练。这些方法最初提出于十多年前,但从未在任何基本层面上与任何贝叶斯方法联系起来。在这里,我们填补了这一空白,并推导出SVRG与最近提出的称为“后验校正”的贝叶斯方法之间令人惊讶的新联系。我们的主要贡献是证明SVRG可以恢复为各向同性高斯后验校正的特例。通过使用更灵活的指数族后验,自动获得了SVRG的新扩展。我们通过使用高斯族推导了两个这样的新扩展:一种具有新颖海森校正的牛顿型变体,以及一种可扩展到大规模问题的Adam型扩展。我们的工作是首次将SVRG与贝叶斯联系起来,并利用它来加速训练。

英文摘要

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections. Originally proposed over a decade ago, these methods have never been connected to any Bayesian method at a fundamental level. Here, we fill this gap and derive surprising new connections of SVRG to a recently proposed Bayesian method called `posterior correction'. Our main contribution is to show that SVRG can be recovered as a special case of posterior-correction over isotropic-Gaussian posteriors. Novel extensions of SVRG are automatically obtained by using more flexible exponential-family posteriors. We derive two new such extensions by using Gaussian families: a Newton-like variant with novel Hessian corrections, and an Adam-like extension that scales to large problems. Our work is the first to connect SVRG to Bayes and use it to speed-up training.

2512.01467 2026-06-09 cs.LG cs.AR cs.SC 版本更新

Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control

可微无权重控制器:学习连续控制的逻辑电路

Fabian Kresse, Christoph H. Lampert

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出可微无权重控制器(DWC),一种符号可微架构,通过梯度训练学习高效控制策略,编译为低延迟、低能耗的FPGA电路,在MuJoCo基准上达到与深度策略竞争的性能,并具有稀疏可解释的连接模式。

Comments Accepted at Forty-third International Conference on Machine Learning (ICML), 19 pages, 12 figures, 12 tables

详情
AI中文摘要

在现实条件下控制自主系统通常需要能够以低延迟和最小能耗评估的策略。不幸的是,这些条件与使用高精度深度神经网络作为控制器相矛盾。在这项工作中,我们引入了可微无权重控制器(DWC),这是一种符号可微架构,学习灵活、非线性但高效的控制策略。DWC可以通过基于梯度的技术进行端到端训练,但直接编译为FPGA兼容电路,具有少至一个时钟周期的延迟和每动作纳焦耳级的能量成本。在五个MuJoCo基准测试中,包括高维Humanoid,DWC实现了与标准深度策略(全精度或量化神经网络)竞争的性能。此外,DWC表现出结构稀疏和可解释的连接模式,使得能够直接检查哪些输入值影响控制决策。

英文摘要

Controlling autonomous systems under real-world conditions often requires policies that can be evaluated with low latency and minimal energy consumption. Unfortunately, these conditions are at odds with the use of high-precision deep neural networks as controllers. In this work, we introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that learns flexible, non-linear, yet highly efficient control policies. DWCs can be trained end-to-end via gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with standard deep policies (full-precision or quantized neural networks). Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling direct inspection of which input values influence control decisions.

2511.18454 2026-06-09 cs.CV cs.AI 版本更新

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

AttnRegDeepLab: 一种用于可解释胚胎碎片分级的双阶段解耦框架

Ming-Jhe Lee, Chang-Hong Wu, Jung-Hua Wang, Ming-Jer Chen, Yu-Chiao Yi, Tsung-Hsien Lee

发表机构 * Department of Electrical Engineering(电气工程系) AI Research Center(人工智能研究中心) National Taiwan Ocean University(国立台湾海洋大学) Department of Obstetrics, Gynecology(妇产科部) Gynecology, CSMU Hospital, Taichung, Taiwan(台中市立医院妇产科)

AI总结 提出AttnRegDeepLab框架,通过双分支多任务学习、注意力门控、多尺度回归头和两阶段解耦训练,实现胚胎碎片分级的高精度与可解释性。

Comments 6 pages, 5 figures

详情
AI中文摘要

胚胎碎片是评估体外受精(IVF)发育潜力的关键形态学指标。然而,手动分级主观且低效,而现有的深度学习解决方案往往缺乏临床可解释性,或在分割区域估计中遭受累积误差。为了解决这些问题,本研究提出了AttnRegDeepLab(注意力引导回归DeepLab),一种以双分支多任务学习(MTL)为特征的框架。通过将注意力门集成到其跳跃连接中,修改了原始的DeepLabV3+解码器,显式抑制细胞质噪声以保留轮廓细节。此外,引入了一个多尺度回归头,并采用特征注入机制将全局分级先验传播到分割任务中,纠正系统量化误差。提出了一种两阶段解耦训练策略来解决MTL中的梯度冲突。同时,设计了一种基于范围的损失以利用弱标记数据。我们的方法在保持出色分割精度(Dice系数=0.729)的同时实现了稳健的分级精度,这与可能以牺牲轮廓完整性为代价最小化分级误差的端到端方法形成对比。这项工作提供了一种在视觉保真度和量化精度之间取得平衡的临床可解释解决方案。

英文摘要

Assessing embryo fragmentation is crucial for predicting IVF success, yet manual grading is prone to subjectivity, and existing AI models struggle with clinical interpretability and segmentation errors. We propose AttnRegDeepLab, a Multi-Task Learning (MTL) framework designed to solve these challenges. The model enhances a DeepLabV3+ decoder with Attention Gates to filter out cytoplasmic noise and retain sharp contour details. It also introduces a Multi-Scale Regression Head with Feature Injection, guiding the segmentation process with global grading priors to eliminate systematic area estimation errors. Based on a two-stage decoupled training strategy and a range-based loss for weakly labeled data, our method resolves MTL gradient conflicts. AttnRegDeepLab yields high grading precision and excellent segmentation quality (Dice coefficient = 0.729), avoiding the trade-off between contour integrity and grading accuracy seen under standard joint optimization. This provides a reliable, clinically interpretable tool balancing visual and quantitative accuracy.

2511.05355 2026-06-09 cs.LG cs.RO cs.SY eess.SY 版本更新

SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning

SAD-Flower:用于安全、可接受和动态一致规划的流匹配

Tzu-Yuan Huang, Armin Lederer, Dai-Jie Wu, Xiaobing Dai, Sihua Zhang, Hsiu-Chin Lin, Shao-Hua Sun, Stefan Sosnowski, Sandra Hirche

发表机构 * TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.(慕尼黑技术大学计算、信息与技术学院) Munich Institute of Robotics(慕尼黑机器人与智能机构研究所) Munich Data Science Institute (MDSI)(慕尼黑数据科学研究所) National University of Singapore(新加坡国立大学) National Taiwan University (NTU)(国立台湾大学) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国立台湾大学人工智能研究中心) University of Utah(犹他大学) Beijing Institute of Technology(北京理工大学) McGill University(麦吉尔大学)

AI总结 提出SAD-Flower框架,通过虚拟控制输入增强流匹配,利用非线性控制理论提供状态约束、动作约束和动态一致性的形式化保证,无需重新训练即可在测试时满足未见约束。

详情
AI中文摘要

流匹配(FM)在数据驱动规划中显示出有希望的结果。然而,它本质上缺乏确保状态和动作约束的形式化保证,而满足这些约束对于各种系统上规划轨迹的安全性和可接受性是一个基本且关键的要求。此外,现有的FM规划器不能确保动态一致性,这可能导致轨迹不可执行。我们通过提出SAD-Flower来解决这些缺陷,这是一个用于生成安全、可接受和动态一致轨迹的新框架。我们的方法依赖于用虚拟控制输入增强流。因此,可以使用非线性控制理论的技术推导出有原则的指导,为状态约束、动作约束和动态一致性提供形式化保证。关键的是,SAD-Flower无需重新训练即可运行,从而在测试时满足未见约束。通过在多个任务上的广泛实验,我们证明SAD-Flower在确保约束满足方面优于各种基于生成模型的基线。

英文摘要

Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.

2512.00239 2026-06-09 cs.LG stat.ML 版本更新

Self-Supervised Dynamical System Representations for Physiological Time-Series

生理时间序列的自监督动力系统表示

Yenho Chen, Maxwell A. Xu, James M. Rehg, Christopher J. Rozell

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出PULSE框架,利用动力系统生成模型的信息结构,通过跨重建预训练目标提取共享系统参数信息,丢弃样本特异性噪声,提升生理时间序列的表示学习效果。

Comments Accepted to ICML 2026

详情
AI中文摘要

自监督学习(SSL)对生理时间序列的有效性取决于预训练目标在过滤掉无关噪声的同时保留关于潜在生理状态信息的能力。然而,现有策略由于依赖启发式原则或约束较差的生成任务而受到限制。为解决这一限制,我们提出一个预训练框架,该框架利用跨多个时间序列的动力系统生成模型的信息结构。该框架揭示了我们的关键见解:通过提取与跨相似时间序列样本共享的系统参数相关的生成变量信息,可以高效捕获类别身份,而应丢弃单个样本特有的噪声。基于这一见解,我们提出PULSE,一种基于跨重建的生理时间序列数据集预训练目标,它在丢弃不可迁移的样本特异性信息的同时显式提取系统信息。我们建立了提供系统信息恢复充分条件的理论,并通过合成动力系统实验进行了实证验证。此外,我们将我们的方法应用于多种真实世界数据集,证明PULSE学习到的表示能够广泛区分语义类别、提高标签效率并改进迁移学习。

英文摘要

The effectiveness of self-supervised learning (SSL) for physiological time series depends on the ability of a pretraining objective to preserve information about the underlying physiological state while filtering out unrelated noise. However, existing strategies are limited due to reliance on heuristic principles or poorly constrained generative tasks. To address this limitation, we propose a pretraining framework that exploits the information structure of a dynamical systems generative model across multiple time-series. This framework reveals our key insight that class identity can be efficiently captured by extracting information about the generative variables related to the system parameters shared across similar time series samples, while noise unique to individual samples should be discarded. Building on this insight, we propose PULSE, a cross-reconstruction-based pretraining objective for physiological time series datasets that explicitly extracts system information while discarding non-transferrable sample-specific ones. We establish theory that provides sufficient conditions for the system information to be recovered, and empirically validate it using a synthetic dynamical systems experiment. Furthermore, we apply our method to diverse real-world datasets, demonstrating that PULSE learns representations that can broadly distinguish semantic classes, increase label efficiency, and improve transfer learning.

2409.15723 2026-06-09 cs.LG cs.CL 版本更新

Federated Large Language Models: Current Progress and Future Directions

联邦大语言模型:当前进展与未来方向

Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Duke University(杜克大学) University of California San Diego(加州大学圣地亚哥分校) The University of New South Wales(新南威尔士大学) Adobe Research(Adobe研究) University of Maryland College Park(马里兰大学学院公园分校) CSIRO’s Data61(澳大利亚联邦科学与工业研究组织Data61)

AI总结 本文综述联邦学习与大语言模型结合(FedLLM)的最新进展,重点分析联邦微调和联邦提示学习如何应对效率、个性化和安全挑战,并展望联邦预训练和联邦智能体等方向。

Comments Accepted by PAKDD 2026

详情
AI中文摘要

大语言模型在各种应用中取得了令人印象深刻的性能,但其训练通常依赖于集中式数据收集,引发了严重的隐私和治理问题。联邦学习通过使多个客户端能够协作训练共享模型而不暴露原始本地数据,提供了一种去中心化的替代方案。然而,将联邦学习与大语言模型集成带来了新的挑战,包括数据异质性、收敛不稳定性、通信开销和计算约束。本综述提供了联邦学习用于大语言模型(FedLLM)的全面且最新的概述。我们系统地回顾了近期进展,特别强调联邦微调和联邦提示学习,并分析了现有方法如何应对效率、个性化和安全挑战。我们进一步总结了新兴方向,如联邦预训练和联邦智能体。我们的目标是提供对这个快速发展领域的结构化视角,并突出未来研究的有前景的途径。

英文摘要

Large Language Models have achieved impressive performance across diverse applications, yet their training typically depends on centralized data collection, raising serious privacy and governance concerns. Federated Learning offers a decentralized alternative by enabling multiple clients to collaboratively train shared models without exposing raw local data. However, integrating FL with LLMs introduces new challenges, including data heterogeneity, convergence instability, communication overhead, and computational constraints. This survey provides a comprehensive and up-to-date overview of Federated Learning for Large Language Models (FedLLM). We systematically review recent advances, with particular emphasis on federated fine-tuning and federated prompt learning, and analyze how existing methods address efficiency, personalization, and security challenges. We further summarize emerging directions such as federated pre-training and federated agents. Our goal is to offer a structured perspective on this rapidly evolving field and to highlight promising avenues for future research.

2511.20397 2026-06-09 cs.LG cs.DS cs.NA math.NA 版本更新

Model-Based Learning of Whittle indices

基于模型的Whittle指数学习

Joël Charles-Rebuffé, Nicolas Gast, Bruno Gaujal

发表机构 * Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家科学研究中心) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔INP)

AI总结 提出BLINQ算法,通过构建MDP经验估计并计算Whittle指数,证明收敛性并给出精度界,数值实验表明样本效率显著优于Q学习。

Comments 30 pages, 7 figures, submitted to TOMPECS

详情
AI中文摘要

我们提出BLINQ,一种新的基于模型的算法,用于学习可索引、连通且单链马尔可夫决策过程(MDP)的Whittle指数。我们的方法依赖于构建MDP的经验估计,然后使用现有最先进算法的扩展版本计算其Whittle指数。我们提供了收敛到我们想要学习的Whittle指数的证明,以及以任意精度学习它们所需时间的界限。此外,我们研究了其计算复杂度。我们的数值实验表明,在获得精确近似所需的样本数量方面,BLINQ显著优于现有的Q学习方法。此外,对于任何合理的高样本数量,其总计算成本甚至低于Q学习。即使使用神经网络加速Q值预测,这些观察结果仍然存在。

英文摘要

We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using neural networks to predict Q-values.

2511.19829 2026-06-09 cs.AI 版本更新

Knowing How to Edit: Reliable Evaluation Signals for Diagnosing and Optimizing Prompts at Query Level

一种统一评估指导的查询相关提示优化框架

Ke Chen, Yifeng Wang, Hassan Almosapeeh, Haohan Wang

发表机构 * School of Information Sciences, University of Illinois Urbana-Champaign(信息科学学院,伊利诺伊大学厄巴纳-香槟分校) College of Engineering, Carnegie Mellon University(工程学院,卡内基梅隆大学)

AI总结 提出一个基于性能导向的提示评估框架,并开发一个无需执行的评估器来预测多维质量分数,进而指导一个度量感知优化器以可解释的查询相关方式重写提示,在多个数据集和骨干模型上优于现有方法。

详情
AI中文摘要

大多数提示优化方法优化单个静态模板,使其在复杂和动态的用户场景中无效。现有的查询相关方法依赖于不稳定的文本反馈或黑盒奖励模型,提供弱且不可解释的优化信号。更根本的是,提示质量本身缺乏统一、系统的定义,导致碎片化和不可靠的评估信号。我们的方法首先建立了一个面向性能的、系统的、全面的提示评估框架。此外,我们开发并微调了一个无需执行的评估器,可以直接从文本中预测多维质量分数。然后,评估器指导一个度量感知优化器,该优化器以可解释的、查询相关的方式诊断失败模式并重写提示。我们的评估器在预测提示性能方面达到了最强的准确性,并且评估指导的优化在八个数据集和三个骨干模型上始终优于静态模板和查询相关的基线。总的来说,我们提出了一个统一的、基于度量的提示质量视角,并证明了我们的评估指导优化流程在多样化任务中提供了稳定、可解释和模型无关的改进。

英文摘要

Prompt optimization has become a central mechanism for eliciting strong performance from LLMs, and recent work has made substantial progress by proposing diverse prompt evaluation metrics and optimization strategies. Despite these advances, prompt evaluation and prompt optimization are often developed in isolation, limiting the extent to which evaluation can effectively inform prompt refinement. In this work, we study prompt optimization as a process guided by performance-relevant evaluation signals. To address the disconnect between evaluation and optimization, we propose an evaluation-instructed prompt optimization approach that explicitly connects prompt evaluation with query-dependent optimization. Our method integrates multiple complementary prompt quality metrics into a performance-reflective evaluation framework and trains an execution-free evaluator that predicts prompt quality directly from text, avoiding repeated model executions. These evaluation signals then guide prompt refinement in a targeted and interpretable manner. Empirically, the proposed evaluator achieves 83.7% accuracy in predicting prompt performance. When incorporated into the optimization process, our approach consistently outperforms existing optimization baselines across eight benchmark datasets and three different backbone LLMs. Overall, our results demonstrate that reliable and efficient evaluation signals can serve as an effective foundation for robust and interpretable prompt optimization.

2511.18676 2026-06-09 cs.CV cs.AI 版本更新

MedVision: Benchmarking Quantitative Medical Image Analysis

MedVision:定量医学图像分析的基准测试

Yongcheng Yao, Yongshuo Zong, Raman Dutt, Yongxin Yang, Sotirios A Tsaftaris, Timothy Hospedales

发表机构 * University of Edinburgh(爱丁堡大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 针对当前医学视觉语言模型缺乏定量推理能力的问题,提出MedVision数据集和基准,涵盖22个公共数据集、3080万图像-标注对,通过监督和强化微调显著提升检测、肿瘤/病变大小估计和角度/距离测量性能。

Comments 22 pages, 13 figures, 14 tables

详情
AI中文摘要

当前医学领域的视觉-语言模型(VLM)主要用于分类问答(如“这是正常还是异常?”)或定性描述任务。然而,临床决策通常依赖于定量评估,例如测量肿瘤大小或关节角度,医生据此得出自己的诊断结论。这种定量推理能力在现有VLM中尚未得到充分探索和支持。在这项工作中,我们引入了MedVision,这是一个专门设计用于评估和改进VLM在定量医学图像分析中的大规模数据集和基准。MedVision涵盖22个公共数据集,涉及多种解剖结构和模态,包含3080万个图像-标注对。我们聚焦于三个代表性的定量任务:(1)解剖结构和异常检测,(2)肿瘤/病变(T/L)大小估计,以及(3)角度/距离(A/D)测量。我们表明,当前现成的VLM在这些任务上表现不佳。然而,在MedVision上进行监督和强化微调显著提升了检测、T/L估计和A/D测量的性能。MedVision为开发具有稳健定量推理能力的医学图像分析VLM奠定了基础。

英文摘要

Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. We show that current off-the-shelf VLMs perform poorly on these tasks. However, supervised and reinforcement fine-tuning on MedVision significantly enhances performance across detection, T/L estimation, and A/D measurement. MedVision provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging.

2511.18421 2026-06-09 cs.SD cs.LG 版本更新

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

DHAuDS:用于测试时自适应的动态异构音频基准

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

发表机构 * School of Computer and Mathematical Sciences, University of Nottingham Malaysia(诺丁汉马来西亚大学计算机与数学科学学院)

AI总结 针对现有测试时自适应(TTA)评估依赖静态同质噪声协议的问题,提出DHAuDS基准,通过动态严重度和异构噪声混合暴露音频分类鲁棒性缺陷。

详情
AI中文摘要

现有的测试时自适应(TTA)研究严重依赖静态和同质的损坏协议,例如ImageNet-C和CIFAR-10-C/100-C,导致评估设置不一致,并且可能高估与实际情况相比的鲁棒性估计。TTA缺乏能够模拟现实异构声学退化的标准化评估基础设施。我们引入了DHAuDS,这是一个标准化的基准套件,用于评估在动态损坏严重性和异构噪声混合下的音频分类TTA鲁棒性。DHAuDS并非提出新的TTA算法,而是专注于暴露在传统固定噪声评估协议下仍然隐藏的鲁棒性限制。

英文摘要

Existing Test-time Adaptation (TTA) studies rely heavily on static and homogeneous corruption protocols, such as ImageNet-C and CIFAR-10-C/100-C, leading to inconsistent evaluation settings and potentially inflated robustness estimates that are compared with real-world situations. TTA lacks a standardized evaluation infrastructure capable of modeling realistic heterogeneous acoustic degradation. We introduce DHAuDS, a standardized benchmark suite for evaluating audio classification TTA robustness under dynamic corruption severity and heterogeneous noise mixtures. Rather than proposing a new TTA algorithm, DHAuDS focuses on exposing robustness limitations that remain hidden under conventional fixed-noise evaluation protocols.

2507.01598 2026-06-09 cs.LG 版本更新

Convergence Bound and Critical Batch Size of Muon Optimizer

Muon优化器的收敛界与临界批量大小

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

发表机构 * Meiji University(立命经济大学) Université de Montréal(蒙特利尔大学) Mila(蒙特利尔人工智能研究院)

AI总结 本文理论分析了Muon优化器在四种实际设置下的收敛性,证明权重衰减确保参数和梯度范数有界,并推导了临界批量大小的下界,揭示了超参数β和λ对其缩放的影响。

详情
AI中文摘要

Muon是一种最近提出的优化器,利用神经网络参数的固有矩阵结构,展现了强大的实证性能,表明其有潜力成为AdamW等标准优化器的后继者。本文提供理论分析以支持其实践成功。我们在四种实际设置下给出了Muon的收敛证明,系统考察了其有无Nesterov动量和权重衰减时的行为。然后我们证明,添加权重衰减可确保参数和梯度范数几乎必然有界——无需依赖通常施加的有界梯度假设——并阐明了权重衰减系数与学习率之间的相互作用。最后,我们推导了Muon临界批量大小的下界——该批量大小最小化训练的随机一阶预言机(SFO)复杂度。由于所得公式涉及不可直接观测的问题相关量(梯度方差、目标精度、有效秩),它不能绝对预测临界批量大小;而是揭示了超参数$\beta$(动量)和$\lambda$(权重衰减)如何控制该值的定性缩放。我们的实验在包括图像分类和语言建模在内的任务上验证了这些依赖于超参数的预测。

英文摘要

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. We then demonstrate that the addition of weight decay ensures almost-sure boundedness of the parameter and gradient norms -- without relying on the commonly imposed bounded-gradient assumption -- and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive a lower bound on the critical batch size for Muon -- the batch size that minimizes the stochastic first-order oracle (SFO) complexity of training. Because the resulting formula involves problem-dependent quantities that are not directly observable (gradient variance, target precision, effective rank), it does not predict the critical batch size in absolute terms; rather, it reveals how the hyperparameters $β$ (momentum) and $λ$ (weight decay) govern the qualitative scaling of this value. Our experiments validate these hyperparameter-dependent predictions across workloads including image classification and language modeling.