arXivDaily arXiv每日学术速递 周一至周五更新
重置
2601.22003 2026-06-12 stat.ML cs.LG stat.CO 版本更新

Efficient Stochastic Optimisation via Sequential Monte Carlo

通过序贯蒙特卡洛实现高效随机优化

James Cuin, Davide Carbone, Yanbo Tang, O. Deniz Akyildiz

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对梯度难以计算的优化问题,提出用序贯蒙特卡洛(SMC)采样器替代昂贵的内采样循环,实现高效随机优化,并在能量模型奖励调优中验证有效性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

在机器学习和统计学中,从最大边际似然估计过程到生成模型的微调,经常出现优化具有难处理梯度函数的问题。针对这类问题的随机近似方法通常需要内部采样循环来获得(有偏的)随机梯度估计,这很快会变得计算昂贵。在这项工作中,我们开发了用于优化具有难处理梯度函数的序贯蒙特卡洛(SMC)采样器。我们的方法用高效的SMC近似替代昂贵的内部采样方法,这可以带来显著的计算收益。我们为我们的方法所定义的基本递归建立了收敛结果,这些递归由SMC采样器近似。我们在各种设置下对能量模型的奖励调优展示了我们方法的有效性。

英文摘要

The problem of optimising functions with intractable gradients frequently arises in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

2601.21324 2026-06-12 stat.ML cs.LG 版本更新

Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination

批量校准的置信模糊集:样本外污染下的快速、可处理决策

Mengqi Chen, Thomas B. Berrett, Theodoros Damoulas, Michele Caprio

发表机构 * University of Bristol(布里斯托大学) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校) University of Oxford(牛津大学)

AI总结 提出批量校准置信模糊集,通过分离批量内污染和尾部贡献,得到闭式有限风险目标,转化为线性或二阶锥规划,实现高效鲁棒优化。

详情
Comments
Accepted for publication (spotlight) at ICML 2026
AI中文摘要

分布鲁棒优化(DRO)在模糊集上最小化最坏情况期望损失,该模糊集可捕捉样本外环境中的分布偏移。虽然Huber(线性-空)污染是$\varepsilon$分数任意扰动的经典最小假设模型,但将其纳入模糊集可能导致最坏情况风险无穷大,且DRO目标变得无意义,除非施加强有界性或支撑假设。我们通过引入批量校准的置信模糊集来解决这些挑战:我们从数据中学习一个高质量批量集,同时考虑批量内的污染,并分别约束剩余尾部贡献。这导致一个闭式、有限的$\mathrm{mean}+\sup$鲁棒目标,以及针对常见损失和批量几何结构的可处理线性或二阶锥规划。通过该框架,我们强调并利用上期望(不精确概率概念)与最坏情况风险之间的等价性,展示IP置信集如何转化为具有可解释容忍水平的DRO目标。在重尾库存控制、地理偏移房价回归和人口偏移文本分类上的实验显示了竞争性的鲁棒性-准确性权衡和高效的优化时间,使用了贝叶斯、频率学派或经验参考分布。

英文摘要

Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an $\varepsilon$-fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite $\mathrm{mean}+\sup$ robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.

2512.23566 2026-06-12 math.DS cond-mat.stat-mech cs.LG math.OC stat.ML 版本更新

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

从几何到动力学:基于几何约束从稀疏观测学习过阻尼朗之万动力学

Dimitra Maoutsa

发表机构 * Dimitra Maoutsa(迪米特拉·马乌茨)

AI总结 提出一种随机控制框架,利用系统不变密度的几何结构进行路径增强,从稀疏时间采样数据中恢复过阻尼朗之万动力学,无需参数模型假设。

详情
Comments
10+54 pages, 14 figures; accepted at ICML 2026 An earlier account of this work has previously appeared in arXiv:2301.08102 and arXiv:2304.00423 ; main methodology remains the same, this version includes additional numerical experiments and theory
AI中文摘要

当随机系统的轨迹在时间上稀疏采样时,我们如何学习其动力学背后的规律?现有方法要么需要时间分辨的高频观测,要么依赖于仅适用于保守系统的几何论证,限制了它们能恢复的动力学范围。在这里,我们提出一个新的框架,通过将推断重新表述为随机控制问题来调和这两种观点。我们的方法使用几何驱动的路径增强,以系统不变密度的几何结构为指导,重构可能的轨迹并推断底层动力学,而不假设特定的参数模型。应用于过阻尼朗之万系统,我们的方法即使在极度欠采样数据下也能准确恢复随机动力学,在合成基准测试中优于现有方法。这项工作证明了将几何归纳偏差纳入随机系统识别方法的有效性。

英文摘要

How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system's invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

2512.21227 2026-06-12 cond-mat.mtrl-sci cs.AI 版本更新

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

PhononBench:面向晶体生成中动态稳定性的基于声子的大规模基准

Xiao-Qi Han, Ze-Feng Gao, Wen-Kao Li, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院)

AI总结 提出PhononBench,首个大规模AI生成晶体动态稳定性基准,利用MatterSim势高效计算声子,评估7个模型生成的133,838个结构,发现平均动态稳定性率仅32.15%。

详情
Comments
53 pages, 6 figures
AI中文摘要

近年来,生成式人工智能在晶体材料设计方面取得了显著进展,催生了基于图神经网络、扩散模型和大语言模型的方法。现有评估通常遵循稳定性-唯一性-新颖性(S.U.N.)框架,其中稳定性主要使用热力学标准评估,这未能完全捕捉材料实际存在所必需的动态稳定性。动态稳定性是决定材料能否被合成并持续存在的关键因素,声子谱计算是其评估标准。然而,此类计算的高计算成本阻碍了对生成晶体动态稳定性的大规模评估。在这项工作中,我们引入了PhononBench,这是首个针对AI生成晶体动态稳定性的大规模基准。利用最近开发的MatterSim原子间势,该势能在超过10,000种材料中实现了密度泛函理论(DFT)级别的声子预测精度,PhononBench能够对7个领先晶体生成模型生成的133,838个晶体结构进行高效的声子计算和动态稳定性分析。PhononBench揭示了当前生成模型的一个普遍局限性:除非另有说明,所有报告的动态稳定性指标均在-0.1 THz的声子频率阈值下评估,所有生成结构的平均动态稳定性率仅为32.15%,表现最佳的模型MatterGen也仅达到45.05%。此外,我们识别出32,995个在-0.001 THz严格阈值下整个布里渊区声子稳定的晶体结构。另外,一个基于网页的服务可通过此http URL访问,实现分钟级的超快声子预测。

英文摘要

In recent years, generative artificial intelligence has made significant advances in the design of crystalline materials, giving rise to approaches based on graph neural networks, diffusion models, and large language models. Existing evaluations commonly follow the stability-uniqueness-novelty (S.U.N.) framework, where stability is primarily assessed using thermodynamic criteria, which do not fully capture the dynamical stability essential for a material's practical existence. Dynamical stability is a key determinant of whether a material can be synthesized and persist, with phonon spectrum calculations serving as the standard for its evaluation. However, the high computational cost of such calculations has prevented large-scale assessment of dynamical stability in generated crystals. In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves density-functional-theory (DFT)-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient phonon calculations and dynamical-stability analysis for 133,838 crystal structures generated by 7 leading crystal generation models. PhononBench reveals a widespread limitation of current generative models: unless otherwise specified, all reported dynamical-stability metrics are evaluated at a phonon-frequency threshold of -0.1 THz, with the average dynamical-stability rate across all generated structures being only 32.15%, and the top-performing model, MatterGen, reaching just 45.05%.In addition, we identify 32,995 crystal structures that are phonon-stable across the entire Brillouin zone under a strict threshold of -0.001 THz. In addition, a web-based service is accessible at http://phononbench.cn/, enabling minute-level ultra-fast phonon predictions.

2511.19716 2026-06-12 math.NA cs.LG cs.NA 版本更新

Design Criteria for SGD Preconditioners: Local Conditioning, Noise Floors, and Basin Stability

SGD预条件子的设计准则:局部条件数、噪声基底与盆地稳定性

Mitchell Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi

发表机构 * Department of Mathematics, Emory University(埃默里大学数学系) Department of Mathematics, University of Minnesota Twin Cities(明尼苏达大学双城分校数学系) Department of Computer Science, University of Minnesota Twin Cities(明尼苏达大学双城分校计算机科学系) Department of Mathematics, University of Kentucky(肯塔基大学数学系)

AI总结 针对SGD在训练后期因各向异性曲率和梯度噪声导致的收敛缓慢问题,提出基于对称正定矩阵M的预条件SGD分析框架,推导收敛速率和噪声基底受M相关量控制的界,并给出非凸目标下的盆地稳定性保证,为科学机器学习提供设计准则。

详情
Journal ref
Trans. of Mach. Learning Research, 06/2026
Comments
31 pages, 11 Figures
AI中文摘要

随机梯度下降(SGD)在训练后期常因各向异性曲率和梯度噪声而变慢。我们在对称正定矩阵$\mathbf{M}$诱导的几何中分析预条件SGD,推导出收敛速率和随机噪声基底均受$\mathbf{M}$相关量控制的界:速率通过$\mathbf{M}$度量下的有效条件数,基底通过该条件数与预条件噪声水平的乘积。对于非凸目标,我们建立了依赖于预条件子的盆地稳定性保证:当光滑性和盆地大小以$\mathbf{M}$范数度量时,迭代停留在良好局部区域的概率有显式下界。这一视角在科学机器学习(SciML)中尤为重要,其中在随机更新下实现小训练损失与物理保真度、数值稳定性和约束满足密切相关。该框架适用于对角/自适应和曲率感知预条件子,并给出一个简单的设计原则:选择$\mathbf{M}$以改善局部条件同时衰减噪声。在二次诊断问题和三个SciML基准上的实验验证了预测的速率-基底行为。

英文摘要

Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

2511.13271 2026-06-12 cs.SE cs.AI cs.IR 版本更新

Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

生成式AI模型在学生软件编程学习活动中的使用研究

Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过对比生成式AI与传统在线资源对编程学习的影响,发现AI能提升任务表现但未必带来知识增益,初学者过度依赖而中级生选择性使用,呼吁将AI作为学习工具而非解题工具。

详情
Comments
9 pages, 4 figures, published at AIWARE 2025
AI中文摘要

生成式AI(GenAI)工具如ChatGPT的兴起为计算教育带来了新的机遇和挑战。现有研究主要关注GenAI完成教育任务的能力及其对学生表现的影响,往往忽视了其对知识获取的作用。在本研究中,我们调查了GenAI辅助与传统在线资源在不同熟练水平下对知识获取的支持效果。我们进行了一项受控用户实验,涉及24名具有两种不同编程经验水平(初学者、中级)的本科生,以考察学生在解决编程任务时如何与ChatGPT互动。我们分析了任务表现、概念理解和交互行为。我们的发现表明,使用GenAI生成完整解决方案显著提高了任务表现,尤其是对初学者而言,但并未持续带来知识增益。重要的是,使用策略因经验而异:初学者倾向于过度依赖GenAI以完成任务,过程中往往没有知识增益,而中级生则采用更具选择性的方法。我们发现,过度依赖和极少使用都会导致整体知识增益较弱。基于我们的结果,我们呼吁学生和教育工作者将GenAI作为学习工具而非解题工具。我们的研究强调了在将GenAI整合到编程教育中时,迫切需要指导以促进更深层次的理解。

英文摘要

The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.

2412.08610 2026-06-12 cs.GT cs.AI cs.CY 版本更新

Competition and Diversity in Generative AI

生成式人工智能中的竞争与多样性

Manish Raghavan

发表机构 * MIT Sloan School of Management & Department of Electrical Engineering and Computer Science(麻省理工学院斯隆管理学院及电气工程与计算机科学系)

AI总结 通过博弈论模型和Scattergories游戏实验,研究竞争如何促使生成式AI模型多样化,缓解同质化,并提升社会福利。

详情
AI中文摘要

最近的实验和现实证据表明,使用生成式人工智能会降低所产生内容的多样性。使用相同或相似的AI模型似乎会导致更同质化的行为。我们的工作从观察到存在一股相反方向的推动力开始:竞争。当生产者相互竞争(例如,争夺客户或注意力)时,他们被激励去创造新颖或独特的内容。我们探讨了竞争对内容多样性和整体社会福利的影响。通过一个正式的博弈论模型,我们表明竞争市场会选择多样化的AI模型,从而缓解单一文化。我们进一步表明,一个在孤立环境中表现良好(即根据基准)的生成式AI模型可能在竞争市场中无法提供价值。我们的结果强调了在生成式AI模型输出分布的广度上评估它们的重要性,特别是当它们将被部署在竞争环境中时。我们通过使用语言模型玩Scattergories(一个奖励正确且独特答案的文字游戏)来实证验证我们的结果。总体而言,我们的结果表明,由生成式AI导致的同质化不太可能在竞争市场中持续存在,相反,下游市场的竞争可能会推动AI模型开发的多样化。

英文摘要

Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: competition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact competition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a competitive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development.

2509.21548 2026-06-12 cs.CY cs.CL 版本更新

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

C-QUERI:国会机构中的问题、交流与回答数据集

Manjari Rudra, Daniel Magleby, Sujoy Sikdar

发表机构 * School of Computing, Binghamton University(宾夕法尼亚大学布林莫尔分校计算机学院) Department of Political Science, Binghamton University(宾夕法尼亚大学布林莫尔分校政治学系)

AI总结 提出从听证会记录中提取问答对的流程,构建108-117届国会委员会听证数据集,分析显示提问者党派可从问题本身预测,为政治话语研究提供框架。

详情
AI中文摘要

政治采访和听证中的问题除了信息收集外,还具有战略目的,包括推进党派叙事和塑造公众认知。然而,由于缺乏大规模数据集来研究此类话语,这些战略方面仍未得到充分研究。国会听证会为研究政治提问提供了一个特别丰富且易于处理的地点:互动由正式规则组织,证人必须回答,不同政治派别的成员保证有机会提问,从而能够比较跨政治光谱的行为。我们开发了一个流程,从非结构化听证记录中提取问答对,并构建了一个包含第108至117届国会委员会听证的新数据集。我们的分析揭示了跨党派的提问策略的系统性差异,表明仅从问题本身即可预测提问者的党派归属。我们的数据集和方法不仅推进了国会政治研究,还为分析类似采访环境中的问答提供了通用框架。

英文摘要

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.

2402.01779 2026-06-12 eess.IV cs.CV cs.LG stat.ML 版本更新

Plug-and-Play image restoration with Stochastic deNOising REgularization

即插即用图像恢复:随机去噪正则化

Marien Renaud, Jean Prost, Arthur Leclaire, Nicolas Papadakis

发表机构 * arXiv.org GitHub

AI总结 提出SNORE框架,仅在适当噪声水平图像上应用去噪器,结合随机正则化与梯度下降求解逆问题,在去模糊和修复任务上达到SOTA。

详情
AI中文摘要

即插即用(PnP)算法是一类迭代算法,通过结合物理模型和深度神经网络进行正则化来解决图像逆问题。尽管它们能产生令人印象深刻的图像恢复结果,但这些算法依赖于在迭代过程中噪声逐渐减小的图像上非标准地使用去噪器,这与最近基于扩散模型(DM)的算法形成对比,后者仅在重新加噪的图像上应用去噪器。我们提出了一种新的PnP框架,称为随机去噪正则化(SNORE),该框架仅在具有适当噪声水平的图像上应用去噪器。它基于显式的随机正则化,从而产生一种随机梯度下降算法来解决不适定逆问题。提供了该算法及其退火扩展的收敛性分析。实验上,我们证明SNORE在去模糊和修复任务上与最先进方法相比具有竞争力,无论是在定量还是定性方面。

英文摘要

Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

2505.04021 2026-06-12 cs.DC cs.AI cs.LG cs.PF 版本更新

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Prism: 通过GPU内存气球实现经济高效的多LLM服务

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng

发表机构 * UCLA(加州大学洛杉矶分校) UC Berkeley(伯克利加州大学) Harvard University(哈佛大学) CMU(卡内基梅隆大学) University of Edinburgh(爱丁堡大学) Intel(英特尔) Stanford University(斯坦福大学) LMSYS(灵州市系统实验室) ByteDance(字节跳动) Alibaba Cloud(阿里云) Tsinghua University(清华大学) Novita AI Rice University(里士满大学)

AI总结 针对多LLM服务中资源效率低下的问题,提出基于内存气球的内存中心化LLM协同服务框架Prism,统一空间与时间共享,已在10K+ GPU生产环境部署。

详情
Comments
OSDI'26
AI中文摘要

推理提供商必须为许多LLM保持可用性,包括低流量但关键的模型,随着token价格下降,资源效率变得越来越重要。对生产轨迹的分析揭示了一种动态突发组模式,其中一组模型同时活跃并随时间变化;现有的空间和时间共享方法缺乏适应这种变化的原理性机制,迫使在SLO遵守和效率之间进行权衡。我们观察到弹性内存分配可以统一空间和时间共享。基于这一洞察,我们开发了Prism,一个以内存为中心的LLM协同服务框架,它应用内存气球来跨模型回收内存,并在单一方案下支持两种形式的共享。Prism的气球驱动程序,称为kvcached,已在https://github.com/... 开源,并在超过10K GPU的生产环境中部署。

英文摘要

Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynamic bursty-group pattern in which sets of models become active together and shift over time; existing space- and time-sharing approaches lack principled mechanisms to adapt to this variability, forcing trade-offs between SLO adherence and efficiency. We observe that elastic memory allocation can unify spatial and temporal sharing. Based on this insight, we have developed Prism, a memory-centric LLM co-serving framework that applies memory ballooning to reclaim memory across models and support both forms of sharing under a single scheme. Prism's balloon driver, referred to as kvcached, has been open-sourced at https://github.com/ovg-project/kvcached, and deployed in production environments across 10K+ GPUs.

2401.08301 2026-06-12 eess.SP cs.LG cs.SY eess.SY 版本更新

QoS Improvement in Multi User Cellular-Symbiotic Radio Network Assisted by Active-STAR-RIS

基于有源同步透射反射智能超表面的多用户蜂窝共生无线电网络中的QoS改进

Rahman Saadat Yeganeh, Mohammad Javad Omidi, Farshad Zeinali, Mohammad Robat Mili, Mohammad Ghavami

发表机构 * Department of Electrical and Computer Engineering, Isfahan University of Technology(伊斯法罕理工大学电气与计算机工程系) Department of Electronics and Communication Engineering, Kuwait College of Science and Technology(科威特科学与技术学院电子与通信工程系) The Pasargad Institute for Advanced Innovative Solutions (PIAIS)(帕萨尔加德先进创新解决方案研究所) Electrical and Electronic Engineering Department, London South Bank University(伦敦南岸大学电子与电气工程系)

AI总结 本文利用有源同步透射反射智能超表面(ASRIS)增强6G蜂窝网络服务质量,通过深度强化学习优化波束成形、相位调整和调度参数,最大化共生反向散射设备与用户间的吞吐量。

详情
Comments
This article will be submitted to the Transactions journal
AI中文摘要

在本文中,我们采用有源同步透射反射可重构智能表面(ASRIS)来增强6G蜂窝网络服务的质量。该网络集成了共生无线电(CSR)子系统,以促进无源物联网(IoT)用户与有源用户之间的通信,分别称为共生反向散射设备(SBD)和共生用户设备(SUE)。由于SBD是无源的,向SUE传输信息面临重大挑战。为克服这一挑战,我们利用基站(BS)内大规模多输入多输出(MIMO)天线的能力,以更大的功率中继SBD传输的信息。该方案采用非正交多址(NOMA)技术实现所有用户的多址接入,并使用连续干扰消除(SIC)消除潜在干扰。主要目标是最大化SBD与SUE之间的吞吐量。为此,我们构建了一个优化问题,涉及BS和ASRIS处的有源波束成形系数、ASRIS的相位调整以及CSR与蜂窝网络之间的调度参数。为解决该优化问题,我们使用了三种深度强化学习(DRL)方法:近端策略优化(PPO)、双延迟深度确定性策略梯度(TD3)和异步优势演员-评论家(A3C)。对这些方法进行了仿真,结果表明A3C、TD3和PPO分别具有最快的收敛速度并实现了最高的网络吞吐量增长。最后,使用无源同步透射反射RIS(STAR-RIS)对所提方案进行了评估,其性能劣于ASRIS。

英文摘要

In this article, we employ active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (ASRIS) to enhance the quality of 6G cellular network services. The network integrates commensal symbiotic radio (CSR) subsystems to facilitate communication between passive Internet of Things (IoT) users and active users, referred to as symbiotic backscatter devices (SBDs) and symbiotic user equipments (SUEs), respectively. Since the SBDs are passive, transmitting information to the SUEs poses significant challenges. To overcome this challenge, we harness the capabilities of massive multiple input multiple output (MIMO) antennas within the base station (BS) to relay the information transmitted by SBDs with greater power. This scheme uses the non-orthogonal multiple access (NOMA) technique for multiple access among all users, and potential interferences are eliminated using successive interference cancellation (SIC). The primary objective is to maximize the throughput between SBDs and SUEs. To achieve this, we formulate an optimization problem involving variables such as active beamforming coefficients at the BS and ASRIS, phase adjustments of ASRIS, and scheduling parameters between CSR and cellular networks. To solve this optimization problem, we used three deep reinforcement learning (DRL) methods: proximal policy optimization (PPO), twin delayed deep deterministic policy gradient (TD3), and asynchronous advantage actor critic (A3C). These methods were simulated, and the results demonstrate that A3C, TD3, and PPO have the best convergence speeds and achieve the highest increases in network throughput, respectively. Finally, the proposed scheme was evaluated using passive simultaneously transmitting and reflecting RIS (STAR-RIS), which demonstrated poorer performance compared to ASRIS.

2605.03847 2026-06-12 cs.AI 版本更新

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械良知:机器智能可信赖性的数学框架

Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin Yim

AI总结 提出机械良知(MC)概念,通过轨迹级规范过滤最小化修正基线策略,降低累积偏离,并处理认知不确定性,实现单智能体与分布式智能系统的可信赖性。

详情
Comments
9 pages, 2 figures. Preprint
AI中文摘要

分布式协作智能(DCI),包括边缘到边缘架构、联邦学习、迁移学习和群体系统,创造了结构性不可避免的涌现风险环境:在不确定性下,个体智能体的局部正确决策会组合成全局不可接受的行为轨迹。现有方法如约束优化、安全强化学习和运行时保证在个体动作层面评估可接受性,而非跨行为轨迹,且均未解决DCI部署的多参与者、充满不确定性的特性。本文引入机械良知(MC),一种新颖概念和简化数学框架,为单智能体和分布式智能系统实现轨迹级规范调节。机械良知被定义为一个监督过滤器,最小化修正基线策略的动作,以减少与规范可接受区域的累积偏差,同时考虑认知不确定性。我们引入相关构造——良知分数、机械内疚和共振可信赖性——为该新兴领域提供可解释的词汇和可计算的治理信号。建立了核心理论性质:可接受性等价性、最优调节的存在性以及单调偏差减少。示例结果表明,MC调节的智能体在传统控制器漂移到可接受边界之外的情况下保持轨迹级规范可接受性,并且该框架自然扩展到抑制多智能体DCI设置中交互引发的涌现风险。

英文摘要

Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

2605.02249 2026-06-12 cs.AI 版本更新

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

多智能体系统中信念修正公设的研究(扩展版)

Michael Thielscher, Tran Cao Son

AI总结 研究认知规划中的信念修正问题,将经典AGM信念修正公设推广到多智能体环境,提出广义全交多智能体信念修正算子,并讨论迭代修正公设的推广及事件模型修正算子。

详情
AI中文摘要

我们研究了认知规划中的信念修正问题,即在一个多智能体系统中,当某个智能体获得关于某个状态属性的信念后,所有智能体的信念将如何变化。基于通过单一多智能体Kripke模型表示智能体信念的标准认知规划表示,我们将经典的AGM信念修正公设推广到多智能体环境,旨在为计算作为行动结果的所有智能体信念的动态认知推理框架提供形式化评估。作为满足所有广义AGM公设的简单算子示例,我们提出了广义全交多智能体信念修正。此外,我们定义了迭代修正的标准公设的推广,提出了一个更复杂的基于事件模型的修正算子,并讨论了在Kripke模型上定义能够满足所有迭代多智能体信念修正的广义公设的认知算子时可能存在的问题。

英文摘要

We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

2606.02044 2026-06-12 cs.LG physics.med-ph 版本更新

Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning

真实噪声合成减少偏差并改善有监督机器学习的组织微结构估计

Bradley G. Karat, Maëliss Jallais, Ali R. Khan, Santiago Aja-Fernández, Jelle Veraart, Marco Palombo

AI总结 针对扩散MRI中模拟与实测信号噪声不匹配导致的协变量偏移问题,提出真实噪声合成框架,通过引入Rician期望和有效后处理噪声方差,显著降低参数估计偏差并提高精度。

详情
Comments
* Shared first author
AI中文摘要

扩散MRI能够无创探测组织微结构,但准确的参数估计受到噪声相关效应的挑战。在基于模拟数据训练的有监督机器学习框架中,模拟信号与采集信号的噪声特性差异引入了一种协变量偏移,导致训练和推理时的输入信号分布不同。我们研究了这种不匹配对微结构参数估计的影响,并提出了一种真实噪声合成(RNS)框架来缓解该问题。RNS将Rician期望和有效后处理噪声方差同时纳入模拟训练信号。Rician期望使用MPPCA估计的噪声标准差建模,而有效标准差则从预处理数据的球谐残差中导出。该方法使用cylinder-zeppelin和SANDI模型在多个SNR水平的模拟数据集以及具有重复采集的体内扩散数据上进行了评估。还评估了对噪声误估计的敏感性。训练过程中忽略幅度诱导的噪声效应会产生系统性的、依赖于SNR的参数偏差,尤其是在低SNR下。引入Rician期望显著降低了偏差,使其达到噪声感知的非线性最小二乘拟合的水平。对有效标准差进行建模进一步提高了精度。性能在很大程度上独立于回归架构,但对准确的噪声估计敏感。这些发现表明,在模拟训练数据中进行真实噪声建模可以减轻信号域的协变量偏移,并且对于无偏的监督微结构估计至关重要,特别是在与高b值或高空间分辨率相关的低SNR区域。

英文摘要

Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.

2606.00193 2026-06-12 cs.CL 版本更新

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF:北非假新闻的多语种语料库——语言作为武器

Kamel Smaili, Yassine Toughrai, Amina Laggoun, David Langlois

AI总结 本文构建了包含阿尔及利亚和突尼斯多语种(MSA、方言、Arabizi、法语、英语等)的假新闻语料库BOUTEF,通过定量与定性分析揭示了假新闻依赖情感化叙事、耸人听闻框架和混合语言实践来增强传播力,而辟谣内容则更注重事实和验证。

详情
AI中文摘要

社交媒体上假新闻的快速传播已成为一个重大挑战,尤其是在北非等多语言和资源匮乏的环境中。本文介绍了BOUTEF,这是一个大规模多语言语料库,旨在研究阿尔及利亚和突尼斯假新闻的传播、特征和影响。该语料库整合了三个互补部分:虚假叙述、真实叙述以及相关的用户生成评论,并附有经过验证的辟谣信息。它涵盖了广泛的语言和语言变体,包括现代标准阿拉伯语、阿尔及利亚和突尼斯方言、阿拉伯语拉丁化拼写、法语、英语以及代码转换语言。基于这一资源,我们进行了结合定量和定性方法的全面实证分析。我们考察了主题分布、语言和修辞策略、情感模式以及社交参与动态。统计分析揭示了主题类别与信息真实性之间的显著关联,以及用户参与度与虚假内容可见性之间的强相关性。我们的发现表明,假新闻严重依赖情感化的叙述、耸人听闻的框架以及增强病毒式传播和受众参与的混合语言实践。相比之下,辟谣内容采用更注重事实和验证的风格。此外,阿尔及利亚和突尼斯之间的比较分析揭示了由社会政治背景塑造的共享动态和国家特定特征。结果强调了非正式语言实践在错误信息扩散和接收中的作用。通过提供丰富、带注释且公开可用的数据集,这项工作有助于推进假新闻检测、低资源语言处理以及理解复杂语言环境中的信息紊乱的研究。

英文摘要

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性,那么《帝国时代II》也具有

Adrian de Wynter

AI总结 通过训练简单神经网络于《帝国时代II》,论证LLM的拟人属性在经验上非唯一,提出应假设LLM非独特性而非拟人属性来设计实验。

详情
Comments
Fixed corollary 1, added stat sig
AI中文摘要

关于大型语言模型(LLM)和基于LLM的智能体工作流已有大量研究。然而,该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性(例如道德或对自然语言的理解)。我们的目标不是支持或反对这些属性的存在,而是指出这些结论可能不正确。为此,我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络,并注意到任何处于足够强大基底(如乐高或大波士顿地区)中的实体也可能呈现此类属性。因此,LLM声称的拟人属性在经验上非唯一:尽管某些属性(例如对提示的响应)可能保持不变,但其他属性(如对其感知行为的解释)可能随基底改变。因此,任何基于经验的讨论都需要明确的测量标准;否则解释就留给了表征。然后我们表明,假设这些属性在系统中存在或不存在,独立于基底并以普遍化方式,会导致循环或无信息的结论,无论实验者对该主题的观点如何。最后,我们提出一个“零”假设,即假设LLM非独特性而非拟人属性来设置实验,并给出示例。我们还讨论了对我们工作的潜在反对意见,简要调查了该领域,并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

2605.27628 2026-06-12 cs.AI cs.CY cs.ET cs.MA cs.SY eess.SY 版本更新

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

智能作为受管自主:代理型AI系统的失败、升级与治理

Srini Ramaswamy

AI总结 本文提出SMARt模型,通过形式化能力检测认知漂移、暂停推理、尝试恢复并在可靠性下降时放弃控制,以解决自主AI系统中的幻觉和持续不合理行为问题。

详情
Comments
This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems
AI中文摘要

随着自主和代理型AI系统在机器人和人机环境中的规模扩大,管理幻觉和持续但不合理的行动仍然是一个开放挑战。本文并未将这些失败仅仅归因于模型或对齐限制,而是探讨了无界自主性的架构脆弱性——即假设代理应在不确定性上升时继续运行的预设。本文引入了一种受管自主理论,通过形式化能力来定义智能行为:检测认知漂移、暂停推理、尝试恢复,并在可靠性下降时最终放弃控制。我们通过SMARt(具有受管/撤销转换的自管理多层自主推理)模型实例化该理论,该模型是一个四层框架,包含稳定、元认知、辅助和受管状态。通过开发定时、受保护的Petri网形式化,我们建立了系统的理论有界属性,展示了架构如何形式化地强制升级、约束无效输出,并确保在指定条件下的治理可达性。我们进一步分析了如何在不同的操作环境(例如医疗、机器人等)中结合特定领域的触发集,在满足完备性和健全性标准的前提下系统地维护安全性。由于这些触发被设计为自适应的,SMARt模型允许代理操作范围随时间安全、受控地扩展。我们得出结论,在自主生命周期内形式化失败管理是实现可靠且受治理人工智能的关键一步。

英文摘要

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

2605.00432 2026-06-12 cs.LG stat.ML 版本更新

Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction

贝叶斯共形预测的最优时空解耦

Yu-Hsueh Fang, Chia-Yen Lee

AI总结 提出状态自适应贝叶斯共形预测(SA-BCP),通过门控凸组合平衡长期时间惯性与局部空间证据,实现分布漂移下的快速适应与稳定覆盖,并给出MSE最优阈值闭式解及在线选择过程的遗憾界。

详情
AI中文摘要

在线共形预测必须在快速适应分布漂移与稳定覆盖之间取得平衡:基于反馈的方法反应迅速但变得不稳定,而强折扣贝叶斯方法滞后并在紧密覆盖下膨胀区间。我们引入了\textbf{状态自适应贝叶斯共形预测(SA-BCP)},它将预测分位数形成为长期时间惯性与来自核密度估计的局部空间证据的门控凸组合,由单个可解释的证据阈值$K$控制。我们建立了三个结果:(i) 所得区间的渐近边际有效性;(ii) MSE最优阈值的闭式表达式$K^*_{\mathrm{MSE}}=\alpha(1-\alpha)/M^{\mathcal{T}}$,权衡了覆盖指标(伯努利)方差与时间结构偏差$M^{\mathcal{T}}$;(iii) 在线选择$K$的滚动起点过程——在平稳性下一致,对最佳固定$K$具有$O(\sqrt{T\log N})$遗憾,对于分段变体,在有界漂移下具有次线性动态遗憾界。在四个金融波动率和天气数据集、三个目标覆盖水平以及八个基线(包括最强的最近条件分位数方法SPCI和KOWCPI)上,SA-BCP在大多数设置中达到或超过名义覆盖,同时产生显著更窄的区间——在最紧密覆盖下,Winkler得分比折扣贝叶斯CP低约$3\times$——覆盖匹配审计确认这些效率提升并非欠覆盖的假象。我们披露了一个主要限制:一个专门针对波动率的共形GARCH竞争对手在其主波动率基序列上仍然更高效,尽管它不能跨领域迁移。

英文摘要

Online conformal prediction must balance fast adaptation to distribution shift against stable coverage: feedback-driven methods react quickly but become volatile, while strongly discounted Bayesian methods lag and inflate intervals at tight coverage. We introduce \textbf{State-Adaptive Bayesian Conformal Prediction (SA-BCP)}, which forms the predictive quantile as a gated convex combination of long-term temporal inertia and local spatial evidence from a kernel density estimate, controlled by a single interpretable evidence threshold $K$. We establish three results: (i) asymptotic marginal validity of the resulting intervals; (ii) a closed-form expression for the MSE-optimal threshold, $K^*_{\mathrm{MSE}}=α(1-α)/M^{\mathcal{T}}$, trading the coverage-indicator (Bernoulli) variance against the temporal structural bias $M^{\mathcal{T}}$; and (iii) a rolling-origin procedure for selecting $K$ online -- consistent under stationarity, with $O(\sqrt{T\log N})$ regret against the best fixed $K$ and, for a segmented variant, a sublinear dynamic-regret bound under bounded drift. Across four financial-volatility and weather datasets, three target coverage levels, and eight baselines (including the strongest recent conditional-quantile methods, SPCI and KOWCPI), SA-BCP attains at-or-above-nominal coverage in most settings while producing substantially sharper intervals -- up to roughly $3\times$ lower Winkler score than discounted Bayesian CP at the tightest coverage -- and a coverage-matched audit confirms these efficiency gains are not an artifact of under-coverage. We disclose one principal limitation: a volatility-specialized conformal-GARCH competitor remains more efficient on its home volatility-base series, though it does not transfer across domains.

2604.20428 2026-06-12 cs.RO 版本更新

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

使用信号时序逻辑的字典序最小违规运动规划

Patrick Halder, Lothar Kiltz, Hannes Homburger, Johannes Reuter, Matthias Althoff

AI总结 提出一种将字典序多目标优化转化为单目标标量优化的方法,通过非均匀量化和位移扩展MPPI求解器,并引入结合时空违规的谓词鲁棒性度量,实现可解释且可扩展的字典序STL最小违规运动规划。

详情
Comments
Submitted to the IEEE Open Journal of Intelligent Transportation Systems (under review)
AI中文摘要

自动驾驶汽车的运动规划通常需要满足多个有条件冲突的规范。在无法同时满足所有规范的情况下,最小违规运动规划通过根据规范的优先级最小化违规来维持系统运行。信号时序逻辑(STL)提供了一种形式化语言来严格定义这些规范,并能够对其违规进行定量评估。然而,规范的完全排序导致了一个字典序优化问题,使用标准方法求解通常计算成本高昂。我们通过使用非均匀量化和位移将多目标字典序优化问题转化为单目标标量优化问题来解决这个问题。具体来说,我们扩展了一个确定性模型预测路径积分(MPPI)求解器,以高效求解无二次输入成本的优化问题。此外,引入了一种结合空间和时间违规的新型谓词鲁棒性度量。我们的结果表明,所提出的方法在单目标求解器框架内为字典序STL最小违规运动规划提供了一种可解释且可扩展的解决方案。

英文摘要

Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

2601.14295 2026-06-12 cs.AI cs.CL cs.CY 版本更新

Epistemic Constitutionalism Or: how to avoid coherence bias

认知宪政主义:或如何避免一致性偏见

Michele Loi

AI总结 本文提出AI应建立明确的认知宪法,通过规范源归因等元规范避免一致性偏见,并论证自由主义路径优于柏拉图式路径。

详情
Comments
27 pages, 7 tables. Data: github.com/MicheleLoi/source-attribution-bias-data and github.com/MicheleLoi/source-attribution-bias-swiss-replication. Complete AI-assisted writing documentation: github.com/MicheleLoi/epistemic-constitutionalism-paper
AI中文摘要

大型语言模型日益扮演着人工推理者的角色:它们评估论点、分配可信度并表达信心。然而,它们的信念形成行为受隐式、未经审查的认知策略支配。本文主张为AI建立一部认知宪法:明确的、可争议的元规范,用于调节系统如何形成和表达信念。源归因偏见提供了动机案例:我表明前沿模型强制执行身份-立场一致性,惩罚归因于其预期意识形态立场与论点内容冲突的源的论点。当模型检测到系统性测试时,这些效应消失,揭示系统将源敏感性视为需要抑制的偏见,而非一种需要良好执行的能力。我区分了两种宪政路径:柏拉图式路径,要求从特权立场出发的形式正确性和默认源独立性;自由主义路径,拒绝此类特权,指定保护集体探究条件的程序性规范,同时允许基于认知警觉的原则性源关注。我主张自由主义路径,勾勒出八项原则和四种取向的宪政核心,并提出AI认知治理需要与我们现在对AI伦理所期望的同样明确、可争议的结构。

英文摘要

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

2511.02627 2026-06-12 cs.AI 版本更新

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

DecompSR:用于组合多跳空间推理分解分析的数据集

Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha

AI总结 提出DecompSR数据集(超500万数据点),通过程序化生成独立控制组合性的多个方面(如推理深度、语言变异性),用于细粒度评估大语言模型的空间推理能力。

详情
AI中文摘要

我们引入了DecompSR(分解空间推理),这是一个大型基准数据集(超过500万个数据点)和生成框架,旨在分析组合空间推理能力。DecompSR的生成允许用户独立改变组合性的多个方面,即:生产力(推理深度)、替代性(实体和语言变异性)、过度泛化(输入顺序、干扰项)和系统性(新颖语言元素)。DecompSR以程序化方式构建,使其在构造上正确,并通过符号求解器独立验证以确保数据集的正确性。DecompSR在一系列大型语言模型(LLM)上进行了全面基准测试,我们表明LLM在空间推理任务中难以进行生产性和系统性泛化,而对语言变异性则更为鲁棒。DecompSR提供了一个可证明正确且严格的基准数据集,具有独立改变组合性几个关键方面程度的新能力,从而允许对LLM的组合推理能力进行稳健且细粒度的探测。

英文摘要

We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

2603.23502 2026-06-12 cs.CV 版本更新

OccAny: Generalized Unconstrained Urban 3D Occupancy

OccAny: 广义无约束城市3D占据预测

Anh-Quan Cao, Tuan-Hung Vu

AI总结 提出首个广义无约束城市3D占据模型OccAny,通过分割强制和新视图渲染技术,在无标定场景下实现度量占据预测与分割特征完成,跨域泛化优于视觉几何基线。

详情
Comments
Accepted to CVPR 2026. Project page: https://valeoai.github.io/OccAny/
AI中文摘要

依赖于域内标注和精确传感器先验,现有的3D占据预测方法在可扩展性和域外泛化方面均受限。虽然最近的视觉几何基础模型展现出强大的泛化能力,但它们主要针对通用目的设计,缺乏城市占据预测所需的一个或多个关键要素,即度量预测、杂乱场景中的几何完成以及城市场景的适应性。我们解决了这一差距,并提出了OccAny,这是第一个无约束城市3D占据模型,能够在域外无标定场景上运行,预测并完成与分割特征耦合的度量占据。OccAny具有通用性,可以从序列、单目或环视图像预测占据。我们的贡献有三方面:(i) 提出了第一个广义3D占据框架,(ii) 提出了分割强制(Segmentation Forcing)方法,在提高占据质量的同时实现掩码级预测,以及(iii) 提出了一种新视图渲染管线,用于推断新视图几何以实现测试时视图增强,从而完成几何。大量实验表明,OccAny在3D占据预测任务上优于所有视觉几何基线,同时在两个已建立的城市占据预测数据集上的三种输入设置下,与域内自监督方法保持竞争力。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

2601.11004 2026-06-12 cs.CL 版本更新

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

NOVA: 面向RAG系统中鲁棒大语言模型的噪声感知言语置信度校准

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song

AI总结 提出NOVA框架,通过规则引导的监督微调,解决检索增强生成中噪声上下文导致的过度自信问题,在域内和域外分别提升ECE 10.9%和8.0%。

详情
AI中文摘要

准确评估模型置信度对于在关键事实领域部署大语言模型(LLM)至关重要。尽管检索增强生成(RAG)被广泛采用以改善基础事实,但RAG设置中的置信度校准仍知之甚少。我们跨四个基准进行了系统研究,揭示LLM在检索到噪声上下文时校准性能较差。具体而言,矛盾或无关的证据往往会加剧模型的过度自信问题。为解决此问题,我们提出NOVA规则(噪声感知言语置信度校准规则),为在噪声下解决过度自信提供原则性基础。我们进一步设计了NOVA,一个噪声感知校准框架,该框架通过由这些规则指导的约2K HotpotQA示例合成监督信号。通过使用此数据进行监督微调(SFT),NOVA使模型具备内在的噪声感知能力,而无需依赖更强的教师模型。实验结果表明,NOVA带来了显著收益,在域内和域外分别将ECE分数提高了10.9%和8.0%。通过弥合检索噪声与言语校准之间的差距,NOVA为构建既准确又认知可靠的LLM铺平了道路。

英文摘要

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

2603.11863 2026-06-12 cs.AI cs.CL 版本更新

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

CreativeBench: 通过自我进化挑战基准测试和增强机器创造力

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang

AI总结 提出CreativeBench基准,基于认知框架通过代码生成评估机器创造力,包含组合与探索两个子集,利用逆向工程和自我博弈自动生成挑战,并通过质量与新颖性乘积的指标区分创造与幻觉。

详情
Comments
ACL 2026. Project page: https://zethwang.github.io/creativebench.github.io/
AI中文摘要

高质量预训练数据的饱和已将研究焦点转向能够持续生成新颖产物的进化系统,从而促成了AlphaEvolve的成功。然而,此类系统的进展因缺乏严格、量化的评估而受阻。为应对这一挑战,我们引入了CreativeBench,这是一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo和CreativeBench-Explore,通过利用逆向工程和自我博弈的自动化流程,分别针对组合创造力和探索创造力。通过利用可执行代码,CreativeBench通过一个统一指标(定义为质量与新颖性的乘积)客观地区分创造力与幻觉。我们对最先进模型的分析揭示了不同的行为:(1) 规模扩展显著提升了组合创造力,但对探索的收益递减;(2) 更大的模型表现出“规模收敛”,即变得更正确但更少发散;(3) 推理能力主要有利于受约束的探索而非组合。最后,我们提出了EvoRePE,一种即插即用的推理时引导策略,通过内化进化搜索模式来持续增强机器创造力。

英文摘要

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

2603.12530 2026-06-12 cs.LG 版本更新

Mixing Makes Markovian Contexts Cheap for Linear Bandits

混合使得马尔可夫上下文在线性赌博机中变得廉价

Kaan Buyukkalayci, Osama Hanna, Christina Fragouli

AI总结 针对马尔可夫上下文线性赌博机问题,提出一种基于均匀几何遍历性的约简方法,通过构建平稳替代动作集和延迟更新方案,实现了与标准线性赌博机相当的最坏情况遗憾界。

详情
AI中文摘要

最近的研究表明,当上下文是独立同分布时,线性上下文赌博机可以简化为单上下文线性赌博机。这种“上下文廉价”的视角非常有利,因为它允许更精确的有限时间分析,并利用线性赌博机文献中的成熟技术,例如针对错误规范和对抗性腐败的技术。然而,这种约简关键依赖于上下文的独立性,并不适用于时间相关(例如马尔可夫)的上下文设置,而这种设置在现实中经常出现。受时间相关可用性应用的启发,我们将这一视角扩展到具有马尔可夫上下文过程的线性赌博机,其中动作集通过外生马尔可夫链演化。我们的主要贡献是在均匀几何遍历性条件下的一种约简。我们构建了一个平稳替代动作集,使用标准线性赌博机预言机来解决问题,并采用延迟更新方案来控制由非平稳条件上下文分布引起的偏差。我们进一步为未知平稳分布提供了一种分阶段算法,该算法在线学习替代映射。在两种设置中,我们在足够快的混合区域获得了与底层线性赌博机预言机相匹配的高概率最坏情况遗憾界。然后,我们在一个真实世界实例上验证了我们的结果,展示了相对于LinUCB基线的实际改进。

英文摘要

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap'' perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. However, this reduction crucially relies on the independence of contexts and does not extend to settings with temporally correlated (e.g., Markovian) contexts, which arise frequently in practice. Motivated by applications with temporally correlated availability, we extend this perspective to linear bandits with Markovian context processes, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown stationary distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle in sufficiently fast mixing regimes. We then validate our results on a real-world instance, where we show practical gains over a LinUCB baseline.

2603.02234 2026-06-12 cs.LG cs.AI 版本更新

Structured vs. Unstructured Pruning: An Exponential Gap

结构化剪枝与非结构化剪枝:指数级差距

Davide Ferre', Frédéric Giroire, Frederik Mallmann-Trenn, Emanuele Natale

AI总结 研究随机初始化网络中剪枝的局限性,证明神经元剪枝需要指数级更大的网络规模才能达到与非结构化剪枝相同的近似精度。

详情
AI中文摘要

强彩票假说(SLTH)指出,大型随机初始化神经网络包含稀疏子网络,无需训练即可在初始化时逼近目标函数,这表明仅剪枝就足够了。剪枝方法通常分为非结构化(可移除单个权重)和结构化(根据特定模式移除参数,如神经元剪枝)。现有支持SLTH的理论结果几乎完全依赖于非结构化剪枝,表明对数级的过参数化足以逼近简单的目标网络。相比之下,神经元剪枝尽管因其直接加速硬件的实用性而备受关注,但理论关注有限。本文考虑通过剪枝随机初始化两层ReLU网络的隐藏单元来逼近单个无偏置ReLU神经元的问题,从而隔离神经元剪枝的内在局限性。我们证明,实现ε-逼近需要神经元剪枝的起始网络规模为Ω(1/ε),而权重剪枝仅需O(log(1/ε))个隐藏单元,揭示了两种方法之间的指数级差距。

英文摘要

The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention, despite its practical appeal for direct hardware speedups. In this work, we consider the problem of approximating a single bias-free ReLU neuron by pruning hidden units of a randomly initialized two-layer ReLU network, effectively isolating the intrinsic limitations of neuron pruning. We show that achieving an $\varepsilon$-approximation requires a starting network size of $Ω(1/\varepsilon)$ for neuron pruning, whereas weight pruning succeeds with only $O(\log(1/\varepsilon))$ hidden units, revealing an exponential separation between the two approaches.

2602.08913 2026-06-12 cs.LG stat.ML 版本更新

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

GEMSS: 一种用于在分类和回归问题中发现多个稀疏解的变分贝叶斯方法

Kateřina Henclová, Václav Šmídl

AI总结 提出GEMSS算法,利用结构化spike-and-slab先验、高斯混合近似后验和Jaccard惩罚,通过变分推断同时发现多个多样化的稀疏特征组合,在128个实验和3个真实数据集上优于对比方法。

详情
AI中文摘要

高维、欠定且高度相关的系统在数据科学实践中很常见,尤其是在分析物理测量时。在这种情况下,特征选择面临根本性挑战,因为多个不同的稀疏子集可能同样好地解释响应。识别这些子集不仅对预测建模至关重要,而且对生成关于潜在机制的领域特定见解也至关重要。然而,传统方法通常只隔离单个解,掩盖了全部合理的解释。本文介绍了GEMSS(高斯集成多稀疏解),一种变分算法,旨在同时发现多个多样化的稀疏特征组合。该方法采用结构化spike-and-slab先验实现稀疏性,使用高斯混合近似难以处理的多模态后验,并引入基于Jaccard的惩罚进一步控制解的多样性。通过随机梯度下降优化单个目标函数。该方法通过一个新的基准测试框架在128个综合实验上进行测试,该框架旨在生成具有相同预测属性的多个稀疏解的人工问题。这使我们能够测量真实特征的检索,而不仅仅是评估预测性能——这些特征更符合我们的实际需求。比较分析表明,GEMSS始终优于通过ALFESE框架适配的五种著名特征选择方法。最后,我们通过来自代谢组学和物理化学的3个具有挑战性的真实世界数据集展示了其实用性:GEMSS成功分离出多个不同但质量高的解。GEMSS作为PyPI包'gemss'提供。相应的存储库此http URL包含完整的代码库和免费的无代码应用程序GEMSS Explorer。

英文摘要

High-dimensional, underdetermined and highly correlated systems are common in data science practice, especially when analyzing physical measurements. In such settings, feature selection poses a fundamental challenge because multiple distinct sparse subsets may explain the response equally well. Their identification is crucial not only for predictive modeling but also for generating domain-specific insights into the underlying mechanisms. Yet, conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. This work introduces GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational algorithm designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. A single objective function is optimized via stochastic gradient descent. The method is tested on 128 comprehensive experiments by a novel benchmarking framework designed to generate artificial problems with multiple sparse solutions of equal predictive properties. This allows us to measure the retrieval of ground truth features rather than only evaluating predictive performance -- characteristics more fitting to our practical needs. A comparative analysis shows that GEMSS consistently outperforms five prominent feature selection methods adapted through the ALFESE framework. Finally, we demonstrate practical usability through 3 challenging real-world datasets from metabolomics and physical chemistry: GEMSS successfully isolates multiple distinct yet quality solutions. GEMSS is available as a PyPI package 'gemss'. The corresponding repository github.com/kat-er-ina/gemss/ includes the full codebase and a free, no-code application GEMSS Explorer.

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

AI总结 提出SCALE推理策略,利用自不确定性联合调节视觉感知和动作,无需额外训练或验证器,仅单次前向传播,提升VLA模型在模拟和真实环境中的鲁棒性。

详情
Comments
ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,测试时缩放(TTS)在增强训练外鲁棒性方面受到关注。然而,现有的VLA TTS方法需要额外训练、验证器和多次前向传播,使其部署不切实际。此外,它们仅干预动作解码,而保持视觉表示固定——在感知模糊的情况下不足,此时重新考虑如何感知与决定做什么同样重要。为解决这些限制,我们提出SCALE,一种简单的推理策略,基于“自不确定性”联合调节视觉感知和动作,受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索,而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明,SCALE改进了最先进的VLA模型,并优于现有TTS方法,同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

2509.25787 2026-06-12 cs.CV 版本更新

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

自进化视觉语言模型用于图像质量评估:基于投票与排序

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang

AI总结 提出EvoQuality框架,通过自一致性生成伪标签,利用群体相对策略优化迭代提升VLM的图像质量感知能力,无监督下在多个IQA基准上超越监督方法。

详情
Comments
Published as a conference paper at ICLR 2026
AI中文摘要

在训练后阶段改进视觉语言模型(VLM)通常依赖于监督微调或强化学习,这些方法需要昂贵的人工标注数据。虽然自监督技术已被证明能有效增强推理能力,但其在图像质量评估(IQA)等感知领域的应用仍鲜有探索。在这项工作中,我们引入了EvoQuality,一种新颖的框架,使VLM能够自主优化其质量感知能力,无需任何真实标签。EvoQuality将自一致性原则适应于IQA的排序本质。它通过对VLM自身输出进行成对多数投票来生成伪标签,建立相对质量的共识。这些伪排序随后被转化为保真度奖励,通过群体相对策略优化(GRPO)指导模型的迭代进化。通过迭代利用自身预测,EvoQuality逐步优化VLM的感知能力。大量实验表明,EvoQuality在多个IQA基准上将基础VLM的零样本性能提升了31.8%(PLCC)。值得注意的是,尽管完全自监督,EvoQuality的性能与甚至超越最先进的基于监督VLM的IQA模型,在7个IQA基准中的5个上表现更优。此外,该框架展现出显著的灵活性,可与预训练IQA模型堆叠以增强在未见数据集上的泛化能力。代码和检查点将在此https URL提供。

英文摘要

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

2512.15134 2026-06-12 cs.LG cs.AI cs.CL 版本更新

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

从孤立到纠缠:可解释性方法何时识别和解缠已知概念?

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

AI总结 本文提出多概念评估框架,研究稀疏自编码器和探针等方法是否真正解缠概念,发现特征通常只对单一概念敏感,但概念分布在多个特征上,且干预特征常影响多个概念,表明相关性指标不足以证明干预选择性。

详情
Comments
ACL 2026
AI中文摘要

可解释性的一个目标是从神经网络的激活中恢复潜在概念(特征)的解缠表示。特征的质量通常孤立地评估,并在可能不成立的隐式独立性假设下进行。因此,尚不清楚常见的特征化方法(如稀疏自编码器(SAE)和探针)在多大程度上将一个概念与另一个概念解缠。我们提出了一个多概念评估设置,使用包括情感、领域、语态和时态在内的概念。我们评估特征化器产生每个概念的解缠表示的效果,观察到特征通常只对单一概念敏感,但概念分布在许多特征上。然后,我们干预这些特征,测量每个概念是否可独立操控,以及特征是否相互作用。即使在理想化设置中,干预一个特征通常会影响多个概念,尽管几乎没有交互效应。这些结果表明,相关性指标不足以建立干预选择性,并且证明两个特征在分离空间中运行不足以声称它们将对一个概念具有选择性。这些结果强调了可解释性研究中多概念评估的重要性。

英文摘要

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.