arXivDaily arXiv每日学术速递 周一至周五更新
重置
2510.07750 2026-06-11 stat.ML cs.LG 版本更新

Calibrating Decision Robustness via Inverse Conformal Risk Control

通过逆保形风险控制校准决策鲁棒性

Wenbin Zhou, Shixiang Zhu

发表机构 * Wenbin Zhou(周文彬) Shixiang Zhu(朱世祥)

AI总结 提出逆保形风险控制框架,为鲁棒优化策略提供无分布、有限样本的误覆盖与遗憾保证,通过追踪Pareto前沿帮助决策者根据成本-风险偏好校准鲁棒性水平。

详情
AI中文摘要

鲁棒优化通过针对最坏情况优化来保护决策免受不确定性影响,但其有效性取决于预先指定的鲁棒性水平,该水平通常是临时选择的,导致保护不足或过度保守且成本高昂的解决方案。最近使用保形预测的方法构建了具有有限样本覆盖保证的数据驱动不确定性集,但它们仍然事先固定覆盖目标,并且对选择鲁棒性水平提供的指导很少。我们提出了一个新框架,该框架为任何鲁棒预测-然后优化策略族提供了无分布、有限样本的误覆盖和遗憾保证。我们的方法构建了有效的估计量,这些估计量描绘出误覆盖-遗憾帕累托前沿,使决策者能够根据其成本-风险偏好可靠地评估和校准鲁棒性水平。该框架易于实现,广泛适用于经典优化公式,并实现了更优的有限样本性能。本文提供了一种原则性的数据驱动方法,用于指导鲁棒性选择,并使从业者能够在高风险决策中平衡鲁棒性和保守性。

英文摘要

Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage--regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost--risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance. This paper offers a principled data-driven methodology for guiding robustness selection and empowers practitioners to balance robustness and conservativeness in high-stakes decision-making.

2601.22025 2026-06-11 cs.CL cs.AI cs.IR cs.SE 版本更新

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

当通用提示改进有害:LLM应用的评估驱动迭代

Daniel Commey

发表机构 * Daniel Commey

AI总结 提出最小可行评估套件(MVES),通过结构化评估框架和本地复现实验,发现通用提示添加并非单调改进,强调评估驱动的提示迭代。

详情
Comments
Technical report. 42 pages, 3 figures. Code, test suites, and result logs: this https URL
AI中文摘要

评估大型语言模型(LLM)应用与传统软件测试不同,因为输出是概率性的、语义可变的,并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件(MVES),一种面向审计的应用级LLM评估结构。MVES将应用类别与失败模式、指标、所需工件和验证证据联系起来,涵盖通用LLM应用、检索增强系统和智能体工作流。我们将该框架与可复现的本地评估工具配对,包括结构化提取、RAG引用/内容合规性和指令遵循检查。使用Ollama与Llama 3 8B Instruct和Qwen 2.5 7B Instruct,我们在扩展的每套30例消融实验中评估了五种提示条件。结果表明,在测试的本地条件下,通用提示添加不会产生单调改进:更强的输出合同提示提高了两种模型的严格提取,而RAG引用/内容合规性在某些通用规则条件下下降。观察到的最显著下降发生在Qwen 2.5上,当通用规则附加到用户提示时,RAG从26/30下降到9/30。这些发现支持评估驱动的提示迭代:提示更改应被视为潜在的回归风险,并在部署前针对特定任务套件进行测试。随附的存储库包含测试套件、提示变体、评估工具、原始结果日志和复现所报告本地消融所需的脚本。

英文摘要

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.

2601.21817 2026-06-11 stat.ML cs.LG 版本更新

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

一种面向评委的排名框架:无需真实标签评估大语言模型

Mingyuan Xu, Xinzi Tan, Jiawei Wu, Doudou Zhou

发表机构 * University of Technology Sydney(悉尼科技大学)

AI总结 本文提出一种面向评委的排名框架,通过引入评委特定的辨别参数扩展Bradley-Terry-Luce模型,在不参考标签的情况下联合估计潜在模型质量和评委可靠性,从而提高人类偏好的一致性,提高数据效率,并产生校准的不确定性量化。

详情
AI中文摘要

评估大语言模型(LLMs)在开放性任务上无需真实标签的评估越来越通过LLM-as-a-judge范式进行。一个关键但未充分建模的问题是,评判LLMs在可靠性上存在显著差异;将所有评委视为同等对待会导致偏见的排行榜和误导性的不确定性估计。更多的数据在不正确的聚合下可能导致评估更加自信地错误。我们提出了一种面向评委的排名框架,通过引入评委特定的辨别参数扩展Bradley-Terry-Luce模型,在不参考标签的情况下联合估计潜在模型质量和评委可靠性。我们建立了可识别性,直到自然归一化,并证明最大似然估计的一致性和渐近正态性,从而能够为分数差异和排名比较生成置信区间。在多个公开基准和一个新收集的数据集上,我们的方法提高了与人类偏好的一致性,比无权基线实现了更高的数据效率,并产生了校准的LLM排名不确定性量化。

英文摘要

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.

2601.17717 2026-06-11 cs.AI cs.LG 版本更新

A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

评估LLM生成数据的质量与可信度综述

Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou

发表机构 * University of Houston(德克萨斯大学休斯敦分校) Worcester Polytechnic Institute(沃思利理工学院) Rice University(里德大学) Texas A&M University(德克萨斯农工大学) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) University of Southern California(南加州大学) University of North Carolina at Charlotte(北卡罗来纳州立大学夏洛特分校)

AI总结 提出LLM数据审计框架,从质量和可信度两个维度系统分类评估指标,分析六种模态数据生成方法的评估缺陷并给出改进建议。

详情
Comments
Published at TMLR. Title changed in the final version
AI中文摘要

大型语言模型(LLM)已成为跨多种模态生成数据的强大工具。通过将数据从稀缺资源转变为可控资产,LLM缓解了真实世界数据获取成本对模型训练、评估和系统迭代造成的瓶颈。然而,确保LLM生成的合成数据的高质量仍然是一个关键挑战。现有研究主要关注生成方法,对生成数据质量的直接关注有限。此外,大多数研究局限于单一模态,缺乏跨不同数据类型的统一视角。为填补这一空白,我们提出了\textbf{LLM数据审计框架}。在该框架中,我们首先描述了如何利用LLM生成六种不同模态的数据。更重要的是,我们从质量和可信度两个维度系统分类了评估合成数据的内在指标。这种方法将评估重点从依赖下游任务性能的外在评估转向数据本身的固有属性。利用这一评估体系,我们分析了每种模态代表性生成方法的实验评估,并指出了当前评估实践中的重大缺陷。基于这些发现,我们为社区改进数据生成评估提供了具体建议。最后,该框架概述了合成数据在不同模态下的实际应用方法。

英文摘要

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

2601.17360 2026-06-11 cs.LG cs.AI cs.CR 版本更新

Robust Privacy: Inference-Stage Privacy through Certified Robustness

鲁棒隐私:通过认证鲁棒性实现推理阶段隐私

Jiankai Jin, Xiangzheng Zhang, Zhao Liu, Wenzhuo Xu, Dongdong Yang, Deyue Zhang, Quanchen Zou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出鲁棒隐私(RP)概念,基于认证鲁棒性确保预测在输入邻域内不变,从而限制推理阶段隐私泄露;实验表明RP在属性推断和模型反演攻击中有效提升隐私-效用权衡。

详情
AI中文摘要

观察模型发布预测的对手可以推断查询输入的敏感属性,甚至重建模型训练数据的代表。因此,推理接口充当隐私泄露的侧信道。我们引入鲁棒隐私(RP),一种受认证鲁棒性启发的推理阶段隐私概念:如果模型预测在输入x周围半径为R的邻域内以至少$1-\alpha$的置信度可证明不变,则x享有$(R,\alpha)$-鲁棒隐私,在此条件下我们证明任何观察发布预测的对手在区分x与距离x为R内的任何输入时最多有$\alpha/2$的优势。基于RP,我们形式化鲁棒属性隐私(RAP),一种属性级隐私概念,刻画与发布预测兼容的敏感属性值集合。在分类任务上,RP将RAP兼容推理区间的中位数长度从23.50增加到29.96,降低了属性推断精度。模型反演攻击通常被视为训练阶段威胁,实际上依赖于通过推理接口泄露的细粒度信号;RP在推理阶段掩盖这些信号,将黑盒反演攻击的成功率(ASR)从73%降至4%。这种直接针对泄露通道的方法使RP在隐私-效用权衡空间中优于DP-SGD和随机响应:RP在21% ASR下保持98.4%的准确率,而DP-SGD必须将准确率降至61.7%才能达到相当的ASR。在两个实验中,增加平滑样本量N同时增强了隐私和效用。最后,我们考察模型蒸馏作为范围边界,表明RP缓解了属性级和实例级推理阶段隐私泄露,但无法通过模型蒸馏缓解函数级提取。

英文摘要

An adversary observing a model's released prediction can infer sensitive attributes of the queried input, or even reconstruct representatives of the model's training data. The inference interface thus acts as a side channel for privacy leakage. We introduce Robust Privacy (RP), an inference-stage privacy notion inspired by certified robustness: if a model's prediction is provably invariant within a radius-R neighborhood around an input x with confidence at least $1-\alpha$, then x enjoys $(R,\alpha)$-Robust Privacy, under which we prove that any adversary observing the released prediction has at most $\alpha/2$ advantage in distinguishing x from any input within distance R of x. Building on RP, we formalize Robust Attribute Privacy (RAP), an attribute-level privacy notion that characterizes the set of sensitive-attribute values that remain compatible with a released prediction. On a classification task, RP increases the median length of the RAP-compatible inference interval from 23.50 to 29.96, reducing attribute-inference precision. Model inversion attacks, often treated as a training-stage threat, in fact rely on fine-grained signals leaked through the inference interface; RP masks these signals at the inference stage, reducing attack success rate (ASR) from 73% to 4% on a black-box inversion attack. This direct targeting of the leakage channel enables RP to dominate DP-SGD and randomized response in the privacy-utility tradeoff space: RP retains 98.4% accuracy at 21% ASR, whereas DP-SGD must drop accuracy to 61.7% to reach a comparable ASR. Across both experiments, increasing the smoothing sample size N strengthens privacy and improves utility together. Finally, we examine model distillation as a scope boundary and show that RP mitigates attribute-level and instance-level inference-stage privacy leakage, but not function-level extraction through model distillation.

2601.14792 2026-06-11 cs.LG 版本更新

Robustness of Mixtures of Experts to Feature Noise

混合专家模型对特征噪声的鲁棒性

Dong Sun, Rahul Nittala, Rebekka Burkholz

发表机构 * Dong Sun(东Sun) Rahul Nittala(拉胡尔·尼塔拉) Rebekka Burkholz(蕾贝卡·布克霍尔兹)

AI总结 研究混合专家模型在特征噪声下的鲁棒性,发现稀疏专家激活能作为噪声滤波器,相比密集网络具有更低的泛化误差、更强的鲁棒性和更快的收敛速度。

详情
Comments
ICML 2026
AI中文摘要

尽管混合专家(MoE)模型在实践中取得了成功,但其为何能在参数规模相当的情况下超越密集网络仍不清楚。我们研究了一个等参数设置,其中输入具有潜在的模块化结构但被特征噪声破坏,这作为内部激活噪声的代理。我们表明,稀疏专家激活起到了噪声滤波器的作用:与密集估计器相比,MoE在特征噪声下实现了更低的泛化误差、对扰动的更强鲁棒性以及更快的收敛速度。在合成数据和真实语言任务上的实验结果证实了理论见解,展示了稀疏模块化计算带来的持续鲁棒性和效率提升。

英文摘要

Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

2601.14764 2026-06-11 cs.AI cs.HC cs.LO 版本更新

An XAI View on Explainable ASP: Methods, Systems, and Perspectives

可解释ASP的XAI视角:方法、系统与展望

Thomas Eiter, Tobias Geibinger, Zeynep G. Saribatur

发表机构 * Institute of Logic and Computation, TU Wien, Austria(逻辑与计算研究所,维也纳技术大学,奥地利)

AI总结 本文从XAI视角综述回答集编程(ASP)的解释方法,分类解释类型并评估现有理论与工具的覆盖范围,指出研究空白与未来方向。

详情
Comments
10 pages
AI中文摘要

回答集编程(ASP)是符号AI中一种流行的声明式推理和问题解决方法。其基于规则的形式化使其天生具有可解释和解释性推理的吸引力,随着可解释AI(XAI)的兴起,这一点日益重要。目前已经开发了许多针对ASP的解释方法和工具,它们通常处理特定的解释设置,可能无法覆盖ASP用户遇到的所有场景。在本综述中,我们从XAI视角出发,概述了与用户解释问题相关的ASP解释类型,并描述了当前理论和工具对其的覆盖情况。此外,我们指出了现有ASP解释方法中的空白,并确定了未来工作的研究方向。

英文摘要

Answer Set Programming (ASP) is a popular declarative reasoning and problem solving approach in symbolic AI. Its rule-based formalism makes it inherently attractive for explainable and interpretive reasoning, which is gaining importance with the surge of Explainable AI (XAI). A number of explanation approaches and tools for ASP have been developed, which often tackle specific explanatory settings and may not cover all scenarios that ASP users encounter. In this survey, we provide, guided by an XAI perspective, an overview of types of ASP explanations in connection with user questions for explanation, and describe their coverage by current theory and tools. Furthermore, we pinpoint gaps in existing ASP explanations approaches and identify research directions for future work.

2601.14031 2026-06-11 stat.ML cs.LG 版本更新

Intermittent time series forecasting: local vs global models

间歇性时间序列预测:局部模型与全局模型

Stefano Damato, Nicolò Rubattu, Dario Azzimonti, Giorgio Corani

发表机构 * Supplementary Institute of Science and Technology(瑞士苏黎世联邦理工学院)

AI总结 针对间歇性时间序列预测问题,首次系统比较了概率性局部模型与全局模型(如TiDE),发现简单神经网络架构TiDE在精度和计算效率上均优于局部模型,且Tweedie分布头对高分位数估计最佳。

详情
Comments
Submitted to the Journal of the Operational Research Society
AI中文摘要

预测包含零值的间歇性时间序列是供应链中的一个关键挑战,因为库存策略需要概率预测来建立安全水平。间歇性时间序列通常使用局部模型进行预测,即对每个时间序列单独训练。近年来,基于大量时间序列训练的全局模型在时间序列预测中变得流行。全局模型通常基于神经网络或梯度提升树。我们进行了首次研究,比较了最先进的概率性局部模型和全局模型在间歇性时间序列上的表现。对于全局模型,我们考虑了三种适用于间歇性时间序列的不同分布头:负二项、障碍移位负二项和Tweedie。据我们所知,这是后两者首次与神经网络结合使用。我们在五个数据集上进行了实验,这些数据集总共包含超过40,000个真实世界的时间序列。在全局模型中,TiDE(一种简单的神经网络架构)取得了最佳精度;它还持续优于局部模型,并且计算需求更低。大型全局模型反而计算需求更高且精度更低。在分布头中,Tweedie提供了最高分位数的最佳估计。

英文摘要

Forecasting intermittent time series, which contain zeros, is a crucial challenge in supply chains as inventory policies require probabilistic forecasts to establish safety levels. Intermittent time series are commonly forecast using local models, trained individually on each time series. In the last years global models, trained on a large collection of time series, have become popular for time series forecasting. Global models are often based on neural networks or gradient boosted trees. We carry out the first study comparing state-of-the-art probabilistic local and global models on intermittent time series. For global models we consider three different distribution heads suitable for intermittent time series: negative binomial, hurdle-shifted negative binomial and Tweedie. To the best of our knowledge, this is the first use of the latter two with neural networks. We perform experiments on five datasets comprising overall more than 40'000 real-world time series. Among global models, TiDE, a simple neural network architecture, achieves the best accuracy; it also consistently outperforms local models and has lower computational requirements. Large global models are instead much more computationally demanding and less accurate. Among the distribution heads, the Tweedie provides the best estimates of the highest quantiles.

2601.10774 2026-06-11 cs.LG hep-lat 版本更新

Analytic Bijections for Smooth and Interpretable Normalizing Flows

用于平滑且可解释的归一化流的解析双射

Mathis Gerdes, Miranda C. N. Cheng

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出三类全局光滑、解析可逆的双射函数,替代耦合流中的仿射变换或样条,并设计径向流架构,在径向结构目标上以千分之一参数达到耦合流质量。

详情
Comments
Final ICML 2026 version. 9 + 14 pages, 10 + 11 figures, 3 + 2 tables. New CIFAR-10 and tabular-data results; main text shortened for readability
AI中文摘要

归一化流中的一个关键挑战是找到表达力强的可逆标量双射。现有方法面临权衡:仿射变换光滑且解析可逆但缺乏表达力;单调样条提供局部控制但仅分段光滑且作用于有界域;残差流实现光滑性但需要数值求逆。我们引入了三类解析双射,它们全局光滑($C^\infty$),定义在整个$\mathbb{R}$上,且以闭式解析可逆,结合了先前方法的有利性质。除了作为耦合流中的即插即用替代品(其性能匹配或超越样条),我们还开发了径向流:一种使用直接参数化的新颖架构,在保持角度方向的同时变换径向坐标。径向流表现出卓越的训练稳定性,产生几何可解释的变换,并且在具有径向结构的目标上,能以$1000$倍更少的参数达到与耦合流相当的质量。我们在1D和2D基准测试上进行了全面评估,并通过$\phi^4$格点场论实验证明了其在更高维物理问题中的适用性,其中我们的双射优于仿射基线,并能够解决模式崩溃问题的特定设计。

英文摘要

A key challenge in normalizing flows is finding expressive invertible scalar bijections. Existing approaches face trade-offs: affine transformations are smooth and analytically invertible but lack expressivity; monotonic splines offer local control but are only piecewise smooth and act on bounded domains; residual flows achieve smoothness but need numerical inversion. We introduce three families of analytic bijections that are globally smooth ($C^\infty$), defined on all of $\mathbb{R}$, and analytically invertible in closed form, combining the favorable properties of prior approaches. Beyond serving as drop-in replacements in coupling flows, where they match or exceed spline performance, we develop radial flows: a novel architecture using direct parametrization that transforms the radial coordinate while preserving angular direction. Radial flows exhibit exceptional training stability, produce geometrically interpretable transformations, and on targets with radial structure can achieve comparable quality to coupling flows with $1000\times$ fewer parameters. We provide comprehensive evaluation on 1D and 2D benchmarks, and demonstrate applicability to higher-dimensional physics problems through experiments on $\phi^4$ lattice field theory, where our bijections outperform affine baselines and enable problem-specific designs that address mode collapse.

2601.08136 2026-06-11 cs.LG eess.SY 版本更新

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

反向流匹配:基于扩散与流策略的在线强化学习统一框架

Zeyang Li, Sunbochen Tang, Navid Azizan

发表机构 * Zeyang Li(李泽阳) Sunbochen Tang(唐顺波晨) Navid Azizan(阿齐兹安纳维)

AI总结 针对在线强化学习中扩散与流策略缺乏目标样本的问题,提出反向流匹配框架,通过后验均值估计和Langevin Stein算子构造控制变量,统一了噪声期望与梯度期望两类方法,并扩展到流策略,提升训练效率与稳定性。

详情
Comments
ICML 2026 (Spotlight); Code: this https URL
AI中文摘要

扩散和流策略因其强大的表达能力在在线强化学习(RL)中日益重要,但高效训练它们仍是一个关键挑战。在线RL与标准生成建模的一个根本区别在于缺乏来自Q函数定义的目标玻尔兹曼分布的直接样本。为此,针对扩散策略提出了两类看似不同的方法:噪声期望族,使用噪声的加权平均作为训练目标;梯度期望族,使用Q函数梯度的加权平均。然而,这些目标如何正式相关,或者它们能否被综合成一个更通用的公式,目前尚不清楚。在本文中,我们提出了一个统一框架——反向流匹配(RFM),该框架严格解决了在没有直接目标样本的情况下训练扩散和流模型的问题。通过采用反向推理视角,我们将训练目标表述为给定中间噪声样本的后验均值估计问题。关键地,我们引入Langevin Stein算子来构造零均值控制变量,推导出一类具有相同期望的通用估计器。我们表明,现有的噪声期望和梯度期望方法只是这个更广泛类别中的两个具体实例。这种统一观点带来了两个关键进展:它将针对玻尔兹曼分布的能力从扩散策略扩展到流策略,并使得能够原则性地结合Q值和Q梯度信息形成有效估计器,从而提高训练效率和稳定性。我们将RFM实例化以在在线RL中训练流策略,并在连续控制基准测试中展示了相比扩散策略基线的改进性能。

英文摘要

Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.

2601.07506 2026-06-11 cs.CL 版本更新

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

与参考标准对照评判:揭示LLM评判者在QA评估中知识驱动的失败模式

Dongryeol Lee, Yerin Hwang, Taegwan Kang, Minwoo Lee, Younhyung Chae, Kyomin Jung

发表机构 * Dept. of ECE, Seoul National University(电子工程系,首尔国立大学) LG AI Research(LG人工智能研究) IPAI, Seoul National University(IPAI,首尔国立大学)

AI总结 本文发现LLM作为QA自动评判者时,当提供的参考答案与模型参数知识冲突,评分可靠性严重下降;通过引入交换参考答案框架系统研究该现象,揭示评判者过度依赖参数知识而忽略参考标准,且常见提示缓解策略无效。

详情
Comments
Under review, 21 pgs, 11 figures, 7 tables
AI中文摘要

虽然大型语言模型(LLMs)越来越多地被用作问答(QA)和其他参考条件评估任务的自动评判者,但关于它们遵循所提供的参考标准的能力知之甚少。我们识别出这种基于参考的LLM QA评估的一个关键失败模式:当提供的参考标准与评判模型的参数知识冲突时,产生的评分变得不可靠,从而严重降低评估保真度。为了系统研究这一现象,我们引入了一个受控的交换参考答案QA框架,该框架引发参考-信念冲突。具体来说,我们将参考答案替换为错误实体,并构建原始和交换参考与相应对齐的候选答案的多样化配对。令人惊讶的是,在广泛的评判模型集合中,交换参考下的评分可靠性急剧下降。我们通过实验表明,这种脆弱性是由评判者过度依赖参数知识驱动的,导致评判者在冲突情况下忽略给定的参考标准。最后,我们发现这种失败在常见的基于提示的缓解策略下仍然存在,突显了LLM作为评判者评估的根本局限性,并激励了强制执行更强参考遵循的基于参考的协议。

英文摘要

While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.

2510.23508 2026-06-11 cs.CL 版本更新

M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset

M4FC:一个多模态、多语言、多文化、多任务的真实世界事实验证数据集

Jiahui Geng, Jonathan Tonglet, Iryna Gurevych

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学) Ubiquitous Knowledge Processing Lab(ubiquitous知识处理实验室) Department of Computer Science, TU Darmstadt(TU Darmstadt计算机科学系) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE) Department of Electrical Engineering, KU Leuven(KU Leuven电气工程系) Department of Computer Science, KU Leuven(KU Leuven计算机科学系)

AI总结 为解决现有事实验证数据集规模小、语言单一、任务局限等问题,提出包含4982张图片和6980条声明的多模态数据集M4FC,覆盖6个验证任务,并提供基线结果。

详情
Comments
Preprint under review. Code and data available at: this https URL
AI中文摘要

现有的多模态事实验证真实世界数据集存在多个局限性:实例数量少,仅覆盖一种或两种语言,只关注单一任务,或依赖外部新闻文章集来获取真实声明。为解决这些不足,我们引入了M4FC,一个新的真实世界数据集,包含4982张图片和6980条声明。这些图片由来自22个组织的专业事实核查员验证,代表了多样化的文化和地理背景。每条声明以十种语言中的一种或两种提供。M4FC涵盖六个多模态事实验证任务:视觉声明提取、声明者意图预测、虚假图像检测、图像语境化、位置验证和裁决预测。我们为所有任务提供了基线结果,并分析了组合中间任务对裁决预测性能的影响。我们公开了数据集和代码。

英文摘要

Existing real-world datasets for multimodal fact-checking have multiple limitations: they contain few instances, cover on only one or two languages, focus only on one task, or rely on external news article sets for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent a diverse range of cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake image detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks affects verdict prediction performance. We make our dataset and code publicly available.

2601.04710 2026-06-11 cs.CL cs.LG 版本更新

Steering the Noise: Turning Random Perturbations into Effective Descent for Memory-Efficient LLM Fine-Tuning

引导噪声:将随机扰动转化为有效下降方向以实现内存高效的LLM微调

Feihu Jin, Shipeng Cen, Ying Tan

发表机构 * School of Intelligence Science and Technology(智能科学与技术学院) Institute for Artificial Intelligence(人工智能研究院) Peking University(北京大学) State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室)

AI总结 提出一种即插即用框架,通过候选扰动池选择或组合与优化目标对齐的扰动,改进零阶优化梯度估计,提升LLM微调的收敛速度和任务精度。

详情
Comments
12pages, 6figures
AI中文摘要

微调大型语言模型(LLMs)取得了强大的性能,但通常受到反向传播内存开销的限制。零阶(ZO)优化通过仅使用前向传递来估计梯度,避免了这一开销,但由于随机高斯扰动在高维参数空间中产生高方差的梯度估计,其收敛速度通常较慢。在本文中,我们提出了一种即插即用框架,将随机扰动转化为更有效的下降方向。关键思想是抽取一小批候选扰动,评估其损失值,然后选择或组合那些与优化目标最一致的扰动。我们开发了该思想的两种实例:MeZO-GV,通过低损失和高损失扰动组之间的对比形成引导向量;以及MeZO-Greedy,在固定的评估预算内保留单个最佳扰动。我们从理论上证明,这两种策略在每步目标函数减少上均优于标准ZO估计,从而提高了收敛速度。在不同规模和架构的LLM上的实验证实,所提出的方法自然地与现有ZO优化器集成,并一致地提高了收敛速度和任务准确性。在OPT-13B上,我们的方法在11个基准测试中优于所有ZO基线,并在其中9个上超过了基于梯度的方法,同时保留了仅前向优化的内存效率。

英文摘要

Fine-tuning large language models (LLMs) achieves strong performance but is often limited by the memory overhead of backpropagation. Zeroth-order (ZO) optimization avoids this overhead by estimating gradients through forward passes alone, yet it typically converges slowly because random Gaussian perturbations yield high-variance gradient estimates in high-dimensional parameter spaces. In this paper, we propose a plug-and-play framework that turns random perturbations into more effective descent directions. The key idea is to draw a small pool of candidate perturbations, evaluate their loss values, and then select or combine those that are best aligned with the optimization objective. We develop two instantiations of this idea: MeZO-GV, which forms a guiding vector from the contrast between low-loss and high-loss perturbation groups, and MeZO-Greedy, which keeps the single best perturbation within a fixed evaluation budget. We theoretically show that both strategies yield a larger per-step reduction in the objective than standard ZO estimation, leading to improved convergence rates. Experiments on LLMs of different scales and architectures confirm that the proposed methods integrate naturally with existing ZO optimizers and consistently improve convergence speed and task accuracy. On OPT-13B, our approach outperforms all ZO baselines across 11 benchmarks and exceeds gradient-based methods on 9 of them, while retaining the memory efficiency of forward-only optimization.

2601.04203 2026-06-11 cs.CL cs.CV cs.LG cs.SE 版本更新

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

FronTalk: 以多模态反馈进行对话式代码生成的前端开发基准测试

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) University of California, Los Angeles(加州大学洛杉矶分校) Duke University(杜克大学)

AI总结 提出FronTalk基准,通过多轮对话和多模态反馈(文本与视觉指令)评估前端代码生成,发现模型存在遗忘和视觉反馈理解困难,提出AceCoder方法有效减少遗忘并提升性能。

详情
AI中文摘要

我们提出了FronTalk,一个前端代码生成基准,开创性地研究了一种独特的交互动态:具有多模态反馈的对话式代码生成。在前端开发中,草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要,但它们在多轮代码生成中的作用仍未得到充分探索。为解决这一差距,我们聚焦于前端开发任务,整理了FronTalk,这是一个包含100个多轮对话的数据集,这些对话源自新闻、金融和艺术等不同领域的真实网站。每一轮都包含一个文本指令和一个等效的视觉指令,每个指令代表相同的用户意图。为全面评估模型性能,我们提出了一种新颖的基于智能体的评估框架,利用网络智能体模拟用户并探索网站,从而衡量功能正确性和用户体验。对20个模型的评估揭示了文献中系统性地未充分探索的两个关键挑战:(1)显著的遗忘问题,即模型覆盖先前实现的功能,导致任务失败;(2)解释视觉反馈的持续挑战,尤其是对于开源视觉语言模型(VLM)。我们提出了一个强大的基线来解决遗忘问题,即AceCoder,一种使用自主网络智能体批评每个过去指令实现的方法。这种方法将遗忘几乎减少到零,并将性能提升高达9.3%(从56.0%到65.3%)。总体而言,我们旨在为前端开发和多轮多模态代码生成的通用交互动态的未来研究提供坚实基础。代码和数据已在此https URL发布。

英文摘要

We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL

2601.03326 2026-06-11 cs.CV cs.LG 版本更新

Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

高阶类PCA旋转不变特征用于模旋转的详细形状描述符

Jarek Duda

发表机构 * Jarek Duda

AI总结 提出将PCA扩展到高阶张量(如三阶中心矩)或多项式乘高斯分布,以获取更精确的旋转不变形状描述符,并应用于分子形状描述、物体识别和形状相似性度量。

详情
Comments
5 pages, 4 figures
AI中文摘要

PCA可用于旋转不变特征,通过协方差矩阵 $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ 用椭球近似形状,并利用其幂的迹等旋转不变量。然而,真实形状通常复杂得多,因此提出将其扩展到例如 $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ 的三阶或更高阶张量以描述中心矩,或多项式乘高斯分布以得到任意高精度的可解码形状描述符及其类似的旋转不变量。其实际应用包括旋转不变特征以包含模旋转的形状,例如用于分子形状描述符,或用于2D图像/3D扫描中直至旋转的物体识别,可能也用于3D场景理解,或作为形状相似性度量,允许模旋转下物体的廉价比较,避免耗时的旋转优化。

英文摘要

PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans maybe also for 3D scene understanding, or shape similarity metric allowing inexpensive comparison of objects modulo rotation avoiding costly optimization over rotations.

2506.08473 2026-06-11 cs.LG 版本更新

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

AsFT:在窄安全盆地内锚定大语言模型微调期间的安全性

Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 针对微调大语言模型时安全性易受损的问题,提出AsFT方法,通过惩罚与对齐方向正交的更新,将模型约束在窄安全盆地内,在提升任务性能的同时显著降低有害行为。

详情
AI中文摘要

微调大语言模型(LLMs)可提升性能,但引入了关键的安全漏洞:即使极少的有害数据也会严重破坏安全措施。我们观察到,与对齐方向(由对齐(安全)模型与未对齐模型之间的权重差异定义)正交的扰动会迅速损害模型安全性。相反,沿对齐方向的更新则基本保持安全性,揭示了参数空间是一个“窄安全盆地”。为解决此问题,我们提出AsFT(在微调中锚定安全性),通过在微调过程中显式约束更新方向来维持安全性。通过惩罚与对齐方向正交的更新,AsFT有效将模型约束在“窄安全盆地”内,从而保持其固有安全性。在多个数据集和模型上的大量实验表明,AsFT将有害行为降低高达7.60%,任务性能提升3.44%,并在多个任务上持续优于现有方法。

英文摘要

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

2511.21594 2026-06-11 cs.LG 版本更新

Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

通过降维可视化LLM潜在空间几何结构

Alex Ning, Vainateya Rangaraju, Yen-Ling Kuo

发表机构 * Department of Computer Science, University of Virginia(计算机科学系,弗吉尼亚大学)

AI总结 通过PCA和UMAP降维,可视化GPT-2和LLaMa中Transformer层的潜在状态几何,发现注意力与MLP输出分离、初始位置高范数及螺旋结构等模式。

详情
Comments
25 pages, 15 figures
AI中文摘要

大型语言模型(LLM)在许多自然语言任务中取得了最先进的结果,但其内部机制仍然难以解释。在这项工作中,我们通过降维提取、处理和可视化基于Transformer的语言模型中的潜在状态几何结构。我们在Transformer块内的多个点捕获逐层激活,并通过主成分分析(PCA)和均匀流形近似与投影(UMAP)实现系统分析。我们在GPT-2和LLaMa模型上进行了实验,发现了潜在空间中有趣的几何模式。值得注意的是,我们识别出中间层中注意力与MLP组件输出之间的清晰分离,据我们所知,这种模式在先前的工作中未被记录。我们还描述了初始序列位置潜在状态的高范数,并可视化了潜在状态的逐层演化。此外,我们展示了GPT-2位置嵌入的高维螺旋结构以及LLaMa中按序列的几何模式。我们在以下网址提供代码:https://this https URL。相同内容的更好格式的博客文章可在以下网址获取:https://this https URL。

英文摘要

Large language models (LLMs) achieve state-of-the-art results across many natural language tasks, but their internal mechanisms remain difficult to interpret. In this work, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction. We capture layerwise activations at multiple points within Transformer blocks and enable systematic analysis through Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). We demonstrate experiments on GPT-2 and LLaMa models, where we uncover interesting geometric patterns in latent space. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge. We also characterize the high norm of latent states at the initial sequence position and visualize the layerwise evolution of latent states. Additionally, we demonstrate the high-dimensional helical structure of GPT-2's positional embeddings and the sequence-wise geometric patterns in LLaMa. We make our code available at this https URL. A better formatted blog-post with identical content is available at this https URL.

2601.00791 2026-06-11 cs.LG cs.AI cs.CL cs.LO 版本更新

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

推理的几何:有效数学推理的谱特征

Valentin Noël

发表机构 * Valentin Noël(瓦伦丁·诺埃尔)

AI总结 通过将注意力矩阵视为加权词图,提取四个无需学习的谱诊断指标(Fiedler值、高频能量比、谱熵和平滑度),有效区分有效推理与模式匹配,在多个模型上达到85-96%的分类准确率。

详情
Comments
30 pages, 13 figures, Accepted at ICML 2026 (main track)
AI中文摘要

验证语言模型是真正推理还是模式匹配仍然是一个开放问题:学习型验证器成本高昂,基于输出的启发式方法脆弱。我们证明,有效的数学推理在Transformer注意力中诱导出可测量的、无需训练的谱特征。通过将每个注意力矩阵视为加权词图,我们提取四个诊断指标:Fiedler值、高频能量比(HFER)、谱熵和平滑度,这些指标无需学习参数。在来自四个架构家族的七个模型上的实验产生了高达Cohen's $d = 3.30$($p < 10^{-116}$)的效应量,实现了$85$--$96\%$的单阈值分类准确率。两个发现加深了理解。首先,\emph{柏拉图式有效性}:谱信号追踪逻辑连贯性而非编译器接受性,因超时或缺失导入而被拒绝的证明被正确分类为有效,这一区别通过人工审核确认($\kappa = 0.82$,$n = 51$)。其次,\emph{架构确定性}:滑动窗口注意力将判别特征从HFER转移到平滑度($d = 2.09$,$p < 10^{-48}$),表明注意力设计决定了哪个谱通道编码推理质量。因果消融证实该特征追踪归纳头电路。该方法泛化到非形式化思维链($d = 0.78$,$p < 10^{-3}$),并且在证明搜索中,HFER重排序将Best-of-16 Pass@1提高了$+4.4$--$6.6\%$,匹配了完全监督探针AUC的$98\%$且无需标签。谱图分析是一种原则性的、架构感知的推理验证原语。

英文摘要

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$--$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, \emph{Platonic validity}: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, \emph{architectural determinism}: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$--$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

2505.00571 2026-06-11 stat.ML cs.LG 版本更新

Discovery and inference beyond linearity for epidemiological data by integrating Bayesian regression, tree ensembles and Shapley values

通过整合贝叶斯回归、树集成和Shapley值对流行病学数据进行线性之外的发现与推断

Giorgio Spadaccini, Marjolein Fokkema, Mark A. van de Wiel

发表机构 * Amsterdam UMC Leiden University(阿姆斯特丹大学医学中心-莱顿大学) Leiden University(莱顿大学) Amsterdam UMC(阿姆斯特丹大学医学中心)

AI总结 提出RuleSHAP框架,结合贝叶斯稀疏回归、改进的树规则生成器和Shapley值,实现非线性与交互效应的检测及个体水平的不确定性量化,应用于流行病学数据发现高胆固醇和血压的影响因素。

详情
AI中文摘要

机器学习在流行病学和医疗健康研究中越来越受欢迎,用于无假设地发现风险和保护因素。机器学习在发现非线性和交互作用方面很强,但这种能力因缺乏可靠的推断而受损。尽管Shapley值提供了特征效应的局部度量,但这些效应通常缺乏有效的不确定性量化,从而排除了统计推断。我们提出RuleSHAP,一个通过结合专用贝叶斯稀疏回归模型、改进的基于树的规则生成器和Shapley值归因来解决这一局限性的框架。RuleSHAP能够检测非线性和交互效应,其关键贡献在于个体水平的不确定性量化。我们推导了一个在该框架内计算边际Shapley值的有效公式。我们将RuleSHAP应用于一个流行病学队列的数据,以检测和推断高胆固醇和血压的几种效应,例如年龄、性别、种族、BMI和血糖水平等特征之间的非线性交互效应。最后,我们在模拟数据上证明了我们框架的有效性。

英文摘要

Machine Learning (ML) is gaining popularity in epidemiology and healthcare studies for hypothesis-free discovery of risk and protective factors. ML is strong at discovering nonlinearities and interactions, but this power is compromised by a lack of reliable inference. Although Shapley values provide local measures of features' effects, valid uncertainty quantification for these effects is typically lacking, thus precluding statistical inference. We propose RuleSHAP, a framework that addresses this limitation by combining a dedicated Bayesian sparse regression model with an improved tree-based rule generator and Shapley value attribution. RuleSHAP provides detection of nonlinear and interaction effects, with uncertainty quantification at the individual level as a key contribution. We derive an efficient formula for computing marginal Shapley values within this framework. We apply RuleSHAP to data from an epidemiological cohort to detect and infer several effects for high cholesterol and blood pressure, such as nonlinear interaction effects between features like age, sex, ethnicity, BMI and glucose level. To conclude, we demonstrate the validity of our framework on simulated data.

2512.22219 2026-06-11 cs.DC cs.LG cs.PL 版本更新

MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

MPK:一种用于将张量程序转化为巨型内核的编译器和运行时系统

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tsinghua University(清华大学) NVIDIA University of Michigan(密歇根大学) Independent Researcher(独立研究者) Peking University(北京大学)

AI总结 提出MPK,首个自动将多GPU模型推理转化为单个高性能巨型内核的编译器和运行时系统,通过SM级图表示实现跨算子软件流水线和细粒度计算通信重叠,显著降低推理延迟。

详情
Comments
14 pages
AI中文摘要

我们介绍了Mirage Persistent Kernel (MPK),这是首个自动将多GPU模型推理转化为单个高性能巨型内核的编译器和运行时系统。MPK引入了一种SM级图表示,该表示在单个流式多处理器(SM)的粒度上捕获数据依赖关系,从而实现跨算子软件流水线、计算与通信的细粒度重叠,以及在传统每算子内核执行模型下不可行的其他优化。MPK编译器将张量程序降级为优化的SM级任务图,并为每个任务生成快速的CUDA实现,而MPK内核内并行运行时则通过跨SM的分散调度在单个持久巨型内核内执行这些任务。这些组件共同提供了端到端的内核融合,且开发工作量极小,同时保留了现有编程模型的灵活性。我们的评估表明,MPK显著优于现有的每算子内核LLM服务系统,实现了高达1.7倍的端到端推理延迟降低,并将LLM推理性能推近底层硬件的极限。MPK在此https URL公开可用。

英文摘要

We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance mega-kernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, \rev{fine-grained overlap of computation and communication, and other optimizations that are infeasible under the conventional kernel-per-operator execution model}. The MPK compiler lowers tensor programs into optimized SM-level task graphs and generates fast CUDA implementations for each task, while the MPK in-kernel parallel runtime executes these tasks within a single persistent mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems, achieving up to 1.7$\times$ lower end-to-end inference latency and pushing LLM inference performance close to the limits of the underlying hardware. MPK is publicly available at this https URL.

2512.19245 2026-06-11 eess.SY cs.RO 版本更新

Vision-Aided Relative State Estimation for Approach and Landing on a Moving Platform with Inertial Measurements

基于视觉辅助的相对状态估计用于移动平台进近与着陆的惯性测量

Tarek Bouazza, Alessandro Melis, Soulaimane Berkane, Robert Mahony, Tarek Hamel

发表机构 * I3S, CNRS, Université Côte d’Azur(I3S、CNRS、普罗旺斯大学) Département d’informatique et d’ingénierie, Université du Québec en Outaouis and Department of Electrical Engineering, Lakehead University(信息与工程系、魁北克大学 Outaouais 以及拉夫堡大学电子工程系) Systems Theory and Robotics Group Australian National University(系统理论与机器人组、澳大利亚国立大学) Institut Universitaire de France (IUF)(法国高等研究院)

AI总结 提出一种级联观测器,结合SO(3)互补滤波和线性Riccati观测器,利用IMU和单目相机估计无人机与移动平台的相对位姿和速度,在持续激励条件下实现几乎全局渐近稳定。

详情
Comments
13 pages, 4 figures. To appear in proceedings of IFAC World Congress 2026
AI中文摘要

本文解决了在进近和着陆过程中,无人机与经历任意三维运动的平面平台之间的相对位置、姿态和速度的估计问题。该估计依赖于安装在两个系统上的惯性测量单元(IMU)的测量值,假设存在合适的通信信道来交换数据,以及由机载单目相机提供的视觉信息,从中提取平台中心的方位(视线方向)和其平面表面的法向量。我们提出了一种级联观测器,在$\mathbf{SO}(3)$上采用互补滤波器来重构相对姿态,随后使用线性Riccati观测器进行相对位置和速度估计。在持续激励条件下建立了两个观测器的收敛性,并证明了级联是几乎全局渐近和局部指数稳定的。我们进一步将设计扩展到平台旋转限制在其法向轴的情况,并表明可以利用其测量的线性加速度来恢复剩余不可观测的旋转角。提供了该情况下局部指数收敛的充分条件。通过大量仿真验证了所提出的观测器。

英文摘要

This paper tackles the problem of estimating the relative position, orientation, and velocity between a UAV and a planar platform undergoing arbitrary 3D motion during approach and landing. The estimation relies on measurements from Inertial Measurement Units (IMUs) mounted on both systems, assuming there is a suitable communication channel to exchange data, together with visual information provided by an onboard monocular camera, from which the bearing (line-of-sight direction) to the platform's center and the normal vector of its planar surface are extracted. We propose a cascade observer with a complementary filter on $\mathbf{SO}(3)$ to reconstruct the relative attitude, followed by a linear Riccati observer for relative position and velocity estimation. Convergence of both observers is established under persistently exciting conditions, and the cascade is shown to be almost globally asymptotically and locally exponentially stable. We further extend the design to the case where the platform's rotation is restricted to its normal axis and show that its measured linear acceleration can be exploited to recover the remaining unobservable rotation angle. A sufficient condition for local exponential convergence in this setting is provided. The proposed observers are validated through extensive simulations.

2512.14096 2026-06-11 cs.CV 版本更新

RSTR: Reducing SpatioTemporal Redundancy in Diffusion Transformers

RSTR: 减少扩散Transformer中的时空冗余

Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出RSTR框架,通过进化搜索和自适应秩分配联合减少扩散Transformer中的时空冗余,实现50%-70%计算节省并保持或提升生成质量。

详情
Comments
International Conference on Machine Learning (ICML)
AI中文摘要

扩散Transformer(DiTs)在图像生成中取得了显著成功,但其部署受到高计算成本的阻碍。我们识别出两种冗余来源。首先,时间冗余:无分类器引导(CFG)在每个时间步应用昂贵的双重前向传播,然而引导仅在特定步骤重要,且关键步骤的可变尺度可以补偿跳过其他步骤。其次,空间冗余:在可变引导下,不同Transformer块表现出异质性敏感性,但跨所有块的统一校准浪费计算且未能满足其不同需求。我们提出RSTR,这是首个联合减少扩散Transformer中时空冗余的框架。第一阶段通过进化搜索解决时间冗余,发现具有可变尺度的稀疏引导调度。第二阶段通过自适应秩分配解决空间冗余,根据敏感性将校准能力分配给Transformer区域。在DiT-XL/2、PixArt-$\alpha$、FLUX和最先进的Qwen-Image上的实验表明,在保持或提升质量的同时实现了50%-70%的计算节省。在DiT-XL/2上,RSTR实现了57%的节省和15%的FID改进;在Qwen-Image上,实现了3.43倍加速且质量保持不变。

英文摘要

Diffusion Transformers (DiTs) have achieved remarkable success in image generation, yet their deployment is hindered by high computational costs. We identify two sources of redundancy. First, temporal redundancy: Classifier-Free Guidance (CFG) applies costly dual forward passes at every timestep, yet guidance matters only at specific steps, and variable scales at critical steps can compensate for skipping others. Second, spatial redundancy: under variable guidance, different transformer blocks exhibit heterogeneous sensitivity, yet uniform calibration across all blocks wastes computation while failing to address their varying requirements. We present RSTR, the first framework to jointly reduce spatiotemporal redundancy in diffusion transformers. Stage-1 addresses temporal redundancy through evolutionary search, discovering sparse guidance schedules with variable scales. Stage-2 addresses spatial redundancy through adaptive rank allocation, assigning calibration capacities to transformer regions based on their sensitivity. Experiments on DiT-XL/2, PixArt-$\alpha$, FLUX, and state-of-the-art Qwen-Image demonstrate 50%-70% compute savings while maintaining or improving quality. On DiT-XL/2, RSTR achieves 57% savings with 15% FID improvement; on Qwen-Image, 3.43$\times$ speedup with preserved quality.

2512.13666 2026-06-11 cs.CR cs.DC cs.IT cs.LG 版本更新

SEDULity: A Proof-of-Learning Framework for Distributed and Secure Blockchains with Efficient Useful Work

SEDULity:一种面向分布式安全区块链的高效有用工作证明学习框架

Weihang Cao, Mustafa Doger, Sennur Ulukus

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 提出一种名为SEDULity的证明学习框架,通过将区块模板编码到训练过程中并设计难解易验的有用函数替代PoW谜题,在保持区块链安全性的同时高效训练机器学习模型。

详情
AI中文摘要

工作量证明(PoW)的安全性和去中心化已在现有区块链系统中得到充分验证,但其巨大的能源浪费引发了可持续性担忧。有用工作证明(PoUW)旨在将无意义的计算重定向到有意义任务(如解决机器学习问题),从而催生了学习证明(PoL)分支。尽管已有研究提出了多种PoL,但它们都在一定程度上存在安全性、去中心化或效率问题。本文提出一种PoL框架,在完全分布式环境中高效训练机器学习模型,同时维护区块链安全性。我们将该框架命名为SEDULity,代表安全、高效、分布式和有用的基于学习的区块链系统。具体而言,我们将区块模板编码到训练过程中,并设计一种难解但相对易验的有用函数,作为PoW谜题的替代。我们证明该框架是分布式、安全的,并能高效训练机器学习模型。进一步展示所提出的PoL框架可扩展到其他类型的有用工作,并设计激励机制以激励任务验证。理论上证明,在精心设计的系统参数下,理性矿工有动机完全诚实地进行训练。最后,通过仿真结果展示框架性能并验证分析。

英文摘要

The security and decentralization of Proof-of-Work (PoW) have been well-tested in existing blockchain systems. However, its tremendous energy waste has raised concerns about sustainability. Proof-of-Useful-Work (PoUW) aims to redirect the meaningless computation to meaningful tasks such as solving machine learning (ML) problems, giving rise to the branch of Proof-of-Learning (PoL). While previous studies have proposed various PoLs, they all, to some degree, suffer from security, decentralization, or efficiency issues. In this paper, we propose a PoL framework that trains ML models efficiently while maintaining blockchain security in a fully distributed manner. We name the framework SEDULity, which stands for a Secure, Efficient, Distributed, and Useful Learning-based blockchain system. Specifically, we encode the template block into the training process and design a useful function that is difficult to solve but relatively easy to verify, as a substitute for the PoW puzzle. We show that our framework is distributed, secure, and efficiently trains ML models. We further demonstrate that the proposed PoL framework can be extended to other types of useful work and design an incentive mechanism to incentivize task verification. We show theoretically that a rational miner is incentivized to train fully honestly with well-designed system parameters. Finally, we present simulation results to demonstrate the performance of our framework and validate our analysis.

2505.15201 2026-06-11 cs.LG cs.AI cs.CL stat.ML 版本更新

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Pass@K 策略优化:解决更困难的强化学习问题

Christian Walder, Deep Karkhanis

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出 Pass-at-k 策略优化 (PKPO),通过变换奖励直接优化 pass@k 性能,利用低方差无偏估计器,在训练中退火 k 可同时提升 pass@1 和 pass@k,解决更难问题。

详情
AI中文摘要

强化学习算法对每个问题采样多个 n>1 的解决方案尝试并独立奖励它们。这优化了 pass@1 性能,优先考虑孤立样本的强度,而牺牲了样本集的多样性和集体效用。这未充分利用采样能力,限制了探索和在更难示例上的最终改进。作为修复,我们提出 Pass-at-k 策略优化 (PKPO),一种对最终奖励的变换,导致直接优化 pass@k 性能,从而优化联合考虑时最大化奖励的样本集。我们的贡献是推导出 pass@k 及其梯度在二元和连续奖励设置中的新型低方差无偏估计器。我们展示了使用我们的估计器进行优化简化为标准强化学习,其中奖励经过稳定高效的变换函数联合变换。虽然先前的工作仅限于 k=n,但我们是第一个能够对任意 k ≤ n 实现 pass@k 鲁棒优化的。此外,我们的方法不是以 pass@1 性能换取 pass@k 增益,而是允许在训练中退火 k,同时优化两个指标,通常能在显著 pass@k 增益的同时获得强大的 pass@1 数值。我们在玩具实验上验证了我们的奖励变换,揭示了我们的公式的方差减少特性。我们还使用开源 LLM GEMMA-2 包含了真实世界的例子。我们发现我们的变换有效地优化了目标 k。此外,更高的 k 值能够解决更多和更难的问题,而退火 k 则同时提升了 pass@1 和 pass@k。关键的是,在传统 pass@1 优化停滞的具有挑战性的任务集上,我们的 pass@k 方法解锁了学习,这可能是由于通过优先考虑联合效用而非单个样本的效用实现了更好的探索。

英文摘要

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

2512.11982 2026-06-11 astro-ph.IM cs.AI cs.CV cs.LG 版本更新

Semantic search for 100M+ galaxy images using AI-generated captions

基于AI生成描述的1亿+星系图像语义搜索

Nolan Koblischke, Liam Parker, Francois Lanusse, Jo Bovy, Irina Espejo, Shirley Ho

发表机构 * New York University(纽约大学) University of Toronto(多伦多大学) Dunlap Institute for Astronomy & Astrophysics(达伦普天文与天体物理研究所) University of California, Berkeley(加州大学伯克利分校) Center for Data Science(数据科学中心) Lawrence Berkeley National Lab(伯克利国家实验室) Flatiron Institute(Flatiron研究所) Université Paris-Saclay(巴黎-萨克莱大学) CEA(法国原子能委员会) CNRS(法国国家科学研究中心) AIM(应用数学研究所) Princeton University(普林斯顿大学)

AI总结 提出利用视觉语言模型生成星系图像描述,并对比对齐预训练天文学基础模型,构建可搜索嵌入,实现大规模星系图像的语义搜索,在稀有现象发现上取得最先进性能。

详情
Comments
ApJ, in press
AI中文摘要

通过缓慢的手动标注活动寻找科学上有趣的现象严重限制了我们对望远镜产生的数十亿星系图像的探索能力。在这项工作中,我们开发了一个流水线,从完全未标记的图像数据创建语义搜索引擎。我们的方法利用视觉语言模型(VLM)为星系图像生成描述,然后将预训练的天文学基础模型与这些嵌入的描述进行对比对齐,以产生大规模可搜索的嵌入。我们发现当前的VLM提供的描述信息足够丰富,可以训练一个语义搜索模型,该模型优于直接图像相似性搜索。我们的模型AION-Search在寻找稀有现象方面实现了最先进的零样本性能,尽管训练是在随机选择的图像上进行的,没有针对稀有情况进行刻意策划。此外,我们引入了一种基于VLM的重排序方法,该方法在top-100结果中对我们最具挑战性的目标的召回率几乎翻倍。首次,AION-Search实现了对超过1亿张星系图像的灵活语义搜索,使得从以前不可行的搜索中能够发现新现象,包括识别出36个新的河外恒星流候选体。更广泛地说,我们的工作提供了一种方法,使大型、未标记的科学图像档案变得可语义搜索,扩展了从地球观测到显微镜等领域的数据探索能力。代码、数据和应用程序可在以下网址公开获取:https://this https URL

英文摘要

Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search for over 100 million galaxy images, enabling discovery from previously infeasible searches, including the identification of 36 new extragalactic stellar stream candidates. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at this https URL

2512.11393 2026-06-11 cs.CV 版本更新

The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

N体问题:从单人物体中心视频进行并行执行

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

发表机构 * University of Bristol(布里斯托尔大学) The University of Tokyo(东京大学)

AI总结 提出N体问题,从单人物体中心视频预测N人并行执行任务,通过结构化提示策略引导视觉语言模型推理3D环境、物体使用和时间依赖,在EPIC-Kitchens和HD-EPIC数据集上显著提升动作覆盖率并降低冲突。

详情
Comments
project webpage: this https URL
AI中文摘要

人类可以直观地并行化复杂活动,但模型能否通过观察一个人来预测这一点?给定一个物体中心视频,我们引入N体问题:预测N个人如何假设性地执行同一组任务。目标是最大化加速,但将视频片段天真地分配给个人往往违反现实世界约束,导致物理上不可能的场景,例如两个人使用同一物体或占据同一空间。为了量化这一点,我们形式化了N体问题,并提出了一套度量标准来评估性能(加速、任务覆盖)和可行性(空间碰撞、物体冲突和因果约束)。作为概念验证,我们引入了一种结构化提示策略,引导视觉语言模型(VLM)推理3D环境、物体使用和时间依赖,从而产生可行的并行执行。在来自EPIC-Kitchens和HD-EPIC的100个视频上,对于N=2,我们的结构化提示相比Gemini 2.5 Pro的基线提示,动作覆盖率提高了45%,同时碰撞率、物体冲突和因果冲突分别降低了51%、52%和55%。

英文摘要

Humans can intuitively parallelise complex activities, but can a model predict this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: predicting how N individuals, can hypothetically perform the same set of tasks. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To quantify this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). As a proof of concept, we introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies, producing a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, for $N = 2$, our structured prompt improves action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 51%, 52% and 55% respectively.

2512.11081 2026-06-11 stat.ML cs.LG stat.ME 版本更新

Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

从随机森林中可证明地恢复局部重要符号特征和交互

Kata Vuk, Nicolas Alexander Ihlo, Merle Behr

发表机构 * Faculty of Informatics and Data Science, University of Regensburg, Germany(信息与数据科学学院,莱茵河畔雷根斯堡大学)

AI总结 提出一种局部、模型特定的特征与交互重要性方法,通过结合全局和局部决策路径模式,在局部尖峰稀疏模型下可证明地恢复真实信号特征及其交互,并识别特征值大小对预测的驱动方向。

详情
AI中文摘要

特征与交互重要性(FII)方法在监督学习中至关重要,用于评估复杂预测模型中输入变量及其交互的相关性。在许多领域,如个性化医疗,通常需要针对单个预测的局部解释,而不是总结整体特征重要性的全局分数。随机森林(RF)在这些场景中被广泛使用,现有的可解释性方法通常利用树结构和分裂统计量来提供模型特定的见解。然而,对RF的局部FII方法的理论理解仍然有限,这使得如何解释单个预测的高重要性分数变得不明确。我们提出了一种新颖的、局部的、模型特定的FII方法,该方法识别特征在决策路径上的频繁共现,将全局模式与特定测试点路径上的模式相结合。我们证明,在局部尖峰稀疏(LSS)模型下,我们的方法一致地恢复真实的局部信号特征及其交互,并识别出大或小的特征值是否驱动预测。通过模拟研究和真实数据示例,我们展示了我们的方法和理论结果的有用性。

英文摘要

Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are often required, rather than global scores summarizing overall feature importance. Random Forests (RFs) are widely used in these settings, and existing interpretability methods typically exploit tree structures and split statistics to provide model-specific insights. However, theoretical understanding of local FII methods for RF remains limited, making it unclear how to interpret high importance scores for individual predictions. We propose a novel, local, model-specific FII method that identifies frequent co-occurrences of features along decision paths, combining global patterns with those observed on paths specific to a given test point. We prove that our method consistently recovers the true local signal features and their interactions under a Locally Spike Sparse (LSS) model and also identifies whether large or small feature values drive a prediction. We illustrate the usefulness of our method and theoretical results through simulation studies and a real-world data example.

2512.08211 2026-06-11 cs.LG 版本更新

MobileFineTuner: A Mobile-Native Framework for On-Device LLM Fine-Tuning in Real-World Embedded AI Applications

MobileFineTuner:面向真实世界嵌入式AI应用中设备端大语言模型微调的移动原生框架

Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, Bing Luo

发表机构 * Duke Kunshan University(Duke昆山大学) The University of Hong Kong(香港大学)

AI总结 提出移动原生框架MobileFineTuner,通过C++实现资源感知训练运行时(内存高效注意力、激活检查点等),在商用手机上实现端到端LLM微调,显著降低内存压力并提升可执行性。

详情
Comments
26 pages, 25 figures
AI中文摘要

大语言模型(LLM)正从以云为中心的服务转向设备端嵌入式AI,其中模型与从用户及其物理环境感知的私有、纵向信号进行交互。手机是此类应用的自然平台,因为用户随身携带、连接可穿戴传感器,并深度集成于日常移动应用中。然而,在商用手机上实际进行LLM微调仍然困难。现有微调框架大多基于Python且面向服务器,难以部署到移动应用中。我们提出MobileFineTuner,一个面向移动原生的开源框架,用于在商用手机上实现端到端LLM微调。MobileFineTuner用C++实现,并提供可复用的训练栈。为了在移动资源约束下使微调可行,MobileFineTuner集成了资源感知的训练运行时,包括内存高效注意力、激活检查点、梯度累积、参数分片和能量感知调度。我们在真实手机上使用GPT-2、Gemma 3和Qwen2.5模型,在多个微调任务上评估MobileFineTuner。结果表明,MobileFineTuner再现了标准Full-FT和LoRA微调行为,显著降低了内存压力并提升了在内存受限手机上的可执行性。我们进一步通过一个私有的校园健康代理应用展示了MobileFineTuner,其中本地LLM在用户特定的可穿戴感知记录上进行微调,以提供更个性化的响应,同时将原始记录保留在手机上。这些结果确立了MobileFineTuner作为在嵌入式AI和感知系统中研究和构建设备端LLM微调应用的实用工具包。

英文摘要

Large language models (LLMs) are moving from cloud-centric services toward on-device embedded AI, where models interact with private, longitudinal signals sensed from users and their physical environments. Mobile phones are a natural platform for such applications because they are continuously carried by users, connected to wearable sensors, and deeply integrated with daily mobile applications. However, practical LLM fine-tuning on commodity phones remains difficult. Existing fine-tuning frameworks are largely Python-based and server-oriented, making them hard to deploy inside mobile applications. We present MobileFineTuner, a mobile-native open-source framework for end-to-end LLM fine-tuning on commodity mobile phones. MobileFineTuner is implemented in C++ and provides a reusable training stack. To make fine-tuning feasible under mobile resource constraints, MobileFineTuner integrates a resource-aware training runtime with memory-efficient attention, activation checkpointing, gradient accumulation, parameter sharding, and energy-aware scheduling. We evaluate MobileFineTuner on real mobile phones using GPT-2, Gemma 3, and Qwen2.5 models across multiple fine-tuning tasks. The results show that MobileFineTuner reproduces standard Full-FT and LoRA fine-tuning behavior, substantially reduces memory pressure and improves executability on memory-constrained phones. We further demonstrate MobileFineTuner through a private campus health-agent application, where a local LLM is fine-tuned on user-specific wearable-sensing records to provide more personalized responses while keeping raw records on the phone. These results establish MobileFineTuner as a practical toolkit for studying and building on-device LLM fine-tuning applications in embedded AI and sensing systems.

2512.03077 2026-06-11 cs.CY cs.AI 版本更新

Irresponsible AI: big tech's influence on AI research and associated impacts

不负责任的人工智能:大型科技公司对AI研究的影响及相关影响

Alex Hernandez-Garcia, Alexandra Volokhova, Ezekiel Williams, Dounia Shaaban Kabakibo, Mélisande Teng

发表机构 * Big Tech(大科技公司)

AI总结 本文指出大型科技公司对AI研究的不成比例影响推动了不负责任的AI发展,并加剧了环境和社会负面影响,呼吁研究者通过集体行动加以抵制。

详情
Comments
Presented as a spotlight oral at the International Conference on Machine Learning 2026 (Position Paper Track). First version presented at NeurIPS 2025 Workshop on Algorithmic Collective Action
AI中文摘要

人工智能系统的加速开发、部署和采纳得益于大型科技公司在AI领域的日益深入。这一趋势伴随着日益增长的伦理关切以及加剧的社会和环境影响。本文立场认为,不负责任的AI发展在很大程度上是由大型科技公司在该领域的影响和参与所驱动的。首先,我们审视了大型科技公司在AI研究中日益增长且不成比例的影响,并认为其对规模化和通用系统的追求从根本上与负责任、合乎伦理和可持续的AI发展相悖。其次,我们回顾了当前AI的主要负面环境和社会影响,并追溯其与大型科技公司影响的联系。第三,我们讨论了推动大型科技公司行动的基本经济力量。最后,作为行动号召,我们邀请AI研究者通过基于相关行为者责任和集体行动的策略,来对抗大型科技公司对不负责任AI发展的影响。

英文摘要

The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we invite AI researchers to counter big tech's influence in irresponsible AI development through strategies that build on the responsibility of implicated actors and collective action.

2508.21380 2026-06-11 cs.LG cs.AI 版本更新

The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network

算法并非行为:学得的先验知识在弈棋神经网络中覆盖前瞻

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

发表机构 * Fraunhofer HHI(弗劳恩霍夫人工智能研究所)

AI总结 研究发现,国际象棋神经网络Leela Chess Zero在中间层能正确计算解法,但最终输出被安全优先的先验知识覆盖,导致错误答案。

详情
AI中文摘要

最近的机制性工作揭示了神经网络内部的学习算法,从模运算到游戏智能体中的搜索与规划。但算法结构是否保证算法行为?我们在最强的神经象棋引擎Leela Chess Zero中对此进行研究,先前工作已识别出学习到的前瞻。通过将logit透镜扩展到其选棋策略网络,我们发现正确的谜题解法——包括即时将杀——经常出现在中间层,但在最终输出中被系统性覆盖,我们将此现象称为“遗忘的谜题”。在这些位置上重复先前的分析,我们发现前瞻运行正常——正确续招的未来走法被表示、因果重要且可线性解码——排除了算法本身的失败。相反,后期层逐渐转向优先考虑安全对局而非激进。为了测试这一转变是否驱动了覆盖,我们引导模型反对这些偏好,并恢复了61.7%的遗忘谜题,提供了因果证据表明安全先验覆盖了算法计算的解法。这些发现表明,算法结构并不保证算法行为:模型可以在内部解决问题,但仍然输出错误答案。

英文摘要

Recent mechanistic work has uncovered learned algorithms within neural networks, from modular arithmetic to search and planning in game-playing agents. But does algorithmic structure guarantee algorithmic behavior? We investigate this in Leela Chess Zero, the strongest neural chess engine, where prior work identified learned look-ahead. By extending the logit lens to its move-selecting policy network, we discover that correct puzzle solutions-including immediate checkmates-often appear in intermediate layers but are systematically overridden in the final output, a phenomenon we term "forgotten puzzles". Replicating prior analyses on these positions, we find that look-ahead operates normally-future moves of the correct continuation are represented, causally important, and linearly decodable-ruling out a failure of the algorithm itself. Instead, late layers increasingly shift toward prioritizing safe play over aggression. To test whether this shift drives the override, we steer the model against these preferences and recover 61.7% of forgotten puzzles, providing causal evidence that safety priors override algorithmically computed solutions. These findings demonstrate that algorithmic structure does not guarantee algorithmic behavior: a model can internally solve a problem and still output the wrong answer.