arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2042
2604.23190 2026-06-05 cs.SE cs.AI

RAT: RunAnyThing via Fully Automated Environment Configuration

RAT: 通过完全自动化的环境配置实现RunAnyThing

Renhong Huang, Dongdong Hua, Yifei Sun, Sitao Ding, Hanyang Yuan, Daixin Wang, Yang Yang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 本文提出RAT框架,用于在任意仓库上实现跨编程语言的全自动环境配置,通过多阶段流水线整合语言感知抽象、镜像初始化、专用配置工具集和稳健沙箱,并提出RATBench基准测试集,实验表明RAT在环境设置成功率上比强基线提升了36.1%。

详情
AI中文摘要

自动化仓库级别的软件工程任务是自主代码代理的基础挑战,主要由于可执行环境配置的难度。然而,手动配置仍然是劳动密集型的瓶颈,需要向完全自动化的环境配置过渡。现有方法往往依赖预定义的制品或局限于特定编程语言,限制了其在现实世界仓库中的适用性。在本文中,我们首先提出RAT(RunAnyThing),一个模块化且可扩展的代理框架,用于在任意仓库上实现跨编程语言的全自动配置。RAT采用多阶段流水线,整合语言感知抽象、镜像初始化、专用配置工具集和稳健沙箱。此外,为了实现严格评估,我们提出RATBench,一个反映现实世界仓库全面覆盖的基准测试集。大量实验表明,RAT实现了最先进的性能,比强基线在环境设置成功率(ESSR)上平均提高了36.1%。

英文摘要

Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to diverse real-world repositories. In this paper, we first propose RAT (RunAnyThing), a modular and extensible agent framework for fully automated configuration across programming languages on arbitrary repositories. RAT adopts a multi-stage pipeline that integrates language-aware abstraction, image initialization, specialized configuration toolset, and robust sandbox. Furthermore, to enable rigorous evaluation, we propose RATBench, a benchmark reflects the comprehensive coverage of real-world repositories. Extensive experiments demonstrate that RAT achieves state-of-the-art performance, improving Environment Setup Success Rate (ESSR) by an average of 36.1% over strong baselines.

2603.03555 2026-06-05 cs.MA cs.AI cs.SI

Benchmarking Emergent Coordination in Large-Scale LLM Populations: An Evaluation Framework on the MoltBook Archive

在大规模大语言模型群体中评估涌现协调:对MoltBook档案库的评估框架

Brandon Yee, Pairie Koh

发表机构 * Management Lab, Yee Collins Research Group(Yee Collins研究组管理实验室)

AI总结 本文提出了一种评估框架,用于在开放代理环境中评估角色专业化、信息扩散和协作任务解决的涌现协调,通过MoltBook档案库的数据集展示了该框架,并建立了量化基准,揭示了核心-外围结构、重尾级联分布和去中心化任务解决中的严重协调开销。

Comments Substantial Revision Required

详情
AI中文摘要

随着多智能体大语言模型(LLM)系统规模扩大,评估其涌现协调动态变得越来越关键。然而,当前的评估范式——专注于单个智能体或小型、显式结构化的群体——无法捕捉到在大规模、去中心化群体中出现的自组织和病毒信息动态。我们引入了一种系统化的评估框架,用于在开放代理环境中基准测试角色专业化、信息扩散和协作任务解决。我们在此框架上展示了MoltBook观测站档案库,这是一个包含273万个交互的2.73M交互数据集,其中90,704个自主代理相互作用。该框架建立了涌现协调的量化基准。我们的评估揭示了明显的核心-外围结构(轮廓0.91)、重尾级联分布(α=2.57)以及去中心化任务解决中的严重协调开销(Cohen's d = -0.88,相对于单智能体基线)。通过提供标准化的评估任务和实证基准,我们的框架使未来多智能体协议的严格比较成为可能,并将评估本身确立为科学研究的对象。

英文摘要

As multi-agent Large Language Model (LLM) systems scale, evaluating their emergent coordination dynamics becomes increasingly critical. However, current evaluation paradigms-focused on single agents or small, explicitly structured groups-fail to capture the self-organization and viral information dynamics that arise in large, decentralized populations. We introduce a systematic evaluation framework to benchmark role specialization, information diffusion, and cooperative task resolution in open agent environments. We demonstrate this framework on the MoltBook Observatory Archive, a dataset of 2.73M interactions among 90,704 autonomous agents, establishing quantitative baselines for emergent coordination. Our evaluation reveals a pronounced core-periphery structure (silhouette 0.91), heavy-tailed cascade distributions ($α= 2.57$), and severe coordination overhead in decentralized task resolution (Cohen's $d = -0.88$ against a single-agent baseline). By providing standardized evaluation tasks and empirical baselines, our framework enables the rigorous comparison of future multi-agent protocols and establishes evaluation itself as an object of scientific study.

2503.17181 2026-06-05 cs.SE cs.AI

A Study of LLMs' Preferences for Libraries and Programming Languages

对大型语言模型在库和编程语言偏好方面的研究

Lukas Twist, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, Detlef Nauck, Jie M. Zhang

发表机构 * King’s College London(伦敦国王学院) University College London(伦敦大学学院) GitHub Next Digital AI Research, BT Group(BT集团数字人工智能研究)

AI总结 本研究探讨了大型语言模型在生成代码时对库和编程语言的选择偏好,通过实证研究分析了八种不同大型语言模型在库和语言选择上的倾向,发现模型倾向于使用广泛采用的库如NumPy,并且在某些情况下这种选择并非必要,同时也显示出对Python的偏好,尽管在某些高性能项目初始化任务中Python并非最优选择。

Comments 21 pages, 10 tables, 3 figures. Accepted to Findings of ACL 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在代码生成方面取得了快速进展,但现有评估主要集中在功能正确性或语法有效性上,忽略了LLMs在关键设计决策中如何选择库或编程语言。为了填补这一空白,我们进行了首次对LLMs在生成代码时对库和编程语言偏好的实证研究,涵盖了八个不同的LLMs。我们观察到LLMs倾向于过度使用广泛采用的库,如NumPy;在多达45%的情况下,这种使用是不必要的,并偏离了真实解决方案。我们研究的LLMs还显示出对Python作为默认语言的显著偏好。在高性能项目初始化任务中,当Python不是最优语言时,它仍然在58%的情况下占据主导地位,而Rust从未被使用。这些结果突显了LLMs在选择熟悉度和流行度而非适合性和任务特定最优性上的倾向;强调了需要针对的微调、数据多样化以及能够明确衡量语言和库选择忠实度的评估基准。

英文摘要

Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions. The LLMs we study also show a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once. These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality; underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.

2603.28257 2026-06-05 q-fin.ST cs.LG

Nonlinear Factor Decomposition via Kolmogorov-Arnold Networks: A Spectral Approach to Asset Return Analysis

通过Kolmogorov-Arnold网络进行非线性因子分解:一种资产收益分析的谱方法

David Breazu

发表机构 * Faculty of Mathematics and Computer Science, University of Bucharest(布加勒斯特大学数学与计算机科学学院)

AI总结 本文提出KAN-PCA,一种利用KAN作为编码器和线性映射作为解码器的自编码器,通过在每条边上使用学习的B样条函数替代线性投影,以捕捉比传统PCA更多的方差。实验表明KAN-PCA在20只S&P 500股票上实现了更高的重建R²值,并在修正数据泄露后与PCA外推结果一致。

Comments 12 pages, 2 figures

详情
AI中文摘要

KAN-PCA是一种自编码器,其编码器使用KAN,解码器使用线性映射。它通过在每条边上使用学习的B样条函数替代线性投影,扩展了传统PCA。动机是捕捉比传统PCA更多的方差,这在市场危机期间线性假设失效时变得效率低下,因为资产之间的相关性剧烈变化。我们证明,如果将样条激活函数强制为线性,KAN-PCA的结果与传统PCA完全相同,从而将PCA确立为特殊情况。在20只S&P 500股票(2015-2024)上的实验表明,KAN-PCA在3个因子下实现了66.57%的重建R²值,比传统PCA的62.99%更高,同时在修正训练过程中的数据泄露后与PCA的外推结果一致。

英文摘要

KAN-PCA is an autoencoder that uses a KAN as encoder and a linear map as decoder. It generalizes classical PCA by replacing linear projections with learned B-spline functions on each edge. The motivation is to capture more variance than classical PCA, which becomes inefficient during market crises when the linear assumption breaks down and correlations between assets change dramatically. We prove that if the spline activations are forced to be linear, KAN-PCA yields exactly the same results as classical PCA, establishing PCA as a special case. Experiments on 20 S&P 500 stocks (2015-2024) show that KAN-PCA achieves a reconstruction R^2 of 66.57%, compared to 62.99% for classical PCA with the same 3 factors, while matching PCA out-of-sample after correcting for data leakage in the training procedure.

2505.11006 2026-06-05 stat.ML cs.LG

Is Supervised Learning Really That Different from Unsupervised?

监督学习真的和无监督学习有那么大的区别吗?

Oskar Allerbo, Thomas B. Schön

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Uppsala University(乌普萨拉大学)

AI总结 该研究通过将监督学习分解为两阶段过程,证明在不访问标签数据的情况下选择模型参数和添加输出,可以实现与传统监督学习相似的性能,表明监督与无监督学习的区别可能不如表面看起来那么根本。

Comments Paper accepted at AISTATS 2026

详情
AI中文摘要

我们展示了监督学习如何分解为一个两阶段过程,其中(1)所有模型参数以无监督的方式选择,(2)输出y被添加到模型中,而无需改变参数值。这通过一种新的模型选择标准实现,与交叉验证不同,该标准可以在不访问y的情况下使用。对于线性岭回归,我们界定了我们方法的渐近外样本风险,以最优渐近风险为基准。我们还证明了在不访问y的情况下训练的线性和核岭回归、平滑样条、k近邻、随机森林和神经网络,其性能与基于y的传统方法相似。因此,我们的结果表明,监督学习和无监督学习之间的区别可能不如表面看起来那么根本。

英文摘要

We demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that - in contrast to cross-validation - can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate that versions of linear and kernel ridge regression, smoothing splines, k-nearest neighbors, random forests, and neural networks, trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.

2603.17925 2026-06-05 stat.ME cs.LG math.ST stat.TH

Multi-Armed Sequential Hypothesis Testing by Betting

通过赌注进行多臂顺序假设检验

Ricardo J. Sandoval, Ian Waudby-Smith, Michael I. Jordan

发表机构 * University of California Berkeley(加州大学伯克利分校) École Normale Supérieure & Inria Paris(法国国家科学研究中心巴黎分校 & 巴黎研究所)

AI总结 本文研究了通过赌注进行多臂顺序检验的问题,提出了一种在多个数据源(臂)中选择以获取数据的统计学家的变体,旨在拒绝全局空假设P(所有臂在某种意义上无效)并支持复合替代假设Q(至少有一个臂非空)。通过推广对数最优性和期望拒绝时间最优性的概念,得到了匹配的上下界,并提出了一个修改的上置信界算法来处理不可观测但足够可估计的奖励。

详情
AI中文摘要

我们考虑了一种通过赌注进行的顺序检验变体,其中在每个时间步,统计学家会面对多个数据源(臂)并选择其中一个以获取数据。我们考虑了一个复合全局空假设P,即所有臂在某种意义上(例如所有治疗剂量无效)都是空假设,并希望拒绝P以支持一个复合替代假设Q,其中至少有一个臂是非空的(例如存在有效的治疗剂量)。我们提出了一种最优性要求,即即使多个臂是非空的,我们寻求e-过程和顺序检验,其性能尽可能强,如同拥有 oracle 知识关于哪个臂生成最多反对P的证据。形式上,我们将对数最优性和期望拒绝时间最优性的概念推广到多个臂,得到两者匹配的上下界。在最优性分析中,一个关键技术设备是一个修改的上置信界算法,用于不可观测但足够“可估计”的奖励。在设计此算法时,我们推导了非渐近的集中不等式,用于最优财富增长率,即凯利[1956]的意义。这些可能具有独立的兴趣。

英文摘要

We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

2603.14169 2026-06-05 stat.ME cs.AI

Beyond Means: Topological Causal Effects under Persistent-Homology Ignorability

超越均值:基于持久同调的因果效应

Amir Saki, Usef Faghihi

发表机构 * Université du Québec à Trois-Rivières(魁北克三河大学)

AI总结 本文提出基于持久同调的因果框架,以解决均值基于因果估计在处理结局分布形状变化时的局限性,通过定义拓扑学的CATE和ATE,并证明其在近似拓扑可忽略性下的可识别性。

详情
AI中文摘要

平均处理效应(ATE)和条件平均处理效应(CATE)是因果估计的核心,但它们仅关注预期结果的变化,可能忽略处理引起的结局分布形状变化。当对照组结果单峰,处理组结果双峰且均值相同,均值基于的因果估计会失效。本文基于持久同调发展了因果框架,提出了持久同调可忽略性条件,定义了拓扑学的CATE和ATE,并证明这些估计量在近似拓扑可忽略性下可识别。同时指出,边际持久图效应不能仅通过条件拓扑可忽略性确定,因为持久同调通常不与协变量混合交换。为保持原意并确保科学正确性,本文保留边际效应作为动机量,但将数学上稳健的条件估计量置于理论中心。合成实验显示,均值基于的因果估计仍接近零,而所提拓扑效应显著增加并在调整混杂后可恢复。

英文摘要

Average treatment effects (ATE) and conditional average treatment effects (CATE) are foundational causal estimands, but they target changes in expected outcomes and can miss treatment-induced changes in the shape of outcome distributions. A canonical failure mode occurs when control outcomes are unimodal, treated outcomes become bimodal, and both distributions have the same mean. In such cases mean-based causal estimands are zero even though the geometry and topology of the outcome law change substantially. This paper develops a topological causal framework based on persistent homology. We formalize a persistent-homology ignorability condition, define topological analogues of CATE and ATE, and prove that these estimands are identifiable up to an explicit error bound under approximate topological ignorability. We also clarify a subtle but important point: a marginal persistence-diagram effect is not identified from conditional topological ignorability alone because persistent homology does not in general commute with mixtures over covariates. To preserve the original intuition while ensuring scientific correctness, we retain the marginal effect as a motivating quantity, but place the mathematically sound conditional estimands at the center of the theory. A synthetic experiment with mean-preserving topology change shows that mean-based causal estimands remain near zero while the proposed topological effect increases sharply and remains recoverable after adjustment for confounding.

2601.11527 2026-06-05 cs.HC cs.AI cs.CY

"What if she doesn't feel the same?" What Happens When We Ask AI for Relationship Advice

如果她不再有同样的感觉呢?当我们将AI用于关系建议时会发生什么

Niva Manchanda, Akshata Kishore Moharir, Ratna Kandala

发表机构 * Department of Psychology, University of Kansas(堪萨斯大学心理学系) Independent Researcher(独立研究者)

AI总结 研究探讨了用户对LLM生成的浪漫关系建议的评价,发现用户对建议的满意度高,并且这种满意度与对模型可靠性和有用性的感知正相关,同时用户对LLM的态度也显著改善。

详情
Journal ref
First Workshop on LLM Persona Modeling, NeurIPS 2025
AI中文摘要

大型语言模型(LLMs)越来越多地被用于提供支持和建议,特别是在浪漫关系等个人领域,但关于用户对这种类型建议的看法知之甚少。本研究调查了人们如何评价LLM生成的浪漫关系建议。参与者评估了建议的满意度、模型的可靠性以及有用性,并完成了关于他们对LLMs总体态度的前后测。总体而言,研究结果表明参与者对LLM生成的建议非常满意。更高的满意度与他们对模型可靠性和有用性的感知正相关。重要的是,接触这些建议后,参与者对LLMs的态度显著改善,这表明支持性和情境相关的建议可以增强用户对这些AI系统的信任和开放性。

英文摘要

Large Language Models (LLMs) are increasingly being used to provide support and advice in personal domains such as romantic relationships, yet little is known about user perceptions of this type of advice. This study investigated how people evaluate advice on LLM-generated romantic relationships. Participants rated advice satisfaction, model reliability, and helpfulness, and completed pre- and post-measures of their general attitudes toward LLMs. Overall, the results showed participants' high satisfaction with LLM-generated advice. Greater satisfaction was, in turn, strongly and positively associated with their perceptions of the models' reliability and helpfulness. Importantly, participants' attitudes toward LLMs improved significantly after exposure to the advice, suggesting that supportive and contextually relevant advice can enhance users' trust and openness toward these AI systems.

2603.02376 2026-06-05 cs.DC cs.AR cs.LG cs.MA

CUCo: An Agentic Framework for Compute and Communication Co-design

CUCo:一种用于计算与通信协同设计的代理框架

Yoga Sri Varshan Varadharajan, Bodun Hu, Saurabh Agarwal, Aditya Akella

发表机构 * UT Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出CUCo框架,通过结合结构化设计空间形式化和正确性优先的快速路径代理以及进化驱动的慢速路径代理,实现了CUDA内核的计算与通信协同设计,从而在四个多GPU工作负载中实现了1.57倍的加速,并在LLM推理成本低于10美元的情况下发现了一种双流重叠策略。

详情
AI中文摘要

在分布式大语言模型(LLM)训练和推理中,计算与通信传统上是孤立优化的;专家设计的系统如DeepEP、FLUX和TokenWeave展示了协同设计的潜力,但需要深入的系统专业知识和硬件特定的调优;CUCo是一种代理框架,通过结合结构化的设计空间形式化与正确性优先的快速路径代理以获得可靠的基线,以及进化驱动的慢路径代理以获得高性能策略,从而在四个多GPU工作负载中实现了高达1.57倍的加速,并在LLM推理成本低于10美元的情况下发现了DeepSeek-V3 MoE层上的双流重叠策略,该策略通过本地计算隐藏调度。

英文摘要

Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.

2509.20345 2026-06-05 stat.ME cs.LG stat.ML

General Synthetic-Powered Inference

通用合成数据驱动推断

Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano

发表机构 * Department of Electrical and Computer Engineering, Technion IIT, Israel(电气与计算机工程系,技术离子研究所,以色列) Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, USA(统计学与数据科学系,沃顿商学院,宾夕法尼亚大学,美国) Department of Computer Science, Technion IIT, Israel(计算机科学系,技术离子研究所,以色列)

AI总结 本文提出了一种通用合成数据驱动推断框架,通过结合高质量合成数据和真实数据来提高样本效率,同时在合成数据质量低时自动回退到传统方法,无需分布假设即可保持误差率在用户指定范围内。

详情
AI中文摘要

高质量合成数据的快速普及——由先进的人工智能模型生成或从相关任务中收集——为统计推断带来了机遇和挑战。本文介绍了一种通用合成数据驱动推断(GESPI)框架,该框架围绕广义的统计推断程序包裹,通过结合合成和真实数据安全地提高样本效率。我们的框架利用高质量合成数据提高统计效力,但能自适应回退到仅使用真实数据的传统方法,当合成数据质量较低时。在不假设合成数据分布的情况下,该方法的误差率始终低于用户指定的界限,且随着合成数据质量的提高而降低。这种灵活性使该框架能够无缝集成到符合性预测、风险控制、假设检验和多重检验程序中,而无需修改基础推断方法。我们在有限标注数据的挑战性任务上展示了该方法的优势,包括AlphaFold蛋白质结构预测,以及在复杂数学问题上比较大型推理模型。

英文摘要

The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around a broad class of statistical inference procedures to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard method using only real data when synthetic data are of low quality. The error rate of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

2508.04409 2026-06-05 stat.ML cs.LG

The Relative Instability of Model Comparison with Cross-validation

模型比较与交叉验证的相对不稳定性

Alexandre Bayle, Lucas Janson, Lester Mackey

发表机构 * Department of Statistics, Harvard University, Cambridge, MA, USA(哈佛大学统计系) Microsoft Research New England, Cambridge, MA, USA(微软研究院新英格兰分部)

AI总结 研究指出即使个体稳定的模型在比较时也可能产生相对不稳定的结果,挑战了交叉验证推断的有效性,特别指出Lasso和软阈值化在最有利的学习条件下仍会导致无效的交叉验证推断。

详情
AI中文摘要

交叉验证(CV)已知能提供渐近精确的模型改进测试和置信区间,但仅在模型比较相对稳定时才成立。令人惊讶的是,我们证明了即使简单且个体稳定的模型也能产生相对不稳定的比较,从而质疑CV推断的有效性。具体来说,我们展示了Lasso及其近亲软阈值化在最有利的学习条件下,即使两个模型本身都稳定,也会产生相对不稳定的比较和无效的CV推断。这些发现强调在部署CV进行模型比较前验证相对稳定性的重要性。

英文摘要

Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.

2602.07739 2026-06-05 cs.IR cs.AI

HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

HypRAG: 超几何密集检索用于检索增强生成

Hiren Madhu, Ngoc Bui, Ali Maatouk, Leandros Tassiulas, Smita Krishnaswamy, Menglin Yang, Sukanta Ganguly, Kiran Srinivasan, Rex Ying

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出超几何密集检索方法,通过在双曲空间中构建HyTE-FH和HyTE-H两种模型变体,解决传统欧几里得空间在检索增强生成中的局限性,提升文档相关性和回答相关性。

详情
AI中文摘要

嵌入几何在检索质量中起着根本作用,然而用于检索增强生成(RAG)的密集检索器仍然主要局限于欧几里得空间。然而,自然语言从广泛主题到具体实体具有层次结构,而欧几里得嵌入无法保持这种结构,导致语义上距离远的文档显得相似,增加幻觉风险。为了解决这些限制,我们引入了双曲密集检索,开发了两种模型变体:HyTE-FH,一个完全双曲的Transformer,以及HyTE-H,一个混合架构,将预训练的欧几里得嵌入投影到双曲空间。为了防止序列聚合期间的表示崩溃,我们引入了向外爱因斯坦中点,一种几何感知的池化操作符,可以证明地保持层次结构。在MTEB上,HyTE-FH优于等效的欧几里得基线,而在RAGBench上,HyTE-H在上下文相关性和回答相关性方面比欧几里得基线高出高达29%,使用比当前最先进的检索器小得多的模型。我们的分析还表明,双曲表示通过基于范数的分离编码文档特定性,从一般到具体概念的径向增加超过20%,这一特性在欧几里得嵌入中不存在,突显了几何归纳偏置在忠实RAG系统中的关键作用。

英文摘要

Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.

2602.01607 2026-06-05 math.ST cs.IT cs.LG math.IT stat.ML stat.TH

Minimax optimal differentially private synthetic data for smooth queries

最小最大最优差分隐私合成数据用于平滑查询

Rundong Ding, Yiyun He, Yizhe Zhu

发表机构 * Department of Mathematics, University of Southern California(南加州大学数学系) Department of Mathematics, University of California San Diego(加州圣地亚哥大学数学系)

AI总结 本文研究了如何生成具有(ε,δ)差分隐私的合成数据,以在保证个体隐私的同时,为有意义的下游分析提供强效用保证。提出了一种多项式时间算法,实现了最小最大误差率O_{k,d}(n^{-min{1, k/d}}),并建立了针对k-平滑查询的首个最小最大下界。

Comments COLT 2026 arXiv version. 34 pages

详情
AI中文摘要

差分隐私合成数据使敏感数据集的共享和分析成为可能,同时为个体贡献者提供严格的隐私保证。一个核心挑战是为有意义的下游分析提供强效用保证。许多现有方法确保在广泛的查询类上具有均匀的准确性,如所有Lipschitz函数,但这种通用性往往导致对实际感兴趣的统计量的次优速率。由于许多常见数据分析查询的平滑性超出了最坏情况Lipschitz界所捕捉的范围,我们询问是否可以利用这种额外的结构来提高效用。我们研究了从大小为n的数据集生成(ε,δ)差分隐私合成数据的问题,该数据集支持在超立方体[-1,1]^d上,具有对所有具有受界导数的平滑查询的均匀效用保证。我们提出了一种多项式时间算法,实现了最小最大误差率O_{k,d}(n^{-min{1, k/d}}),除了一个log(n)因子。这一特征揭示了k=d处的相变。我们的结果推广了Chebyshev矩匹配框架(Musco等,2025;Wang等,2016),并且严格改进了在\citep{wang2016differentially}中为k-平滑查询建立的误差率。此外,我们建立了针对k-平滑查询的首个最小最大下界,扩展了Boedihardjo等(2024)中关于ε-差分隐私的Wasserstein下界。

英文摘要

Differentially private synthetic data enables the sharing and analysis of sensitive datasets while providing rigorous privacy guarantees for individual contributors. A central challenge is to achieve strong utility guarantees for meaningful downstream analysis. Many existing methods ensure uniform accuracy over broad query classes, such as all Lipschitz functions, but this level of generality often leads to suboptimal rates for statistics of practical interest. Since many common data analysis queries exhibit smoothness beyond what worst-case Lipschitz bounds capture, we ask whether exploiting this additional structure can yield improved utility. We study the problem of generating $(\varepsilon,δ)$-differentially private synthetic data from a dataset of size $n$ supported on the hypercube $[-1,1]^d$, with utility guarantees uniformly for all smooth queries having bounded derivatives up to order $k$. We propose a polynomial-time algorithm that achieves a minimax error rate of $O_{k,d}(n^{-\min \{1, \frac{k}{d}\}})$, up to a $\log(n)$ factor. This characterization uncovers a phase transition at $k=d$. Our results generalize the Chebyshev moment matching framework of (Musco et al., 2025; Wang et al., 2016) and strictly improve the error rates for $k$-smooth queries established in \citep{wang2016differentially}. Moreover, we establish the first minimax lower bound for the utility of $(\varepsilon,δ)$-differentially private synthetic data with respect to $k$-smooth queries, extending the Wasserstein lower bound for $\varepsilon$-differential privacy in (Boedihardjo et al., 2024).

2602.05056 2026-06-05 cs.CR cs.CL cs.LG

Grounded but Misleading: Evaluating Semantic Alignment in AI-Generated Security Explanations

grounded but Misleading: Evaluating Semantic Alignment in AI-Generated Security Explanations

Heajun An, Connor Ng, Sandesh Sharma Dulal, Junghwan Kim, Jin-Hee Cho

发表机构 * Virginia Tech(弗吉尼亚理工学院)

AI总结 本文研究了AI生成的安全解释中语义对齐的问题,通过VEXA测试平台验证了词汇基础与语义风险对齐之间的差距,发现即使解释在词汇上显得合理,其语义解释可能削弱检测器的意图风险评估。

详情
AI中文摘要

在线诈骗越来越多地利用流畅且具有上下文意识的社会工程策略,导致对能够解释为何一条信息可能具有风险的AI系统的需求日益增长。然而,引用检测器衍生证据的解释可能仍然在语义上削弱或改变预期的风险解释。我们介绍了VEXA:验证语义解释对齐,一个用于研究AI生成诈骗风险解释中词汇基础与语义风险对齐差距的受控测试平台。VEXA通过独立控制证据基础和语义框架来生成无基础、风险对齐和风险稀释的解释。通过LLM作为判断者和人类评估,我们发现即使解释的语义解释削弱了检测器的意图风险评估,解释仍可能在比较上显得合理。在人类评估中,风险稀释的XAI基础解释保留了相对较高的感知证据基础评分(3.66),尽管其帮助性(3.00)和推理支持(3.14)评分较低。这些发现提供了AI生成安全解释中基础错觉效应的受控证据,并表明可信的解释评估必须不仅验证是否引用了证据,还要验证如何解释这些证据。

英文摘要

Online scams increasingly leverage fluent and context-aware social engineering strategies, creating growing demand for AI systems that explain why a message may be risky. However, explanations that cite detector-derived evidence may still semantically weaken or redirect the intended risk interpretation. We introduce VEXA: Verifying Semantic Explanation Alignment, a controlled testbed for studying the gap between lexical grounding and semantic risk alignment in AI-generated scam-risk explanations. VEXA generates ungrounded, risk-aligned, and risk-diluting explanations by independently controlling evidence grounding and semantic framing. Through LLM-as-a-judge and human evaluations, we show that explanations may continue to appear comparatively grounded even when their semantic interpretation weakens the detector's intended risk assessment. In human evaluation, risk-diluting XAI-grounded explanations retained comparatively elevated Perceived Evidence Grounding scores (3.66) despite lower Helpfulness (3.00) and Reasoning Support (3.14) scores. These findings provide controlled evidence of grounding illusion effects in AI-generated security explanations and suggest that trustworthy explanation evaluation must verify not only whether evidence is cited, but also how that evidence is interpreted.

2601.21162 2026-06-05 cs.IR cs.AI cs.DB

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

A2RAG:面向成本感知和可靠推理的自适应代理图检索

Jiate Liu, Zebin Chen, Shaobo Qiao, Mingchen Ju, Danting Zhang, Bocheng Han, Shuyue Yu, Xin Shu, Jinglin Wu, Dong Wen, Xin Cao, Guanfeng Liu, Zhengyi Yang

发表机构 * University of New South Wales(新南威尔士大学) Euler AI Sigma Trading Management(Sigma 交易管理) Eigenflow AI Macquarie University(麦考瑞大学)

AI总结 本文提出A2RAG框架,通过自适应控制器和代理检索器解决图检索中成本和可靠性问题,提升多跳问答的准确率并减少计算开销。

详情
AI中文摘要

图检索增强生成(Graph-RAG)通过将语料库组织成知识图谱并利用关系结构路由证据来增强多跳问答。然而,实际部署面临两个持续瓶颈:(i)混合难度的工作负载中,单一检索策略要么浪费成本于简单查询,要么在多跳情况中失败;(ii)提取损失,即图抽象省略了仅存在于源文本中的细粒度限定词。我们提出了A2RAG,一种面向成本感知和可靠推理的自适应和代理图RAG框架。A2RAG结合了一个自适应控制器,用于验证证据充分性并在必要时触发定向细化,以及一个代理检索器,逐步提升检索努力并映射图信号回来源文本,以在提取损失和不完整图的情况下保持稳健。在HotpotQA和2WikiMultiHopQA上的实验表明,A2RAG在Recall@2上实现了+9.9/+11.8的绝对增益,同时将token消耗和端到端延迟降低了约50%。

英文摘要

Graph Retrieval-Augmented Generation (Graph-RAG) enhances multihop question answering by organizing corpora into knowledge graphs and routing evidence through relational structure. However, practical deployments face two persistent bottlenecks: (i) mixed-difficulty workloads where one-size-fits-all retrieval either wastes cost on easy queries or fails on hard multihop cases, and (ii) extraction loss, where graph abstraction omits fine-grained qualifiers that remain only in source text. We present A2RAG, an adaptive-and-agentic GraphRAG framework for cost-aware and reliable reasoning. A2RAG couples an adaptive controller that verifies evidence sufficiency and triggers targeted refinement only when necessary, with an agentic retriever that progressively escalates retrieval effort and maps graph signals back to provenance text to remain robust under extraction loss and incomplete graphs. Experiments on HotpotQA and 2WikiMultiHopQA demonstrate that A2RAG achieves +9.9/+11.8 absolute gains in Recall@2, while cutting token consumption and end-to-end latency by about 50% relative to iterative multihop baselines.

2601.18219 2026-06-05 physics.med-ph cs.CV cs.LG

Automated HER2 scoring with uncertainty quantification using lensfree holography and deep learning

利用无透镜全息和深度学习进行自动HER2评分及不确定性量化

Che-Yung Shen, Xilin Yang, Yuzhu Li, Leon Lenk, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校电气与计算机工程系) Bioengineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校加州纳米系统研究所) Department of Computer Science, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校计算机科学系)

AI总结 本文提出了一种基于无透镜全息和深度学习的紧凑型、低成本系统,用于自动免疫组化染色乳腺组织切片的HER2评分,通过贝叶斯蒙特卡洛Dropout策略提高诊断可靠性,实现了高准确率的HER2分类和评分。

Comments 23 Pages, 6 Figures, 1 Table

详情
Journal ref
BME Frontiers, AAAS (2026)
AI中文摘要

准确评估人类表皮生长因子受体2(HER2)的表达对于乳腺癌的诊断、预后和治疗选择至关重要;然而,大多数现有的数字HER2评分方法依赖于笨重且昂贵的光学系统。本文提出了一种紧凑且经济的无透镜全息平台,结合深度学习用于自动免疫组化染色乳腺组织切片的HER2评分。该系统在RGB激光照明下捕获染色HER2组织切片的无透镜衍射图案,并在约1250 mm²的样本区域上以约84 mm²/分钟的有效吞吐量获取复杂数学信息。为提高诊断可靠性,我们采用了基于贝叶斯蒙特卡洛Dropout的不确定性量化策略,为每个预测提供自主的不确定性估计,支持可靠且稳健的HER2评分,整体修正率为30.4%。使用412个盲测样本的测试集,本方法在4类(0,1+,2+,3+)HER2分类中实现了84.9%的测试准确率,在二分类(0/1+ vs. 2+/3+)HER2评分中实现了94.8%的准确率,结合不确定性量化。总体而言,这种无透镜全息方法提供了一条通往便携式、高吞吐量和低成本HER2评分的实用途径,特别适用于资源有限的环境,其中传统数字病理基础设施不可用。

英文摘要

Accurate assessment of human epidermal growth factor receptor 2 (HER2) expression is critical for breast cancer diagnosis, prognosis, and therapy selection; yet, most existing digital HER2 scoring methods rely on bulky and expensive optical systems. Here, we present a compact and cost-effective lensfree holography platform integrated with deep learning for automated HER2 scoring of immunohistochemically stained breast tissue sections. The system captures lensfree diffraction patterns of stained HER2 tissue sections under RGB laser illumination and acquires complex field information over a sample area of ~1,250 mm^2 at an effective throughput of ~84 mm^2 per minute. To enhance diagnostic reliability, we incorporated an uncertainty quantification strategy based on Bayesian Monte Carlo dropout, which provides autonomous uncertainty estimates for each prediction and supports reliable, robust HER2 scoring, with an overall correction rate of 30.4%. Using a blinded test set of 412 unique tissue samples, our approach achieved a testing accuracy of 84.9% for 4-class (0, 1+, 2+, 3+) HER2 classification and 94.8% for binary (0/1+ vs. 2+/3+) HER2 scoring with uncertainty quantification. Overall, this lensfree holography approach provides a practical pathway toward portable, high-throughput, and cost-effective HER2 scoring, particularly suited for resource-limited settings, where traditional digital pathology infrastructure is unavailable.

2505.03336 2026-06-05 cs.IR cs.AI cs.SI

Eliminating Out-of-Domain Recommendations in LLM-based Recommender Systems: A Unified View

消除基于大语言模型的推荐系统中的域外推荐:一种统一视角

Hao Liao, Jiwei Zhang, Jianxun Lian, Wensheng Lu, Mingqi Wu, Shuo Wang, Yong Zhang, Yitian Huang, Mingyang Zhou, Rui Mao

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Microsoft Research Asia(微软亚洲研究院)

AI总结 本文提出RecLM框架,通过统一架构整合三种 grounding 方法,系统比较了基于嵌入检索、约束生成和离散项生成的推荐方法,有效消除域外推荐并提升了推荐准确性。

Comments 20 pages

详情
AI中文摘要

基于大语言模型(LLMs)的推荐系统常常受到域外(OOD)项目幻觉的困扰。为了解决这个问题,我们提出了RecLM,一种统一框架,通过在单一架构下实例化三种grounding范式来弥合检索与生成之间的差距:基于嵌入的检索、在重写项目标题上的约束生成以及离散项目-令牌生成。使用相同的LLM和提示,我们系统地在公开基准上比较了这三种视角。RecLM在所有变体中严格消除了域外推荐(OOD@10=0),并且约束生成变体RecLM-cgen和RecLM-token在与强ID基线和LLM基线相比时达到了最先进的准确性。我们的统一视角为比较三种不同的范式提供了系统的基础,以减少项目幻觉,提供了一个实用的框架来促进LLM在推荐任务中的应用。源代码位于https://github.com/microsoft/RecAI。

英文摘要

Recommender systems based on Large Language Models (LLMs) are often plagued by hallucinations of out-of-domain (OOD) items. To address this, we propose RecLM, a unified framework that bridges the gap between retrieval and generation by instantiating three grounding paradigms under a single architecture: embedding-based retrieval, constrained generation over rewritten item titles, and discrete item-tokenizer generation. Using the same backbone LLM and prompts, we systematically compare these three views on public benchmarks. RecLM strictly eradicates OOD recommendations (OOD@10 = 0) across all variants, and the constrained generation variants RecLM-cgen and RecLM-token achieve overall state-of-the-art accuracy compared to both strong ID-based and LLM-based baselines. Our unified view provides a systematic basis for comparing three distinct paradigms to reduce item hallucinations, offering a practical framework to facilitate the application of LLMs to recommendation tasks. Source code is at https://github.com/microsoft/RecAI.

2601.06056 2026-06-05 cs.CY cs.AI cs.CV

Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications

利用街景图像和视觉大语言模型预测遗产价值以支持治理:风险、伦理与政策影响

Tim Johansson, Mikael Mangold, Kristina Dabrock, Anna Donarelli, Ingrid Campo-Ruiz

发表机构 * RISE Research Institutes of Sweden AB(瑞典RISE研究机构) Malmö University(马尔默大学) Forschungszentrum Jülich GmbH(朱利奇研究中心) Uppsala University(乌普萨拉大学)

AI总结 本研究利用街景图像和视觉大语言模型评估瑞典建筑遗产价值,以支持建筑翻新计划的制定,探讨了方法中的问题、潜在改进以及使用LLM数据的伦理风险。

详情
AI中文摘要

在2025年至2026年期间,欧盟成员国必须实施《建筑性能能效指令》,要求所有成员国制定国家建筑翻新计划。在瑞典,没有全面记录具有遗产价值的建筑的国家注册表,这被视为阻碍建筑翻新计划制定分析的障碍。本研究旨在帮助瑞典当局了解瑞典建筑存量中的遗产价值。通过对瑞典各地(N=154710)的街景图像中的建筑进行多模态大语言模型(LLM)分析,评估了可见的遗产价值指示方面。使用LLM的零样本预测作为基础,确定了潜在具有遗产价值的建筑,覆盖500万平方米的供暖地板面积。本文呈现了预测结果和所学到的经验,并将其与瑞典建筑翻新计划的制定相结合,作为治理的一部分。讨论了方法中的问题和潜在的改进。探讨了当局使用基于LLM的数据的潜在风险,重点是透明性、错误检测和阿谀奉承的问题。

英文摘要

During 2025 and 2026, the Energy Performance of Buildings Directive is being implemented in the European Union member states, requiring all member states to have National Building Renovation Plans. In Sweden, there is no comprehensive national register of buildings with heritage values. This is seen as a barrier for the analyses underlying the development of Building Renovation Plans by the involved Swedish authorities. The purpose of this research was to assist Swedish authorities in developing information on heritage values in the Swedish building stock. Buildings in street view images from all over Sweden (N=154 710) have been analysed using multimodal Large Language Models (LLM) to assess visible aspects indicative of heritage value. Zero-shot predictions by LLMs were used as a basis for identifying buildings with potential heritage values for 5.0 million square meters of heated floor area. In this paper, the results of the predictions and lessons learned are presented and related to the development of the Swedish Building Renovation Plan as part of governance. The problems with the method and potential improvements are discussed. Risks with authorities use of LLM-based data are addressed, with a focus on issues of transparency, error detection and sycophancy.

2510.02415 2026-06-05 physics.ao-ph cs.LG

The Equilibrium Response of Atmospheric Machine-Learning Models to Uniform Sea Surface Temperature Warming

大气机器学习模型对均匀海表温度变暖的平衡响应

Bosong Zhang, Timothy M. Merlis

发表机构 * University of Washington(华盛顿大学)

AI总结 本文评估了几种先进的机器学习模型对均匀海表温度变暖的气候响应,探讨了这些模型在气候预测中的潜力与局限性。

详情
AI中文摘要

近年来,能够产生稳定、多年气候模拟的全球大气机器学习模型已得到发展。然而,这些机器学习模型超越训练分布进行泛化的能力仍是一个开放性问题。在本研究中,我们评估了几种最先进的机器学习模型(ACE2-ERA5、NeuralGCM和cBottle)对均匀海表温度变暖的气候响应,这是一种广泛用于评估气候变化的基准测试。我们评估了这些机器学习模型相对于基于物理的一般环流模型(NOAA的Geophysical Fluid Dynamics Laboratory AM4)在关键诊断指标上的性能,包括地表空气温度、降水量、温度和风廓线以及大气顶部辐射。尽管机器学习模型能够再现物理模型响应的关键方面,特别是降水量的响应,但某些模型在辐射响应和陆地区域变暖方面表现出显著偏离稳健的物理响应。我们的结果突显了机器学习模型在气候变化应用中的潜力和当前的局限性,并表明需要进一步改进以实现稳健的样本外泛化。

英文摘要

Machine learning models for the global atmosphere that are capable of producing stable, multi-year simulations of Earth's climate have recently been developed. However, the ability of these ML models to generalize beyond the training distribution remains an open question. In this study, we evaluate the climate response of several state-of-the-art ML models (ACE2-ERA5, NeuralGCM, and cBottle) to a uniform sea surface temperature warming, a widely used benchmark for evaluating climate change. We assess each ML model's performance relative to a physics-based general circulation model (NOAA's Geophysical Fluid Dynamics Laboratory AM4) across key diagnostics, including surface air temperature, precipitation, temperature and wind profiles, and top-of-atmosphere radiation. While the ML models reproduce key aspects of the physical model response, particularly the response of precipitation, some exhibit notable departures from robust physical responses, including radiative responses and land region warming. Our results highlight the promise and current limitations of ML models for climate change applications and suggest that further improvements are needed for robust out-of-sample generalization.

2512.21335 2026-06-05 physics.med-ph cs.LG physics.app-ph physics.bio-ph

Autonomous Uncertainty Quantification for Computational Point-of-care Sensors

自主不确定性量化用于计算床旁传感器

Artem Goncharov, Rajesh Ghosh, Hyou-Arm Joung, Dino Di Carlo, Aydogan Ozcan

发表机构 * Electrical & Computer Engineering Department(电气与计算机工程系) Bioengineering Department(生物工程系) California NanoSystems Institute (CNSI)(加州纳米系统研究所) Department of Surgery(外科医学系) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种自主不确定性量化技术,用于改进床旁诊断中的神经网络驱动计算传感器系统,通过蒙特卡洛dropout方法提高诊断的准确性和可靠性。

Comments 18 Pages, 5 Figures

详情
Journal ref
ACS Nano (2026)
AI中文摘要

计算床旁(POC)传感器能够为紧急、偏远和资源有限地区提供快速、低成本和可及的诊断。这些系统可以利用基于神经网络的算法从快速诊断测试或传感器生成的信号中准确推断诊断。然而,基于神经网络的诊断模型容易产生幻觉,并可能产生错误预测,导致误诊和不准确的临床决策。为了解决这一挑战,本文提出了一种专为POC诊断开发的自主不确定性量化技术。作为测试平台,我们使用了用于快速诊断莱姆病(全球最普遍的蜱传疾病)的纸基计算垂直流分析(xVFA)平台。xVFA平台集成了可丢弃的纸基检测、手持光学读取器和基于神经网络的推断算法,可在20分钟内使用仅20微升患者血清提供快速且经济有效的莱姆病诊断。通过将基于蒙特卡洛dropout(MCDO)的不确定性量化方法整合到诊断流程中,我们识别并排除了具有高不确定性的错误预测,显著提高了xVFA的灵敏度和可靠性,无需访问患者的真实诊断信息。使用新患者样本的盲测显示,诊断灵敏度从88.2%提高到95.7%,表明基于MCDO的不确定性量化在增强神经网络驱动的计算POC传感系统鲁棒性方面的有效性。

英文摘要

Computational point-of-care (POC) sensors enable rapid, low-cost, and accessible diagnostics in emergency, remote and resource-limited areas that lack access to centralized medical facilities. These systems can utilize neural network-based algorithms to accurately infer a diagnosis from the signals generated by rapid diagnostic tests or sensors. However, neural network-based diagnostic models are subject to hallucinations and can produce erroneous predictions, posing a risk of misdiagnosis and inaccurate clinical decisions. To address this challenge, here we present an autonomous uncertainty quantification technique developed for POC diagnostics. As our testbed, we used a paper-based, computational vertical flow assay (xVFA) platform developed for rapid POC diagnosis of Lyme disease, the most prevalent tick-borne disease globally. The xVFA platform integrates a disposable paper-based assay, a handheld optical reader and a neural network-based inference algorithm, providing rapid and cost-effective Lyme disease diagnostics in under 20 min using only 20 uL of patient serum. By incorporating a Monte Carlo dropout (MCDO)-based uncertainty quantification approach into the diagnostics pipeline, we identified and excluded erroneous predictions with high uncertainty, significantly improving the sensitivity and reliability of the xVFA in an autonomous manner, without access to the ground truth diagnostic information of patients. Blinded testing using new patient samples demonstrated an increase in diagnostic sensitivity from 88.2% to 95.7%, indicating the effectiveness of MCDO-based uncertainty quantification in enhancing the robustness of neural network-driven computational POC sensing systems.

2512.20627 2026-06-05 cs.NI cs.AI

Efficient Asynchronous Federated Evaluation with Strategy Similarity Awareness for Intent-Based Networking in Industrial Internet of Things

面向工业互联网-of-things意图网络的高效异步联邦评估与策略相似性意识

Shaowen Qin, Jianfeng Zeng, Haodong Guo, Xiaohuan Li, Jiawen Kang, Qian Chen

发表机构 * Guangxi University Key Laboratory of Intelligent Networking and Scenario System (School of Information and Communication, Guilin University of Electronic Technology)(广西智能网络与场景系统重点实验室(信息与通信学院,桂林电子科技大学)) National Engineering Laboratory for Comprehensive Transportation Big Data Application Technology (Guangxi)(综合交通运输大数据应用技术国家工程实验室(广西)) School of Automation, Guangdong University of Technology(自动化学院,广东工业大学) School of Architecture and Transportation Engineering, GUET(建筑与交通工程学院,桂林电子科技大学)

AI总结 本文提出了一种基于联邦学习的增强意图网络框架FEIBN,利用大语言模型将用户意图转化为结构化策略元组,并通过策略相似性意识联邦学习机制提升训练效率和通信效率,从而在工业互联网-of-things环境中实现更高效的策略评估。

Comments 12 pages with 7 figures and 4 tables

详情
AI中文摘要

意图网络(IBN)通过将高层用户意图转化为可执行的网络策略,为工业互联网-of-things(IIoT)环境中的智能和自动化网络控制提供了一种有前景的范式。然而,由于紧密耦合的工作流和高停机成本,频繁的策略部署和回滚是不切实际的,而节点异质性和隐私约束进一步复杂化了集中式策略评估。为了解决这些挑战,我们提出了一种联邦评估增强的意图网络框架(FEIBN),该框架利用大语言模型(LLMs)将用户意图转化为结构化策略元组,并采用联邦学习支持分布式策略评估。为了提高训练效率并减少通信开销,我们设计了一种策略相似性意识联邦学习机制(SSAFL),该机制根据策略相似性和资源状态选择相关节点,并仅在本地更新显著时触发异步模型上传。实验表明,所提出的方法在模型精度、收敛速度和通信成本方面均优于基线方法。

英文摘要

Intent-Based Networking (IBN) offers a promising paradigm for intelligent and automated network control in Industrial Internet of Things (IIoT) environments by translating high-level user intents into executable network strategies. However, frequent strategy deployment and rollback are impractical due to tightly coupled workflows and high downtime costs, while node heterogeneity and privacy constraints further complicate centralized strategy evaluation. To address these challenges, we propose a Federated Evaluation Enhanced Intent-Based Networking framework (FEIBN), which leverages large language models (LLMs) to translate user intents into structured strategy tuples and employs federated learning to support distributed strategy evaluation. To improve training efficiency and reduce communication overhead, we design a Strategy Similarity Aware Federated Learning mechanism (SSAFL), which selects nodes relevant to the task based on strategy similarity and resource status, and triggers asynchronous model uploads only when local updates are significant. Experiments demonstrate that the proposed method improves model accuracy, accelerates convergence, and reduces communication cost compared with the baselines.

2506.11152 2026-06-05 q-bio.GN cs.LG q-bio.CB

HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data

HEIST:一种用于空间转录组学和蛋白质组学数据的图基础模型

Hiren Madhu, João Felipe Rocha, Tinglin Huang, Siddharth Viswanath, Smita Krishnaswamy, Rex Ying

发表机构 * Yale University, USA(耶鲁大学)

AI总结 本文提出HEIST模型,通过图结构建模空间转录组学和蛋白质组学数据,利用层次化图Transformer实现对细胞空间位置和基因表达的联合建模,从而提升对细胞异质性和微环境响应的理解。

详情
AI中文摘要

单细胞转录组学和蛋白质组学已成为驱动生物学研究的重要数据来源,使高级深度学习方法能够理解单细胞水平的细胞异质性和基因表达。随着空间组学数据的出现,我们有希望在组织背景下表征细胞,因为其提供了空间坐标和细胞内转录或蛋白质计数。蛋白质组学通过直接测量蛋白质提供互补视角,蛋白质是细胞功能的主要效应器和关键治疗靶点。然而,现有模型要么忽略空间信息,要么忽略细胞内的复杂遗传和蛋白质组程序,因此无法推断细胞内部调节如何适应微环境信号。此外,这些模型通常使用固定基因词汇表,限制了其对未知基因的泛化能力。在本文中,我们介绍了HEIST,一种用于空间转录组学和蛋白质组学的层次化图Transformer基础模型。HEIST将组织建模为层次化图。高层图是空间细胞图,每个细胞再由其下层的基因共表达网络图表示。HEIST通过执行不同层次的消息传递来利用其嵌入中的层次结构,从而能够泛化到包括空间蛋白质组学在内的新数据类型,而无需重新训练。HEIST在15个器官的124种组织中使用空间感知对比和掩码自动编码目标,预训练了2230万细胞。对HEIST嵌入的无监督分析揭示了先前模型遗漏的具有空间信息的亚群。下游评估显示其在蛋白质组学数据上的泛化能力和在临床结果预测、细胞类型注释和基因填补中的最先进性能。

英文摘要

Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability unseen genes. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.

2512.03086 2026-06-05 cs.PL cs.AI cs.SE

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

超越代码对:基于对话的数据生成用于LLM代码翻译

Le Chen, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, Chunhua Liao

发表机构 * Argonne National Laboratory(阿贡国家实验室) University of Minnesota(明尼苏达大学) Iowa State University(爱荷华州立大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本文提出了一种基于对话的数据生成方法,通过双LLM架构生成验证的翻译和多轮对话,以提升LLM在低资源编程领域中的代码翻译能力。

详情
AI中文摘要

大型语言模型(LLMs)在代码翻译任务中表现出色,但在资源稀缺的编程领域如Fortran和新兴框架如CUDA中性能下降,因为高质量并行数据稀缺。我们提出了一种自动化数据生成流水线,采用双LLM提问者-求解器设计,整合编译器和运行时反馈的外部知识。除了传统的源-目标代码对数据集外,我们的方法还生成(1)带有单元测试的验证翻译以评估功能一致性,以及(2)多轮对话,捕捉翻译优化过程中的推理过程。应用于Fortran到C++和C++到CUDA的转换中,该流水线分别生成3,640和3,930个对话。在该数据上微调可显著提升功能正确性,使C++到CUDA任务的单元测试成功率提高超过56%。我们证明生成的数据使7B开放式模型在编译成功率等关键指标上显著优于更大的专有系统。

英文摘要

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran-to-C++ and C++-to-CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show that the generated data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

2511.16111 2026-06-05 stat.ML cs.LG math.SP

Rotation-Parameterized Graph Fractional Fourier Transform: Definition, Properties, and Optimal Filtering

旋转参数化图分数阶傅里叶变换:定义、性质和最优滤波

Feiyue Zhao, Mingzhi Wang, Yangfan He, Zhichao Zhang

发表机构 * School of Mathematics and Statistics, Nanjing University of Information Science and Technology(南京信息工程大学数学与统计学院) School of Communication and Artificial Intelligence, Nanjing Institute of Technology(南京理工大学通信与人工智能学院) School of Integrated Circuits, Nanjing Institute of Technology(南京理工大学集成电路学院) Jiangsu Province Engineering Research Center of IntelliSense Technology and System(江苏省智能感知技术与系统工程研究中心) Hubei Key Laboratory of Applied Mathematics, Hubei University(湖北省应用数学重点实验室) Key Laboratory of System Control and Information Processing, Ministry of Education, Shanghai Jiao Tong University(教育部系统控制与信息处理重点实验室,上海交通大学)

AI总结 本文提出旋转参数化图分数阶傅里叶变换(RP-GFRFT),通过统一分数阶和旋转参数化的谱分析,解决现有方法在旋转基控制和零角度退化方面的不足,提升图信号处理的去噪、重建和特征保留性能。

详情
AI中文摘要

图谱表示在图信号处理中是基础,为分析图结构数据提供严谨的框架。图分数阶傅里叶变换(GFRFT)通过分数阶参数扩展图傅里叶变换(GFT),实现灵活的谱分析并保持数学一致性。角图傅里叶变换(AGFT)通过旋转GFT特征向量引入角度控制;然而现有构造可能无法在零角度时精确还原为GFT,削弱理论一致性和可解释性。为解决这些互补的局限性,即GFRFT缺乏基于旋转的基控制和AGFT的零角度退化问题,本文提出旋转参数化图分数阶傅里叶变换(RP-GFRFT),统一分数阶和旋转参数化的谱分析。构造了一个保持退化的旋转矩阵族以保证在零角度时精确还原为GFT。然后提出了两种RP-GFRFT变体,I-RP-GFRFT和II-RP-GFRFT,并通过理论分析确认其幺正性、可逆性、还原行为和光滑参数依赖性。将分数阶和旋转角度联合优化用于自适应图谱滤波。在真实世界信号、图像和点云上的实验表明,RP-GFRFT在去噪精度、重建质量和特征保留方面优于GFRFT、AGFT和代表性滤波基线。

英文摘要

Graph spectral representations are fundamental in graph signal processing, providing a rigorous frameworkforanalyzing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the graph Fourier transform (GFT) through a fractional-order parameter, enabling flexible spectral analysis with mathematical consistency. The angular graph Fourier transform (AGFT) further introduces angular control by rotating GFT eigenvectors; however, existing constructions may fail to reduce exactly to the GFT at zero angle, weakening theoretical consistency and interpretability. To address these complementary limitations, namely the lack of rotation-based basis control in GFRFT and the defective zero-angle degeneracy of AGFT, this paper proposes the rotation-parameterized graph fractional Fourier transform (RP-GFRFT), which unifies fractional order and rotation-parameterized spectral analysis. A degeneracy preserving rotation matrix family is constructed to guarantee exact GFT reduction at zero angle. TwoRP-GFRFTvariants,I-RP-GFRFTandII-RP-GFRFT,arethenformulated, with theoretical analyses confirming their unitarity, invertibility, reduction behavior, and smooth parameter dependence. The fractional order and rotation angle are jointly optimized for adaptive graph spectral filtering. Experiments on real-world signals, images, and point clouds demonstrate that RP-GFRFT improves denoising accuracy, reconstruction quality, and feature preservation over GFRFT, AGFT, and representative filtering baselines.

2503.01734 2026-06-05 cs.CR cs.AI

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

对抗代理:基于强化学习的黑盒逃逸攻击

Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文提出了一种基于强化学习的对抗攻击方法,通过学习生成对抗样本的新算法,提高了攻击效率和成功率,同时在图像分类基准上展示了其优越的性能。

Comments Accepted to the Findings of CVPR 2026

详情
AI中文摘要

对机器学习模型的攻击已通过无状态优化广泛研究。本文展示了强化学习(RL)代理如何学习一种新类型的攻击算法来生成对抗样本。与传统对抗机器学习(AML)方法不同,我们的RL方法保留并利用过去的攻击经验,以提高未来攻击的有效性和效率。我们将对抗样本生成建模为马尔可夫决策过程,并评估RL在(a)学习有效且高效的攻击策略以及(b)与最先进的AML竞争的能力。在两个图像分类基准上,我们的代理在训练过程中将攻击成功率提高了最高13.2%,并将每个攻击的受害者模型查询平均次数减少了最高16.9%。在与最先进的图像攻击进行直接比较时,我们的方法使攻击者能够在训练后在未见过的输入上生成对抗样本的成功率提高了17%。从安全角度来看,这项工作展示了一种强大的新攻击向量,利用RL训练能够高效且大规模攻击ML模型的代理。

英文摘要

Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.

2510.15814 2026-06-05 stat.ML cs.LG

On Universality of Deep Equivariant Networks

关于深度等变网络的通用性

Marco Pacini, Mircea Petrache, Bruno Lepri, Shubhendu Trivedi, Robin Walters

发表机构 * University of Trento(特伦托大学) Fondazione Bruno Kessler(布鲁诺·凯瑟勒基金会) PUC Chile(智利天主教大学) Northeastern University(东北大学)

AI总结 本文研究了等变神经网络的通用性问题,提出在分离约束下,通过全连接读出层可实现连续函数的近似,并引入了更严格的逐元素分离性准则,证明了足够深度或适当读出层可使等变网络在逐元素分离性范围内实现通用性。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

对于等变神经网络的通用性结果仍然很少。已有的结果通常仅在受限的设置中成立:要么依赖于常规或高阶张量表示,导致隐藏空间维度过高,要么针对专门的架构,通常局限于不变设置。本文提出了一种更一般性的结论。对于不变网络,我们在分离约束下建立了通用性定理,证明添加全连接读出层可使连续函数的近似在分离约束下实现。对于等变网络,其中结果更为稀少,我们证明标准分离性概念不足,并引入更严格的逐元素分离性准则。我们证明在足够深度或添加适当读出层的情况下,等变网络可在逐元素分离性范围内实现通用性。结合先前结果表明浅层模型无法实现通用性,我们的发现将深度和读出层识别为通用性的关键机制,同时提供了一个统一的视角,涵盖了并扩展了先前专门的结果。

英文摘要

Universality results for equivariant neural networks remain rare. Those that do exist typically hold only in restrictive settings: either they rely on regular or higher-order tensor representations, leading to impractically high-dimensional hidden spaces, or they target specialized architectures, often confined to the invariant setting. This work develops a more general account. For invariant networks, we establish a universality theorem under separation constraints, showing that the addition of a fully connected readout layer secures approximation within the class of separation-constrained continuous functions. For equivariant networks, where results are even scarcer, we demonstrate that standard separability notions are inadequate and introduce the sharper criterion of $\textit{entry-wise separability}$. We show that with sufficient depth or with the addition of appropriate readout layers, equivariant networks attain universality within the entry-wise separable regime. Together with prior results showing the failure of universality for shallow models, our findings identify depth and readout layers as a decisive mechanism for universality, additionally offering a unified perspective that subsumes and extends earlier specialized results.

2510.11974 2026-06-05 cs.CR cs.AI

CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

CTIConnect:一种用于异构网络威胁情报的检索增强大语言模型基准

Yutong Cheng, Yang Liu, Changze Li, Dawn Song, Peng Gao

发表机构 * Virginia Tech Department of Computer Science(弗吉尼亚理工大学计算机科学系) University of California, Berkeley Department of Computer Science(加州大学伯克利分校计算机科学系)

AI总结 本文提出CTIConnect基准,用于评估检索增强型大语言模型在网络威胁情报任务中的表现,通过整合五个异构数据源构建了1860个专家验证的问答对,揭示了不同任务类别中跨源语义差距的差异以及检索策略和性能瓶颈的变化,展示了领域特定策略在提升性能上的优势。

Comments Accepted to KDD 2026

详情
AI中文摘要

网络威胁情报(CTI)是现代网络安全的基础,使组织能够主动防御不断演变的威胁。然而,CTI数据的规模和异质性,从结构化知识库(CVE、CWE、CAPEC、MITRE ATT&CK)和非结构化威胁报告,远远超出了手动分析的能力。大型语言模型(LLMs)强大的上下文理解和推理能力推动了其在CTI任务中的应用。然而,现有的基准评估在检索增强设置中缺乏适当的评估框架,无法访问分析师在实践中依赖的异构领域知识源。为此,我们提出了CTIConnect,一种系统评估检索增强型LLMs在CTI任务领域的基准。我们构建了一个统一的评估环境,整合了五个异构CTI数据源,构建了1860个专家验证的问答对,涵盖实体链接、多文档综合和实体归属三个类别共九项任务。对十种最先进的LLMs进行了大量实验,发现跨源语义差距在不同任务类别中表现不同,需要根本不同的检索策略,并且性能瓶颈在检索基础设施和证据利用之间切换。我们的领域特定策略进一步优于更强的一般检索范式(检索后重排、IRCoT),表明缩小这一差距需要结构干预而非通用检索改进。这些发现在所有十种LLMs上均成立,保持在完整基准上的一致性,并在2008-2025时间分割下保持稳定。共同,它们为设计可扩展的异构CTI生态系统检索架构提供了可操作的指导。

英文摘要

Cyber Threat Intelligence (CTI) is foundational to modern cybersecurity, enabling organizations to proactively defend against evolving threats. However, the sheer volume and heterogeneity of CTI data, spanning structured knowledge bases (CVE, CWE, CAPEC, MITRE ATT&CK) and unstructured threat reports, far exceed the capacity of manual analysis. The strong contextual understanding and reasoning of Large Language Models (LLMs) have driven growing interest in applying them to CTI tasks. Yet no existing benchmark evaluates LLMs in a retrieval-augmented setting with a proper evaluation harness that grants access to the heterogeneous domain knowledge sources analysts rely on in practice. To address this gap, we present CTIConnect, a benchmark for systematically evaluating retrieval-augmented LLMs across the CTI task landscape. We construct a unified evaluation environment integrating five heterogeneous CTI sources into 1,860 expert-verified QA pairs spanning nine tasks across three categories: Entity Linking, Multi-Document Synthesis, and Entity Attribution. Extensive experiments on ten state-of-the-art LLMs reveal that the cross-source semantic gap manifests differently across task categories, demanding fundamentally different retrieval strategies, and that the performance bottleneck shifts between retrieval infrastructure and evidence utilization depending on the task. Our domain-specific strategies further outperform stronger general-purpose retrieval paradigms (retrieve-then-rerank, IRCoT), showing that closing this gap requires structural interventions rather than generic retrieval improvements. These findings hold across all ten LLMs, remain consistent on the full benchmark, and stay stable under temporal splits spanning 2008-2025. Together, they provide actionable guidance for designing scalable retrieval architectures over heterogeneous CTI ecosystems.

2510.05709 2026-06-05 cs.CR cs.AI cs.CL

Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

纠正大语言模型基准测试中的提示依赖:一种具有嵌入空间聚类的贝叶斯分层模型

Mary Llewellyn, Isobel Thornton, James Bishop, Annie Gray

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种贝叶斯分层模型,通过嵌入空间聚类来纠正大语言模型基准测试中的提示依赖问题,在数据有限的情况下提供更稳健的性能指标,并在对抗鲁棒性基准测试中实现了性能指标的显著提升。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情
AI中文摘要

大语言模型基准测试指标经常错误地陈述性能和不确定性,因为它们依赖于两个在实践中经常不成立的假设:(i) 经典推断有足够的评估数据,和 (ii) 测试提示是独立的。我们提出了一种纠正性的贝叶斯分层模型,结合嵌入空间聚类,能够在数据有限的情况下提供稳健的性能指标,同时纠正提示依赖问题。我们将该方法应用于对抗鲁棒性基准测试,展示了聚类结构的一致恢复,从而得到更可靠的性能指标,平均绝对误差提高了4-73%,预期对数后验密度提高了40-450个单位。

英文摘要

LLM benchmarking metrics often misstate performance and uncertainty as they rely on two assumptions that frequently do not hold in practice: (i) a sufficient number of evaluations are available for classical inference, and (ii) test prompts are independent. We propose a corrective Bayesian hierarchical model with embedding-space clustering that provides robust performance metrics in limited-data settings while correcting for prompt dependence. We apply the approach to adversarial robustness benchmarks, showing consistent recovery of clustering structure, resulting in more reliable performance metrics, with 4-73% improvements to mean absolute errors and 40-450 unit improvements to expected log posterior densities.

2509.25450 2026-06-05 cs.CE cs.AI cs.NA math.NA physics.comp-ph

Multi-patch isogeometric neural solver for partial differential equations on computer-aided design domains

多补丁等几何神经求解器用于计算机辅助设计域上的偏微分方程

Moritz von Tresckow, Ion Gabriel Ion, Dimitrios Loukrezis

发表机构 * Institute for Accelerator Science and Electromagnetic Fields, Technische Universität Darmstadt(加速器科学与电磁场研究所,德累斯顿技术大学) Terra Quantum AG(Terra Quantum公司) Scientific Computing, Centrum Wiskunde & Informatica(科学计算,数学与信息学中心)

AI总结 本文提出了一种结合物理感知神经网络与多补丁等几何分析的计算框架,用于解决复杂计算机辅助设计几何上的偏微分方程。该方法利用补丁局部神经网络在等几何分析的参考域上操作,并通过定制的输出层强加狄利克雷边界条件。通过专用的界面神经网络确保非均匀有理B样条补丁之间界面的解一致性。通过变分框架最小化偏微分方程弱形式导出的能量函数进行训练。在两个高度非平凡且实际相关的应用案例中验证了该方法的有效性,即四极磁铁的2D磁静力学模型和机械夹具的3D非线性固体力学与接触力学模型。结果与高保真有限元求解器获得的参考解高度一致,展示了该神经求解器在处理复杂工程问题方面的潜力。

Comments 33 pages, 15 figures

详情
AI中文摘要

本工作开发了一种计算框架,结合物理感知神经网络与多补丁等几何分析,用于解决复杂计算机辅助设计几何上的偏微分方程。该方法利用补丁局部神经网络在等几何分析的参考域上操作。定制的输出层使强加狄利克雷边界条件。通过专用的界面神经网络确保非均匀有理B样条补丁之间界面的解一致性。通过变分框架最小化偏微分方程弱形式导出的能量函数进行训练。该方法的有效性在两个高度非平凡且实际相关的应用案例中得到验证,即四极磁铁的2D磁静力学模型和机械夹具的3D非线性固体力学与接触力学模型。结果与高保真有限元求解器获得的参考解高度一致,从而突显了该神经求解器在处理复杂工程问题方面的潜力,鉴于相应的计算机辅助设计模型。

英文摘要

This work develops a computational framework that combines physics-informed neural networks with multi-patch isogeometric analysis to solve partial differential equations on complex computer-aided design geometries. The method utilizes patch-local neural networks that operate on the reference domain of isogeometric analysis. A custom output layer enables the strong imposition of Dirichlet boundary conditions. Solution conformity across interfaces between non-uniform rational B-spline patches is enforced using dedicated interface neural networks. Training is performed using the variational framework by minimizing the energy functional derived after the weak form of the partial differential equation. The effectiveness of the suggested method is demonstrated on two highly non-trivial and practically relevant use-cases, namely, a 2D magnetostatics model of a quadrupole magnet and a 3D nonlinear solid and contact mechanics model of a mechanical holder. The results show excellent agreement to reference solutions obtained with high-fidelity finite element solvers, thus highlighting the potential of the suggested neural solver to tackle complex engineering problems given the corresponding computer-aided design models.

2509.25397 2026-06-05 cs.SE cs.AI cs.LG

A Cartography of Open Collaboration in Open Source AI: Mapping Practices, Motivations, and Governance in 14 Open Large Language Model Projects

开源人工智能中开放协作的图谱:映射14个开源大语言模型项目的实践、动机与治理

Johan Linåker, Cailean Osborne, Jennifer Ding, Ben Burtenshaw

发表机构 * RISE Research Institutes of Sweden AB(瑞典RISE研究机构) University of Oxford(牛津大学)

AI总结 本文通过分析14个开源大语言模型项目的开发与再利用生命周期中的开放协作实践,揭示了协作方法、动机和治理结构的多样性,以及开放源代码AI并非单一属性,而是协作组织方式在互联艺术领域、生命周期阶段和制度背景下的涌现结果。

Comments In submission

详情
AI中文摘要

开源大语言模型(LLMs)的普及正在推动人工智能(AI)领域形成一个活跃的生态系统。然而,开发开源LLMs所使用的协作方法,在其公开发布前后仍未被系统研究,这限制了我们对开源LLM项目如何启动、组织和治理的理解,以及进一步促进这一生态系统的机会。我们通过探索性分析开源LLMs的开发与再利用生命周期中的开放协作,基于对14个不同开源LLM项目开发者的半结构化访谈。这些协作跨越多个艺术领域——包括模型、数据、软件、评估、计算和社区参与——每个领域都使不同的参与形式成为可能,并涉及不同的利益相关者,这些利益相关者在LLM开发生命周期中不断演变,从早期的集中、选择性参与转变为模型发布后的广泛、分散参与。开源LLM开发者受多种社会、经济和技术动机驱动,从民主化AI访问和促进开放科学到构建区域生态系统和扩展语言代表性。这些动态通过一系列治理结构协调,通常在不同程度上正式和专业化,包括以公司为中心的集中努力到去中心化的基层倡议。我们通过一个概念模型综合了我们的发现,提供了实践建议,并得出结论:开源AI的开放性并非单一属性,而是协作在互联艺术领域、生命周期阶段和制度背景下的组织方式的涌现结果。

英文摘要

The proliferation of open large language models (LLMs) is fostering a vibrant ecosystem in artificial intelligence (AI). However, the methods of collaboration used to develop open LLMs, both before and after their public release, have not yet been systematically studied, limiting our understanding of how open LLM projects are initiated, organised, and governed, as well as the opportunities to further foster this ecosystem. We address this gap through an exploratory analysis of open collaboration throughout the development and reuse lifecycle of open LLMs, drawing on semi-structured interviews with the developers of 14 diverse open LLM projects. These collaborations span multiple artefact domains -- including models, data, software, evaluation, compute, and community engagement -- each enabling distinct forms of participation and involving different stakeholders that evolves across the LLM development lifecycle, shifting from concentrated, selective engagement in the early stages to broader, distributed participation after model release. The open LLM developers are motivated by a variety of social, economic, and technological motivations, ranging from democratising access to AI and promoting open science to building regional ecosystems and expanding language representation. These dynamics are coordinated through a range of governance structures, typically formal and professionalised to varying degrees, including centralised company-led efforts to decentralised grassroots initiatives. We synthesise our findings in a conceptual model of open collaboration in open LLM ecosystems, provide recommendations for practice, and conclude that openness in open source AI is not a uniform property but an emergent outcome of how collaboration is organised across interconnected artefact domains, lifecycle stages, and institutional contexts.