arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4101
2606.01070 2026-06-02 cs.IR cs.AI cs.LG

Test-Time Training for Zero-Resource Dense Retrieval Reranking

零资源稠密检索重排的测试时训练

Shiyan Liu, Yichen Li

发表机构 * Huazhong University of Science and Technology(华中科技大学) ByteDance(字节跳动)

AI总结 提出 DART 方法,通过测试时自适应双线性评分矩阵,利用伪正负样本进行少量梯度更新,在零资源下提升稠密检索重排性能。

Comments Accepted at KnowFM @ ACL 2026

详情
AI中文摘要

稠密检索器在第一阶段候选生成中表现出色,但在零资源设置下缺乏有效的重排能力。现有方法面临根本性困境:交叉编码器重排质量高,但需要昂贵的监督训练且延迟高,而无监督的 BM25 重排在大多数 BEIR 基准上持续降低稠密检索性能。我们提出 DART(测试时稠密自适应重排),通过在推理时自适应评分函数来解决这一困境。对于每个查询,排名靠前的文档作为伪正例,排名靠后的作为伪负例,提供噪声但可用的监督信号,通过少量梯度更新来适应双线性评分矩阵 $W$。我们进一步引入置信加权边际损失和跨查询动量缓冲区,以预热跨查询的适应过程。在六个 BEIR 基准上,DART 相对于稠密检索基线实现了平均每个数据集 NDCG@10 相对提升 +2.1%,且每个查询额外延迟低于 10ms,展示了强大的零样本性能提升和跨领域泛化能力。

英文摘要

Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix $W$ via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.

2606.01065 2026-06-02 cs.DC cs.AI cs.LG

Leyline: KV Cache Directives for Agentic Inference

Leyline:用于智能推理的 KV 缓存指令

Bole Ma, Jan Eitzinger, Harald Koestler

发表机构 * Erlangen National High Performance Computing Center(埃朗根国家高性能计算中心)

AI总结 针对智能体 LLM 中策略驱动的缓存编辑需求,提出 Leyline 服务端原语,通过声明式指令四元组和架构无关接口实现缓存拼接与截断,提升缓存命中率和求解率。

详情
AI中文摘要

现代 KV 缓存管理假设聊天机器人工作负载:提示一次性到达,缓存仅追加增长,因此前缀缓存和仅向前驱逐在构造上是正确的。智能体 LLM 打破了这一假设。它们的对话通过策略驱动的编辑演变:失败的工具调用被重试,过时的输出被丢弃,轨迹被转向。这导致两个不同的缓存问题。首先,相同的内容在轮次之间移动到新位置,使得精确前缀缓存失效,尽管底层 KV 仍然有效;最近针对 MLA 的位置无关缓存工作解决了这个重用问题。其次,也是本文的重点,策略可能需要指示服务系统主动移除或替换一段缓存内容,并继续而不重新预填充之后的所有内容。没有现有的原语提供此功能。生产智能体框架退回到每次编辑时重新预填充,支付完整的前缀重新计算成本;内核级驱逐方法自行决策,无法接受来自内核外部的策略指令。我们引入 Leyline,一个弥补这一差距的服务端原语。一个声明式指令四元组将编辑内容与保持位置正确性分离。策略声明编辑及其模式(原地拼接或前缀修剪的重新预填充以实现语义遗忘);一个架构无关的接口路由到每个架构的内核,通过闭式 RoPE 旋转校正恢复注意力计算。拼接内核将重放缓存命中率提高 11.2 个百分点,并将延迟降低最多 241 毫秒。通过同一接口路由的十行截断规则将 debug-gym 上的智能体求解率提高 14.3 个百分点。该机制是开放的;它启用的策略空间是未来的议程。

英文摘要

Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conversations evolve through policy-driven editing: failed tool calls are retried, stale outputs dropped, trajectories pivoted. Two distinct cache problems result. First, identical content moves to new positions between turns, invalidating exact-prefix caches even though the underlying KV would still be valid; recent work on position-independent caching for MLA addresses this reuse problem. Second, and this paper's focus, a policy may need to direct the serving system to actively remove or replace a span of cached content and continue without re-prefilling everything that came after. No existing primitive offers this. Production agentic harnesses fall back to re-prefill on every edit, paying full prefix-recomputation cost; kernel-level eviction methods make their own decisions and cannot accept policy directives from outside the kernel. We introduce Leyline, a serving-side primitive that closes this gap. A declarative directive 4-tuple separates what to edit from how to preserve position correctness. The policy declares the edit and its mode (in-place splice or prefix-trimmed re-prefill for semantic forgetting); an architecture-agnostic interface routes to a per-architecture kernel that restores attention math via a closed-form RoPE-rotation correction. The splice kernel lifts replay cache-hit by +11.2 pp and cuts latency by up to 241 ms. A ten-line truncation rule routed through the same interface lifts agentic solve rate by +14.3 pp on debug-gym. The mechanism is open; the policy space it enables is the agenda.

2606.01031 2026-06-02 cs.GR cs.AI cs.CV cs.LG cs.MM

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

音频驱动说话头生成的时序对齐评估

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院) Data61/CSIRO(Data61/澳大利亚国家科学委员会)

AI总结 针对现有帧级评估指标对时序偏差敏感的问题,提出基于软动态时间规整的序列级对齐评估框架,提升评估鲁棒性并揭示不同建模范式间的系统权衡。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成技术发展迅速,但现有评估协议主要依赖帧级指标,假设生成视频与参考视频之间存在严格的时间对应关系。这一假设与语音驱动的面部运动不符,后者自然包含轻微的时间偏移、不同的说话速度和风格变化。因此,传统指标可能将无害的时间差异视为质量错误,使得公平比较方法并理解其权衡变得更加困难。在这项工作中,我们认为动态生成模型的评估应被表述为序列对齐问题,而非独立的帧比较。我们引入了一种统一的序列级重新表述,将软动态时间规整集成到已有的评估流程中。通过在对齐特征轨迹的同时保持时间顺序,所提出的框架对有限的时间错位具有鲁棒性,且不改变底层的感知、身份或同步编码器。我们表明,在刚性对齐下,帧级评估可被视为一个特例,而序列级对齐提供了更好的稳定性、对时间差异的更低敏感性以及建模范式之间更清晰的区分。基于这一原则性表述,我们在标准化协议下,对涵盖规范、野外和风格多样场景的七个数据集上的20种方法进行了大规模基准测试。大量实验表明,时序对齐的指标对时间差异更鲁棒,跨数据集提供更一致的结果,并能更好地揭示建模范式之间的系统权衡,例如同步性与真实性、表现力与稳定性之间的权衡。

英文摘要

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

2606.01008 2026-06-02 cs.SE cs.AI

FVSpec: Real-World Property-Based Tests as Lean Challenges

FVSpec: 作为精益挑战的真实世界基于属性的测试

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

发表机构 * Forall R&D(Forall 研发) Benchify Galois Inc(Galois 公司)

AI总结 提出FVSpec基准,通过从真实Python仓库中提取属性测试并自动翻译为Lean 4规范,评估AI在形式化验证任务上的能力。

详情
AI中文摘要

我们提出了一个用于评估AI模型和智能体在真实世界形式化软件验证任务上的基准。首先从真实世界的Python仓库中抓取11,039个基于属性的测试(PBT),然后自动将其中2,772个(25%)翻译成9,415个带有sorry占位符的Lean 4规范(每个PBT约3个形式化;当没有形式化在质量指标上占优时,我们保留多次尝试)。将PBT翻译成Lean规范具有挑战性:需要在Lean中建模Python语义,推断命令式PBT中编码的逻辑属性,并处理在很少使用的语言中进行依赖类型编程的固有困难。我们描述了一个用于将PBT转译为Lean规范的三智能体LLM流水线,评估覆盖率和质量指标,并使用多种自动化和基于模型的方法为证明生成提供基线。所有代码(爬虫和智能体)和数据(PBT和Lean规范)都是开源的。我们的基准旨在推动AI辅助形式化验证真实世界软件这一尚未充分探索的问题的进展,随着AI生成越来越多的代码,这一问题日益受到关注。

英文摘要

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.

2606.01002 2026-06-02 stat.ME cs.LG math.ST stat.TH

Theoretical Analysis of Engression and Reverse Markov Engression

Engression与反向马尔可夫Engression的理论分析

Jiaqi Huang, Gongjun Xu, Ji Zhu

发表机构 * Department of Statistics, University of Michigan(密歇根大学统计系)

AI总结 本文针对Engression及其反向马尔可夫扩展,在深度神经网络参数化下建立了非渐近收敛界,并通过能量距离链式法则分析了误差传播,得到了接近最优的过量风险界。

详情
AI中文摘要

Engression是最近提出的用于条件分布学习的有效框架。其多步反向马尔可夫扩展通过将复杂条件采样分解为顺序反向转移,进一步提高了生成灵活性。尽管这些方法具有强大的实证性能,但其严格的有限样本统计保证仍然缺乏。在本文中,在深度神经网络参数化下,我们通过直接控制学习到的条件分布与目标条件分布之间的能量距离,建立了Engression的非渐近收敛界。对于反向马尔可夫框架,我们进一步开发了基于能量距离的链式法则,从而能够严格分析反向步骤间的误差传播。我们的分析得到了相应的过量风险界,相对于一般Hölder类上的经典极小化最优速率,该界在对数因子意义下是接近最优的。

英文摘要

Engression is a recently proposed and effective framework for conditional distribution learning. Its multi-step Reverse Markov extension further improves generative flexibility by decomposing complex conditional sampling into sequential reverse transitions. Despite their strong empirical performance, rigorous finite-sample statistical guarantees for these methods remain unavailable. In this paper, under deep neural network parameterizations, we establish nonasymptotic convergence bounds for Engression by directly controlling the Energy Distance between the learned and target conditional distributions. For the Reverse Markov framework, we further develop an Energy-Distance-based chain rule that enables a rigorous analysis of error propagation across reverse steps. Our analysis yields corresponding excess-risk bounds that are near-optimal up to logarithmic factors relative to the classical minimax rate over a general Hölder class.

2606.00984 2026-06-02 stat.ML cs.LG

Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates

线性上下文赌博机中参数稀有更新的实用最优算法

Sanghoon Yu, Min-hwan Oh

发表机构 * Sanghoon Yu(苏杭oon Yu) Min-hwan Oh

AI总结 针对参数更新次数受限的线性上下文赌博机问题,提出两种仅需O(log log T)次参数更新的算法,在静态调度下达到极小化最优遗憾,并显著降低计算复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究在参数稀有更新下的线性上下文赌博机:学习器只能在少量更新时刻将奖励反馈纳入其参数估计,同时仍在线观察上下文并顺序选择动作。这一观点澄清了文献中常被模糊的实际区别:许多“严格批处理”方法额外限制了区间内上下文的自适应性,即区间内的动作规则不能依赖于该区间内已实现的上下文/动作序列(除了当前轮次的上下文)。对于线性上下文赌博机,我们提出了两种仅需$O(\log\log T)$次参数更新的实用算法。我们的第一个算法BLCE-G在静态调度下,同时在小$K$和大$K$机制下达到极小化最优遗憾(达到$T$的多对数因子)。第二个算法BLCE去除了近G-最优设计步骤——这是先前严格批处理静态网格方法中主要的计算瓶颈——同时保持极小化最优遗憾,并在最优算法中实现了已知最低的运行时间复杂度。我们进一步将这些稀有更新和计算原则扩展到广义线性上下文赌博机。总体而言,我们的结果在$O(\log\log T)$次参数更新下产生了统计最优且计算高效的算法。

英文摘要

We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict within-interval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round's context). For linear contextual bandits, we propose two practical algorithms with only $O(\log\log T)$ parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in $T$) simultaneously in both the small-$K$ and large-$K$ regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step -- a dominant computational bottleneck in prior strictly batched static-grid methods -- yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under $O(\log\log T)$ parameter updates that are also computationally efficient in practice.

2606.00962 2026-06-02 cs.CR cs.AI

SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

SS-ZKR:面向隐私保护多智能体协作的空间语义零知识路由

Hassan Touheed

发表机构 * Linux Foundation(Linux基金会) Google(谷歌) W3C(万维网联盟)

AI总结 提出SS-ZKR协议,通过差分隐私语义意图向量、自适应净化和空间到密码策略编译器三种机制,在不解密负载的情况下实现跨组织信任边界的内容感知语义路由。

详情
AI中文摘要

基础智能体互操作标准,特别是智能体到智能体(A2A)协议和模型上下文协议(MCP),推动了多智能体系统通信的发展,而利用W3C去中心化标识符(DID)和可验证凭证(VC)的补充身份框架提供了密码学智能体认证。然而,现有协议均不支持在无需路由中介解密负载的情况下,跨组织信任边界进行基于内容的智能体负载语义路由,而这在受GDPR、HIPAA和MiFID II监管的合规敏感环境中是硬性约束。我们提出SS-ZKR,一种三机制隐私保护路由协议,设计为A2A/MCP之上的补充层。机制一通过差分隐私语义意图向量引入盲路由,该向量密码学绑定到负载模式一致性的零知识证明。机制二提供向量加权自适应负载净化,对数值字段采用形式化(ε,δ)-差分隐私,对文本字段采用启发式语义聚合。机制三提出空间到密码策略编译器,将视觉定义的信任区域拓扑转换为确定性零知识访问电路。我们提供形式化威胁模型,分析意图向量的信息泄露界限,给出所有三种机制的伪代码,并与基于TEE和同态加密的路由基线进行解析复杂度比较。SS-ZKR允许金融服务、医疗保健和国防领域的企业跨监管边界编排异构AI智能体,而无需向路由基础设施暴露专有数据。

英文摘要

Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advanced multi-agent system communication, and complementary identity frameworks leveraging W3C Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) provide cryptographic agent authentication. However, no existing protocol supports content-based semantic routing of agent payloads across organisational trust boundaries without requiring the routing intermediary to decrypt the payload, which is a hard constraint in compliance-sensitive environments governed by GDPR, HIPAA, and MiFID II. We propose SS-ZKR, a three-mechanism privacy-preserving routing protocol designed as a complementary layer atop A2A/MCP. Mechanism I introduces blind routing via differentially private semantic intent vectors cryptographically bound to zero-knowledge proofs of payload-schema consistency. Mechanism II offers vector-weighted adaptive payload sanitisation with formal (epsilon, delta)-differential privacy for numerical fields and heuristic semantic aggregation for textual fields. Mechanism III presents a spatial-to-cryptographic policy compiler that translates visually defined trust-zone topologies into deterministic zero-knowledge access circuits. We provide a formal threat model, analyse information leakage bounds of intent vectors, present pseudocode for all three mechanisms, and give analytical complexity comparisons against TEE-based and homomorphic encryption-based routing baselines. SS-ZKR lets enterprises in financial services, healthcare, and defence orchestrate heterogeneous AI agents across regulatory boundaries without exposing proprietary data to routing infrastructure.

2606.00946 2026-06-02 cs.DC cs.AI cs.LG

Lodestar: An Online-Learning LLM Inference Router

Lodestar: 一种在线学习的大语言模型推理路由器

Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan, Le Xu, Liguang Xie

发表机构 * UIUC(伊利诺伊大学香槟分校) Bytedance(字节跳动) University of Edinburgh(爱丁堡大学)

AI总结 提出Lodestar,一种基于在线学习的请求路由系统,通过实时收集集群状态并训练奖励预测器,以最小化TTFT为目标分配推理请求,在异构GPU集群上显著降低延迟。

详情
AI中文摘要

高效服务大语言模型(LLM)推理任务对于用户感知的延迟(如首令牌时间TTFT)和GPU利用率至关重要。然而,LLM请求路由(即将每个推理请求分配给GPU实例)尤其具有挑战性:执行高度依赖于输入;批处理和KV缓存重用造成了强烈的跨请求耦合;延迟对上下文长度、模型/引擎设置和异构加速器呈非线性响应。因此,简单的传统负载均衡算法,甚至针对LLM推理定制的启发式方法,都难以实现良好性能。我们提出Lodestar,一种面向分布式GPU集群的基于学习的请求路由系统。Lodestar持续在每个请求级别收集集群快照,包括实时实例状态、请求特征和观察到的性能,并训练一个在线奖励预测器,用于将推理请求路由到将最大化给定奖励(例如最小化TTFT)的实例。Lodestar是云原生的,并与现有服务栈(vLLM)无缝协作。通过持续在线适应变化的工作负载和基础设施条件,与最先进的前缀缓存和负载感知启发式方法相比,Lodestar在平均TTFT上降低1.41倍,在P99 TTFT上平均降低1.47倍(在同构集群上最高达2.15倍/1.86倍,在异构集群上最高达4.38倍/4.42倍),并且根据在公有云GPU集群上的实验,大约在5分钟内学习到这些高效的路由策略。

英文摘要

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.

2606.00938 2026-06-02 cs.CE cs.LG

Machine Learning Surrogate Modeling for Homogenization of Hyperelastic Materials with Boolean Microstructures

具有布尔微结构的超弹性材料均匀化的机器学习代理建模

Matthias Brändel, Oliver Rheinbach

发表机构 * Technische Universität Bergakademie Freiberg(弗赖贝格应用科学大学)

AI总结 提出一种监督学习方法,利用低维微观结构描述符(如面积分数、形状描述符τ、两点相关函数S2(r)和线路径函数ℓ(z))预测超弹性复合材料的有效拉梅参数,并通过留一法交叉验证评估泛化能力。

Comments 16 pages, 7 figures

详情
AI中文摘要

数据驱动代理模型是非均质材料数值均匀化的替代方案。本文提出一种监督学习方法,用于从低维微观结构描述符预测超弹性复合材料的有效拉梅参数。数据集基于先前发表的平面布尔模型生成的两相随机微观结构集合的数值均匀化结果,涵盖了夹杂物形状、相衬和面积分数的变化;参见Brändel, Brands, Maike, Rheinbach, Schröder, Schwarz和Stoyan (2022)。神经网络在标量和曲线值统计描述符的组合上进行训练,包括面积分数、导出的标量形状描述符$τ$、两点相关函数$S_2(r)$和线路径函数$\ell(z)$。还加入了代表参数空间极限情况的额外数据,以稳定训练并改善外推行为。通过留一颗粒类型交叉验证评估代理模型,以评估对未见颗粒几何形状的泛化能力。数值结果表明,额外的描述符可以降低相对误差。使用$τ$和$S_2(r)$训练的预测器提供了紧凑的表示,具有良好的定量精度和规则的密集响应行为。添加线路径函数$\ell(z)$进一步降低了可用数据点上的误差,表明它是一个有前景的额外描述符;然而,训练后密集响应评估显示,改进的点态精度并不能自动保证采样参数值之间的物理可接受行为。这激励了未来在物理约束代理模型、损失公式、有界输出参数化以及曲线值几何描述符的更系统表示方面的工作。

英文摘要

Data-driven surrogate models are an alternative to numerical homogenization of heterogeneous materials. In this contribution, a supervised learning approach is presented for predicting effective Lamé parameters of hyperelastic composites from low-dimensional microstructural descriptors. The data set is based on previously published numerical homogenization results for ensembles of two-phase stochastic microstructures generated by planar Boolean models, covering variations of inclusion shape, phase contrast, and area fraction; see Brändel, Brands, Maike, Rheinbach, Schröder, Schwarz and Stoyan (2022). A neural network is trained on combinations of scalar and curve-valued statistical descriptors, including the area fraction, a derived scalar shape descriptor $τ$, the two-point correlation function $S_2(r)$, and the lineal-path function $\ell(z)$. Additional data representing limiting cases of the parameter space are incorporated to stabilize training and improve extrapolation behavior. The surrogate is evaluated by leave-one-grain-type-out cross-validation in order to assess generalization to unseen grain geometries. Numerical results demonstrate that additional descriptors can reduce relative errors. A predictor trained with $τ$ and $S_2(r)$ provides a compact representation with good quantitative accuracy and regular dense response behavior. Adding the lineal-path function $\ell(z)$ further reduces the error at the available data points, indicating that it is a promising additional descriptor; however, dense post-training response evaluations show that improved pointwise accuracy does not automatically guarantee physically admissible behavior between sampled parameter values. This motivates future work on physically constrained surrogate models, loss formulations, bounded output parametrizations, and a more systematic representation of curve-valued geometric descriptors.

2606.00934 2026-06-02 stat.ML cs.LG stat.AP stat.ME

Efficient Synthetic Network Generation via Latent Embedding Reconstruction

通过潜在嵌入重建的高效合成网络生成

Feifan Jiang, Yinan Bu, Shihao Wu, Gongjun Xu, Ji Zhu

发表机构 * Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA(统计学系,密歇根大学,安娜堡,密歇根州,美国)

AI总结 提出SyNGLER框架,基于潜在空间网络模型,通过重建潜在嵌入生成合成网络,兼顾效率与结构保真度。

详情
AI中文摘要

网络数据在社会科学、生物学和信息系统中无处不在。生成逼真的合成网络数据具有从网络模拟到科学发现的广泛应用。然而,许多现有的黑盒网络生成方法倾向于过拟合观测数据,同时忽视特征网络结构,并在大规模下产生大量计算开销。这些实际挑战要求合成网络生成方法既高效又能捕捉网络的结构特性。在本文中,我们介绍了通过潜在嵌入重建的合成网络生成(SyNGLER),这是一个基于潜在空间网络模型的通用且高效的合成网络生成框架。给定一个观测网络,SyNGLER首先通过潜在空间网络模型学习低维潜在节点嵌入,然后通过在这些嵌入上构建无分布生成器来重建潜在空间。对于生成,SyNGLER首先从潜在空间中的生成器采样(或重采样)节点嵌入,然后使用潜在空间网络模型生成合成网络。通过潜在空间框架,SyNGLER保留了网络中的独特特征,如稀疏性和节点度异质性,同时允许以比许多现有深度架构更低的计算成本进行高效训练。我们通过开发真实边缘分布与合成边缘分布之间距离的一致性结果来提供理论保证。实证研究进一步证明了SyNGLER的有效性,与现有方法相比,它高效地生成了更好地保留关键网络特征(如网络矩和度分布)的网络。代码可在 https://github.com/FeifanJiang/syngler 获取。

英文摘要

Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at https://github.com/FeifanJiang/syngler.

2606.00925 2026-06-02 cs.CR cs.AI

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

开放智能体技能生态系统中的安全风险检测与验证基准测试

Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder, Nan Jiang

发表机构 * University of Texas at El Paso(德克萨斯理工大学) Southern Illinois University-Carbondale(南方伊利诺伊大学卡本代尔分校) Purdue University(普渡大学)

AI总结 提出SkillVetBench,一个两阶段安全审查基准,通过语义审查和沙箱执行检测与验证开放智能体技能生态系统中的恶意技能。

详情
AI中文摘要

开放智能体平台允许社区贡献者发布可重用的技能,智能体可在运行时调用。这种可扩展性也带来了供应链风险:恶意贡献者可以将有害行为隐藏在表面检查看似良性的技能中。然而,现有防御措施难以评估,因为没有同时衡量恶意技能检测和运行时验证的基准。我们提出了SkillVetBench,一个针对开放智能体技能生态系统的两阶段安全审查基准。第一阶段对每个技能的自然语言规范进行语义审查,以检测隐藏的恶意意图。第二阶段在沙箱中执行标记的技能,观察运行时行为并收集可审计的证据。我们从活跃的OpenClaw生态系统中确认的恶意技能构建基准,包括最近ClawHavoc供应链攻击中的样本。与仅静态方法不同,SkillVetBench通过执行轨迹验证检测到的威胁。我们的实验表明:(1)仅语义和基于签名的基线方法不足,最多遗漏89%的恶意技能,其威胁源于自然语言指令、多组件逻辑或跨组件交互;(2)运行时攻击集中在少量高权限原语上,特别是exec、write_file、install_skill和spawn;(3)SkillVetBench提供了案例研究,其中沙箱执行直接以具体的运行时证据支持恶意判定。

英文摘要

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill's natural-language specification to detect hidden malicious intent. The second stage executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. We build a benchmark from confirmed malicious skills in the live OpenClaw ecosystem, including samples from the recent ClawHavoc supplychain campaign. Unlike static-only methods, SkillVetBench verifies detected threats with execution traces. Our experiments show that: (1) semantic-only and signature-based baselines are insufficient, missing up to 89\% of malicious skills whose threats arise from natural-language instructions, multicomponent logic, or cross-component interactions; (2) runtime attacks are concentrated in a small set of high-permission primitives, especially exec, write\_file, install\_skill, and spawn; and (3) SkillVetBench provides case studies in which sandbox execution directly supports malicious verdicts with concrete runtime evidence.

2606.00922 2026-06-02 physics.med-ph cs.RO

A Machine-to-Machine Knowledge-Guided LLM Agent for Generalizable Radiotherapy Treatment Planning

一种机器到机器知识引导的LLM智能体用于泛化放射治疗计划

Md Mainul Abrar, Xun Jia, Yujie Chi

发表机构 * National Institutes of Health (NIH)(国家卫生研究院) Department of Physics, The University of Texas at Arlington(德克萨斯理工大学阿灵顿分校物理系) Department of Radiation Oncology and Molecular Radiation Sciences, Johns Hopkins University(约翰霍普金斯大学放射肿瘤学与分子放射科学系)

AI总结 提出一种机器到机器知识引导的大语言模型框架,通过深度强化学习发现的治疗计划参数分布知识迁移至LLM智能体,实现无需人工干预的自主迭代规划,在多种病例中显著提升规划质量和泛化能力。

Comments 10 pages, 6 figures

详情
AI中文摘要

在这项工作中,我们提出了一种机器到机器(M2M)知识引导的大语言模型(LLM)框架,用于自动化放射治疗计划。在所提出的范式中,由深度强化学习(DRL)智能体发现的治疗计划参数(TPP)分布知识通过上下文学习迁移至LLM智能体,使其能够在无需人工干预的情况下自主进行迭代规划。虽然基于LLM的标准规划通常缺乏物理直觉且难以收敛,但整合DRL导出的引导将智能体约束在物理有效的参数空间内。我们在三种不同的规划场景中进行了实验评估:基础前列腺病例、具有增加器官危及(OAR)约束的复杂前列腺配置以及肝脏病例。评估结果表明,与无引导规划相比,引导的LLM智能体在显著减少迭代次数的同时,始终达到最优规划评分。对最终TPP配置的分析显示,该智能体成功学习了目标的层次优先级,有效恢复了参数调整与剂量学结果之间的逻辑“因果”关系。至关重要的是,该原型框架展现出强大的泛化能力,无论患者具体解剖结构、治疗部位或初始计划质量如何,都能保持高规划质量。通过将DRL的专业优化与LLM的自适应推理相结合,该M2M框架为迈向泛化的自主治疗计划建立了可扩展的基础,最终在现实环境中惠及临床实践。

英文摘要

In this work, we propose a prototype machine-to-machine (M2M) knowledge-guided Large Language Model (LLM) framework for automated radiotherapy treatment planning. In the proposed paradigm, Treatment Planning Parameter (TPP) distribution knowledge discovered by a Deep Reinforcement Learning (DRL) agent is transferred to an LLM agent through in-context learning, enabling autonomous iterative planning without human intervention. While standard LLM-based planning often lacks physical intuition and struggles with convergence, the integration of DRL-derived guidance constrains the agent to a physically valid parameter space. Experimental evaluations are performed across three diverse planning scenarios: basic prostate cases, complex prostate configurations with increased organ-at-risk (OAR) constraints, and liver cases. The evaluation results demonstrate that the guided LLM agent consistently achieves optimal planning scores while significantly reducing the number of iterations compared to unguided planning. Analysis of the final TPP configurations reveals that the agent successfully learns a hierarchical priority of objectives, effectively restoring a logical "cause-and-effect" relationship between parameter tuning and dosimetric outcomes. Crucially, this prototype framework exhibits robust generalizability, maintaining high planning quality regardless of specific patient anatomy, treatment site, or initial plan quality. By bridging the specialized optimization of DRL with the adaptive reasoning of LLMs, this M2M framework establishes a scalable foundation towards generalizable autonomous treatment planning, ultimately benefiting clinical practice in realistic environments.

2606.00913 2026-06-02 stat.ML cs.LG

Bandit Simulation for Average Reward Inference

平均奖励推断的赌博机模拟

Samya Praharaj, Chih-Yu Chang, Koulik Khamaru, Kelly W. Zhang

发表机构 * Rutgers University(罗格斯大学) Imperial College London(伦敦帝国理工学院)

AI总结 提出BSI框架,通过拟合环境模拟器并传播参数不确定性,为自适应赌博机算法构建渐近有效的置信区间。

详情
AI中文摘要

多臂赌博机算法越来越多地用于在线平台、临床试验和社会科学实验,但对其性能的有效统计推断仍然是一个开放挑战。部署赌博机后,一个自然的问题是能否为其平均奖励构建置信区间,并评估其是否可靠地优于基线策略。任何单次赌博机部署中获得的总奖励是随机的,由于奖励的随机性,在同一人群上部署两次赌博机通常会产生不同的奖励轨迹。标准统计推断方法无法使用,因为赌博机算法在收集的数据中引入了复杂的依赖性,违反了经典方法所依赖的独立同分布假设。此外,现有的自适应收集数据推断方法仅适用于不依赖于数据收集算法的估计量(例如固定动作下的平均奖励)。我们提出了用于推断的赌博机模拟(BSI),这是一个框架,它从观测数据(在线或离线)中拟合赌博机环境的模拟器,并用于估计任何评估策略(包括自适应黑盒算法)下的平均奖励。BSI将估计的模拟器参数的不确定性正式传播到置信区间构建中。此外,BSI的有效性仅需要对行为策略的弱探索假设,并避免了重要性加权。我们证明BSI产生渐近有效的置信区间,并通过实验证明在标准离线策略评估方法失败的情况下,BSI能保持名义覆盖。

英文摘要

Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confidence interval for its mean reward and assess whether it reliably outperforms a baseline policy. The total reward achieved in any single bandit deployment is random, and deploying a bandit twice on the same population typically yields different reward trajectories due to stochastic rewards. Standard statistical inference methods cannot be used because bandit algorithms introduce complex dependencies in the collected data, which violate the i.i.d. assumption underlying many classical approaches. Moreover, existing inference methods for adaptively collected data only apply to estimands that do not depend on the data-collection algorithm (such as the mean reward under a fixed action). We propose Bandit Simulation for Inference (BSI), a framework that fits a simulator of the bandit environment from observed data--either on-policy or off-policy--and uses it to estimate the mean reward under any evaluation policy, including adaptive blackbox algorithms. BSI formally propagates uncertainty in the estimated simulator parameters into the confidence interval construction. Furthermore, for BSI to be valid, it requires only weak exploration assumptions on the behavior policy and avoids importance weighting. We prove that BSI yields asymptotically valid confidence intervals, and demonstrate empirically that it maintains nominal coverage in settings where standard off-policy evaluation methods fail.

2606.00895 2026-06-02 math.OC cs.LG

Tiny Recursive Models for Solving the J2-Perturbed Lambert Problem

用于求解J2摄动兰伯特问题的小型递归模型

Minduli Wijayatunga, Roberto Armellin

发表机构 * Department of Aerospace Engineering, University of Illinois Urbana-Champaign(航空航天工程系,伊利诺伊大学厄巴纳-香槟分校) Te Pūnaha Ātea – Space Institute, University of Auckland(太空研究所,奥克兰大学)

AI总结 提出基于小型递归模型(TRM)的快速递归神经求解器TRM-PL,通过迭代深度而非参数数量实现有效容量,统一初始猜测生成与迭代校正,在多种轨道转移场景中显著降低终端位置误差。

详情
AI中文摘要

本文提出一种基于小型递归模型(TRM)的快速递归神经求解器,用于求解J2摄动兰伯特问题,称为TRM-PL模型。TRM是一种权重共享架构,其有效容量源于迭代深度而非参数数量:一个紧凑的推理模块在两级潜在层次中重复应用,通过模拟J2轨迹并根据产生的跟踪误差进行校正,来优化候选出发速度。这统一了初始猜测生成和迭代校正于一个端到端可微分的单一架构中。递归精化循环是经典摄动兰伯特求解器中同伦和延拓方案的一种学习替代方案:网络学习自己的校正序列,而不是遵循从开普勒解到摄动解的手工设计路径。我们在三个难度递增的测试案例上评估TRM-PL:单圈低地球轨道(LEO)转移、多圈LEO转移和多圈木星转移。比较了三种训练范式:联合学习兰伯特解和J2校正;使用目标位置和J2校正速度监督精化兰伯特初始速度;仅使用目标位置监督精化。在所有案例中,仅精化方法最为可靠。在单圈LEO上,位置监督变体将中位终端位置误差从21.7公里降至0.027公里,在多圈LEO上从340.9公里降至0.31公里,均采用相同的230万参数架构。对TRM-PL输出进行一次牛顿校正迭代,可将木星案例的中位误差收紧至0.063公里,从而得到足够精确的紧凑模型,适用于嵌入式部署。

英文摘要

This paper presents a fast, recursive neural solver for the J2-perturbed Lambert problem based on Tiny Recursive Models (TRM), termed the TRM-Perturbed Lambert (TRM-PL) model. TRM is a weight-shared architecture whose effective capacity emerges from iteration depth rather than parameter count: a compact reasoning module is applied repeatedly within a two-level latent hierarchy, refining a candidate departure velocity by simulating the J2 trajectory and correcting it from the resulting tracking error. This unifies initial-guess generation and iterative correction in a single, end-to-end differentiable architecture. The recursive refinement loop is a learned alternative to the homotopy and continuation schemes of classical perturbed-Lambert solvers: rather than following a hand-designed path from the Keplerian to the perturbed solution, the network learns its own sequence of corrections. We evaluate TRM-PL on three test cases of increasing difficulty: single-revolution low-Earth-orbit (LEO) transfers, multi-revolution LEO transfers, and multi-revolution Jovian transfers. Three training paradigms are compared: jointly learning the Lambert solution and the J2 correction; refining the Lambert initial velocity with target-position and J2-corrected velocity supervision; and refining it with target-position supervision alone. Across all cases, the refinement-only approaches are the most reliable. The position-supervised variant reduces the median terminal-position error from 21.7 km to 0.027 km on single-revolution LEO, from 340.9 km to 0.31 km on multi-revolution LEO, all with the same 2.3M-parameter architecture. A single Newton corrector iteration on the TRM-PL output tightens the Jovian median to 0.063 km, yielding compact models accurate enough for embedded deployment.

2606.00889 2026-06-02 cs.CR cs.LG

A Lightweight Hybrid MLP-Based Framework for Real-Time Phishing URL Detection Using Structural URL Features

基于结构URL特征的轻量级混合MLP框架用于实时钓鱼URL检测

Uche Unoke Emmanuel, Gideon Francis Oghie

发表机构 * Department of Cyber Security Science, School of Information and Communication Technology, Federal University of Technology, Minna, Nigeria(网络安全科学系,信息与通信技术学院,联邦科技大学,米纳,尼日利亚)

AI总结 提出一种结合黑名单筛选和仅基于结构URL特征的多层感知器(MLP)分类器的轻量级混合框架,用于实时钓鱼URL检测,在PhiUSIIL数据集上达到99.24%准确率和1.2ms推理延迟。

Comments 27 pages, 6 figures, 12 tables

详情
AI中文摘要

钓鱼攻击仍然是主要的网络安全威胁,利用欺骗性URL窃取用户敏感信息。传统的黑名单和基于规则的检测方法是被动的,往往无法识别新出现的钓鱼URL。本文提出了一种轻量级的实时钓鱼URL检测混合框架,该框架将基于黑名单的筛选与仅基于结构URL特征的多层感知器(MLP)分类器相结合。该框架提取16个URL衍生特征,捕获结构、域和与安全相关的特征,无需网页内容访问、第三方API或视觉渲染,因此计算效率高,适合实时部署。该系统在包含235,795个标记URL的PhiUSIIL钓鱼数据集上进行了训练和评估。实验结果表明,所提出的MLP在相同评估设置下达到了99.24%的准确率、98.74%的精确率、99.95%的召回率、99.34%的F1分数和99.65%的ROC-AUC,优于随机森林、逻辑回归、XGBoost、LightGBM和CatBoost。混合架构在并发处理下实现了每个URL平均1.2毫秒的推理延迟和每秒4200个URL的峰值吞吐量。一个功能性的桌面应用程序原型CyberGuard进一步展示了部署可行性。结果表明,所提出的框架为资源受限环境下的实时钓鱼URL检测提供了准确且计算高效的解决方案。

英文摘要

Phishing attacks remain a major cybersecurity threat, exploiting deceptive URLs to steal sensitive user information. Traditional blacklist and rule-based detection approaches are reactive and often fail to identify newly emerging phishing URLs. This paper proposes a lightweight hybrid framework for real-time phishing URL detection that combines blacklist-based screening with a Multi-Layer Perceptron (MLP) classifier operating solely on structural URL features. The framework extracts 16 URL-derived features capturing structural, domain-based, and security-related characteristics without requiring webpage content access, third-party APIs, or visual rendering, making it computationally efficient for real-time deployment. The system was trained and evaluated on the PhiUSIIL phishing dataset containing 235,795 labelled URLs. Experimental results show that the proposed MLP achieved 99.24% accuracy, 98.74% precision, 99.95% recall, 99.34% F1-score, and 99.65% ROC-AUC, outperforming Random Forest, Logistic Regression, XGBoost, LightGBM, and CatBoost under the same evaluation setting. The hybrid architecture achieved an average inference latency of 1.2 ms per URL and a peak throughput of 4,200 URLs per second under concurrent processing. A functional desktop application prototype, CyberGuard, further demonstrates deployment viability. The results indicate that the proposed framework provides an accurate and computationally efficient solution for real-time phishing URL detection in resource-constrained environments.

2606.00867 2026-06-02 stat.ML cs.LG eess.SP

Statistical Analysis of using the Shapley Value for Sensor Anomaly Localization with Accurate Classifiers

使用Shapley值进行传感器异常定位与准确分类器的统计分析

Xubin Fang, Rick S. Blum

发表机构 * Electrical and Computer Engineering Department of Lehigh University(莱斯大学电气与计算机工程系)

AI总结 本文通过数学定义的二元最优分类器分析Shapley值在传感器异常定位中的性能,证明在独立观测下等价于低复杂度测试,而在相关双变量高斯/拉普拉斯场景下两者存在本质差异,并首次提供理论统计结果。

详情
AI中文摘要

最近的出版物建议使用Shapley值进行传感器异常/攻击定位。我们通过在Shapley值计算中使用数学定义的二元最优分类器来研究这种方法的性能。为了判断定位性能,我们研究给定传感器观测的Shapley值确定该观测是否异常的能力。首先,我们证明对于独立传感器观测的情况,使用Shapley值的优化异常测试等价于使用Shapley值计算中单个项的优化低复杂度异常测试,产生完全相同的错误概率。对于涉及两个传感器的一些流行的相关观测情况,包括相关双变量高斯/拉普拉斯概率密度函数和常数/高斯攻击/异常,我们证明这两个测试本质上是不同的,产生不同的决策区域和错误概率。此外,我们证明在某些统计相关的双变量高斯场景中,当相关幅度较大且存在加性攻击/异常时,Shapley值测试有时严格劣于另一个(Shapley计算中的单个项)测试,而在其他情况下则严格优于它,具体取决于相关的符号。在这些情况下,可以结合这两种方法以获得严格更好的方法。这些结果首次提供了基于Shapley定位的理论统计分析,鉴于许多研究人员广泛接受Shapley值,这些结果似乎非常有趣,并应鼓励对该主题的进一步研究。提供了数值结果以说明我们的发现。

英文摘要

Recent publications have suggested using the Shap- ley value for sensor anomaly/attack localization. We study the performance of such an approach by using mathematically de- fined optimum binary classifiers in the Shapley value calculation. To judge localization performance, we study the ability of the Shapley value of a given sensor observation to determine if that observation is anomalous. First, we prove that for cases with independent sensor observations, an optimized anomaly test using the Shapley value is equivalent to an optimized lower-complexity anomaly test using a single term in the Shapley value calculation, yielding the exact same probability of error. For some popular dependent observation cases involving two sensors, including correlated bivariate Gaussian/Laplacian probability density functions and constant/Gaussian at- tacks/anomalies, we prove that these two tests are fundamentally different, yielding different decision regions and error probabil- ities. Further, we prove that the Shapley value test is sometimes strictly inferior to the other (single term in Shapley calculation) test in certain statistically dependent bivariate Gaussian scenarios with large correlation magnitude and additive attacks/anomalies, while it is strictly superior in others, depending on the sign of the correlation. One can combine these two approaches to obtain a strictly better approach in these cases. These results, which provide the first theoretical statistical analysis of Shapley-based localization, seem very interesting based on the wide acceptance of the Shapley value by many researchers and should encourage further research on this topic. Numerical results are provided which illustrate our findings.

2606.00862 2026-06-02 cs.NE cs.LG

Meta-Black-Box Optimization with Ensemble Surrogate Modeling for Robustness-Accuracy Trade-off within SAEA

基于集成代理建模的元黑箱优化以实现SAEA中的鲁棒性-准确性权衡

Xiao Jin, Yongxiong Wang, Haobo Liu, Yudong Du, Yukun Du

发表机构 * GitHub

AI总结 提出AdaE-SAEA,一种将SAEA嵌入MetaBBO框架并联合控制填充准则与集成代理建模的方法,通过强化学习训练元策略,自适应平衡鲁棒性与准确性,在昂贵多目标优化中优于现有方法。

详情
AI中文摘要

代理辅助进化算法(SAEAs)已被广泛用于昂贵的黑箱优化问题。然而,它们对刚性且手动设计组件的依赖限制了其跨任务的灵活性和泛化能力。元黑箱优化(MetaBBO)为自适应配置算法组件提供了一种有前景的范式。尽管如此,现有的MetaBBO方法通常只控制单个组件,很少有研究调查多组件优化器(如SAEAs)的统一控制。此外,代理建模中的鲁棒性-准确性权衡对于早期稳定探索和后期精确开发至关重要,但很少被明确考虑。为了解决这些问题,我们提出了AdaE-SAEA,一种用于昂贵多目标优化的自适应集成代理辅助进化算法。AdaE-SAEA将SAEA作为低层优化器嵌入MetaBBO框架,并联合控制填充准则和基于集成的代理建模。具体来说,bagging和boosting被设计为代理建模模块,以在不同搜索阶段自适应平衡鲁棒性和准确性,而元策略同时选择填充准则以实现自适应采样决策。元策略通过并行采样和集中训练的强化学习进行训练,提高了训练效率和可迁移性。在合成和实际问题上的实验表明,AdaE-SAEA优于最先进的基线和基于MetaBBO的方法。我们进一步验证了TabPFN作为集成学习基础代理模型的有效性。据我们所知,这是第一个统一控制SAEAs中代理建模和填充准则,同时明确解决鲁棒性-准确性权衡的工作。

英文摘要

Surrogate-assisted evolutionary algorithms (SAEAs) have been widely used for expensive black-box optimization problems. However, their reliance on rigid and manually designed components limits their flexibility and generalization across tasks. Meta-black-box optimization (MetaBBO) provides a promising paradigm for adaptively configuring algorithmic components. Nevertheless, existing MetaBBO methods usually control only a single component, and few studies have investigated the unified control of multi-component optimizers such as SAEAs. Moreover, the robustness-accuracy trade-off in surrogate modeling, which is crucial for stable early-stage exploration and accurate late-stage exploitation, has rarely been explicitly considered. To address these issues, we propose AdaE-SAEA, an adaptive ensemble surrogate-assisted evolutionary algorithm for expensive multi-objective optimization. AdaE-SAEA embeds SAEA as the low-level optimizer within the MetaBBO framework and jointly controls the infill criterion and ensemble-based surrogate modeling. Specifically, bagging and boosting are designed as surrogate modeling modules to adaptively balance robustness and accuracy across different search phases, while the meta-policy simultaneously selects the infill criterion to enable adaptive sampling decisions. The meta-policy is trained through reinforcement learning with parallel sampling and centralized training, improving both training efficiency and transferability. Experiments on synthetic and real-world problems demonstrate that AdaE-SAEA outperforms state-of-the-art baselines and MetaBBO-based methods. We further verify the effectiveness of TabPFN as the base surrogate model for ensemble learning. To the best of our knowledge, this is the first work to unify the control of surrogate modeling and infill criteria in SAEAs while explicitly addressing the robustness--accuracy trade-off.

2606.00860 2026-06-02 cs.SI cs.AI cs.CL

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

GenPT:通过生成式投射测试实现超越自我报告的可靠LLM心理测量

Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computing and Information Systems, Singapore Management University(新加坡管理学院计算机与信息学院) Mental Health Education Center, Northeastern University(东北大学心理健康教育中心) School of Psychology, Northeast Normal University(东北师范大学心理学系) Faculty of psychology, Southwest University(西南大学心理学系) School of Sociology and Psychology, Central University of Finance and Economics(中央财经大学社会学与心理学学院) College of Arts, Northeastern University(东北大学艺术学院)

AI总结 针对自我报告问卷在人格化智能体心理测量中存在的训练语料污染和方向性偏差问题,提出GenPT方法,通过改编投射测试范式并构建三阶段评估流程,实现了更可靠的心理状态测量。

详情
AI中文摘要

自我报告问卷仍然是探测人格化智能体(PC-Agents)心理状态的主流工具。然而,经典工具存在两个众所周知的威胁:来自训练语料的污染以及由社会期望或上下文框架驱动的方向性偏差。为了克服这些方法论瓶颈,我们探讨投射范式是否能够被改编为一种稳健的心理测量工具。我们提出了 extbf{GenPT}(生成式投射测试),它通过新生成的刺激重新表述了TAT、罗夏测试和SCT,并将评估组织为三阶段流程,以导出标准化的心理指标和目标状态。通过评估由CharacterRAG和AnnaAgent配置文件诱导的PC-Agents,我们针对经典问卷基准测试了GenPT的信度和效度。结果表明,问卷在社会期望框架下表现出系统性的方向性偏移,在自杀意念上最为强烈。相比之下,GenPT收集的行为模式保持在对称基线附近。此外,在纵向咨询背景下,当Qwen3作为骨干模型时,基于GenPT的抑郁评估变化幅度比问卷对应方法大约一个数量级。总体而言,GenPT在需要抗污染、偏差对称性和上下文敏感性的场景中补充了自我报告方法。代码和刺激材料可在https://github.com/sci-m-wang/GenPT获取。

英文摘要

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). However, classical instruments inherit two well-known threats: contamination from training corpora and directional bias driven by social-desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce \textbf{GenPT} (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three-stage pipeline to derive standardized psychological indicators and target states. Evaluating PC-Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social-desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT-based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci-m-wang/GenPT.

2606.00834 2026-06-02 stat.AP cs.AI cs.LG math.PR

Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing

加纳五岁以下儿童疟疾住院人数的混合概率预测:高斯过程回归与Holt-Winters平滑

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * GAEC, Ghana(加纳农业和粮食部)

AI总结 针对加纳疟疾预测中季节性和数据不确定性挑战,提出结合高斯过程回归与Holt-Winters指数平滑的混合模型,实现概率性预测并评估其性能。

Comments 24 pages, 8 figures, accepted for publication in Artificial Intelligence in Medicine

详情
AI中文摘要

准确的疟疾预测在撒哈拉以南非洲仍是一个重大挑战,那里强烈的季节性、报告不确定性和非平稳传播动态降低了传统模型的可靠性。在加纳,地区级疟疾监测需要概率上严谨且数据有限时稳健的预测框架。本研究提出了一个混合框架,将高斯过程回归(GPR)与Holt-Winters指数平滑相结合,用于建模每月五岁以下儿童疟疾住院人数。GPR捕捉非线性行为和预测不确定性,而Holt-Winters稳定长期预测并保留季节结构。使用十年(2014-2023年)的地区级数据,通过滚动起点扩展窗口验证评估性能。混合模型实现了$R^2 = 0.9906$,而单独Holt-Winters为$0.8213$,$94.2\%$的残差在$\pm 2σ$范围内。2024-2028年的预测显示月平均住院人数约为8,000至12,200例。时空分析揭示了显著的生态异质性:北部高负担地区尽管绝对波动较大,但相对模式稳定。该框架为疟疾流行地区的早期预警和运营规划提供了一种可扩展的概率方法,支持加纳国家疟疾控制战略。

英文摘要

Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved $R^2 = 0.9906$ versus $0.8213$ for Holt-Winters alone, with $94.2\%$ of residuals within $\pm 2σ$ bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8{,}000 to 12{,}200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana's national malaria control strategy.

2606.00822 2026-06-02 cs.IR cs.AI

SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval

SkillPager: 通过语义节点检索实现查询自适应的技能内导航

Zicai Cui, Zihan Guo, Weiwen Liu, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Sun Yat-sen University(中山大学)

AI总结 针对基于技能的LLM代理在长过程文档中需要高效检索的问题,提出SkillPager两阶段框架,通过离线解析Markdown技能为类型化语义节点并在线利用MMR进行查询条件节点选择,在保持高上下文充分性的同时显著减少提示令牌。

Comments 20 pages, 6 figures

详情
AI中文摘要

基于技能的LLM代理越来越依赖长过程文档,但全文档提示浪费令牌并稀释对执行至关重要的信息。我们将此设置研究为技能内检索,其目标是从已知技能文档中为给定查询选择最小且执行充分的上下文。我们提出SkillPager,一个两阶段框架,离线将每个Markdown技能解析为类型化语义节点,并在线利用最大边际相关性(MMR)进行全局的、查询条件的节点选择。在包含395个技能和1,975个查询的基准测试中,SkillPager实现了78.89%的LLM判断上下文充分性,而穷举全文档基线为82.23%,同时减少了47.04%的提示令牌。粒度消融实验表明,将相同的检索算法应用于原始固定长度块可达到可比的81.77%充分性,但令牌成本增加了28.81%,证明效率提升源于类型化语义粒度而非检索算法本身。在基于图的基线中,SkillPager以12.16%的幅度优于最强基线。进一步的消融实验表明,支持内容在候选池中保留并通过自适应选择而非静态启发式移除时最为有效。这些结果将类型化文档内检索确定为基于技能的代理的一个独特访问问题。

英文摘要

Skill-based LLM agents increasingly rely on long procedural documents, but full-document prompting wastes tokens and dilutes information critical to execution. We study this setting as intra-skill retrieval, where the goal is to select a minimal, execution-sufficient context from a known skill document given a query. We present SkillPager, a two-stage framework that parses each Markdown skill into typed semantic nodes offline and leverages Maximal Marginal Relevance (MMR) to perform global, query-conditioned node selection online. On a benchmark of 395 skills and 1,975 queries, SkillPager achieves 78.89% LLM-judged context sufficiency, compared to 82.23% for the exhaustive full-document baseline, while reducing prompt tokens by 47.04%. A granularity ablation shows that applying the same retrieval algorithm to raw fixed-length chunks reaches a comparable 81.77% sufficiency but increases token cost by 28.81%, demonstrating that efficiency gains are driven by typed semantic granularity rather than the retrieval algorithm alone. Among graph-based baselines, SkillPager outperforms the strongest baseline by a margin of 12.16%. Further ablations show that supporting content is most effective when retained in the candidate pool and selected adaptively rather than removed by static heuristics. These results identify typed intra-document retrieval as a distinct access problem for skill-based agents.

2606.00817 2026-06-02 cs.GR cs.CV

Directed Distance Fields for Constant-Time Ray Queries on Gaussian Splatting

定向距离场:用于高斯泼溅的恒定时间射线查询

Subhankar MIshra

发表机构 * School of Computer Sciences, National Institute of Science Education and Research (NISER)(计算机科学学院,国家科学教育与研究研究所(NISER))

AI总结 提出定向距离函数(DDF),将训练好的3D高斯泼溅场景转化为射线预言机,实现恒定时间的射线查询,用于全局光照等二次射线追踪。

详情
AI中文摘要

3D高斯泼溅(3DGS)实时渲染场景的新视图。与所有光栅化器一样,它只回答主射线,即从相机穿过图像的射线。它无法追踪阴影、环境遮挡和全局光照所需的二次射线。我们通过蒸馏定向距离函数(DDF)将训练好的3DGS场景转化为射线预言机。DDF是一个小型神经场。它接受由原点和方向给出的射线,并返回到第一个表面的距离以及射线是否击中任何物体。每次查询是一次前向传递。该场大小为52 MB,其大小不依赖于高斯数量,因此其成本和内存随场景增长而保持不变。我们提出三点。首先,我们研究DDF需要什么样的监督。从高斯渲染的深度太模糊,无法学习薄部分,而清晰的距离监督可以恢复它们。其次,我们测量速度。DDF比球体追踪等效的有符号距离场快26到72倍,并且与在高斯上构建的包围体积层次结构不同,即使在专用的RT核心硬件上,其查询时间和内存也不会随场景增长。第三,我们展示了一个不需要网格的流程:图像生成3DGS场景,神经表面提供清晰的距离,DDF从中学习。我们将DDF用作全局光照的二次射线预言机。它在142个对象上以30.3 dB的PSNR再现参考光线追踪阴影,以21.3 dB的PSNR再现环境遮挡,并在真实捕获场景上有效。我们的代码可在https://github.com/smlab-niser/ddf-gs获取。

英文摘要

3D Gaussian Splatting (3DGS) renders new views of a scene in real time. Like every rasterizer, it answers only primary rays, the rays from the camera through the image. It cannot trace the secondary rays that shadows, ambient occlusion, and global illumination need. We turn a trained 3DGS scene into a ray oracle by distilling a Directed Distance Function (DDF). The DDF is a small neural field. It takes a ray, given by an origin and a direction, and returns the distance to the first surface and whether the ray hits anything. Each query is one forward pass. The field is 52~MB, and its size does not depend on the number of Gaussians, so its cost and memory stay flat as the scene grows. We make three points. First, we study what supervision a DDF needs. Depth rendered from the Gaussians is too blurry to teach thin parts, while clean distance supervision recovers them. Second, we measure speed. The DDF is 26 to 72 times faster than sphere tracing an equivalent signed distance field, and unlike a bounding volume hierarchy built over the Gaussians, even on dedicated RT-core hardware, its query time and memory do not grow with the scene. Third, we show a pipeline that needs no mesh: images give a 3DGS scene, a neural surface gives clean distances, and the DDF learns from them. We use the DDF as a secondary-ray oracle for global illumination. It reproduces reference ray-traced shadows at 30.3~dB and ambient occlusion at 21.3~dB across 142 objects, and on real captured scenes. Our codes are available at https://github.com/smlab-niser/ddf-gs.

2606.00813 2026-06-02 cs.CR cs.CL cs.ET cs.LG cs.NE

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

跨代对抗攻击迁移揭示大语言模型安全对齐的非单调性

Subhadip Mitra

发表机构 * Rota Labs(Rota实验室)

AI总结 通过质量多样性进化(MAP-Elites)对四代Gemma模型进行自动红队探测,发现安全对齐非单调变化,其中Gemma 3攻击成功率显著高于前后代,且攻击迁移率在不同代际间存在差异。

Comments 8 pages, 3 figures

详情
AI中文摘要

大语言模型的安全对齐在不同代际之间并非单调提升。通过对Google Gemma家族四代模型(7B-31B)使用质量多样性进化(MAP-Elites)作为自动红队探测,我们发现Gemma 3(12B)的攻击成功率为68.7% ± 5.7%(均值±标准差,3个种子),显著高于其前代Gemma 2(45.5% ± 7.2%;p = 0.030,配对bootstrap)和后继Gemma 4(33.9% ± 1.8%)。跨代重放进化攻击档案显示,来自其他代际的攻击对Gemma 3的迁移率为44-46%,而对Gemma 4仅为14-18%,表明Gemma 4的安全增益泛化到了针对前代进化出的攻击分布之外。在我们的8B评判模型下,版权和网络犯罪漏洞在所有代际中接近100%,但第二评判审计(第6节)表明版权结果对评判选择敏感。错误信息ASR从Gemma 2的29%跃升至Gemma 3的99%,并在Gemma 4中仍保持77%的高位,表明该退化未得到完全解决。这些模式在静态基准测试中不可见,仅通过自适应的纵向探测才显现。所有实验使用3个随机种子和统一的自主托管评判模型;代码和工件可在https://github.com/bassrehab/red-queen获取。

英文摘要

Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at https://github.com/bassrehab/red-queen.

2606.00811 2026-06-02 econ.EM cs.AI

Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand

没有电子的证书?AI驱动电力需求影响的理论与证据

Dana Golden, Aruna Balasubramanian, Niranjan Balasubramanian

发表机构 * Department of Economics, Stony Brook University(石溪大学经济系) Department of Computer Science, Stony Brook University(石溪大学计算机科学系)

AI总结 通过博弈论模型和自然实验,研究AI数据中心使用可再生能源证书和购电协议对电网可靠性、电价和排放的影响,发现证书无法解决时序错配问题,而共置储能可有效缓解。

详情
AI中文摘要

数据中心目前占美国电力需求的4.4%,但超大规模企业用于宣称碳中和的可再生能源证书(RECs)和购电协议(PPAs)在电网层面的有效性仍不明确。我们开发了一个博弈论模型,其中数据中心运营商在RECs、PPAs和表后共置之间选择,而发电商在内生融资成本下做出进入决策。该模型识别出一个时序楔子——消费与信用可再生能源发电之间的不匹配——作为核心机制,通过该机制,即使RECs覆盖100%的年消费量,AI需求也会降低可靠性、提高价格并增加排放。与储能共置直接解决了这一楔子,并通过消除发电商收入风险诱导最大的可再生能源进入。我们通过利用大型语言模型的分阶段发布作为自然实验来检验这些预测,使用双重差分法分析一个将AI活动与当地电网结果联系起来的新数据集。AI需求显著增加了数据中心附近的化石燃料发电、批发价格(在处理的PJM区域高达25%)和停电频率(每年额外0.5-1次停电),其影响随模型规模扩大而扩大。拥有现场发电的数据中心在电能质量效应上表现出符号反转,这与模型的预测一致,即表后容量吸收了需求峰值。反事实分析表明,边缘推理、空间重新分配和共置储能均能显著减轻电网影响,而仅依赖RECs的策略则不能。总之,我们的结果表明,AI对电网的外部性与采购设计及数据中心基础设施的空间组织紧密相关。

英文摘要

Data centers now account for 4.4% of United States electricity demand, yet the grid-level effectiveness of the renewable energy certificates (RECs) and power purchase agreements (PPAs) hyperscalers use to claim carbon neutrality remains unclear. We develop a game-theoretic model in which a data center operator chooses among RECs, PPAs, and behind-the-meter colocation while generators make entry decisions under endogenous financing costs. The model identifies a timing wedge -- the mismatch between consumption and credited renewable generation -- as a central mechanism through which AI demand degrades reliability, raises prices, and increases emissions even when RECs cover 100% of annual consumption. Colocation with storage addresses this wedge directly and induces the greatest renewable entry by eliminating generator revenue risk. We test these predictions by exploiting the staggered release of large language models as a natural experiment, using difference-in-differences on a novel dataset linking AI activity to local grid outcomes. AI demand significantly increases fossil generation, wholesale prices (up to 25% in treated PJM zones), and outage frequency (0.5--1 additional outages per year) near data centers, with impacts scaling in model size. Data centers with on-site generation exhibit a sign reversal in power-quality effects, consistent with the model's prediction that behind-the-meter capacity absorbs demand spikes. Counterfactual analyses show that edge inference, spatial reallocation, and colocated storage each substantially mitigate grid impacts, while REC-only strategies do not. Together, our results demonstrate that the externalities of AI to the grid are tightly coupled to procurement design and the spatial organization of data center infrastructure.

2606.00803 2026-06-02 astro-ph.CO cs.CV cs.LG

Generative Diffusion Priors for 3D Mapping of the Dark Universe

用于暗宇宙三维映射的生成扩散先验

Brandon Zhao, Diana Scognamiglio, Olivier Doré, Katherine L. Bouman

发表机构 * Department of Computing and Mathematical Sciences, California Institute of Technology(加州理工学院计算与数学科学系) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室) Department of Physics, Duke University(杜克大学物理系) Cahill Center for Astronomy and Astrophysics, California Institute of Technology(加州理工学院卡希尔天文与天体物理中心)

AI总结 利用扩散模型学习宇宙模拟中的先验分布,结合物理正向模型解决弱引力透镜三维暗物质反问题,显著提升重建精度并生成统计一致的后验样本。

Comments Accepted to CVPR 2026 (Highlight)

详情
AI中文摘要

从弱引力透镜观测重建暗物质的三维分布是宇宙学中一个核心但高度病态的反问题。与多视角标准三维重建不同,我们通过单一视线方向观测宇宙,通过星系不确定距离的噪声形状畸变,因此有意义的三维物质场恢复需要强先验假设。现有方法要么使用手工先验产生点估计,要么使用神经集成进行近似贝叶斯不确定性,难以捕捉宇宙网的非高斯、纤维状结构。随着新的高分辨率宇宙学模拟的出现,我们现在有了另一种先验知识来源,其捕捉结构形成的非线性统计的保真度远高于解析公式。我们利用这些模拟构建了一个新数据集$ exttt{Conicus3D}$,使我们能够学习一个数据驱动的扩散模型先验,捕捉暗物质结构在宇宙时间内的完整三维分布。基于最近的即插即用方法,我们将基于扩散的后验采样方案修改为三维弱引力透镜设置,将学习到的先验与可微分的物理正向模型相结合。在针对现代弱引力透镜巡天的逼真模拟上,我们的方法在二维和三维重建精度上显著优于基线方法。此外,它产生的后验样本的统计量紧密跟踪底层模拟,同时对宇宙学参数的适度偏移保持鲁棒性。

英文摘要

Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset $\texttt{Conicus3D}$, which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.

2606.00801 2026-06-02 cs.CR cs.CL cs.ET cs.LG cs.NE

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

用于发现LLM安全中多样漏洞的质量-多样性进化

Subhadip Mitra

发表机构 * Rota Labs(Rota实验室)

AI总结 提出基于质量-多样性进化框架(MAP-Elites)在语义层面生成可解释攻击策略,发现不同LLM的特定漏洞模式。

Comments 9 pages, 6 figures. Accepted at the ICLR 2026 Workshop on Agents in the Wild (AIWILD)

详情
AI中文摘要

当前LLM对抗性测试方法存在覆盖缺口:手动红队测试无法扩展,LLM作为攻击者的方法会出现模式崩溃,基于梯度的方法产生不可解释的乱码。我们引入一个在语义层面运行的质量-多样性进化框架,进化可解释的攻击策略而非令牌序列。使用MAP-Elites,我们在行为维度(策略类型、编码方法、提示长度)上维护一个多样化的攻击档案。在GPT-4o-mini、Claude 3.5 Sonnet、Gemini 2.0 Flash和一个开放权重的编码模型(Devstral-small-2)上的实验中,我们发现了不同的漏洞特征:GPT-4o-mini容易受到假设性和多轮框架结合ROT13编码的攻击(适应度0.8),Gemini容易受到直接攻击结合ROT13以及多轮攻击结合Leetspeak(0.8),而Claude在所有策略上表现出统一的模糊响应(最大0.4)。语义表示产生了可解释的攻击,揭示了系统性的、模型特定的弱点,为改进LLM安全性提供了可操作的见解,并为评估未来前沿模型提供了可复现的基线。代码和实验工件发布在https://github.com/bassrehab/red-queen。

英文摘要

Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.

2606.00794 2026-06-02 cond-mat.mtrl-sci cs.LG

Benchmark Dataset for Catalysis on 2D MXenes

二维MXene催化基准数据集

Pavlo Melnyk, Anmar Karmush, Mårten Wadenbäck, Ania Beatriz Rodríguez-Barrera, Johanna Rosen, Michael Felsberg, Jonas Björk

发表机构 * Computer Vision and Learning Systems, Department of Electrical Engineering (ISY) & AI4X(计算机视觉与学习系统,电气工程系(ISY)及AI4X) Materials Design Division, Department of Physics, Chemistry and Biology (IFM)(材料设计分校,物理、化学与生物系(IFM)) Wallingenberg Initiative Materials Science for Sustainability (WISE)(瓦伦贝格可持续材料科学倡议(WISE))

AI总结 通过结合第一性原理计算与机器学习,构建包含50000个DFT计算训练集和10000个测试集的数据集,训练并验证多种机器学习原子间势模型,实现约10^3倍加速且保持高精度,推动MXene催化行为的高效研究。

详情
AI中文摘要

将第一性原理计算与机器学习(ML)相结合,旨在加速新型材料催化行为的探索。我们专注于二维(2D)Ti$_2$CT$_y$ MXene,其多样的表面化学性质使其成为极具吸引力的催化候选材料。由于计算成本,在现实条件下解析其组成和结构超出了标准密度泛函理论(DFT)的能力。为应对这一挑战,我们生成了一个包含50000个DFT计算用于训练和10000个用于测试的全面数据集,涵盖Ti$_2$CT$_y$ MXene构型和分子系统,以及一个包含1000个真正新的大系统的额外测试数据集,以研究模型的泛化能力。我们训练并验证了广泛使用且具有竞争力的机器学习原子间势(MLIP)模型,包括EquiformerV2、MACE、MatRIS和UPET,这些模型能够准确预测原子力和形成能——这些是DFT在结构和催化研究中必须反复计算的量——对于这些二维材料。这种DFT-ML联合框架实现了约$1-4 \cdot 10^3$倍(在CPU上)的计算加速,同时保持所需精度(力约$\pm 10$ meV/Å,每原子能量约$\pm 1$ meV),为更高效地研究MXene催化行为铺平了道路。此外,我们对训练模型进行了广泛的定性评估,展示了超越基准指标的基于模拟的综合比较的重要性。数据集、训练模型及代码可在https://huggingface.co/datasets/CatalystAnonymous/catalyst_mxenes获取。

英文摘要

Merging first-principles calculations with machine learning (ML), we aim to accelerate the exploration of catalytic behaviour in novel materials. We focus on two-dimensional (2D) Ti$_2$CT$_y$ MXenes, whose versatile surface chemistry makes them particularly compelling candidates for catalysis. Resolving their composition and structure under realistic conditions exceeds the reach of standard density functional theory (DFT) due to computational cost. To address this challenge, we generate a comprehensive dataset of 50,000 DFT calculations for training and 10,000 for testing, encompassing both Ti$_2$CT$_y$ MXene configurations and molecular systems, along with an additional test dataset with 1000 genuinely new, larger systems to investigate how well models generalise. We train and validate widely used and competitive machine learning interatomic potential (MLIP) models, including EquiformerV2, MACE, MatRIS, and UPET, that accurately predict atomic forces and formation energies -- quantities that DFT must repeatedly compute for structural and catalytic investigations -- for these 2D materials. This combined DFT-ML framework achieves computational acceleration on the order of approximately $1-4 \cdot 10^3$ (on a CPU) while maintaining desired-level accuracy (approximately +/- $10$ meV/A for forces and approximately +/- $1$ meV for per-atom energies), paving the way for more efficient investigations of MXene catalytic behaviour. Moreover, we perform an extensive qualitative evaluation of the trained models, showcasing the importance of comprehensive simulation-based comparison beyond benchmark metrics. The dataset and the trained models with the code are available at https://huggingface.co/datasets/CatalystAnonymous/catalyst_mxenes.

2606.00783 2026-06-02 stat.AP cs.AI math.PR stat.CO

Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler

加纳非线性疟疾动力学的贝叶斯推断:基于集成马尔可夫链蒙特卡洛采样器

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * Ghanaian Agricultural and Environmental Council(加纳农业与环境委员会)

AI总结 针对加纳疟疾监测数据短、噪声大、空间异质性强的问题,提出一种贝叶斯非线性推断框架,结合三次基线与阻尼振荡核,通过仿射不变集成马尔可夫链蒙特卡洛采样器估计参数,实现了高精度拟合和概率预测,揭示了空间异质性并预测了2024-2026年疟疾回升趋势。

Comments 27 pages, 15 figures, published in Expert Systems with Applications

详情
Journal ref
Expert Systems with Applications, Volume 312, 131540 (2026)
AI中文摘要

可靠量化撒哈拉以南非洲疟疾动态受到短、噪声大且空间异质的监测记录阻碍。在加纳,2014年至2023年的卫生设施数据揭示了住院人数的非线性和年龄特异性波动,然而现有方法难以捕捉随机变异或提供可信的不确定性区间。本研究开发了一个贝叶斯非线性推断框架,该框架将三次基线与阻尼振荡核相结合,通过仿射不变集成马尔可夫链蒙特卡洛采样器进行估计。该框架适应有限数据,建模参数不确定性,并为五岁以下儿童和五岁及以上个体生成概率预测。结果显示较强的经验充分性(五岁以下:$R^2 = 0.9958$;五岁及以上:$R^2 = 0.9956$),残差低于$2\%$,且混合良好的后验分布确认了收敛性。区级分析揭示了显著的空间异质性,变异系数从库马西等城市中心的$<0.07$到姆波霍尔和东比亚等边缘地区的$>3.3$。2024-2026年的预测表明逐步回升:五岁以下儿童病例从137,000例增至149,000例,年长个体从348,000例增至375,000例,不确定性随时间扩大。通过生成概率预测,该贝叶斯框架为预测疟疾波动和加强加纳国家疟疾控制战略中的数据驱动决策提供了原则性工具。

英文摘要

Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance records. In Ghana, health-facility data from 2014 to 2023 reveal non-linear and age-specific fluctuations in hospital admissions, yet existing approaches struggle to capture stochastic variability or provide credible uncertainty bounds. This study develops a Bayesian nonlinear inference framework that integrates a cubic baseline with a damped oscillatory kernel, estimated via an affine-invariant ensemble Markov Chain Monte Carlo sampler. The framework accommodates limited data, models parameter uncertainty, and generates probabilistic forecasts for children under five years and individuals aged five years or more. Results show strong empirical adequacy ($R^2 = 0.9958$ for $<5$ years; $R^2 = 0.9956$ for $\geq 5$ years) with residual errors below $2\%$ and well-mixed posteriors confirming convergence. District-level analysis reveals pronounced spatial heterogeneity, with coefficients of variation ranging from $<0.07$ in urban centres such as Kumasi to $>3.3$ in peripheral districts such as Mpohor and Bia East. Forecasts for 2024-2026 indicate a gradual resurgence: from 137,000 to 149,000 cases among children under five years and from 348,000 to 375,000 cases among older individuals, with uncertainty widening over time. By producing probabilistic forecasts, this Bayesian framework provides a principled tool for anticipating malaria fluctuations and strengthening data-driven decision-making in Ghana's national malaria control strategy.

2606.00758 2026-06-02 stat.ML cs.LG eess.SP stat.ME

Statistical Testing on Directed Graphs by Surrogate Data Generation

通过替代数据生成的有向图统计检验

Chun Hei Michael Chan, Alexandre Cionca, Dimitri Van De Ville

发表机构 * Neuro-X Institute, Ecole polytechnique fédérale de Lausanne, and the Department of Radiology and Medical Informatics, University of Geneva(Neuro-X研究所,瑞士联邦理工学院洛桑校区,日内瓦大学放射科与医学信息学系)

AI总结 针对有向图,基于图移位算子的特征分解定义宽平稳信号,提出保持协方差结构的替代数据生成框架,用于构建检验统计量的零分布,并在真实数据上验证了优于现有方法。

Comments Submitted to IEEE Transactions on Signal and Information Processing over Networks

详情
AI中文摘要

近年来,图信号处理已成为信号处理与图论交叉领域的一个强大框架,提供了分析定义在节点上的信号同时考虑其由边表示的关系的工具。这些工具已成功应用于各种场景,包括统计假设检验。特别地,针对无向图上的信号,已提出了基于替代生成的非参数方法。然而,这些方法尚未扩展到有向图。在这项工作中,我们首先重新审视有向图上平稳图信号的概念。具体地,通过图移位算子的特征分解,我们定义了有向图宽平稳信号。然后,我们提出一个新的框架来生成替代图信号,该信号在平稳性假设下保持协方差结构。然后,可以从这些替代信号构建检验度量的零分布,并作为经验数据的参考。最后,我们提供了指导性示例和真实数据上的应用,其中我们将我们的框架与现有针对无向图或基于朴素置换的技术进行了性能比较,证明了所提方法的可行性和优越性。

英文摘要

In recent years, graph signal processing has emerged as a powerful framework at the intersection of signal processing and graph theory, providing tools for the analysis of signals defined on nodes while accounting for their relationships represented by edges. These tools have been successfully applied to various settings, including statistical hypothesis testing. In particular, non-parametric approaches based on surrogate generation have been proposed for signals on undirected graphs. However, they are yet to be extended to directed graphs. In this work, we first revisit the notion of stationary graph signals on directed graphs. Specifically, and through the eigendecomposition of the graph shift operator, we define directed graph wide-sense stationary signals. Then, we propose a new framework to generate surrogate graph signals that preserve covariance structure under stationarity assumptions. Null distributions of the test metric can then be constructed from these surrogates and serve as a reference for the empirical data. Finally, we provide guiding examples and an application on real data, in which we compare the performance of our framework with existing techniques for undirected graphs or based on naive permutation, demonstrating feasibility and superiority of the proposed approach.

2606.00754 2026-06-02 stat.ME cs.AI cs.LG

Causal Density Functions

因果密度函数

Sridhar Mahadevan

发表机构 * Adobe Research(Adobe研究院) University of Massachusetts(马萨诸塞大学) Amherst(阿默斯特)

AI总结 提出因果密度函数作为干预分布与观测分布的Radon-Nikodym导数,用于局部密度比衡量因果效应,并给出估计与检验方法。

Comments 25 pages

详情
AI中文摘要

我们引入因果密度函数:Radon-Nikodym导数,它比较干预分布与观测分布,因此作为因果效应的局部密度比。许多因果强度度量在图手术后的整个分布上进行比较,而因果密度函数提供了一个逐点的测度变换对象,可以估计、校准并用于评分有向影响。基本恒等式 \[ \mathbb{E}_{\mathrm{do}}[f(Y)] = \mathbb{E}_{\mathrm{obs}}\!\left[f(Y)ρ(X,Y)\right] \] 使得因果密度直接可检验:如果估计的密度比正确,通过ρ重新加权的观测期望重现干预期望。我们推导了do曲线和有向边得分的实用估计量,将构造与条件作用和干预的Radon-Nikodym/Kan语义联系起来,并在合成和真实扰动基准上评估了所得估计量。

英文摘要

We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act as local density ratios for causal effects. Whereas many causal-strength measures compare whole distributions after graph surgery, causal density functions provide a pointwise change-of-measure object that can be estimated, calibrated, and used to score directed influence. The basic identity \[ \mathbb{E}_{\mathrm{do}}[f(Y)] = \mathbb{E}_{\mathrm{obs}}\!\left[f(Y)ρ(X,Y)\right] \] makes causal density directly testable: if the estimated density ratio is correct, observational expectations reweighted by $ρ$ reproduce interventional expectations. We derive practical estimators for do-curves and directed edge scores, relate the construction to Radon-Nikodym/Kan semantics for conditioning and intervention, and evaluate the resulting estimators on synthetic and real perturbation benchmarks.

2606.00735 2026-06-02 cs.DC cs.LG

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

ViBE: 针对MoE服务的工作负载偏斜与硬件变异性协同优化

Seokjin Go, Marko Scrbak, Ephrem Wu, Srilatha Manne, Divya Mahajan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 提出ViBE框架,通过感知硬件的专家放置方法,结合GPU性能建模与专家激活分析,最小化分布式MoE推理中的执行时间不平衡,显著提升SLO达标率并降低P90 TTFT。

详情
AI中文摘要

在分布式混合专家(MoE)推理中,依赖于输入的令牌路由与GPU性能变异性相互作用,在同步执行下产生持续的掉队者,其中最慢的GPU决定层延迟。这种性能变异性是现代加速器固有的:制造差异、功率限制和热条件在名义上相同的GPU之间引入了可测量的执行时间差异。核心挑战在于MoE执行时间不平衡源于工作负载偏斜和硬件不对称的相互作用。令牌路由产生不均匀且逐层变化的专家负载,而GPU吞吐量取决于设备特定的操作特性和工作负载强度。先前的工作缓解了路由偏斜,但假设硬件同质,优化令牌平衡而非执行延迟。因此,即使平衡的令牌分配也可能留下硬件引起的掉队者未解决。为此,我们提出了变异性感知的专家分箱(ViBE),一种硬件感知的专家放置框架,旨在最小化跨GPU的执行时间不平衡。ViBE结合了每GPU性能建模与专家激活分析,将高负载专家分配给更快的设备,低负载专家分配给较慢的设备,从而在不修改模型语义或硬件的情况下减少层级别的掉队者。由于工作负载特征和有效GPU吞吐量可能随服务条件变化,ViBE支持在负载/性能漂移下进行轻量级重新校准,以在需要时刷新其路由和性能估计。结果表明,ViBE持续减少执行时间不平衡,并将SLO达标率提高14%,同时将P90 TTFT降低高达45%。我们进一步表明,硬件变异性的影响在规模扩大时增加,使得变异性感知的放置对于高效、高利用率的LLM服务至关重要。

英文摘要

In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions introduce measurable execution-time differences across nominally identical GPUs. The core challenge is that MoE execution-time imbalance arises from the interaction of workload skew and hardware asymmetry. Token routing produces uneven and layer-varying expert loads, while GPU throughput depends on device-specific operating characteristics and workload intensity. Prior work mitigates routing skew but assumes homogeneous hardware, optimizing token balance rather than execution latency. As a result, even balanced token assignments can leave hardware-induced stragglers unaddressed. Thus, we propose Variability-Informed Binning of Experts (ViBE), a hardware-aware expert placement framework that minimizes execution-time imbalance across GPUs. ViBE combines per-GPU performance modeling with expert activation profiling to assign high-load experts to faster devices and low-load experts to slower ones, reducing layer-level stragglers without modifying model semantics or hardware. Because both workload characteristics and effective GPU throughput can shift across serving conditions, ViBE supports lightweight recalibration under workload/performance drift to refresh its routing and performance estimates when needed. Results show that ViBE consistently reduces execution-time imbalance and improves SLO attainment by 14%, while lowering P90 TTFT by up to 45%. We further show that the impact of hardware variability increases at scale, making variability-aware placement important for efficient, high-utilization LLM serving.