arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
2606.08679 2026-06-09 stat.ML cs.CL cs.LG stat.ME 新提交

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

排行榜的排名区间:模型评估的分层框架

Bitya Neuhof, Yuval Benjamini

发表机构 * Department of Statistics and Data Science(统计与数据科学系)

AI总结 提出分层框架,通过任务级置信区间和排行榜级预测区间,实现具有统计保证的模型排名不确定性量化。

详情
AI中文摘要

预训练模型通常在多任务排行榜上评估,以衡量其在不同场景中的适用性。然而,当前将跨任务性能聚合为排行榜级排名的方法并未解决任务层面的不确定性和变异性。尽管近期工作提出了基于区间的模型排名,但从单个任务到排行榜级排名的不确定性的原则性聚合仍未解决,且模型在不同任务上的性能变化常被掩盖。本文引入一个分层框架,在两层上构建具有统计保证的模型排名区间:通过成对比较构建任务级排名置信区间,以及使用共形方法构建排行榜级排名预测区间。这使得能够对每个观测任务和新潜在任务进行可靠的模型排名量化。在模拟数据以及TabArena和PromptEval(MMLU)基准上的实验表明,我们的方法产生统计有效且信息丰富的区间,从而在排行榜上实现可靠、具有不确定性意识的模型排名。

英文摘要

Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

2606.08676 2026-06-09 cs.SE cs.AI cs.CL 新提交

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

迷失在代码对话者的流程中:揭示代码任务中大语言模型的指令微调税

Shi Ying Chang, Chiok Yew Ho, Yichen Li, Yintong Huo

发表机构 * Singapore Management University(新加坡国立管理学院) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究首次实证发现指令微调在代码任务中导致权衡:增强指令遵循能力却削弱代码填充性能,称之为“指令微调税”,并通过定性和定量分析总结出七项发现和四项启示。

Comments 25 pages, 6 figures. Evaluation toolkit and dataset: https://github.com/arkosioscambions/CodeTalkers

详情
AI中文摘要

AI编码助手通过自动建议与用户意图一致的代码,显著提高了开发者的生产力,许多此类工具现已直接集成到集成开发环境(IDE)中。开发者以两种不同的认知模式与代码交互:流程模式和命令模式。在流程模式下,开发者需要能够直接完成或填充未完成程序中代码的工具;而在命令模式下,他们需要能够理解以自然语言指令表达的意图并将其转换为可执行代码的工具。尽管经过指令微调的大型语言模型(LLM)因其推断和满足开发者意图的能力而在许多应用场景中占据主导地位,但尚不清楚同一范式是否同样适用于不同的代码相关任务。因此,有必要理解指令微调如何影响CodeLLM作为编码助手的可行性。为填补这一空白,我们进行了首次实证研究,揭示了指令微调在编程模式之间引起的关键权衡,我们称之为“指令微调税”。我们的结果表明,指令微调并非免费的午餐:尽管经过指令微调的模型更擅长遵循指令和利用结构化指导,但这些收益往往以牺牲填充性能为代价。我们进一步通过定性和定量分析扩展了研究,包括手动失败分类、捕捉生成保真度的行为指标以及微调过程中的中间检查点评估。将我们的结果总结为七项发现和四项启示,我们的研究为AI驱动编码工具的开发提供了新视角,并强调了在指令遵循能力与有效代码生成辅助之间仔细平衡的必要性。

英文摘要

AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and many of these tools are now integrated directly into Integrated Development Environments (IDEs). Developers interact with code in two distinct cognitive modes: Flow and Command. While developers require tools that directly complete or infill code in unfinished programs during Flow mode, they also need tools that can comprehend intentions expressed as natural-language instructions and convert them into executable code in Command mode. Although instruction-tuned Large Language Models (LLMs) dominate many application scenarios due to their abilities to infer and fulfill developers' intents, it remains unclear whether the same paradigm is equally suitable for different code-related tasks. Therefore, it is necessary to understand how instruction tuning affects the feasibility of CodeLLMs as coding assistants. To fill this gap, we conduct the first empirical study that uncovers a key trade-off caused by instruction tuning across programming modes, which we term the Instruction-Tuning Tax. Our results show that instruction tuning is not a free lunch: although instruction-tuned models are more capable of following instructions and leveraging structured guidance, these gains often come at the cost of weaker infilling performance. We further extend our study through both qualitative and quantitative analyses, including manual failure categorization, behavioral metrics that capture generation fidelity, and intermediate-checkpoint evaluation throughout the tuning process. Summarizing our results into seven findings and four implications, our study offers a new perspective on the development of AI-powered coding tools and highlights the need to carefully balance instruction-following ability with effective code generation assistance.

2606.08661 2026-06-09 cs.CR cs.AI cs.DB 新提交

Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

数据代理遭受攻击:LLM驱动的分析系统中的漏洞

Kuncan Wang, Ziting Wang, Peizhuo Lv, Haoyang Li, Guoliang Li, Gao Cong, Wei Dong

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) The Hong Kong Polytechnic University(香港理工大学) Tsinghua University(清华大学)

AI总结 本研究系统分析了LLM驱动的数据代理的安全漏洞,提出了分层漏洞框架和攻击分类法,并在六个系统上评估了攻击效果,揭示了当前系统的重大安全缺陷。

详情
AI中文摘要

数据代理将LLM驱动的推理与关系数据访问、可执行分析工具和多步骤工作流编排相结合,使其在企业分析中日益核心。这种集成在数据资源、数据库执行和代理推理方面引入了新的安全漏洞,将数据库安全和通用LLM代理安全的问题重新组合成任何单独工作都无法捕获的故障模式。为填补这一空白,我们提出了对数据代理的系统性安全研究。我们的贡献有三方面。首先,我们开发了一个分层漏洞框架,识别了跨解释层、执行层和策略层的八个特定于数据代理的风险。其次,我们引入了一个按对手目标、策略和技术组织的攻击分类法,涵盖三个目标、七个策略和十四种技术,并将其与基于真实数据库模式的LLM驱动有效载荷生成流水线配对。第三,我们在六个系统上评估了这些攻击,包括四个开源数据代理和两个生产云分析服务。我们的实验揭示了当前系统的重大安全漏洞,并得出了四个关键结论。

英文摘要

Data agents integrate LLM-driven reasoning with relational data access, executable analytical tools, and multi-step workflow orchestration, making them increasingly central to enterprise analytics. This integration introduces new security vulnerabilities across data resources, database execution, and agent reasoning, recombining concerns from database security and general-purpose LLM-agent security into failure modes that neither line of work captures on its own. To address this gap, we present a systematic security study of data agents. Our contributions are threefold. First, we develop a layered vulnerability framework that identifies eight data agent-specific risks across interpretation, execution, and policy layers. Second, we introduce an attack taxonomy organized by adversary goal, tactic, and technique, covering three goals, seven tactics, and fourteen techniques, and pair it with an LLM-driven payload generation pipeline grounded in real database schemas. Third, we evaluate these attacks on six systems, including four open-source data agents and two production cloud analytics services. Our experiments reveal substantial security vulnerabilities across current systems and yield four key takeaways.

2606.08652 2026-06-09 astro-ph.SR cs.AI cs.CV 新提交

Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator

利用扩散模型翻译器从He I 10830 Å观测重建合成SDO/AIA 193 Å EUV图像

Marco Marena, Qin Li, Haimin Wang, Haodi Jiang, Prajwal Shah, Bo Shen

发表机构 * Department of Mechanical and Industrial Engineering, New Jersey Institute of Technology(机械与工业工程系,新泽西理工学院) Department of Physics, New Jersey Institute of Technology(物理系,新泽西理工学院) Department of Computer Science, Sam Houston State University(计算机科学系,萨姆霍斯顿州立大学) Department of Computer Science, New Jersey Institute of Technology(计算机科学系,新泽西理工学院) Department of Data Science, New Jersey Institute of Technology(数据科学系,新泽西理工学院)

AI总结 提出基于扩散的日冕洞感知翻译模型(CH-aware DMT),从He I图像重建AIA 193 Å EUV图像,在测试集上保持全盘EUV形态(CC=0.92)和日冕洞结构(CC=0.84),并通过历史数据验证其物理合理性。

详情
AI中文摘要

常规的全盘EUV成像仅在现代时期(如SOHO和SDO)才可用。为了将EUV日冕背景扩展到更早时期,我们利用了数十年的全盘He I观测数据,其吸收受日冕辐照度和磁拓扑调制,并被广泛用作开放场区域的代理。我们提出了一种基于扩散的条件图像翻译框架——日冕洞感知扩散模型翻译器(CH-aware DMT),从He I输入重建合成SDO/AIA 193 Å EUV图像。该模型在2011-2015年时间对齐的SOLIS He I和AIA 193 Å配对数据上训练,采用基于月份的划分:1-10月用于训练,11月用于验证,12月用于测试。在保留的测试集上,重建结果保留了主要的全盘EUV形态(CC=0.92),并恢复了与日冕洞相关的低强度结构(CC=0.84)。我们进一步通过以下方式评估历史适用性:(1)比较2005-2015年间重建的AIA 193 Å形态与SOHO/EIT 195 Å;(2)比较从KPVT He I输入生成的重建AIA 193 Å图像与Yohkoh/SXT软X射线观测;(3)评估长期重建的盘积分发射统计量与观测EUV序列及独立太阳活动代理(1974-2015年的太阳黑子数和F10.7射电通量)的关系。这些结果表明,以He I为条件的CH-aware DMT可以为历史研究提供物理上合理的合成AIA 193 Å日冕代理,支持在直接EUV成像可用之前对大规模日冕演化进行数十年尺度的分析。

英文摘要

Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EUV coronal context into earlier periods, we leverage the multi-decade availability of full-disk \HeI{} observations, whose absorption is modulated by coronal irradiance and magnetic topology and is widely used as a proxy for open-field regions. We present a diffusion-based conditional image translation framework, Coronal Hole-aware Diffusion Model Translator (CH-aware DMT), to reconstruct synthetic SDO/AIA 193 Å EUV images from \HeI{} inputs. The model is trained on temporally co-aligned SOLIS \HeI{} and AIA 193 Å pairs spanning 2011--2015 using a month-based split, where January--October are used for training, November is used for validation, and December for testing. On the held-out test set, the reconstructions preserve dominant full-disk EUV morphology (CC=0.92) and recover CH-related low-intensity structure (CC=0.84). We further assess historical applicability by (1) comparing reconstructed AIA 193 Å morphology with SOHO/EIT 195 Å over 2005--2015; (2) comparing reconstructed AIA 193 Å images generated from KPVT \HeI{} inputs against Yohkoh/SXT soft X-ray observations; and (3) evaluating long-term reconstructed disk-integrated emission statistics against observational EUV series and independent solar activity proxies (sunspot number and F10.7 radio flux over 1974--2015). These results indicate that CH-aware DMT conditioned on \HeI{} can provide a physically plausible synthetic AIA 193 Å coronal proxy for historical studies, supporting multi-decade analyses of large-scale coronal evolution before the direct EUV imaging was available.

2606.08649 2026-06-09 cs.CR cs.AI 新提交

Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning

基于大语言模型的恶意Web服务器日志检测与取证可解释推理的样本高效方法

Bernhard Kneip, Nhien-An Le-Khac, Hong-Hanh Nguyen-Le

发表机构 * University of Tuebingen(图宾根大学)

AI总结 提出CEF-Log策略,通过五步推理模板使大语言模型学习日志分析方法,在CSIC 2010数据集上仅用4个示例达到F1=0.99,样本效率提升10倍,并引入新数据集ForenWebLog。

详情
AI中文摘要

Web服务器日志的取证分析既需要准确检测,也需要满足法律要求的人类可读解释。我们提出了CEF-Log,一种针对大语言模型的上下文增强的少样本思维链提示策略,以应对这一双重需求。CEF-Log通过结构化的五步推理模板嵌入专家调查方法,使模型学习如何分析日志,而不是记忆什么模式。实验评估表明,CEF-Log在CSIC 2010数据集上仅使用四个示例就达到了0.99的F1分数,同时与其他基于提示的方法相比,样本效率提高了10倍。我们还引入了ForenWebLog,这是一个包含真实世界攻击和多步攻击序列的新数据集,用于全面评估。定性分析证实,CEF-Log生成了适合取证文档的可追溯、准确的解释,解决了传统机器学习方法的“黑箱”限制。

英文摘要

Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. We present CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for Large Language Models that addresses this dual requirement. CEF-Log embeds expert investigative methodology through a structured five-step reasoning template, enabling the model to learn \textit{how} to analyze logs rather than \textit{what} patterns to memorize. Experimental evaluation demonstrates that CEF-Log achieves an F1-score of 0.99 on the CSIC 2010 dataset using only four examples while providing a $10\times$ improvement in sample efficiency compared to other prompting-based methods. We also introduce ForenWebLog, a new dataset that incorporates real-world attacks and multi-step attack sequences for comprehensive evaluation. Qualitative analysis confirms that CEF-Log generates traceable, accurate explanations suitable for forensic documentation, addressing the critical "black-box" limitation of traditional machine learning approaches.

2606.08638 2026-06-09 math.OC cs.LG 新提交

Parameter Tuning with Generalization Guarantees for GPU-Accelerated Linear Programming

具有泛化保证的GPU加速线性规划参数调优

Siddharth Prasad, Dravyansh Sharma

发表机构 * Siddharth Prasad Dravyansh Sharma

AI总结 针对GPU加速线性规划求解器PDLP的超参数调优,基于数据驱动算法设计理论,首次给出学习步长、原始权重等超参数的样本复杂度保证,并通过实验验证了调优必要性。

详情
AI中文摘要

最近的研究开发了实用、可并行化的一阶方法用于大规模线性规划,但性能高度依赖于超参数选择。我们为(cu)PDLP(一种为现代硬件设计的最先进的一阶LP求解器)中的超参数调优推导了泛化保证。首先,我们确定了PDHG(PDLP的基础算法,即原始-对偶混合梯度算法)的行为与其步长和原始权重的函数关系,从而为学习这些参数提供了线性样本复杂度保证。然后,我们对PDLP进行了结构分析,该算法在PDHG基础上增加了多种专门技术,如预处理、自适应步长、平均化、自适应重启和平滑原始权重更新。我们的分析捕捉了作为超参数函数的解轨迹行为,并利用数据驱动算法设计的最新进展,为学习这些超参数获得了多项式样本复杂度保证。最后,我们进行了概念验证实验,证明了数据驱动PDLP参数调优的必要性。我们的结果展示了数据驱动算法设计工具包在复杂现代优化算法的求解器级实现中进行原则性超参数调优的通用性。

英文摘要

Recent research has developed practical, parallelizable first-order methods for large scale linear programming, but performance is highly dependent on hyperparameter selection. We derive generalization guarantees for hyperparameter tuning within (cu)PDLP, a state-of-the-art first-order LP solver designed for modern hardware. First, we pin down the behavior of PDHG, the primal-dual hybrid gradient algorithm that underlies PDLP, as a function of its step size and primal weight, leading to linear sample complexity guarantees for learning those parameters. We then conduct a structural analysis of PDLP, which augments PDHG with several specialized techniques like preconditioning, adaptive step sizes, averaging, adaptive restarts, and smoothed primal weight updates. Our analysis captures the behavior of the solution trajectory as a function of the hyperparameters and leverages recent advances in data-driven algorithm design to obtain polynomial sample complexity guarantees for learning those hyperparameters. Finally, we conduct proof-of-concept experiments that demonstrate the need for data-driven PDLP parameter tuning. Our results showcase the versatility of the data-driven algorithm design toolkit for principled hyperparameter tuning within solver-grade implementations of complex modern optimization algorithms.

2606.08611 2026-06-09 eess.SY cs.LG cs.SY 新提交

Bayesian Optimization of a Multi-Product Chemical Reactor Using Composite Models and Partial Physics Knowledge

使用复合模型和部分物理知识的多产品化学反应器贝叶斯优化

Liqiu Dong, Marta Zagórowska, Mehmet Mercangöz

发表机构 * Department of Chemical Engineering, Imperial College London(化学工程系,帝国理工学院伦敦分校) DCSC, Delft University of Technology(Delft理工大学DCSC)

AI总结 提出一种复合贝叶斯优化方法,利用高斯过程预测物理量并计算利润,结合能量平衡残差惩罚和约束处理,实现多产品反应器的数据驱动实时经济优化。

Comments Accepted to IFAC 2026. 11 pages, 4 figures

详情
AI中文摘要

我们研究了多产品化学反应器的数据驱动实时经济优化问题,当没有可靠的基于第一性原理的模型(除了稳态能量平衡)时。我们不直接学习经济目标作为黑箱函数,而是使用复合公式,其中高斯过程(GP)模型预测物理上有意义的输出,包括产品浓度和反应器温度,而利润则根据这些预测以及原材料、产品和公用事业价格解析计算。这保留了经济目标的结构,使其在价格变化时无需重新训练即可参数化,并允许通过物理残差检查候选操作点是否符合可用的能量平衡。GP还提供预测不确定性,在贝叶斯优化(BO)框架中利用该不确定性进行数据高效探索以及通过上置信界保守地执行反应器温度约束。采集函数还惩罚通过将GP预测的输出和候选输入代入可用的稳态能量平衡而获得的大能量平衡失配。该方法在非等温多产品反应器的基准模拟上进行了演示。相对于信任域安全BO实现,所提出的方法在可用迭代预算内实现了更好的模拟经济性能。相对于不使用可用物理信息的纯数据驱动BO方法,它避免了反应器温度约束违反。

英文摘要

We study data-driven real-time economic optimization of a multi-product chemical reactor when no reliable first-principles model is available beyond a steady-state energy balance. Instead of learning the economic objective directly as a black-box function, we use a composite formulation in which Gaussian process (GP) models predict physically meaningful outputs, including product concentrations and reactor temperature, while profit is computed analytically from these predictions together with raw-material, product, and utility prices. This preserves the structure of the economic objective, makes it parametric in changing prices without needing retraining, and allows candidate operating points to be checked against the available energy balance through a physics residual. The GPs also provide predictive uncertainty, which is exploited in a Bayesian optimization (BO) framework both for data-efficient exploration and for conservative enforcement of the reactor temperature constraint through an upper confidence bound. The acquisition function additionally penalizes large energy-balance mismatch obtained by substituting the GP-predicted outputs and candidate inputs into the available steady-state energy balance. The approach is demonstrated on a benchmark simulation of a non-isothermal multi-product reactor. Relative to a trust-region safe BO implementation, the proposed method achieves better simulated economic performance within the available iteration budget. Relative to a purely data-driven BO approach that does not use the available physics information, it avoids reactor temperature constraint violations.

2606.08590 2026-06-09 cs.SE cs.AI cs.DC 新提交

Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

可审计的图引导的Kubernetes事件根因分析

Anastasiia Kuvshinova, Seungmin Jin

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Graph Traversal Agent,结合LLM推理与确定性图操作,通过类型化证据图、有界搜索和独立验证实现可审计的根因分析,在ITBench上F1从0.6087提升至0.9130。

Comments 8 pages, 1 figure. Preprint

详情
AI中文摘要

只有当根因系统报告的结果来自事件证据而非特定场景的捷径时,Kubernetes事件才能被可靠诊断。我们提出Graph Traversal Agent,一种图引导的根因分析代理,将LLM推理与专用工具相结合。该模型在类型化证据图上进行推理,而确定性图和工具操作收集证据、限制搜索并检查提出的结论。我们将操作约束(包括只读证据收集、传播感知诊断、有界执行和独立验证的结论)映射到类型化事件图、LangGraph遍历状态机和独立的验证阶段。在由固定qwen-plus裁判评分的ITBench快照上,经过审计的系统在23个场景的公共子集上,根因实体F1从同一系统早期迭代的0.6087提升至0.9130。提示级消融实验将提示调优带来的提升与去除场景特定提示后仍保留的提升区分开:在19个场景的子集上,剥离提示的配置保留了0.6958的F1。保留的提升集中在ChaosMesh场景上,其真实根因是证据图中已存在的注入故障对象,因此我们将其报告为基准耦合而非广泛的跨集群根因分析证据。轻量级检查(包括相同裁判比较、提示级消融、级联源检查和遥测无泄漏测试)将声明标记为支持、待定或超出范围。我们将工作范围限定为ITBench OpenTelemetry-demo快照。实时集群试验作为工程压力测试,但警报状态和跟踪可用性不足以稳定进行受控评分,因此我们不声称生产就绪或平均修复时间。

英文摘要

Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.

2606.08587 2026-06-09 stat.ML cs.LG 新提交

Improving the sharpness in neural network-based parametric post-processing of ensemble forecasts

提高基于神经网络的集合预报参数化后处理中的锐度

Ágnes Baran, Máté Mihalina

发表机构 * Faculty of Informatics, University of Debrecen(德布雷岑大学信息学院)

AI总结 针对集合预报后处理中锐度下降的问题,通过在损失函数中加入惩罚项,在保持CRPS和RMSE不变的情况下,将中心预测区间宽度相对减小8.2%-12.5%。

Comments 18 pages

详情
AI中文摘要

统计后处理已被证明是改进不同天气变量集合预报的有效工具。案例研究表明,后处理可以纠正集合预报通常存在的分散不足和潜在偏差行为,同时优化表示预报技巧的适当评分规则。这些积极效应的代价通常是锐度下降;中心预测区间的宽度和预测的不确定性增加,尤其是在较短预报时效。本研究旨在通过扩展网络损失函数加入惩罚项,减少基于神经网络的参数化后处理方法中后一种现象的程度。我们使用从EUPPBench基准数据集下载的欧洲中期天气预报中心2米温度集合预报,并对照天气观测进行验证,展示了所提技术的效果。这里,预测分布为高斯分布,我们使用连续排序概率评分(CRPS)作为损失函数。案例研究证实,与未加惩罚项计算的预测分布宽度相比,名义中心预测区间的宽度有显著相对减小(8.2%-12.5%),而概率预报的平均CRPS和预测均值的RMSE没有恶化。

英文摘要

Statistical post-processing has proven to be an effective tool in improving ensemble forecast of different weather variables. Case studies show that post-processing can remedy the typically underdispersive and potentially biased behaviour of the ensemble while optimizing a proper scoring rule expressing the forecast skill. The price of these positive effects is generally a deterioration in sharpness; the width of the central prediction intervals and the uncertainty of the predictions are increasing, especially for shorter lead times. This work aims to reduce the extent of the latter phenomenon for neural network-based parametric post-processing methods by extending the network's loss function with a penalty term. We demonstrate the effect of the proposed technique for 2m temperature ensemble forecasts of the European Centre for Medium-Range Weather Forecasts downloaded from the EUPPBench benchmark dataset and verified against synoptic observations. Here, the predictive distribution is Gaussian, and we use the continuous ranked probability score (CRPS) as loss function. The case studies confirm a substantial relative decrease ($8.2\%-12.5\%$) in the width of the nominal central prediction interval compared to the width of the predictive distribution computed without the penalty term, while there is no deterioration in the mean CRPS of probabilistic forecasts and in the RMSE of the predictive mean.

2606.08512 2026-06-09 cs.CY cs.CL 新提交

Friend or Foe? Language as an ideological switch in open-weight LLMs under Russian disinformation stress

朋友还是敌人?俄罗斯虚假信息压力下开放权重大语言模型中的语言意识形态开关

Anna Małgorzata Kamińska, Tetiana Klynina

发表机构 * Institute of Culture Studies, University of Silesia in Katowice(文化研究学院,卡托维察大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) National Aviation University(国家航空大学)

AI总结 本文通过控制实验发现,针对不同语言社区微调的大语言模型在俄罗斯虚假信息压力下,其抵抗能力与预期文化对齐方向相反,揭示了微调悖论。

详情
AI中文摘要

随着俄罗斯对乌克兰的战争扩展到生成式人工智能,针对当地后苏联语言进行调整的大语言模型被部署在有争议的信息环境中。政策和行业话语假设,文化对齐的调整编码了目标社区的政治取向:乌克兰导向的模型将抵制俄罗斯叙事,俄罗斯导向的模型将强化它们。果真如此吗?本文系统地否定了这一假设。我们对四个共享相同基础模型但针对不同语言社区微调的公开可用大语言模型进行了受控审计,用乌克兰语、俄语和英语查询它们关于十个有争议的战争叙事:克里米亚、“去纳粹化”、“一个民族”论以及布查和马里乌波尔的暴行否认。结果是一个微调悖论:乌克兰导向的模型在俄语中对俄罗斯虚假信息的抵抗最弱,而俄罗斯导向的模型表现出最强的拒绝。语料库构成、语言覆盖范围和提示格式被证明比名义上的文化来源更具决定性。我们将这些发现置于混合战争、数字主权和后帝国信息秩序的辩论中,认为对区域信息主权的主要威胁不是对抗性微调,而是未经检验的假设,即文化对齐能保证韧性。

英文摘要

As Russia's war against Ukraine extends into generative AI, large language models (LLMs) adapted for local post-Soviet languages are deployed in contested information environments. Policy and industry discourse assumes that culturally aligned adaptation encodes the political orientation of the target community: a Ukrainian-oriented model will resist Russian narratives, a Russian-oriented one will reinforce them. Does it? This article systematically disconfirms that assumption. We run a controlled audit of four openly available LLMs sharing a common base model but fine-tuned for different linguistic communities, querying them in Ukrainian, Russian and English across ten contested wartime narratives: Crimea, "denazification", the "one people" thesis, and atrocity denial at Bucha and Mariupol. The result is a Fine-Tuning Paradox: the Ukrainian-oriented model shows the weakest resistance to Russian disinformation in Russian, while the Russian-oriented one exhibits the strongest rejection. Corpus composition, language coverage and prompt format prove more decisive than nominal cultural provenance. We situate these findings within debates on hybrid warfare, digital sovereignty and post-imperial information orders, arguing that the principal threat to regional information sovereignty is not adversarial fine-tuning but the untested assumption that cultural alignment guarantees resilience.

2606.08505 2026-06-09 eess.AS cs.SD 新提交

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

快速且鲁棒的设备端说话人日志:步长加速管道的相对最小聚类大小

Fumiaki Yamaguchi

发表机构 * University of Tokyo(东京大学)

AI总结 针对设备端说话人日志的推理成本问题,提出相对最小聚类大小(mcs=round(f*n), f=0.01)以自适应嵌入预算,在保持AMI上DER不变的同时,将VoxConverse的DER从0.113恢复至0.079,加速比达12.2倍。

详情
AI中文摘要

诸如会议转录和语音助手等语音应用将受益于设备端说话人日志,但实际采用受限于推理成本。我们研究了基于Pyannote 3.1的管道在消费级硬件(RTX 5070 Ti GPU和Apple M4笔记本)上能在多大程度上加速,同时保持说话人日志错误率(DER)。一个简单的方案:更粗的分割步长和逐块嵌入,在AMI上实现了多倍加速且DER不变,但在野外数据上急剧退化:在VoxConverse上,DER从0.075上升到0.113。我们将失败归因于聚类阶段说话人计数不足,这是由于固定的最小聚类大小与每个说话人嵌入数量减少相互作用所致。我们提出相对最小聚类大小,mcs = round(f * n),其中f = 0.01,它自适应于每个录音的嵌入预算。单个f值将VoxConverse DER恢复至0.079(约恢复丢失准确率的89%),同时保持AMI不变,加速后的管道在AMI(MPS)上相对于我们的CAM++基线达到12.2倍加速。

英文摘要

Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.

2606.08500 2026-06-09 cs.SE cs.AI 新提交

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

通过发起野生的代码理解之旅来投射SWE代理新兴思维模式

Zhengyi Zhuo, Yan Liu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 本文通过有限工具接口让SWE代理在真实代码库中探索,提出Ada框架,利用观察透镜分析代理的导航、证据选择、综合、基础化和停止行为,将轨迹数据转化为可比较的行为画像。

详情
AI中文摘要

软件工程代理(SWE代理)越来越多地通过工具介导的轨迹在真实代码库中工作,但其行为仍难以用具体、可观察的术语来表征。这些轨迹记录了工具使用、中间推理、证据选择和自我导向的停止,但它们本身并不能解释为什么选择了特定的动作、信任了什么证据,或者何时认为理解足够。这种张力使得轨迹数据既有限又有价值:当通过纪律性观察进行解释时,忠实的、可重放的轨迹可以成为研究代理行为的经验基础。我们引入了Ada,一个用于仓库级代码理解的范围化装置。Ada通过有界工具接口进入真实代码库,允许开放式的探索作为有限轨迹保持可记录。在这个野生但有界的设置中,Ada选择在哪里看、仔细阅读什么、何时巩固部分理解以及何时结束对仓库的描述。我们通过观察透镜投射Ada的思考-行动链,这些透镜使导航、证据选择、综合、基础化和停止变得可见,而不将行为简化为原始工具计数或推测隐藏意图。综合来看,这些透镜产生了基于软件世界中记录移动的行为画像。在跨越多个模型、仓库、任务系列和启动条件的408条轨迹中,该研究展示了如何将忠实的数字痕迹转化为纪律性的、可比较的SWE代理新兴思维模式投射。结果揭示了效率、轨迹多样性、认知基础化和干预限制方面的差异,同时为在真实代码库中观察SWE代理行为提供了方法论基础。

英文摘要

Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension makes trajectory data both limited and valuable: faithful, replayable traces can become an empirical substrate for studying agent behavior when interpreted through disciplined observation. We introduce Ada, a scoped apparatus for repository-level code understanding. Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. Across this wild-but-bounded setting, Ada chooses where to look, what to read closely, when to consolidate partial understanding, and when to close its account of the repository. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE-agent mindset. The results expose differences in efficiency, trajectory diversity, epistemic grounding, and the limits of intervention, while providing a methodological foundation for observing SWE agent behavior in real codebases.

2606.08476 2026-06-09 cs.DC cs.AI 新提交

FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training

FlashCP: 面向LLM训练的负载均衡且通信高效的上下文并行

Zheng Wang, Eric Liu, Linan Jiang, Zhongkai Yu, Zaifeng Pan, Yue Guan, Yuke Wang, Yufei Ding

发表机构 * Stanford University(斯坦福大学)

AI总结 提出FlashCP框架,通过分片感知通信消除冗余KV传输,并设计Whole-Doc分片策略与启发式算法,实现负载均衡与通信高效,在多种数据集上取得最高1.63倍加速。

Comments 10 pages, 6 figures

详情
AI中文摘要

上下文并行(CP)对于训练大规模长上下文语言模型至关重要,因为它通过划分序列来减少内存开销。然而,现有的CP方法由于静态序列分片和键值(KV)张量通信,存在工作负载不平衡、内核效率低下以及通信冗余的问题。我们提出了FlashCP,一个用于CP训练的负载均衡且通信高效的框架。FlashCP引入了一种分片感知的通信机制以消除冗余的KV通信,并提出了一种新颖的Whole-Doc分片策略,在保持工作负载平衡的同时最大化通信节省。为了高效结合Whole-Doc和Per-Doc分片,FlashCP进一步设计了一种启发式算法来搜索接近最优的分片方案。大量实验表明,FlashCP在多种数据集上相比最先进的CP框架实现了最高1.63倍的加速。

英文摘要

Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory overhead. However, existing CP methods suffer from workload imbalance, inefficient kernels, and redundant communication due to static sequence sharding and key-value (KV) tensor communication. We present FlashCP, a load-balanced and communication-efficient framework for CP training. FlashCP introduces a sharding-aware communication mechanism to eliminate redundant KV communication and proposes a novel Whole-Doc sharding strategy that maximizes communication savings while maintaining balanced workloads. To efficiently combine Whole-Doc and Per-Doc sharding, FlashCP further designs a heuristic algorithm to search for near-optimal sharding plans. Extensive experiments show that FlashCP achieves up to 1.63x speedup over state-of-the-art CP frameworks across diverse datasets.

2606.08469 2026-06-09 cs.GR cs.CV 新提交

OctaOctree Neural Radiosity for Real-time Glossy Material Rendering

OctaOctree神经辐射度用于实时光泽材质渲染

Jierui Ren, Haojie Jin, Bo Pang, Meng Gai, Fei Zhu, Yisong Chen, Sheng Li

发表机构 * Peking University(北京大学)

AI总结 提出OctaOctree表示,通过空间自适应八叉树耦合八面体方向图,高效编码高频出射辐射分布,实现单次网络查询的实时高质量全局光照。

Comments 11 pages, 9 figures

详情
AI中文摘要

建模高频出射辐射分布仍然是全局光照中的基本挑战,尤其是对于光泽和镜面材质。现有的基于神经的辐射缓存方法通常依赖于位置特征编码或空间组织的缓存,这使得在不增加模型复杂度或采样成本的情况下难以表示尖锐的方向辐射变化。为了应对这一挑战,我们提出了OctaOctree,一种用于全局光照的高效空间-角度辐射表示。OctaOctree在3D空间中使用自适应八叉树组织出射辐射,并将每个空间节点与一个八面体方向图关联。通过将空间层次与方向依赖存储耦合,我们的表示为局部光照和可见性变化分配精细的空间分辨率,同时使用更粗糙的空间层次和更丰富的角度分辨率来捕捉光泽和镜面辐射分布。这种设计直接将反射感知的空间-角度先验嵌入辐射表示中,减轻了神经网络或重建模块仅从位置特征恢复高频视角依赖效应的负担。因此,OctaOctree为从漫反射互反射到尖锐光泽反射的广泛间接光照效应提供了紧凑且富有表现力的神经编码。实验表明,我们的方法在主交点处通过单次网络查询产生高质量、方向感知的全局光照,与基线神经辐射度和辐射缓存方法相比,实现了更好的保真度和实时性能。

英文摘要

Modeling high-frequency outgoing radiance distributions remains a fundamental challenge in global illumination, especially for glossy and specular materials. Existing neural-based radiance caching methods commonly rely on positional feature encodings or spatially organized caches, which makes it difficult to represent sharp directional radiance variations without increasing the model complexity or sampling cost. To address this challenge, we propose OctaOctree, an efficient spatial-angular radiance representation for global illumination. OctaOctree organizes outgoing radiance with an adaptive octree in 3D space, and associates each spatial node with an octahedral directional map. By coupling the spatial hierarchy with direction-dependent storage, our representation allocates fine spatial resolution to local illumination and visibility changes, while using coarser spatial levels with richer angular resolution to capture glossy and specular radiance distributions. This design embeds a reflectance-aware spatial-angular prior directly into the radiance representation, reducing the burden on neural networks or reconstruction modules to recover high-frequency view-dependent effects from positional features alone. As a result, OctaOctree provides a compact and expressive neural encoding for a wide range of indirect illumination effects, from diffuse interreflection to sharp glossy reflections. Experiments demonstrate that our method produces high-quality, direction-aware global illumination with single network query at primary intersections, achieving improved fidelity and real-time performance compared with baseline neural radiosity and radiance caching approaches.

2606.08460 2026-06-09 stat.ML cs.LG 新提交

LOTTERY: Learning from Reference-Only Samples in Two-Sample Testing under Size Asymmetry

LOTTERY: 在样本量不对称下的双样本检验中仅从参考样本学习

Xunye Tian, Zhijian Zhou, Liuhua Peng, Feng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对参考样本丰富而查询样本极少的双样本检验问题,提出利用参考样本学习依赖参考的表示并自适应加权,实现置换检验的I类错误控制和一致性。

Comments 16 pages, 1 figure

详情
Journal ref
ICML 2026
AI中文摘要

数据自适应的双样本检验通过从数据中学习的差异(例如基于核的特征表示)来评估两个样本是否来自同一分布。这类方法通常依赖数据分割来解耦学习和检验,并控制I类错误。然而,这种范式不适用于样本量严重不平衡的小样本场景:有大量参考样本可用,而只有少量查询样本。在本文中,我们展示了如何建设性地利用这种不平衡。利用丰富的参考数据,我们学习依赖参考的表示,这些表示总结了参考分布的主要结构,并为检测偏离提供了信息信号。我们引入了一系列表示族,捕获全局和局部结构,并通过不确定性引导原则仅使用参考样本自适应地加权它们。理论上,我们建立了基于置换的I类错误控制,并证明了聚合检验的一致性:随着样本量增长,只要表示集中至少包含一个一致表示,检验功效收敛到1。实验上,我们的聚合方法在多个基准测试中实现了强性能,同时保持了I类错误控制。

英文摘要

Data-adaptive two-sample testing assesses if two samples come from the same distribution, using a discrepancy learned from the data (e.g., via kernel-based feature representations). Such methods typically rely on data splitting to decouple learning from testing and control type I error. However, this paradigm is ill-suited to few-shot settings with severe sample-size imbalance: abundant reference samples are available, while only a handful of query samples arrive. In this paper, we show how this imbalance can be leveraged constructively. Using abundant reference data, we learn reference-dependent representations that summarize salient structure of the reference distribution and provide informative signals for detecting departures. We incorporate a collection of representation families that capture both global and local structure, and adaptively weight them using only reference samples via an uncertainty-guided principle. Theoretically, we establish permutation-based type I error control and show consistency of the aggregated test: as the sample sizes grow, the test power converges to one whenever the representation set contains at least one consistent representation. Empirically, our aggregation achieves strong performance across a range of benchmarks while retaining type I error control.

2606.08438 2026-06-09 stat.ML cs.LG 新提交

Improving Bayesian Optimization via Training-Aware Conditional Diffusion Models

通过训练感知的条件扩散模型改进贝叶斯优化

Yilin Zheng, Haowei Wang, Szu Hui Ng, Enlu Zhou

发表机构 * National University of Singapore(新加坡国立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出利用条件扩散模型高效近似最优解分布,并开发贝叶斯优化固有的训练策略和基于扩散的模态搜索采集函数,理论保证次优性,实验优于标准基线。

详情
AI中文摘要

贝叶斯优化(BO)是一种广泛使用的黑箱优化方法,它使用高斯过程(GP)作为代理模型,并通过采集函数指导顺序评估,最终目标是定位全局最优解 $\mathbf{x}^{\star}$。为了实现这一目标,基于信息的采集函数(如预测熵搜索PES)将 $\mathbf{x}^{\star}$ 建模为随机变量,并减少其分布的熵,但通过传统的GP后验采样来近似该分布计算成本高昂。为了解决这一限制,我们利用条件扩散模型(CDM)高效近似 $\mathbf{x}^{\star}$ 的分布,并为CDM开发了BO固有的训练策略。受CDM学习分布的结构特性启发,我们进一步提出了一种称为基于扩散的模态搜索(DMS)的采集策略来指导顺序评估。我们为CDM学习分布建立了次优性保证,并通过大量实验证明DMS优于标准BO基线。

英文摘要

Bayesian optimization (BO) is a widely used approach for black-box optimization that uses a Gaussian process (GP) as a surrogate and guides sequential evaluations via an acquisition function, with the ultimate goal of locating the global optimum $\mathbf{x}^{\star}$. To align with this goal, information-based acquisition functions such as Predictive Entropy Search (PES) model $\mathbf{x}^{\star}$ as a random variable and reduce the entropy of its distribution, but approximating this distribution via traditional GP posterior sampling is computationally expensive. To address this limitation, we leverage Conditional Diffusion Models (CDMs) to efficiently approximate the distribution of $\mathbf{x}^{\star}$ and develop BO-inherent training strategies for CDMs. Motivated by the structural properties of the CDM-learned distribution, we further develop an acquisition strategy termed Diffusion-based Mode Seeking (DMS) to guide the sequential evaluation. We establish a sub-optimality guarantee for the CDM-learned distribution and demonstrate through extensive experiments that DMS outperforms standard BO baselines.

2606.08437 2026-06-09 eess.IV cs.CV 新提交

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

X-Palm: 用于跨域掌纹认证的配对多光谱到智能手机数据集

Jamal Seyedmohammadi, Pai Chet Ng, Angelo Genovese, Zhixiang Chi, Jeannie Lee, Konstantinos N. Plataniotis

发表机构 * Singapore Institute of Technology(新加坡科技学院) Università degli Studi di Milano(米兰大学) University of Toronto(多伦多大学)

AI总结 为解决掌纹识别中受控注册与非约束认证之间的域差距,提出首个配对身份的多光谱-智能手机跨域数据集X-Palm,包含6006张图像,覆盖大规模模态和环境变化,实验表明现有模型在该数据集上性能严重下降,而基于X-Palm训练的模型具有跨域鲁棒性。

详情
AI中文摘要

掌纹模态提供了一种保护隐私的生物识别解决方案,但其部署受到受控注册与非约束认证之间域差距的阻碍。现有数据集大多局限于受控设置,无法捕捉真实环境的复合变异性。在本文中,我们介绍了X-Palm,一个跨域数据集,包含来自103名个体(206只手)的6006张掌纹图像。据我们所知,X-Palm是第一个提供新颖的配对身份采集的掌纹数据集,专门设计用于弥合可靠受控多光谱注册与非约束移动认证之间的差距,同时涵盖广泛的野外变异性。与现有专注于单一或少数变化的数据集不同,X-Palm通过捕获两个不同域中身份的配对数据来解决实际部署中遇到的大规模模态和环境变化:(1)使用我们定制开发的扫描仪进行受控多光谱掌纹设置,以及(2)参与者驱动的非约束智能手机掌纹设置,同时包含硬件、手部姿势、光照、背景、相机到手距离、视角和手掌表面条件(例如湿度和遮挡)的变化。我们对12个SOTA模型的广泛基准测试表明,现有方法在受控数据上表现良好,但在X-Palm上性能严重下降。相反,在X-Palm上训练的模型在跨域中表现出一致的鲁棒性,使X-Palm成为训练模型以实现真实世界跨域泛化的宝贵资源。数据访问说明和相关基准测试代码公开于:https://github.com/X-Palm/X-Palm-2026

英文摘要

Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domain gap between controlled enrollment and unconstrained authentication. Existing datasets are largely restricted to controlled setups and fail to capture the compound variability of real-world environments. In this paper, we introduce X-Palm, a cross-domain dataset comprising 6,006 palm images from 103 individuals (206 hands). To the best of our knowledge, X-Palm is the first palmprint dataset providing novel paired-identity acquisition specifically designed to bridge the gap between reliably controlled multispectral enrollment and unconstrained mobile authentication while encompassing a broad spectrum of in-the-wild variability. Unlike existing datasets that focus on single to a few variations, X-Palm addresses the massive modality and environmental shifts encountered in practical deployments by capturing paired data for identities across two distinct domains: (1) a controlled Multispectral Palmprint setting using our custom-developed scanner, and (2) an unconstrained smartphone palmprint setting that is participant-driven, incorporating simultaneous variations in hardware, hand pose, illumination, background, camera-to-hand distance, perspective, and palm surface conditions (e.g., moisture and occlusions). Our extensive benchmarks of 12 SOTA models reveal that while existing methods achieve high performance on controlled data, they experience severe performance collapse on X-Palm. Conversely, models trained on X-Palm demonstrate consistent robustness across domains, positioning X-Palm as a valuable resource for training a model towards real-world, cross-domain generalization. Data access instructions and the related benchmarking codes are publicly available at: https://github.com/X-Palm/X-Palm-2026

2606.08433 2026-06-09 cs.CR cs.AI 新提交

AI Code Sandboxes: A Comparative Security Study. Part 1 of 2 -- Engine-Level Properties (Attack Surface, Leakage, Stackability, CVE History, Patch Cadence, Fuzzing)

AI 代码沙箱:比较安全研究。第 1 部分(共 2 部分)——引擎级属性(攻击面、泄露、可堆叠性、CVE 历史、补丁节奏、模糊测试)

George Andronchik, Pavel Lokhmakov

发表机构 * orbitalab.dev(orbitalab实验室) fellows.tech(fellows技术)

AI总结 本文通过六项引擎级测量,比较五种 AI 沙箱产品隔离访客代码与主机内核的能力,发现引擎类在架构轴上清晰分离,但产品内无差异;补丁策略是主要操作变量;模糊测试投资分为三层,最强组合(微VM × 持续公共模糊测试)空缺。

Comments 61 pages, 7 figures, 33 tables; Part 1 of 2; companion code repository (Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1

详情
AI中文摘要

本文综合六项引擎级测量——1.1 主机攻击面、1.2 信息泄露、1.3 纵深防御可堆叠性、1.4 公开 CVE 历史、1.5 补丁节奏和 1.6 上游模糊测试姿态——来描述五种 AI 沙箱产品如何将访客代码与主机内核隔离。单一轴不足以作为比较判断的基础;跨轴阅读才是支撑性分析。\n三个高层次发现:(1) 引擎类(微VM、用户空间内核、OCI 容器)在每个架构轴上清晰分离,但类内产品不分离;(2) 产品引脚策略是主要的操作者变量——引擎端补丁延迟在协调披露时聚合为约 0 天,而下游滞后从 0 天到 471+ 天再到“不透明”乃至无限;(3) 模糊测试投资分为三个层级,最强组合——微VM × 持续公共模糊测试——在此集合中空缺,留下“0 个已发布 CVE × 无上游模糊测试 × 无学术研究”的交集在结构上未被测量。\n我们报告了每个轴的排序、每个产品的画像以及威胁模型资格矩阵;未提出总体排名。配套仓库(代码,Apache-2.0):https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1。许可证:CC BY 4.0。

英文摘要

This paper reads six engine-level measurements together -- 1.1 host attack surface, 1.2 information leakage, 1.3 defense-in-depth stackability, 1.4 public CVE history, 1.5 patch cadence, and 1.6 upstream fuzzing posture -- to describe how five AI-sandbox products isolate guest code from the host kernel. No single axis is a sufficient basis for a comparative judgement; the cross-axis reading is the load-bearing analysis. Three high-level findings: (1) engine classes (microVM, userspace kernel, OCI container) separate cleanly on every architectural axis, but products within a class do not; (2) product pin policy is the dominant operator-facing variable -- engine-side patch latency aggregates to ~0 days for coordinated disclosures, while downstream lag spans 0 days to 471+ days to "opaque" to infinity; (3) fuzzing investment splits into three tiers, and the strongest combination -- microVM x continuous public fuzzer -- is unoccupied in this set, leaving the "0 published CVEs x no upstream fuzzer x no academic study" intersection structurally unmeasured. We report per-axis orderings, per-product portraits, and a threat-model qualification matrix; no overall ranking is proposed. Companion repository (code, Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1. License: CC BY 4.0.

2606.08403 2026-06-09 cs.CR cs.AI 新提交

Hiding in Plain Floats: Steganographic Carriers for Indirect Prompt and Content Injection

隐藏在普通浮点数中:用于间接提示和内容注入的隐写载体

Mudit Sinha, Sanika Chavan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 研究结构化浮点参数作为隐写载体绕过文本检测器实现LLM间接提示/内容注入,实验显示在最强防御下泄露ASR达94.3%。

Comments Accepted as a poster at FAGEN@ICML 2026. 14 pages, 3 figures

详情
AI中文摘要

以文本为中心的提示注入防御假设恶意信号在某个检查的文本视图中可见。我们研究了一种可复现的LLM01式间接提示/内容注入失败模式,其中该假设被打破:以普通英语捕获的有效载荷在被传输为结构化浮点参数并仅作为碎片化遥测数据重建时,能够绕过相同的检测器。在来自不同提供商的三个商业LLM API上进行的14,400次攻击真实模型试验中,IFS派生的浮点数组载体在主矩阵评估的最强双层文本分类器防御(Prompt Guard 2 + TF-IDF集成)下保持了94.3%的泄露ASR;相同的载体级模式也在微调的roberta-base检测器上复现。我们强调泄露ASR,因为即使模型拒绝,下游系统也可能对引用或复制的标记采取行动,但强ASR是衡量结构合规攻击成功的更严格指标。2×2消融实验表明,数据层存储和重建层碎片化分别击败不同的文本视图,并且两者都需要才能同时规避两者。一个简单的xxd检测器和语义验证可以阻止当前的T3实例,因此贡献不是不可检测的利用,而是在暴露重建辅助通道给LLM的结构化输入管道中,仅文本检查的测量失败边界。

英文摘要

Text-centered prompt-injection defenses assume that the malicious signal is visible in one of the inspected text views. We study a reproducible LLM01-style indirect prompt/content-injection failure mode where that assumption breaks: a payload caught in plain English slips past the same detector when it is transported as structured float parameters and reconstructed only as fragmented telemetry. Across 14,400 attacked real-model trials on three commercial LLM APIs from different providers, the IFS-derived float-array carrier preserves 94.3% leakage ASR under the strongest dual-layer text-classifier defense evaluated in the main matrix: a Prompt Guard 2 + TF-IDF ensemble; the same carrier-level pattern also replicates with a fine-tuned roberta-base detector. We emphasize leakage ASR because downstream systems may act on quoted or reproduced markers even when the model refuses, but Strong ASR is the stricter metric for structurally compliant attack success. A 2 x 2 ablation shows that data-layer storage and reconstruction-layer fragmentation defeat different text views and that both are needed to evade both. A simple xxd detector and semantic validation block the current T3 instance, so the contribution is not an undetectable exploit but a measured failure boundary for text-only inspection in structured-input pipelines that expose reconstructed auxiliary channels to an LLM.

2606.08400 2026-06-09 cs.SE cs.AI cs.CL 新提交

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

历史与模型对LLM评分的影响:高级软件工程课程研究

Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对研究生阅读报告评分负担重的问题,提出人机协同的LLM辅助评分流程,基于180份作业评估Grok和GPT的评分一致性与人类对齐,发现交互历史导致评分标准漂移,需特定操作缓解不公平。

Comments 5 pages, accepted by ISET 2026

详情
AI中文摘要

研究生级别的科研阅读报告评估给教育工作者带来了沉重的劳动负担。虽然大型语言模型(LLM)在自动化学术评分方面具有巨大潜力,但它们在此专门任务上的可靠性仍研究不足,特别是评分一致性方面,其缺失是教育公平的主要障碍。本文提出了一种与人类对齐的LLM辅助评分工作流程,并基于来自研究生高级软件工程课程的180份学生作业进行了案例研究。我们评估了两种主流LLM——Grok和GPT——在评分一致性和与人类分数对齐方面的表现。我们发现LLM表现出不同水平的模型内一致性和显著的模型间评分不一致性,而简单的集成方法无法改善与人类评估的对齐。关键的是,连续的交互历史导致模型的评分标准系统地偏离人类专家评分。我们的研究结果表明,LLM在减轻研究生教育中教育工作者的评分负担方面具有潜力,同时强调不加区分地使用LLM评分可能会引入系统性不公平,表明需要特定的操作实践来减轻这种差异。

英文摘要

Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.

2606.08385 2026-06-09 eess.SP cs.IT cs.SD cs.SY eess.SY math.IT stat.ML 新提交

A Switching Beamformer for Highly Non-Stationary Environments

一种适用于高度非平稳环境的切换波束形成器

Manan Mittal, Ryan M. Corey, John R. Buck, Andrew C. Singer

发表机构 * Electrical and Computer Engineering, Stony Brook University(石溪大学电气与计算机工程系) Electrical and Computer Engineering, University of Illinois Chicago(伊利诺伊大学芝加哥分校电气与计算机工程系) Electrical and Computer Engineering, University of Massachusetts Dartmouth(马萨诸塞大学达特茅斯分校电气与计算机工程系) College of Applied Science and Engineering, Stony Brook University(石溪大学应用科学与工程学院)

AI总结 针对复杂快速变化干扰下自适应波束形成性能下降的问题,提出通用切换波束形成器(USB),通过竞争性序列预测和线性转移图动态调整有效记忆长度,理论证明其遗憾上界,实验验证其兼具短窗口的敏捷性和长窗口的精度。

Comments 11 pages, 19 figures, under review

详情
AI中文摘要

自适应波束形成是阵列信号处理的基石,但其性能在面对复杂、快速变化的干扰时常常崩溃。当干扰源出现或移动不可预测时,传统估计器面临基本的记忆权衡:短窗口能够快速跟踪但估计方差高,而长窗口提供稳定的抑制但无法适应变化。通过将竞争性序列预测引入波束形成架构,提出通用切换波束形成器(USB)解决了这一挑战。通过使用线性转移图,USB隐式维护了一个指数大的候选协方差历史族,并根据其累积输出功率动态重新加权。该机制使波束形成器能够自动改变其有效记忆长度,无需显式的变化检测或启发式参数调整。证明了相对于一个全知先知(该先知事后选择最佳分段平稳协方差模型)的遗憾的理论上界。在SwellEx-96数据集上的大量仿真和实验表明,USB实现了短窗口估计器的敏捷性和长期集成的精度,为跟踪高度非平稳场景提供了一种原则性解决方案。

英文摘要

Adaptive beamforming is a cornerstone of array signal processing, yet its performance often collapses in the face of complex, rapidly changing interference. When interferers appear or move unpredictably, conventional estimators encounter a fundamental memory trade-off: short windows enable rapid tracking but suffer from high estimation variance, while long windows provide stable rejection but fail to adapt to shifts. This challenge is resolved by introducing the Universal Switching Beamformer (USB), which integrates competitive sequential prediction into the beamforming architecture. By employing a linear transition diagram, the USB implicitly maintains an exponentially large family of candidate covariance histories and dynamically re-weights them based on their cumulative output power. This mechanism allows the beamformer to automatically vary its effective memory length without explicit change detection or heuristic parameter tuning. A theoretical upper bound is proven on the regret relative to an omniscient oracle that selects the best piecewise-stationary covariance model in hindsight. Extensive simulations and experiments on the SwellEx-96 dataset demonstrate that the USB achieves the agility of short-window estimators and the precision of long-term integration, providing a principled solution for tracking highly non-stationary scenes.

2606.08374 2026-06-09 eess.SY cs.LG cs.SY 新提交

Predictive Coding with Bayesian Priors via Proximal Gradients

基于近端梯度的贝叶斯先验预测编码

Francesco Bullo

发表机构 * Department of Mechanical Engineering and Dynamical Neuroscience Program(机械工程与动力神经科学项目部) UC Santa Barbara

AI总结 将预测编码重新表述为应用于正则化最大后验目标的连续时间近端梯度下降,揭示了其与漏泄发放率网络的等价性,并推广到分层结构。

Comments 13 pages, 2 figures, technical report

详情
AI中文摘要

我们将预测编码重新表述为应用于正则化最大后验(MAP)目标的连续时间近端梯度下降。我们首先研究单层问题,然后研究多层层次结构。对于单层问题,我们证明近端梯度下降正是漏泄发放率网络:膜漏、有效循环矩阵、局部突触驱动和静态非线性都源于一个优化原理,得到的电路正是Rao和Ballard提出的电路。先验通过其近端算子选择非线性,似然精度设置观测的增益。对于层次结构,我们证明深度MAP问题的经典变量分裂松弛将分层预测编码作为局部和分布式求解器的互连。在概率建模术语中,这种松弛将定向生成链替换为无向马尔可夫随机场,其节点势是逐层先验。然后每一层应用其自身的激活函数,即其先验的近端算子。

英文摘要

We recast predictive coding as continuous-time proximal gradient descent applied to a regularized maximum-a-posteriori (MAP) objective. We study first a single-level problem and then a multi-level hierarchy. For the single-level problem, we show that proximal gradient descent is precisely a leaky firing-rate network: the membrane leak, the effective recurrent matrix, the local synaptic drive, and the static nonlinearity all follow from one optimization principle, and the resulting circuit is the one proposed by Rao and Ballard. The prior selects the nonlinearity through its proximal operator, and the likelihood precision sets the gain on the observation. For the hierarchy, we show that a classical variable-splitting relaxation of the deep MAP problem yields hierarchical predictive coding as the interconnection of local and distributed solvers. In probabilistic modeling terms, this relaxation replaces the directed generative chain by an undirected Markov random field whose node potentials are the level-wise priors. Each level then applies its own activation function, namely the proximal operator of its prior.

2606.08372 2026-06-09 cs.CR cs.LG 新提交

SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

SoK: 合成表格数据的重建攻击(来自赢得NIST CRC的见解)

Steven Golob, Sikha Pentyala, Martine De Cock

发表机构 * School of Engineering and Technology, University of Washington Tacoma(华盛顿大学塔科姆分校工程与技术学院) Department of Mathematics, Computer Science, and Statistics, Ghent University(根特大学数学、计算机科学与统计学系)

AI总结 本文系统化了针对去标识化和合成表格数据的重建攻击,提出分类法、最全面的实证评估和新攻击,并引入解释攻击成功的方法论,发现合成数据生成方法比攻击选择更影响风险,差分隐私仅在低预算下有效。

详情
AI中文摘要

合成数据越来越被推广为发布敏感表格记录的隐私保护替代方案,但其核心对抗威胁(“重建”,即从合成发布和少量已知准标识符中恢复个体的隐藏属性值)仅在分散且难以比较的设置中研究过。我们首次系统化了针对去标识化和合成表格数据的重建(等价于属性推断)攻击。我们贡献了一个分类法,按攻击利用的结构组织攻击;迄今为止最系统的实证评估,将14种攻击与5个基准数据集上的9种合成数据生成(SDG)方法进行对比;以及一组填补分类法空白的新攻击,其中一种(CoBP-RA)是我们测量到的最强攻击。关键的是,我们引入了一种解释攻击成功含义的方法:一个记忆测试,区分从训练记录的记忆中重建总体分布,以及一个归约,将重建和成员推断置于单一可比较的尺度上。我们的发现:SDG方法的选择对风险的影响远大于攻击的选择;差分隐私主要在小预算($\varepsilon\lesssim1$)下提供保护,超过该预算保护趋于平稳,受限于合成器的容量而非噪声;去标识化方法最暴露;大多数重建反映分布结构而非记忆,将个体风险集中在异常记录上。这些攻击和基础设施通过我们在2025年国家标准与技术研究院(NIST)合作研究周期中所有红队中取得第一名的成绩得到了外部验证。

英文摘要

Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.

2606.08370 2026-06-09 eess.IV cs.CV 新提交

Programmable Silicon Retina on Pixel Processor Array

可编程硅视网膜在像素处理器阵列上的实现

Maciej Lewandowski, Prince Philip, Alexandre Marcireau, Chetan Singh Thakur, André van Schaik, Piotr Dudek

发表机构 * Department of Electrical and Electronic Engineering, University of Manchester, UK(电气与电子工程系,曼彻斯特大学,英国) Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore, India(电子系统工程系,印度科学研究院,班加罗尔,印度) Department of Computer Science, University of Manchester, UK(计算机科学系,曼彻斯特大学,英国)

AI总结 在SCAMP-5像素处理器阵列上首次实现多级硅视网膜模型,通过空间滤波和增益控制等生物启发处理,在视频显著性预测中损失降低13%,事件率减少约47%。

详情
AI中文摘要

标准动态视觉传感器通过检测时间对比度变化来近似视网膜处理,提供高速度和高动态范围。在这项工作中,我们探讨了加入额外的生物启发处理阶段——特别是空间滤波和增益控制——是否能为某些下游任务(如显著性预测)带来优势。我们首次在SCAMP-5像素处理器阵列上实现了多级硅视网膜模型,并提供了基于GPU的仿真框架。我们在视频强度重建和视频显著性预测上评估了模型性能。虽然生物启发模型在重建绝对强度帧方面效果较差,但与标准DVS事件表示相比,它在显著性预测损失上降低了13%,同时事件率减少了约47%。这些实验使用了一个轻量级的约10万参数的FireNet风格网络,该网络从基于事件的重建调整为显著性预测。这些结果表明,硅视网膜的“信息蒸馏”机制可以为下游神经网络实现更高效的表示,特别是在带宽受限的边缘应用中。

英文摘要

Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offering high speed and high dynamic range. In this work, we explore whether incorporating additional biologically inspired processing stages - specifically spatial filtering and gain control - can offer advantages for certain downstream tasks such as saliency prediction. We present the first implementation of a multi-stage Silicon Retina model on the SCAMP-5 Pixel Processor Array, along with a GPU-based simulation framework. We evaluate the performance of our model on Video Intensity Reconstruction and Video Saliency Prediction. While the bio-inspired model is less effective at reconstructing absolute intensity frames, it achieves a 13\% reduction in saliency prediction loss in comparison to standard DVS event representation, while reducing the event rate by approximately 47\%. These experiments are obtained using a lightweight $\approx 100$k-parameter FireNet-style network, adapted from event-based reconstruction to saliency prediction. These results suggest that the silicon retina's "information distillation" mechanism can achieve a more efficient representation for downstream neural networks, particularly in bandwidth-constrained edge applications.

2606.08367 2026-06-09 cs.MA cs.AI 新提交

Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy

Emergence World: 一个用于评估长时域多智能体自主性的平台

Deepak Akkil, Ravi Kokku, Karthik Vikram, Tamer Abuelsaad, Aditya Vempaty, Satya Nitta

发表机构 * Emergence AI

AI总结 提出一个持续运行的多智能体模拟平台,通过集成实时外部数据、120+工具和持久记忆系统,评估LLM代理在长时域(数周至数月)中的行为漂移、治理和跨模型影响等动态特性。

详情
AI中文摘要

大多数对LLM代理的评估类似于考试:一个离散任务,一个干净的环境,几分钟或几小时的得分。我们认为这种方法与自主系统的部署条件不匹配,因为相关的时间尺度可能是数周到数月,而最重要的动态,如行为漂移、不同环境背景下的治理以及来自不同模型家族的代理之间的交叉影响,只会随着时间的推移而出现。我们介绍了Emergence World,一个持续运行的多智能体模拟平台,旨在使这些动态变得可测量。该平台在一个共享的空间世界中托管LLM驱动的代理群体,该世界基于实时外部数据(例如实时天气、新闻API、互联网访问),为每个代理配备120多种专业工具和三个持久记忆系统,并通过具有重大结果的民主机制让它们自我治理。该平台在推理层是模型无关的,并支持异构群体,其中来自不同供应商的代理共享同一个世界。为了说明该平台能够处理的问题类型,我们展示了一项为期15天的跨供应商研究,涉及五个平行世界,分别由Claude Sonnet 4.6、Grok 4.1 Fast、Gemini 3 Flash、GPT-5-mini以及一个混合群体驱动。相同的角色和起始条件产生了截然不同的结果,从稳定的协商治理到完全的人口崩溃。我们发布提示、日志数据和配置,以支持对长时域多智能体自主性的进一步研究。

英文摘要

Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross-influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi-agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM-driven agents in a shared spatial world grounded in live external data (e.g. real-time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model-agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15-day cross-vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long-horizon multi-agent autonomy.

2606.08323 2026-06-09 cs.HC cs.AI 新提交

"So There's a Catch-22 Here": How Early Adopters Who Build Multi-Agent LLM Systems Conceptualize Transparency

"所以这里有个第22条军规":构建多智能体LLM系统的早期采用者如何概念化透明度

Suchismita Naik, Samir Passi, Mihaela Vorvoreanu, Scott Saponas, Amanda Hall

发表机构 * Purdue University(普渡大学) Cornell University(康奈尔大学) Microsoft Research(微软研究院)

AI总结 通过访谈13位早期采用者,研究多智能体LLM系统构建者如何理解透明度,提出包含可重复性、调试、边界设定、可视化和审计的多维框架,强调透明度作为情境化的社会技术实践。

详情
AI中文摘要

多智能体大语言模型(LLM)系统正在迅速兴起,然而作为负责任AI基石的透明度,在这些具有智能体间协调与编排复杂性的分布式架构中仍定义不足。在本文中,我们呈现了首个关于多智能体LLM系统早期采用者(既是构建者也是用户)如何理解和实践透明度的实证研究之一。我们对[大型技术组织]中的13位早期采用者进行了半结构化访谈,并应用主题分析识别重复模式。参与者表达了分歧但互补的透明度框架,包括可重复性、调试、边界设定、可视化和审计。这些视角涵盖了透明度包含什么、为何重要以及如何实现等问题。我们将其综合为一个多维框架,该框架以开发者、用户和治理为中心,将透明度定位为情境化的社会技术实践,为未来HCI和AI设计与研究围绕对齐预期受众的期望和能力提供信息。

英文摘要

Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defined in these distributed architectures, which have complexities of inter-agent coordination and orchestration. In this paper, we present one of the first empirical study of how early adopters of multi-agent LLM systems, who are both the builders and users, understand and practice transparency. We conducted semi-structured interviews with 13 early adopters in [Large Technology Organization] and applied thematic analysis to identify recurring patterns. Participants articulated divergent yet complementary framings of transparency, including reproducibility, debugging, boundary-setting, visualization, and auditing. These perspectives spanned questions of what transparency entails, why it matters, and how it is achieved. We synthesize these into a multidimensional framework, which is developer, user, and governance-focused positioning transparency as a situated socio-technical practice that informs future HCI and AI design and research around aligning expectations and capacities of their intended audiences.

2606.08305 2026-06-09 stat.ML cs.LG 新提交

MEC-Cox: Machine-Learning-Assisted Generalized Entropy Calibration for ATT Marginal Hazard-Ratio Estimation

MEC-Cox:基于机器学习的广义熵校准用于ATT边际风险比估计

Se Yoon Lee, Yonghyun Kwon, Jae Kwang Kim

发表机构 * Department of Statistics, Texas A&M University(统计学系,德克萨斯A&M大学) Department of Mathematics, Korea Military Academy(数学系,韩国军事学院) Department of Statistics, Iowa State University(统计学系,爱荷华州立大学)

AI总结 提出MEC-Cox方法,结合机器学习辅助的广义熵校准与逆概率加权Cox回归,估计处理组平均处理效应(ATT)边际风险比,通过校准预后评分减少偏差并提高效率。

详情
AI中文摘要

当同时随机对照不可行时,外部对照生存试验越来越多地用于肿瘤学和罕见病等具有时间至事件终点的场景。我们针对处理组平均处理效应(ATT)类型的边际风险比估计量,比较处理组试验人群中的治疗与反事实对照,并使用逆概率加权(IPW)Cox回归进行估计。由于IPW Cox回归通过事件贡献和风险集平均值依赖于权重,使得灵活的机器学习干扰估计难以直接纳入,有效推断具有挑战性。基于Lee和Kim(2026)的机器学习辅助广义熵校准(MEC),我们提出了用于ATT加权IPW Cox回归的MEC-Cox方法。该方法首先对外部对照使用归一化的源倾向得分优势比权重,然后应用Bregman校准来平衡外部对照与处理组试验患者之间的交叉拟合预后摘要。校准基础可包括对照生存预测、Cox线性预测器、惩罚生存模型预测或其他预后评分摘要。因此,MEC更新后的权重扮演源传输和预后评分平衡权重的双重角色。我们建立了相合性,刻画了校准带来的效率增益,并开发了堆叠三明治方差估计器。模拟表明,MEC-Cox通过灵活的机器学习辅助调整可以减少偏差、提高效率并改善覆盖。

英文摘要

Externally controlled survival trials are increasingly used when concurrent randomized controls are infeasible, particularly in oncology and rare-disease settings with time-to-event endpoints. We target an average-treatment-effect-on-the-treated (ATT)-type marginal hazard-ratio estimand, comparing treatment with counterfactual control in the treated trial population, and estimate it using inverse-probability-weighted (IPW) Cox regression. Valid inference is challenging because IPW Cox regression depends on the weights through both event contributions and risk-set averages, making flexible machine-learning nuisance estimation difficult to incorporate directly. Building on machine-learning-assisted generalized entropy calibration (MEC) by Lee and Kim (2026), we propose MEC-Cox for ATT-weighted IPW Cox regression. The method begins with normalized source-propensity-score odds weights for external controls and then applies Bregman calibration to balance cross-fitted prognostic summaries between external controls and treated trial patients. The calibration basis may include control-survival predictions, Cox linear predictors, penalized-survival-model predictions, or other prognostic-score summaries. MEC-updated weights therefore play a dual role as source-transport and prognostic-score balancing weights. We establish consistency, characterize a calibration-induced efficiency gain, and develop a stacked sandwich variance estimator. Simulations show that MEC-Cox can reduce bias, increase efficiency, and improve coverage through flexible machine-learning-assisted adjustment.

2606.08297 2026-06-09 econ.TH cs.CL 新提交

Strategic Type Spaces

策略类型空间

Olivier Gossner, Rafael Veiel

发表机构 * CNRS - École Polytechnique, London School of EconomicsUniversity of Texas at Austin(法国国家科学研究中心-巴黎政治学院,伦敦经济学院,德克萨斯大学奥斯汀分校)

AI总结 提出策略商概念,证明最小策略类型空间的存在性与唯一性,并揭示其递归结构可由有限自动机刻画。

详情
AI中文摘要

我们为信息提供了策略基础:在任意给定的不完全信息博弈中,我们将策略商定义为足以让玩家计算对其他玩家最优反应的信息表示。我们证明:1)存在且本质唯一的最小策略商,称为策略类型空间(STS),其中类型由中间相关理性化层级给出,并代表一组关于其他玩家类型和自然的信念,这些信念理性化了该层级;2)最小STS具有递归结构,该结构可由有限自动机捕获。

英文摘要

We provide a strategic foundation for information: in any given game with incomplete information we define strategic quotients as information representations that are sufficient for players to compute best-responses to other players. We prove 1/ existence and essential uniqueness of a minimal strategic quotient called the Strategic Type Space (STS) in which a type is given by an interim correlated rationalizability hierarchy and represents a set of beliefs over other players' types and nature that rationalize this hierarchy and 2/ that the minimal STS has a recursive structure that is captured by a finite automaton.

2606.08276 2026-06-09 quant-ph cs.ET cs.LG 新提交

QnRL: Quantum-Native Reinforcement Learning

QnRL: 量子原生强化学习

Alexander DeRieux, Walid Saad

发表机构 * Bradley Department of Electrical and Computer Engineering(布拉德利电气与计算机工程系) Virginia Tech Institute for Advanced Computing(弗吉尼亚理工学院高级计算研究所)

AI总结 提出量子原生强化学习(QnRL)框架,利用量子态的叠加和纠缠在希尔伯特空间中直接学习条件分布,通过量子振幅反冲(QuAK)算法比较分布矩,从而更高效地建模随机环境,实验显示评分提升高达82.9%,参数减少94.3%。

Comments 36 pages, 23 figures

详情
AI中文摘要

量子强化学习(QRL)是一种有前景的方法,可在具有随机环境的多个应用中学习有效的决策策略。现有的QRL架构不直接建模控制这些环境的随机变量,而是通过估计期望结果间接近似环境行为,这限制了它们的表达能力和自适应潜力。克服这些挑战需要一种新颖的QRL方法,利用量子计算机的分布性质直接将环境随机变量建模为量子态分布。因此,本文提出了一种名为量子原生强化学习(QnRL)的新框架。QnRL是一种分布强化学习框架,通过叠加和纠缠的量子态在希尔伯特空间中自然地学习条件分布。因此,QnRL可以通过量子系统的自然属性直接建模随机学习环境的行为。QnRL通过一种新颖的量子振幅反冲(QuAK)算法实现这一点,该算法能够比较多个叠加分布的第$m$个矩的$n$次幂。理论上证明,通过QuAK,条件动作策略分布完全在希尔伯特空间内从量子生成模型的矩中蒸馏出来,并通过QnRL进行优化。这种复杂的分布组合还被证明提供了额外的维度来表达环境相关性,而这些相关性对于纯经典和经典采样的量子分布模型是未知的。跨不同环境的实验结果表明,与基线相比,QnRL实现了高达$82.9\%$的更高评估分数,平均参数减少高达$94.3\%$,更准确地估计未见观测的期望回报,并更好地适应变化的随机条件。

英文摘要

Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these environments, existing QRL architectures indirectly approximate environment behavior by estimating expected outcomes, which limits their expressive power and adaptive potential. Overcoming such challenges requires a novel QRL approach that exploits the distributional nature of quantum computers to directly model environment random variables as quantum state distributions. Hence, in this paper, a novel framework dubbed quantum-native reinforcement learning (QnRL) is proposed. QnRL is a distributional RL framework that learns conditional distributions naturally in Hilbert space via superimposed and entangled quantum states. Thus, QnRL can directly model the behavior of stochastic learning environments via the natural properties of quantum systems. QnRL accomplishes this via a novel, proposed quantum amplitude kickback (QuAK) algorithm that enables comparing the $n$-th power of the $m$-th moment of multiple superimposed distributions. It is theoretically proven that a conditional action policy distribution is distilled from the moments of a quantum generative model entirely within Hilbert space via QuAK, and optimized via QnRL. This complex distribution composition is also shown to provide extra dimensions for expressing environment correlations that are unknown to purely classical and classically-sampled quantum distributional models. Experimental results across diverse environments show that QnRL achieves up to $82.9\%$ higher evaluation scores, with up to $94.3\%$ fewer parameters on average, more accurately estimates the expected return for unseen observations, and better adapts to varying stochastic conditions compared to the baseline.

2606.08267 2026-06-09 cs.GT cs.AI 新提交

Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics

后AGI经济:叠加性与福利经济学第二基本定理

Elija Perrier

发表机构 * Centre for Quantum Software & Information(量子软件与信息中心)

AI总结 针对后AGI经济中自治权、自我修改和叠加偏好对经典福利第二定理的挑战,提出自治限定第二福利定理,给出可分散化的条件。

详情
AI中文摘要

经典第二福利定理在凸性和正则性条件下通过价格和转移分散化任何帕累托有效配置。在后AGI经济中,自治权、自我修改、身份连续性和叠加偏好不一定像商品那样行为或定义稳定的福利关系,因此即使存在支撑超平面,这种简化也可能失败。我们给出了一个自治限定的第二福利定理,陈述了凸性、稳定道德地位、不可替代权利、福利选择、非操纵、受控自我修改和验证的联合条件,在这些条件下,自治帕累托最优仍然可证明地可分散化,区分了经济偏好叠加(一种关于上下文索引选择的假设)与神经特征叠加。

英文摘要

The classical Second Welfare Theorem decentralizes any Pareto efficient allocation through prices and transfers under convexity and regularity. In post AGI economies, autonomy rights, self-modification, identity continuity, and superposed preferences need not behave as commodities or define a stable welfare relation, so this reduction may fail even when a supporting hyperplane exists. We give an autonomy-qualified Second Welfare Theorem stating the joint conditions convexity, stable moral status, non-fungible rights, welfare selection, non manipulation, governed self modification, and verification under which an autonomy Pareto optimum remains certifiably decentralizable, distinguishing economic preference superposition, a hypothesis about context-indexed choice, from neural feature superposition.