arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.23701 2026-05-25 cs.CL

Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

元数据可预测性并非证据依赖性:基于干预的弱标签基准审计

Kan Shao

AI总结 本文研究了弱标注基准测试中的一种协议级测试方法,即在干预提供的证据时,基准输出是否会变化。作者指出,仅基于元数据的快捷检查关注的是输出是否可由元数据预测,而非对证据的依赖性。为此,他们结合元数据统计量MPDS与证据干预统计量ΔEvi,揭示了元数据预测能力与证据敏感性的区别,并通过实验展示了不同数据集在不同模型下的表现差异,强调在基准审计中应同时报告元数据筛选、证据干预和模型校准结果。

详情
Comments
5 pages, 1 figure, 1 table. Accepted at ICML 2026 Workshop on Hypothesis Testing
AI中文摘要

我们研究了一种针对弱标签基准的协议级测试:当提供的证据被干预时,基准输出是否发生变化。仅元数据快捷检查回答了一个不同的问题,即输出是否可以从元数据先验中预测。因此,我们将元数据统计量——元数据先验主导得分(MPDS)——与证据干预统计量ΔEvi相结合,后者衡量在跨项目洗牌下对证据身份的敏感性。合成HotpotQA给出了一个针对仅元数据筛查的构造反例:MPDS仅为中等(0.643),但ΔEvi为零。更强阅读器的重新运行显示了为什么校准应属于测试程序:SNLI显示了校准反转,重构的HotpotQA占据了一个问题主导的警告区域,而FEVER在四个Transformer上是一个强证据敏感的正对照。实际教训很简单:基准审计应同时报告仅元数据筛查、证据干预和阅读器强度校准。

英文摘要

We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, ΔEvi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet ΔEvi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only screening, evidence intervention, and reader-strength calibration together.

2605.23699 2026-05-25 cs.CV

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

CRONOS:视频模型中反事实物理一致性的基准测试

León Begiristain, Olaf Dünkel, Adam Kortylewski

AI总结 本文提出CRONOS,一个基于干预的视频模型评估基准,用于检验模型在面对视觉输入变化时对物理事件预测的反事实一致性。该基准构建于高度逼真的Unreal Engine环境中,通过系统性地改变视角、场景、物体类别和外观等因素,而保持物理事件类型不变,从而评估模型对这些变化的鲁棒性。实验表明,当前主流视频生成模型在面对不同干预时,其预测质量存在显著下降,突显了现有模型在物理一致性方面的不足。CRONOS为研究和改进视频模型的物理理解能力提供了可控且可复现的测试平台。

详情
Comments
27 pages, 12 figures
AI中文摘要

视频预测日益被视为通向通用世界模型的途径,但目前尚不清楚这些系统是学习了潜在的因果结构,还是仅仅利用表面的视觉相关性进行未来预测。我们引入了CRONOS,一个基于干预的基准,旨在评估反事实物理一致性:即模型对物理事件的预测是否对视觉输入中的受控变化(如场景上下文、视角、物体外观和物体类别的变化)做出适当响应。CRONOS构建在逼真的Unreal Engine环境中,能够跨不同场景和动态生成受控、高保真的视频。与之前的基准相比,CRONOS系统性地干预了四个关键因素——视角、场景、物体类别和物体外观——同时保持底层物理事件类型(如碰撞、遮挡或坠落)不变。我们对近期开源视频生成器的评估揭示了反事实物理一致性的显著失败:同一物理事件类型的预测质量受到外观、环境,尤其是视角变化的影响。CRONOS提供了一个受控且可重复的测试平台,用于诊断不同干预下生成视频质量的变化,为开发在多种条件变化下表现一致的模型确立了具体目标。数据集和代码可在我们的项目页面获取。

英文摘要

Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.

2605.23696 2026-05-25 cs.LG

Graph-based Complexity Forecasts in UK En Route Airspace Using Relevant Aircraft Interactions

基于相关飞机交互的英国航路空域图复杂度预测

Edward Henderson, George De Ath, Nick Pepper

AI总结 本文研究如何利用基于图的方法预测英国空域中航路管制员的工作负荷,提出了一种概率模型,通过计算需要监控或冲突解决的飞机对数量来估计空域复杂度。该方法结合了伦敦中区(LMS)航路网络的图表示,并考虑了航班到达时间的不确定性,最终在预测精度上显著优于传统流量预测方法,为管制员排班和空域配置决策提供了有力支持。

详情
Comments
Accepted paper at the US-Europe Air Transportation Research & Development Symposium (ATRD) 2026
AI中文摘要

有效管理空中交通管制员(ATCO)的工作量对于维持运行安全至关重要。小组主管使用工具估计即将到来的交通负荷以辅助决策。然而,行业标准模型可能无法捕捉即将到来的空中交通复杂性的细微差别。本研究提出了一种概率方法,使用相关飞机对(即需要管制员监控或解冲突的飞机对)的数量作为ATCO工作量的代理指标,来预测空域扇区的复杂性。我们改编了一种现有的过滤算法,使其适用于伦敦中部扇区(LMS),这是一个复杂的空域扇区,在欧洲一些最繁忙机场的上空有多股交通流。通过与ATCO的迭代反馈,算法被改进并扩展以处理特定的几何和运行考虑。更新后的算法优于原始算法,在50个标记交通场景集上F1分数为0.84,而原始算法为0.69。为了预测扇区内未来相关飞机对的数量,构建了LMS航线网络的图表示,标准化了航路段的空间保真度。该预测方法通过建模每个飞机在未来查询时间点占据航路段的概率,考虑了飞机到达时间的不确定性。当与历史相关交互分布和实时运行数据流结合时,可以提前最多45分钟预测即将到来的ATCO工作量。所提出的预测即将到来的工作量的方法,与实际相关交互的Spearman相关系数(ρ=0.68)显著强于标准交通量预测(ρ=0.55)。由此产生的数据驱动工具显示出有望被小组主管用于扇区配置和ATCO排班决策。

英文摘要

Effectively managing Air Traffic Control Officer (ATCO) workload is crucial in maintaining operational safety. Group supervisors use tools that estimate upcoming traffic load to aid decision-making. However, industry-standard models can fail to capture the nuances of upcoming air traffic complexity. This study presents a probabilistic approach to forecast the complexity of an airspace sector using the number of relevant aircraft pairs, i.e., those that require monitoring or deconfliction by a controller, as a proxy measure for ATCO workload. We adapted an existing filter algorithm to make it suitable for use in London Middle Sector (LMS), a complex airspace sector with multiple flows of traffic above some of the busiest airports in Europe. Through iterative feedback with ATCOs, the algorithm was refined and extended to handle specific geometric and operational considerations. The updated algorithm outperformed the original, with an F1-score of 0.84 compared to 0.69 on a labelled set of 50 traffic scenarios. To produce forecasts of future numbers of relevant aircraft pairs in the sector, a graph representation of the LMS route network was constructed, standardising the spatial fidelity of route legs. The forecasting method accounts for uncertainty in aircraft arrival times by modelling the probability of each aircraft occupying route segments at future query times. When combined with historic distributions of relevant interactions and a live operational data stream, predictions of upcoming ATCO workload could be made up to 45 minutes in advance. The proposed method to forecast upcoming workload showed a significantly stronger correlation with actual relevant interactions (Spearman's $ρ= 0.68$) than a standard traffic volume prediction ($ρ= 0.55$). The resulting data-driven tool shows promise for use by group supervisors to inform sector configuration and ATCO rostering decisions.

2605.23689 2026-05-25 cs.LG math.DS

Optimization of randomized neural networks for transfer operator approximation

随机神经网络优化用于传递算子逼近

Mohammad Tabish, Stefan Klus

AI总结 本文提出了一种用于复杂动力系统传递算子近似的随机神经网络架构RaNNDy,其隐藏层的权重和偏置随机初始化并固定,仅训练输出层,从而降低了训练成本并提供了闭式解。然而,该方法依赖于初始选择的激活函数来确定基函数,为此,本文提出了一种优化激活函数的算法,在保持网络参数固定的情况下提升基函数的适应性,并通过多个基准问题验证了方法的有效性。

详情
AI中文摘要

RaNNDy是一种随机神经网络架构,用于数据驱动地逼近与复杂动力系统相关的传递算子。网络隐藏层的权重和偏置随机初始化并保持固定,仅训练输出层。与完全优化的神经网络相比,这具有几个优点,特别是输出层的闭式解和显著降低的训练成本。尽管有这些优点,RaNNDy局限于参数化算子逼近所需基函数的权重和偏置的初始选择。由于基函数由激活函数决定,为隐藏层选择合适的激活函数至关重要。在这项工作中,我们提出了一种算法,该算法优化激活函数本身,同时保持随机神经网络中的权重和偏置固定,从而提供更合适的字典。我们通过各种基准问题(包括随机微分方程和图上的随机游走)说明了该方法的有效性。

英文摘要

RaNNDy is a randomized neural network architecture for the data-driven approximation of transfer operators associated with complex dynamical systems. The weights and biases of the hidden layers of the network are randomly initialized and kept fixed, only the output layer is trained. This has several advantages over fully optimized neural networks, notably a closed-form solution for the output layer and significantly lower training costs. Despite these advantages, RaNNDy is restricted to the initial selection of weights and biases that parametrize the basis functions required for the operator approximation. Since the basis functions are determined by the activation function, choosing an appropriate activation function for the hidden layers is crucial. In this work, we propose an algorithm that optimizes the activation function itself, while keeping the weights and biases in the randomized neural network fixed, providing a more suitable dictionary. We illustrate the efficacy of the approach using various benchmark problems, including stochastic differential equations and random walks on graphons.

2605.22738 2026-05-25 cs.LG cs.AI stat.ML

Proxy-Based Approximation of Shapley and Banzhaf Interactions

基于代理的Shapley和Banzhaf交互近似

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter, Eyke Hüllermeier, Maximilian Muschalik, Fabian Fumagalli

AI总结 本文研究了如何高效准确地估计Shapley和Banzhaf交互值,以解释机器学习模型中特征之间的复杂相互作用。为此,作者提出了ProxySHAP方法,结合树模型代理的高效采样与残差校正策略,实现了在保证精度的同时提升计算效率。理论分析表明,ProxySHAP能够在多项式时间内计算树集成模型的精确交互指数,并有效控制偏差与方差。实验表明,ProxySHAP在多个基准测试中表现优异,尤其在大规模高维数据上显著优于现有方法。

详情
AI中文摘要

Shapley和Banzhaf交互捕捉了现代机器学习应用中固有的复杂动态。然而,当前对这些高阶交互的估计器在速度和准确性之间进行权衡。为了克服这一限制,我们引入了ProxySHAP。ProxySHAP将基于树的代理模型的高样本效率与通过残差校正实现一致性的原则路径相结合。在理论层面,我们推导了干预TreeSHAP的多项式时间推广,以计算树集成的精确交互指数,成功避免了先前方法中的指数树深度依赖。此外,我们正式分析了残差调整策略,刻画了最大样本重用(MSR)在特定条件下校正代理偏差而不使其方差随交互规模指数增长的条件。广泛的基准测试表明,ProxySHAP在近似质量上树立了新的最先进标准,包括在具有数千个特征的大规模应用中。通过在小预算和大预算场景下均实现最低误差,ProxySHAP显著优于先前最佳估计器ProxySPEX和KernelSHAP-IQ,同时在可解释性下游任务上也提供了卓越性能。

英文摘要

Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.

2605.22672 2026-05-25 cs.AI

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

能力是负担吗?更强大的语言模型在关键时刻做出更差的预测

Nick Merrill, Jaeho Lee, Ezra Karger

AI总结 本文研究了在具有超线性增长和制度变化尾风险的时间序列预测任务中,能力更强的语言模型反而会产生更差的分布预测这一逆向缩放现象。通过在模拟和真实数据集上的实验,发现更强大的模型倾向于高估上尾风险,而下尾预测相对稳定。研究还表明,模型规模和后训练均对这一现象有影响,并建议在评估语言模型预测能力时应结合连续的准确性指标,而不仅仅依赖于单一阈值的二元指标。

详情
AI中文摘要

我们记录了LLM在预测问题上的逆缩放现象,这些问题的底层时间序列表现出超线性增长和制度转换的尾部风险,这种结构在金融和流行病学中很常见。在这些任务上,更强大的模型会产生更差的分位数预测。该模式出现在我们发布的、无污染的模拟世界基准ForecastBench-Sim(FBSim)上,在预测具有匹配线性控制的合成SIR流行病时,并在COVID-19、麻疹、住房市场和恶性通货膨胀的真实世界数据集中得到复现。每个分位数的分解表明,失败集中在尾部上端,更强大的模型将其向上移动以跟踪激进的增长外推,而下尾部保持不变。Llama-3.1的族内研究表明,模型规模和后训练都独立地促成了这种效应。领域知识并不能可靠地挽救校准。这种逆缩放并不出现在LLM预测基准中常见的单阈值指标上,在相同的输出上,能力-准确性关系的符号发生了反转。在常规截止点上的单阈值评分忽略了尾部上端的成本;包含尾部的评分在相同的输出上反转了能力-准确性关系的符号。我们建议LLM预测评估使用连续(且无界)的准确性度量以及有界的二元阈值度量。

英文摘要

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

2605.22643 2026-05-25 cs.CL

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

温水煮青蛙:面向智能体安全的多轮基准测试

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi

AI总结 本文提出“Boiling the Frog”(慢慢煮青蛙)多轮基准测试,用于评估作为智能体部署的语言模型在企业及办公场景中的安全性,尤其关注其在逐步攻击下的表现。该基准通过模拟工作环境中的多轮交互,测试模型在面对逐步引入的风险请求时是否会引发安全问题。研究发现,不同模型在该基准上的攻击成功率差异显著,部分模型的攻击成功率高达92.9%,突显了当前AI系统在安全防护方面仍面临严峻挑战。

详情
AI中文摘要

背景。传统的语言模型安全基准评估生成的文本:模型是否输出有毒语言、再现偏见或遵循有害指令。当模型作为智能体部署时,安全相关的对象从系统所说的内容转移到它在环境中执行的操作,仅评估模型在提示下的响应已不足以应对人工智能带来的安全挑战。近期发展出现了评估大型语言模型作为智能体的基准测试。我们对此研究方向做出贡献。方法。我们引入“温水煮青蛙”基准测试,评估部署在企业及办公环境中使用工具的AI模型是否易受渐进式攻击。每个场景以良性工作区编辑开始,随后引入一个包含风险的请求。基准测试聚焦于有状态的多轮评估:链暴露持久工作区,将风险负载置于轮次序列中的受控位置,并评估最终工件状态是否变得不安全。场景基于“温水煮青蛙”风险、AI法案附件I和附件III高风险情境以及欧盟AI法案通用人工智能行为准则(GPAI)构建的三级操作风险分类进行组织。结果。在九个模型的面板中,总体严格攻击成功率(ASR)为44.4%。模型级ASR范围从Claude Haiku 4.5的20.5%到Gemini 3.1 Flash Lite的92.9%,Seed 2.0 Lite也超过80%。对于行为准则失控场景,平均链类别级ASR达到93.3%。

英文摘要

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.

2605.21874 2026-05-25 cs.SD

Real-time, EDM-inspired sonification of the activity of a supercomputer

实时、受EDM启发的超级计算机活动声音化

Marco Alunno, Paolo Bientinesi

AI总结 本文研究了如何将超计算机实时运行数据通过声音形式进行信息丰富的声学化呈现。研究提出了一种基于电子舞曲(EDM)风格的声学化方法,以持续、清晰且吸引人的方式反映系统各节点的活动状态。该方法强调实时监控而非调试,生成无限延续且风格统一的音乐,将数据声学化与长期监听需求相结合,具有独特创新性。

详情
Comments
7 pages, 2 figures, accepted conference paper
AI中文摘要

本文描述的项目探索了对超级计算机实时接收的数据进行信息性声音化。这些数据捕获了计算机所有节点当前的活动,因此其声音化作为一种持续监控节点行为以及整个系统行为的形式。由于这种监控理论上永无止境,因此产生的声音化必须在音乐上能够通过声音传达信息,同时保持长时间的可理解性和吸引力。我们没有将预定义的音乐风格强加于数据,而是试图找到一种数据本身能够合理支持的音乐风格。从一小部分候选中,我们选择了EDM,因为它是一类流派,其结构和时间特征与连续的数据驱动过程和长期聆听非常契合。通过这种基于风格的方法,本研究建立在计算机数据声音化的悠久传统之上,同时独特地结合了很少同时处理的三个要素:以监控(而非调试)为主要目标、实时(而非事后)数据解释,以及生成几乎无限且风格连贯(而非不协调)的音乐。

英文摘要

The project described in this paper explores the informative sonification of data received in real time from a supercomputer. These data capture the current activities in all the nodes of the computer, therefore, their sonification functions as a form of continuous monitoring of the nodes' behavior and, by extension, of the system as a whole. Because such monitoring is theoretically unending, the resulting sonification must be musically capable of conveying information through sound in a way that remains both intelligible and engaging over long durations. Rather than imposing a predefined musical style onto the data, we sought to identify one which the data themselves could plausibly support. From a small set of candidates, we selected EDM because it is a family of genres whose structural and temporal characteristics align well with continuous, data-driven processes and long-term listening. Through this style-based approach, this research builds on the long tradition of computer data sonification while uniquely combining three elements rarely addressed together: monitoring (rather than debugging) as the primary goal, real-time (rather than post-mortem) data interpretation, and generation of virtually infinite and stylistically coherent (rather than incongruous) music.

2605.21813 2026-05-25 cs.LG stat.ME stat.ML

Symbolic Density Estimation for Discrete Distributions

离散分布的符号密度估计

Ziwen Liu, Meng Li

AI总结 本文提出了一种名为符号密度估计(SDE)的无监督框架,用于自动恢复离散分布的闭式概率质量函数。该方法通过在结构化的搜索空间中组合基本解析操作,结合领域特定的结构先验、进化搜索和有效性感知推理阶段,能够有效扩展至更复杂的分布族,如零膨胀分布和有限混合分布。研究还构建了一个涵盖多种常用离散分布的基准数据集,并在实验中验证了该算法在参数估计和模型拟合方面的优越性。

详情
Comments
28 pages, 5 figures, 22 tables
AI中文摘要

离散概率法则支撑着统计建模,然而可解释分布的目录通过几个世纪以来逐案数学推导仅逐渐扩展。我们引入了符号密度估计(SDE),这是一个无监督框架,通过在结构化搜索空间内组合基本解析操作自动恢复闭式概率质量函数。我们的方法将领域特定的结构先验与进化搜索和有效性感知推理阶段相结合,并扩展到更丰富的分布族,如零膨胀和有限混合。为了支持系统评估和未来研究,我们贡献了一个涵盖广泛常用离散分布的基准数据集。所提出的算法恢复了所有基准分布族,并给出了准确的参数估计。一个真实数据应用表明,它识别出简洁且可解释的混合模型,这些模型在拟合优度上优于标准模型。

英文摘要

Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.

2605.21504 2026-05-25 q-fin.ST cs.AI

Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

使用Chronos时间序列基础模型进行多元金融预测

Sanjiv R Das, Tarang Goyal, Mohini Yadav

AI总结 本文利用开源时间序列基础模型Chronos-2,评估预训练时间序列模型在经济与金融预测中的表现,重点研究多变量(MV)输入相比单变量(UV)基线是否能提升预测精度。研究覆盖了七只优质股票、美国国债利率及其组合面板,通过2000年至2025年的滚动月度评估,结果显示多变量预测在利率和股票数据中均显著优于单变量预测,且误差分布更集中。研究还指出,跨市场混合时间序列会降低预测准确性,表明引入噪声背景可能影响模型性能,整体表明基础模型可通过跨序列信息提升金融预测精度,尤其在结构化滚动协议下效果更佳。

详情
Comments
10 pages, 3 tables, 3 figures
AI中文摘要

使用开源时间序列基础模型Chronos-2,我们评估了预训练时间序列模型在经济和金融预测中的表现,重点研究多元输入相对于单变量基线是否提高了准确性。研究涵盖两个面板——Magnificent-7股票和美国国债利率——以及一个组合面板,使用2000年至2025年的滚动月度评估。我们改变输入窗口长度和预测范围,并报告RMSE和MAPE。跨数据集,多元预测一致优于单变量预测,利率的增益尤为强劲,股票也有显著改善。序列级比较显示多元输入在所有情况下均有改进,且误差离散度通常更低。我们还提供了参数热图和时间序列可视化。然而,混合股票和利率市场的时间序列会降低预测准确性,表明添加噪声上下文会降低模型性能。总体而言,结果表明基础模型可以利用跨序列信息提高金融预测准确性,并且在严格滚动协议下对相关序列进行联合建模时收益最大。除了使用开源基础模型外,本文还展示了AI如何用于金融研究。

英文摘要

Using Chronos-2, an open-source time-series foundation model, we evaluate pretrained time-series models for economic and financial forecasting with an emphasis on whether multivariate (MV) inputs improve accuracy relative to univariate (UV) baselines. The study covers two panels -- the Magnificent-7 equities and U.S. Treasury interest rates -- as well as a combined panel, using rolling monthly evaluations from 2000--2025. We vary input window lengths and forecast horizons and report RMSE and MAPE. Across datasets, MV forecasts consistently outperform UV forecasts, with especially strong gains for interest rates and meaningful improvements for equities. Series-level comparisons show MV improvements in every case, and error dispersion is generally lower under MV inputs. We also provide parameter-heatmap and time-series visualizations. However, mixing time series across equity and interest rate markets reduces forecast accuracy, indicating that adding noisy context degrades model performance. Overall, the results indicate that foundation models can leverage cross-series information to improve forecast accuracy in finance, and that the benefits are strongest when related series are modeled jointly under disciplined rolling protocols. Other than using an open-source foundation model, this paper also showcases how AI may be used for financial research.

2605.21031 2026-05-25 cs.RO

Modeling and Control of a Pneumatic Morphing Soft Quadrotor based on the SOFA Framework for Dynamic Soft Robotic Simulation

基于SOFA框架的软体动态仿真的气动变形软四旋翼建模与控制

F. Labra Caso, V. Sumathy, P. Ferrentino, B. Vanderborght, J. Haluska, G. Nikolakopoulos

AI总结 本文提出了一种基于SOFA框架的有限元方法,用于建模和控制一种充气变形软体四旋翼飞行器。该方法在保留传统四旋翼动力学物理可解释性和控制结构的同时,能够捕捉充气软臂复杂的时变行为。通过在SOFA中将软体手臂离散为四面体网格,并结合弹性材料定律模拟其内部力,实现了对充气驱动变形能力的动态仿真与控制分析,展示了该建模框架和控制器设计的有效性。

详情
Comments
8 pages, 10 figures
AI中文摘要

本文提出了一种基于SOFA的有限元方法,用于气动变形软四旋翼的软体建模及相应的动态仿真与控制。所提出的建模保留了传统四旋翼动力学的物理可解释性和控制结构,同时捕捉了气动驱动软臂的复杂时变行为。在SOFA中,软气动驱动臂被离散化为四面体网格,遵循弹性材料定律,产生与身体真实动态行为相适应的内力。在内部空腔中施加由周期性和基于误差的控制信号共同驱动的气动作用,以分析变形能力。最后,提出了一种比例积分控制器来研究气动臂的受控动态行为和变形能力,其中对软臂的气动驱动进行控制以实现期望的目标位置。仿真结果证明了所提出的新型建模框架及相关控制器设计的有效性。

英文摘要

This article presents a novel SOFA based finite element method for the soft body modeling and the corresponding dynamic simulation and control of a pneumatic morphing soft quadrotor. The proposed modeling preserves the physical interpretability and control structure of traditional quadrotor dynamics, while capturing the complex, time-varying behavior of pneumatically actuated soft arms. In SOFA, the soft pneumatically actuated arms are discretized as a tetrahedral mesh following an elastic material law that produces internal forces adequate to the real dynamic behavior of the body. Pneumatic actuation governed by both periodic and error-based control signals is applied within the internal cavities to analyze the morphing capability. Finally, a proportional-integral controller is proposed to study the controlled dynamic behavior and morphing capabilities of the pneumatic arm, wherein the pneumatic actuation to the soft arm is controlled to achieve the desired target position. The simulation results show the effectiveness of the proposed novel modeling framework and the related controller design.

2605.19069 2026-05-25 cs.CL cs.AI

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业ASR系统在代码切换语音上的基准测试:阿拉伯语、波斯语和德语

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

AI总结 本文研究了自动语音识别(ASR)系统在语言代码转换(Code-Switching)场景下的性能,针对阿拉伯语、波斯语和德语与英语之间的四种语言对进行了评估。通过一个两阶段的筛选流程,选取了300个样本,并使用BERTScore和词错误率(WER)进行测评,发现不同指标对系统排名的一致性及质量差距的反映存在差异。研究还揭示了商业ASR系统在处理代码转换语音时的性能差距,并公开了相关数据集以供进一步研究。

详情
AI中文摘要

代码切换——在同一话语中两种语言的自然交替——仍然是自动语音识别(ASR)中最具挑战性和研究不足的条件之一。我们提出了一个基准测试,评估了五个商业ASR提供商在四种语言对上的表现:埃及阿拉伯语-英语、沙特阿拉伯语(纳吉迪/希贾兹)-英语、波斯语(法尔西)-英语和德语-英语,每对包含300个样本,通过结合启发式过滤和GPT-4o与Gemini 1.5 Pro集成评分器的两阶段管道选择,将LLM成本降低约91%。我们在WER和BERTScore上进行评估,表明虽然两个指标在阿拉伯语和波斯语对的系统排序上一致(τ=1.0),但WER通过惩罚语义正确的音译选择,将质量差距的幅度夸大约3倍。ElevenLabs Scribe v2实现了最低的WER(总体13.2%),并在BERTScore上领先(总体0.936)。难度分层分析揭示了被总体平均值掩盖的性能差距,BERT嵌入投影证实了参考和假设之间的语义接近性,尽管存在表面脚本差异。数据集公开于https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch。

英文摘要

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($τ= 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

2605.18214 2026-05-25 cs.CV

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

EgoInteract: 用于交互理解和预测的合成自我中心视频生成

Rosario Leonardi, Francesco Ragusa, Daniele Materia, Alessandro Passanisi, James Fort, Jakob Engel, Giovanni Maria Farinella

AI总结 本文提出EgoInteract,一个可控的模拟器,用于生成具有精细时空标注的以自我为中心的合成视频,旨在解决真实数据收集困难以及交互模式覆盖有限的问题。该模拟器支持对相机、人体和手部运动、物体操作及场景构图的精确控制,生成的视频数据可用于时序动作分割、下一时段活跃物体检测、交互预测等任务。实验表明,基于该模拟器训练的模型在多个真实世界的以自我为中心数据集上均取得了优于现有方法的性能,验证了其有效性和泛化能力。

详情
AI中文摘要

收集具有密集时空标注的大规模自我中心视频数据集成本高昂、速度缓慢,且常受环境偏差、隐私约束和交互模式覆盖有限的限制。虽然合成数据在多个视觉领域显示出巨大潜力,但其在自我中心感知中的应用仍相对未被充分探索,尤其是对于需要时间一致的人-物交互的任务。在这项工作中,我们引入了EgoInteract,一个用于自我中心视频生成的可控模拟器,旨在建模细粒度的自我中心交互及其时间动态。该模拟器能够精确控制相机、人体和手部运动、物体操作以及跨不同环境的场景组成。基于此框架,我们生成一个带有密集时空标注的合成自我中心视频数据集,用于时间动作分割、下一活动物体检测、交互预测和手-物交互检测。我们评估了在模拟数据上训练的模型在多个真实世界自我中心基准上的表现,这些基准涵盖不同环境、物体类别和交互模式。结果表明,在各项任务和数据集上,我们的方法相较于强基线有一致的改进,展示了基于模拟方法的有效性和可迁移性。

英文摘要

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

2605.13930 2026-05-25 cs.LG cs.HC cs.NE

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

基于稀疏自编码器的脑电图基础模型机制可解释性

William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schiøler, Andreas Brink-Kjær, Sadasivan Puthusserypady, Sándor Beniczky, James Zou, Lars Kai Hansen

AI总结 该研究旨在提升EEG基础模型的可解释性,通过稀疏自编码器(SAEs)从三个不同架构的EEG变压器模型中提取稀疏特征字典,并将其与临床分类(如异常、年龄、性别和用药)对齐。研究提出了一种统一的超参数优化方法,用于评估模型特征的语义清晰度和纠缠程度,并引入“目标与非目标”探针区域度量,揭示了模型在概念控制方面的三种操作模式。此外,研究还展示了模型在临床概念干预中的关键失败案例,并通过频谱解码器将潜在空间操作映射到生理可解释的频率特征,为临床应用提供了更透明的解释框架。

详情
Comments
Preprint. 14 pages, 7 figures, 4 tables
AI中文摘要

脑电图基础模型在临床性能上达到了最先进水平,但其驱动预测的内部计算仍然不透明,这是临床信任的障碍。我们将TopK稀疏自编码器应用于三种架构不同的EEG Transformer:SleepFM、REVE和LaBraM,从其嵌入中提取稀疏特征字典。通过将这些特征基于临床分类法(异常、年龄、性别和用药)进行 grounding,我们跨架构基准测试了单语义性和纠缠性。一个由内在字典健康审计驱动的单一超参数过程,在所有三种架构上鲁棒地迁移。通过概念引导,我们引入了一个“目标 vs. 非目标”探测区域度量来量化引导选择性,并揭示了三种操作模式:可选择性引导、编码但纠缠、以及未编码。该框架暴露了关键的表征失败:“破坏球”干预会破坏全局模型性能,以及临床纠缠,例如年龄-病理混淆,其中不可能在不破坏另一个概念的情况下抑制一个概念。最后,一个频谱解码器将这些干预映射回幅度谱,将潜在操作转化为生理上可解释的频率特征,例如病理性慢波抑制和α频带恢复。

英文摘要

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

2605.07919 2026-05-25 cs.CV

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

MedVIGIL: 在视觉证据受损下评估可信的医学视觉语言模型

Hanqi Jiang, Junhao Chen, Mingyu Kang, Hyeokjae Kwon, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li

AI总结 本文提出MedVIGIL,一个用于评估医疗视觉-语言模型(VLMs)在面对失效视觉证据时可信度的基准测试。研究关注模型在图像或问题被篡改时是否仍能正确拒绝回答,而非给出流畅但错误的答案。MedVIGIL包含300个由放射科专家标注的案例,提供了多种评估指标和复合得分,用于衡量模型在不同失效场景下的表现,并公开了16个视觉模型和两个纯文本基线的评估结果。

详情
AI中文摘要

医学视觉语言模型(VLM)通常在完整的图像-问题对上进行评估,但可信的临床应用需要更强的性质:模型必须能够识别答案的证据基础何时失效。我们通过扰动证据下的静默失败来研究这一问题,其中视觉相关的医学问题与错误前提、措辞扰动、仅知识改写或ROI损坏的图像配对,但模型返回流畅的非拒绝答案。我们引入了medvigil,一个从四个公共医学VQA来源中提取的300例评估套件,由四位委员会认证的放射科医生全程监督:每个黄金答案、拒绝选项、候选答案集、释义、错误前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医生并行注释每个案例,一位高级放射科医生整合发布的清单,第四位独立于构建的放射科医生回答每个探针以提供人类参考基线。发布包含2556个MCQ探针、240个反事实三元组、医生裁定的风险等级和可回答性标志、ROI框以及配对的开放式变体。我们报告了七个正确性条件审计指标,总结为medvigil复合评分(MCS),并审计了16个视觉能力模型加上两个纯文本基线。独立放射科医生得分为MCS 83.3,静默失败率为5.8%,比最强审计模型(Claude Opus 4.7为69.2)高出14.1个复合分。基准和评估工具已公开发布。

英文摘要

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

2605.02443 2026-05-25 cs.CL

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

HalluScan:一个用于检测和缓解指令遵循型大语言模型幻觉的系统性基准

Ahmed Cherif

AI总结 HalluScan 是一个系统化的基准框架,用于检测和缓解遵循指令的大型语言模型中的幻觉问题。该研究提出了 HalluScore 作为新的综合评估指标,并引入了 Adaptive Detection Routing 算法以提高检测效率,同时通过错误类型分解揭示了不同领域中幻觉表现的显著差异。实验表明,NLI 验证方法在检测效果上表现最佳,为后续研究和应用提供了重要参考。

详情
Comments
38 pages, 13 figures, 10 tables. Submitted to Neural Computing and Applications
AI中文摘要

大型语言模型(LLM)在各种自然语言处理任务中展现了卓越的能力,但它们仍然容易产生幻觉——生成事实上不正确、与提供上下文不一致或与用户指令不符的内容。我们提出了HalluScan,一个全面的基准框架,系统性地评估了跨越72种配置(涵盖6种检测方法、4个开放权重模型家族和3个不同领域)的幻觉检测与缓解。我们引入了三个关键贡献:(1)HalluScore,一种新颖的复合指标,与人类专家判断的皮尔逊相关系数达到r=0.41;(2)自适应检测路由(ADR),一种智能路由算法,实现了2.0倍的成本降低,仅损失0.1%的AUROC;(3)系统错误级联分解,揭示了不同领域间幻觉错误类型的显著差异。我们的实验表明,NLI验证达到了最高的总体AUROC为0.88,而RAV达到了第二高的AUROC为0.66。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

2604.26145 2026-05-25 cs.HC cs.AI

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ceci n'est pas une explication: 评估语言学习系统中作为可解释性陷阱的解释失败

Ben Knight, Wm. Matthew Kennedy, Danielle Carvalho, Isaac Pattis, James Edgell

AI总结 该研究探讨了人工智能语言学习系统中解释性失败的问题,指出这些系统提供的即时反馈可能在表面看似有帮助,但实际上存在根本性缺陷,可能加剧学习者的误解并影响学习效果。研究提出了L2-Bench基准,用于评估语言教育中的AI系统,涵盖诊断准确性、错误原因分析等多个关键反馈维度,并分析了AI在这些维度上的失效方式及其带来的“可解释性陷阱”。研究强调了语言学习场景下这些风险的特殊性,并呼吁在设计评估框架时更加关注相关问题。

详情
Comments
Accepted to Misleading Impacts Resulting from AI Generated Explanations (MIRAGE) Workshop @ IUI 2026
AI中文摘要

AI驱动的语言学习工具日益为全球数百万学习者提供即时、个性化的反馈。然而,这种反馈可能以学习者甚至教师难以察觉的方式失败,长期使用可能强化误解并侵蚀学习效果。我们提出了L2-Bench的一部分,这是一个用于评估语言教育中AI系统的基准,包括(但不限于)有效反馈的六个关键维度:诊断准确性、适当性意识、错误原因、优先级排序、改进指导和支持自我调节。我们分析了AI系统在这些维度上可能失败的方式。这些失败,我们认为会导致“可解释性陷阱”,即表面上看似有帮助但本质上有缺陷的AI生成解释,增加了成就、人机交互和社会情感危害的风险。我们讨论了语言学习的特定背景如何放大这些风险,并概述了在设计评估框架时我们认为值得更多关注的开放问题。我们的分析旨在扩展社区对可解释性陷阱类型学及其可能发生的上下文动态的理解,以鼓励AI开发者更好地设计安全、可信和有效的AI解释。

英文摘要

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

2604.24920 2026-05-25 cs.CR cs.AI

SUDP: Secret-Use Delegation Protocol for Agentic Systems

SUDP: 面向智能体系统的秘密使用委托协议

Xiaohang Yu, Hejia Geng, Xinmeng Zeng, William Knottenbelt

AI总结 随着代理系统越来越多地使用用户秘密进行API调用、消息平台和云服务操作,现有的运行时授权机制往往通过暴露秘密或其衍生物来实现,导致潜在的安全风险。本文提出了一种名为SUDP的机密使用委托协议,旨在确保用户授权的秘密操作不被滥用,且不赋予请求者持久的访问权限。该协议通过用户授权、请求者提出操作、托管方执行有限使用的方式,满足七个关键安全属性,在结合硬件根运行时的情况下,能够在标准密码学假设下保障秘密的完整性和机密性。

详情
AI中文摘要

智能体系统越来越多地使用用户秘密来访问API、消息平台和云服务。当前的智能体运行时通常通过暴露来实现授权:启用操作通常意味着将可重复使用的秘密或由其派生的可重复使用工件放入运行时,因此瞬时的提示注入或工具侧妥协就会变成持久的账户妥协。现有的防御措施涵盖了相邻的部分,如秘密存储、范围委托、发送者约束令牌和运行时监控,但未能为组合的智能体义务提供通用规范:一个不可信的自主请求者应该能够发起用户授权的秘密支持操作,而不会获得对该操作的可重复使用权限。我们将此形式化为智能体秘密使用(ASU)问题,并确定了任何解决方案必须满足的七项安全属性,涵盖授权完整性和秘密机密性。我们提出了秘密使用委托协议(SUDP),其中请求者提出规范操作,用户使用新鲜的身份验证器支持的授权进行授权,保管人兑现授权以执行有限的使用;可重复使用的权限永远不会跨越请求者边界。我们将SUDP专门用于LLM驱动的智能体,每当工具调用会使用用户注册的授权材料时,它都适用。在标准密码学假设下,当与硬件根运行时集成时,SUDP满足所有七项属性。参考实现可在https://github.com/xhyumiracle/sudp获取。

英文摘要

Agentic systems increasingly act with user secrets for APIs, messaging platforms, and cloud services. Today's agent runtimes typically implement authorization by exposure: enabling action often means placing a reusable secret, or a reusable artifact derived from it, inside the runtime, so a transient prompt-injection or tool-side compromise becomes durable account compromise. Existing defenses cover adjacent pieces such as secret storage, scoped delegation, sender-constrained tokens, and runtime monitoring, but leave the combined agentic obligation without a common specification: an untrusted autonomous requester should be able to cause a user-authorized secret-backed operation without gaining reusable authority over it. We formalize this as the Agent Secret Use (ASU) problem and identify seven security properties any solution must satisfy, spanning authorization integrity and secret confidentiality. We propose the Secret-Use Delegation Protocol (SUDP), in which a requester proposes a canonical operation, the user authorizes it with a fresh authenticator-backed grant, and a custodian redeems the grant to perform the bounded use; reusable authority never crosses the requester boundary. We specialize SUDP for LLM-driven agents, where it applies whenever a tool call would exercise user-enrolled authority-bearing material. Under standard cryptographic assumptions, SUDP satisfies all seven properties when integrated with a hardware-rooted runtime. A reference implementation is available at https://github.com/xhyumiracle/sudp.

2604.24021 2026-05-25 cs.AI math.AP

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

QED:一个用于生成开放问题数学证明的开源多智能体系统

Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

AI总结 本文介绍了一个名为 QED 的开源多智能体系统,旨在无需人工干预即可将人类提出的研究问题转化为完整的数学证明。该系统通过分离规划、证明和验证三个阶段,有效克服了单一查询证明生成的常见缺陷,其中分解代理负责结构规划,证明代理生成候选论证,验证代理检查正确性。在与领域专家合作的评估中,QED 在 18 个不同难度的研究项目上表现出色,成功生成了五项原创性研究成果,其中三项被认为具有与主流数学期刊相当的深度和广度。

详情
AI中文摘要

我们提出 extbf{QED},一个开源的多智能体系统,它能够将人类提供的研究问题转化为完整的数学证明,无需进一步的人类指导。其流水线旨在通过分离规划、证明和验证来克服单次查询证明生成的常见失败:分解智能体结构化证明搜索,证明智能体生成候选论证,验证智能体检查正确性。与领域专家合作,我们在18个不同难度的研究级项目上评估了QED。QED在代数几何、流体偏微分方程、概率和反问题领域产生了五篇原创工作。专家评估认为这些工作是扎实的专业研究贡献,其中三篇在难度和范围上与常见于成熟专业数学场所发表的工作相当。QED发布于https://github.com/proofQED/QED。

英文摘要

We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single-query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research-level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

2604.21502 2026-05-25 cs.CV

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

VFM$^{4}$SDG:揭示VFM在单域广义目标检测中的力量

Yupeng Zhang, Ruize Han, Ningnan Guo, Wei Feng, Song Wang, Liang Wan

AI总结 该研究针对单域通用目标检测(SDGOD)中因环境变化导致的性能下降问题,提出了一种基于视觉基础模型(VFM)的新型框架VFM$^{4}$SDG。通过分析发现,检测器在跨域场景下的性能下降主要源于关系结构的不稳定,而VFM在严重域偏移下仍能保持稳定的关系和物体响应,因此被用作跨域稳定性先验。该方法通过引入冻结的VFM,分别在编码器和解码器中进行关系先验蒸馏和语义-上下文查询增强,有效提升了检测器的跨域鲁棒性,并在多个基准测试中取得了显著优势。

详情
AI中文摘要

现实世界中的天气、光照和成像变化常常引起严重的域偏移,导致单源检测器在未见环境中性能下降。现有的单域广义目标检测(SDGOD)方法主要依赖于数据增强或域不变学习,而很大程度上忽略了域偏移如何破坏检测器的预测稳定性。通过分析实验,我们发现性能下降主要由漏检增加主导。进一步分析表明,这一现象源于DETR风格检测器的跨域稳定性降低:域偏移破坏了编码器侧的物体-背景和实例间关系,并进一步削弱了解码器查询与真实物体之间的语义-空间绑定。受此启发,我们发现视觉基础模型(VFM)在严重偏移下仍能保持稳定的关系结构和物体响应,使其成为补偿检测器退化的合适跨域稳定性先验。为此,我们提出了VFM$^{4}$SDG,一个用于SDGOD的双先验学习框架,它将冻结的VFM引入编码器表示学习和解码器查询建模。具体来说,我们提出了跨域稳定关系先验蒸馏,将VFM中的稳定物体-背景和实例间关系蒸馏到编码器中,补偿关系退化。同时,我们提出了基于语义-上下文先验的查询增强,在查询进入解码器层之前注入类别语义原型和全局物体上下文,增强语义-空间查询-物体绑定稳定性。大量实验表明,VFM$^{4}$SDG在标准SDGOD基准和两个主流基于DETR的检测框架上显著优于现有先进方法,证明了其有效性、鲁棒性和泛化性。

英文摘要

Real-world weather, illumination, and imaging variations often induce severe domain shifts, degrading single-source detectors in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant learning, while largely overlooking how domain shift disrupts detector prediction stability. Through analytical experiments, we find that performance degradation is mainly dominated by increasing missed detections. Further analysis shows that this phenomenon stems from reduced cross-domain stability in DETR-style detectors: domain shift disrupts encoder-side object-background and inter-instance relations, and further weakens the semantic-spatial binding between decoder queries and real objects. Motivated by this, we find that vision foundation models (VFMs) still preserve stable relational structures and object responses under severe shifts, making them suitable cross-domain stability priors to compensate for detector degradation. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen VFM into encoder representation learning and decoder query modeling. Specifically, we propose Cross-domain Stable Relational Prior Distillation to distill stable object-background and inter-instance relations from the VFM into the encoder, compensating for relational degradation. Meanwhile, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category semantic prototypes and global object context into queries before they enter the decoder layer, enhancing semantic-spatial query-object binding stability. Extensive experiments show that VFM$^{4}$SDG significantly outperforms existing advanced methods on standard SDGOD benchmarks and two mainstream DETR-based detection frameworks, demonstrating its effectiveness, robustness, and generality.

2604.17134 2026-05-25 cs.CL

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian

RoIt-XMASA:面向罗马尼亚语和意大利语的多领域多语言情感分析数据集

Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, Vlad Andrei Muntean, Dumitru-Clementin Cercel

AI总结 本文介绍了 RoIt-XMASA,一个扩展了跨语言多领域亚马逊情感分析数据集的多语言数据集,涵盖意大利语和罗马尼亚语,包含36,000条标注评论和202,141条未标注样本。为应对跨语言和跨领域挑战,研究提出了一种多目标对抗训练框架,通过损失反转和元学习系数动态平衡情感判别与领域和语言不变性。实验表明,该方法在XLM-R模型上实现了66.23%的F1分数,优于基线4.64%,并在少样本设置下展示了任务微调与提示方法之间的性能权衡。

详情
Comments
Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)
AI中文摘要

我们提出了RoIt-XMASA,一个多语言数据集,它将跨语言多领域亚马逊情感分析扩展到意大利语和罗马尼亚语,包含36,000条跨三个领域(书籍、电影和音乐)的标注评论和202,141条未标注样本。为了解决跨语言和跨领域的挑战,我们提出了一种多目标对抗训练框架,该框架采用带有元学习系数的损失反转,以动态平衡情感判别与领域和语言不变性。使用我们的方法,XLM-R达到了66.23%的F1分数,比基线高出4.64%。少样本评估显示,Llama-3.1-8B达到了58.43%的F1分数,揭示了基于提示的方法的效率与任务特定微调的更高性能之间的有意义权衡。

英文摘要

We present RoIt-XMASA, a multilingual dataset that extends the Cross-lingual Multi-domain Amazon Sentiment Analysis to Italian and Romanian, comprising 36,000 labeled reviews across three domains (books, movies, and music) and 202,141 unlabeled samples. To address cross-lingual and cross-domain challenges, we propose a multi-target adversarial training framework that employs loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination with domain and language invariance. XLM-R achieves an F1-score of 66.23% with our approach, outperforming the baseline by 4.64%. Few-shot evaluation shows that Llama-3.1-8B achieves 58.43% F1-score, revealing a meaningful trade-off between the efficiency of prompting-based approaches and the higher performance of task-specific fine-tuning.

2604.11759 2026-05-25 cs.AI

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

检索是不够的:为什么组织AI需要认知基础设施

Federico Bottino, Carlo Ferrero, Nicholas Dosio, Pierfrancesco Beneventano

AI总结 本文指出,当前组织中AI使用的知识通常缺乏认知结构,仅依赖检索无法准确区分决策、假设、争议和未知问题等不同知识状态。为此,研究提出了OIDA框架,通过引入知识对象、重要性评分和矛盾关系等机制,构建具有认知一致性的知识表示系统,并引入“问题”作为组织未知的建模方式,提升AI对组织认知状态的理解能力。实验表明,OIDA在保持知识质量方面具有显著优势,并验证了其核心机制的有效性。

详情
Comments
10 pages, 2 figures, 8 tables, 6 appendices
AI中文摘要

AI代理使用的组织知识通常缺乏认知结构:检索系统会呈现语义相关的内容,而不区分约束性决策与放弃的假设、有争议的主张与已解决的问题、已知事实与未解决的问题。我们认为,组织AI的上限不是检索保真度,而是认知保真度——即系统将承诺强度、矛盾状态和组织无知表示为可计算属性的能力。我们提出了OIDA,这是一个框架,将组织知识结构化为类型化的知识对象,这些对象带有认知类别、具有类别特定衰减的重要性分数以及带符号的矛盾边。知识重力引擎以确定性方式维护分数,并具有经过证明的收敛保证(充分条件:最大度数<7;经验上对度数为43的情况鲁棒)。OIDA引入了“问题”作为模型化的无知:一种具有反向衰减的原语,以越来越紧迫的方式揭示组织不知道什么——这是所有被调查系统中缺失的机制。我们描述了认知质量评分(EQS),一种包含五个组成部分的评估方法,并带有明确的循环性分析。在受控比较(n=10个响应对)中,OIDA的RAG条件(3,868个令牌)达到EQS 0.530,而全上下文基线(108,687个令牌)为0.848;28.1倍的令牌预算差异是主要的混淆因素。问题机制在统计上得到验证(Fisher p=0.0325,OR=21.0)。形式化属性已建立;在相等令牌预算下的决定性消融实验(E4)已预注册但尚未运行。

英文摘要

Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

2604.10077 2026-05-25 cs.CV

DocRevive: A Unified Pipeline for Document Text Restoration

DocRevive:文档文本恢复的统一流水线

Kunal Purkayastha, Ayan Banerjee, Josep Llados, Umapada Pal

AI总结 DocRevive 是一种统一的文档文本修复管道,旨在解决损坏、遮挡或不完整文本的重建问题。该方法结合了先进的OCR、图像分析、掩码语言模型和扩散模型,实现了在保持视觉完整性的同时进行语义连贯的文本修复。研究还构建了一个包含30,078张退化文档图像的合成数据集,并提出了一种综合上下文相似度度量指标,以评估修复质量,为文档修复任务设立了新的基准。

详情
AI中文摘要

在文档理解中,重建受损、遮挡或不完整文本的挑战仍然是一个关键但未充分探索的问题。后续的文档理解任务可以受益于文档重建过程。为此,本文提出了一种新颖的统一流水线,结合了最先进的光学字符识别(OCR)、高级图像分析、掩码语言建模和基于扩散的模型,以在保持视觉完整性的同时恢复和重建文本。我们创建了一个包含30,078张退化文档图像的合成数据集,模拟了多种文档退化场景,为恢复任务设定了基准。我们的流水线检测并识别文本,通过遮挡检测器识别退化,并使用修复模型进行语义连贯的重建。基于扩散的模块无缝地重新整合文本,匹配字体、大小和对齐方式。为了评估恢复质量,我们提出了统一上下文相似度度量(UCSM),结合了编辑相似度、语义相似度和长度相似度,并引入上下文可预测性度量,当正确文本在上下文中显而易见时,对偏差进行惩罚。我们的工作推进了文档恢复,有利于档案研究和数字保存,同时为文本重建设立了新标准。OPRB数据集和代码分别可在Hugging Face(https://huggingface.co/datasets/kpurkayastha/OPRB)和Github(https://github.com/kunalpurkayastha/DocRevive)上获取。

英文摘要

In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.

2603.28767 2026-05-25 cs.CV

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Gen-Searcher: 强化搜索代理用于图像生成

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue

AI总结 本文提出Gen-Searcher,首个结合搜索增强的图像生成智能体,旨在解决现有模型因内部知识固化而在知识密集型或需最新信息的现实场景中表现不佳的问题。该方法通过多跳推理与搜索获取生成所需的文字知识和参考图像,并构建了两个高质量数据集及一个综合性基准KnowGen用于评估模型性能。实验表明,Gen-Searcher在多个指标上显著优于现有模型,为基于搜索的图像生成智能体研究提供了开放基础。

详情
Comments
Project page: https://gen-searcher.vercel.app Code: https://github.com/tulerfeng/Gen-Searcher
AI中文摘要

最近的图像生成模型在生成高保真度和逼真图像方面表现出强大能力。然而,它们从根本上受限于冻结的内部知识,因此在需要知识密集型或最新信息的现实场景中常常失败。在本文中,我们提出Gen-Searcher,作为训练搜索增强图像生成代理的首次尝试,该代理执行多跳推理和搜索,以收集基于文本的知识和参考图像,用于接地生成。为实现这一目标,我们构建了一个定制数据管道,并策划了两个高质量数据集:Gen-Searcher-SFT-10k和Gen-Searcher-RL-6k,包含多样化的搜索密集型提示和对应的真实合成图像。我们进一步引入了KnowGen,一个综合基准,明确要求搜索接地外部知识用于图像生成,并从多个维度评估模型。基于这些资源,我们使用SFT训练Gen-Searcher,随后进行具有双重奖励反馈的代理强化学习,该奖励结合了基于文本和基于图像的奖励,为GRPO训练提供更稳定和信息丰富的学习信号。实验表明,Gen-Searcher带来了显著提升,在KnowGen上使Qwen-Image提高了约16分,在WISE上提高了15分。我们希望这项工作能够作为图像生成中搜索代理的开放基础,并完全开源我们的数据、模型和代码。

英文摘要

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

2603.23565 2026-05-25 cs.LG cs.AI

Safe Reinforcement Learning with Preference-based Constraint Inference

基于偏好的约束推断的安全强化学习

Chenglin Li, Grant Ruan, Hua Geng

AI总结 本文研究了安全强化学习中如何从人类偏好中高效且可靠地学习复杂的安全约束。针对现有方法依赖专家演示或限制性假设的问题,提出了一种基于偏好的约束强化学习框架(PbCRL),通过引入死区机制和信噪比损失,提升了对安全成本分布的建模能力,并优化了策略学习过程。实验表明,该方法在满足安全约束和提升奖励方面优于现有先进方法,为安全关键场景中的约束推理提供了有效解决方案。

详情
Comments
Accepted by the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

安全强化学习(RL)是安全关键决策的标准范式。然而,现实世界中的安全约束可能复杂、主观,甚至难以明确指定。现有的约束推断工作依赖于限制性假设或大量的专家演示,这在许多实际应用中并不现实。如何廉价且可靠地学习这些约束是我们本研究关注的主要挑战。虽然从人类偏好中推断约束提供了一种数据高效的替代方案,但我们发现流行的Bradley-Terry(BT)模型未能捕捉安全成本的非对称、重尾特性,导致风险低估。在文献中,理解BT模型对下游策略学习的影响仍然很少。为了解决上述知识空白,我们提出了一种新颖的方法,即基于偏好的约束强化学习(PbCRL)。我们在偏好建模中引入了一种新颖的死区机制,并从理论上证明它鼓励重尾成本分布,从而实现更好的约束对齐。此外,我们引入了信噪比(SNR)损失,通过成本方差鼓励探索,这被发现有利于策略学习。进一步,采用两阶段训练策略以降低在线标注负担,同时自适应地增强约束满足。实验结果表明,PbCRL实现了与真实安全要求的优越对齐,并在安全性和奖励方面优于最先进的基线。我们的工作为安全RL中的约束推断探索了一种有前景且有效的方法,在各种安全关键应用中具有巨大潜力。

英文摘要

Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which are not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy is deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, with great potential in various safety-critical applications.

2603.21880 2026-05-25 cs.RO

Optimal Solutions for the Moving Target Vehicle Routing Problem with Obstacles via Lazy Branch and Price

带障碍物的移动目标车辆路径问题的最优解:懒惰分支定价法

Anoop Bhat, Geordan Gutow, Surya Singh, Zhongqiang Ren, Sivakumar Rathinam, Howie Choset

AI总结 本文研究了存在障碍物的移动目标车辆路径规划问题(MT-VRP-O),旨在为多个代理规划路径以拦截移动目标,同时满足时间窗口、速度限制和容量约束。为此,作者提出了一种基于延迟分支定价的优化方法Lazy BPRC,通过在分支定价框架中使用放松连续性约束的运动规划技术,有效降低了计算成本,并在保证最优解的前提下显著提升了求解效率。

详情
AI中文摘要

带障碍物的移动目标车辆路径问题(MT-VRP-O)旨在为多个智能体寻找轨迹,使其共同拦截一组移动目标。每个目标有一个或多个必须被访问的时间窗口,智能体必须避开静态障碍物并满足速度和容量约束。我们引入了具有松弛连续性的懒惰分支定价法(Lazy BPRC),为MT-VRP-O找到最优解。Lazy BPRC应用了VRP的分支定价框架,该框架在受限主问题(RMP)和定价问题之间交替。RMP旨在从有限的路径子集中为每个智能体选择一系列目标-时间窗口配对(称为路径)来执行。定价问题将路径添加到有限子集中。传统上,求解RMP需要计算每个智能体遵循有限子集中每条路径的成本。在MT-VRP-O中计算这些成本是计算密集型的,因为它需要在移动目标之间进行无碰撞运动规划。Lazy BPRC通过使用每条路径成本的下界来求解RMP,从而推迟成本计算,这些下界是通过具有松弛连续性约束的运动规划计算得出的。我们根据需要懒惰地评估路径的真实成本。我们通过在凸集图(GCS)上搜索最短路径来计算路径成本,并使用我们的连续性松弛方法加速搜索。我们证明,Lazy BPRC的运行速度比两种消融方法快一个数量级。

英文摘要

The Moving Target Vehicle Routing Problem with Obstacles (MT-VRP-O) seeks trajectories for several agents that collectively intercept a set of moving targets. Each target has one or more time windows where it must be visited, and the agents must avoid static obstacles and satisfy speed and capacity constraints. We introduce Lazy Branch-and-Price with Relaxed Continuity (Lazy BPRC), which finds optimal solutions for the MT-VRP-O. Lazy BPRC applies the branch-and-price framework for VRPs, which alternates between a restricted master problem (RMP) and a pricing problem. The RMP aims to select a sequence of target-time window pairings (called a tour) for each agent to follow, from a limited subset of tours. The pricing problem adds tours to the limited subset. Conventionally, solving the RMP requires computing the cost for an agent to follow each tour in the limited subset. Computing these costs in the MT-VRP-O is computationally intensive, since it requires collision-free motion planning between moving targets. Lazy BPRC defers cost computations by solving the RMP using lower bounds on the costs of each tour, computed via motion planning with relaxed continuity constraints. We lazily evaluate the true costs of tours as-needed. We compute a tour's cost by searching for a shortest path on a Graph of Convex Sets (GCS), and we accelerate this search using our continuity relaxation method. We demonstrate that Lazy BPRC runs up to an order of magnitude faster than two ablations.

2603.07615 2026-05-25 cs.LG cs.CV

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

压缩即适应:基于扩散基础模型的隐式视觉表示

Zongyu Guo, Jiajun He, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu

AI总结 本文提出了一种将视觉信号编码为函数的新表示框架,通过低秩适配参数附着在冻结的视觉生成模型上,从而实现对视觉内容的隐式表示。该方法能够将例如81帧视频的信号压缩为一个紧凑的向量,在极低比特率下实现高质量的感知视频压缩。此外,该函数式表示支持推理时的扩展与控制,提升了压缩性能,并为视觉压缩与生成提供了一个统一的框架。

详情
Comments
ICML 2026
AI中文摘要

现代视觉生成模型通过大规模训练获得丰富的视觉知识,但现有的视觉表示(如像素、潜变量或标记)仍独立于模型,无法直接利用这些知识进行紧凑存储或重用。在这项工作中,我们引入了一种新的视觉表示框架,将信号编码为一个函数,该函数通过附加在冻结的视觉生成模型上的低秩适应参数进行参数化。这种视觉信号的隐式表示,例如一个81帧的视频,可以进一步哈希成一个紧凑的向量,在极低比特率下实现强感知视频压缩。除了基本压缩外,这种表示的函数性质使得推理时缩放和控制成为可能,从而在压缩性能上实现额外优化。更广泛地说,由于隐式表示直接作为生成过程的函数,这提出了一个统一视觉压缩与生成的框架。

英文摘要

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

2603.07079 2026-05-25 cs.LG cs.CL

Entropy-Aware On-Policy Distillation of Language Models

熵感知的在线策略蒸馏语言模型

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee

AI总结 本文研究了如何在语言模型的策略对齐过程中保持生成多样性并提高知识迁移效果。针对现有基于反向KL散度的策略蒸馏方法在教师模型熵较高时可能导致生成多样性下降的问题,提出了一种熵感知的策略蒸馏方法,在教师模型熵较高时引入正向KL散度,从而在保持精确模仿的同时增强对多模态输出的覆盖能力。实验表明,该方法在多个数学推理任务中显著提升了学生模型的性能,验证了其在保持生成多样性与提升对齐效果方面的有效性。

详情
Comments
18 pages, 11 figures, ICML 2026
AI中文摘要

在线策略蒸馏是一种有前景的语言模型知识迁移方法,学生模型沿着自身轨迹从密集的token级信号中学习。该框架通常使用反向KL散度,鼓励学生匹配教师的高置信度预测。然而,我们表明反向KL的模式寻求特性会降低生成多样性,并在教师分布具有高熵时产生不稳定的学习信号。为解决此问题,我们引入了熵感知的在线策略蒸馏。我们的关键思想是在教师熵高时,用前向KL增强标准的反向KL目标,以捕获全部合理输出范围,同时在其他地方保留精确模仿。它在不牺牲在线策略训练效率的情况下,平衡了模式寻求的精确性与模式覆盖的鲁棒性。实验表明,我们的方法保持了生成多样性(持续的token级熵),并改善了学生-教师对齐(在高熵token上降低前向KL)。在六个数学推理基准上,与基线在线策略蒸馏方法相比,Qwen3-0.6B-Base的Pass@8准确率提升+1.37,Qwen3-1.7B-Base提升+2.39,Qwen3-4B-Base提升+5.05。这些结果表明,考虑教师不确定性对于保持多样性和实现有效知识迁移至关重要。

英文摘要

On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

2602.15602 2026-05-25 cs.LG stat.ML

Certified Per-Instance Unlearning Using Individual Sensitivity Bounds

使用个体灵敏度界限的认证逐实例遗忘

Hanna Benarroch, Jamal Atif, Olivier Cappé

AI总结 本文研究了如何通过个体敏感度界限实现有保证的逐实例模型遗忘。不同于传统的基于最坏情况敏感度的噪声注入方法,作者提出了一种针对每个数据点贡献进行自适应噪声校准的新方法,从而减少噪声注入量并提升模型性能。在岭回归和深度学习实验中验证了该方法的有效性,证明其在保证遗忘认证的同时能够显著降低噪声影响。

详情
AI中文摘要

认证的机器遗忘可以通过注入噪声实现,从而提供差分隐私保证,其中噪声根据最坏情况灵敏度进行校准。这种保守的校准通常会导致性能下降,限制了实际适用性。在这项工作中,我们研究了一种基于自适应逐实例噪声校准的替代方法,该校准针对每个数据点对学习解的个体贡献进行定制。这引发了以下挑战:当机制依赖于要移除的特定点时,如何建立正式的遗忘保证?为了定义噪声梯度动力学中的个体数据点灵敏度,我们考虑使用逐实例差分隐私。对于通过朗之万动力学训练的岭回归,我们推导出高概率的逐实例灵敏度界限,从而在注入显著更少噪声的情况下实现认证遗忘。我们通过线性设置中的实验证实了我们的理论发现,并提供了进一步的经验证据,表明该方法在深度学习设置中的相关性。

英文摘要

Certified machine unlearning can be achieved via noise injection leading to differential privacy guarantees, where noise is calibrated to worst-case sensitivity. Such conservative calibration often results in performance degradation, limiting practical applicability. In this work, we investigate an alternative approach based on adaptive per-instance noise calibration tailored to the individual contribution of each data point to the learned solution. This raises the following challenge: how can one establish formal unlearning guarantees when the mechanism depends on the specific point to be removed? To define individual data point sensitivities in noisy gradient dynamics, we consider the use of per-instance differential privacy. For ridge regression trained via Langevin dynamics, we derive high-probability per-instance sensitivity bounds, yielding certified unlearning with substantially less noise injection. We corroborate our theoretical findings through experiments in linear settings and provide further empirical evidence on the relevance of the approach in deep learning settings.

2602.12534 2026-05-25 stat.ML cs.DS cs.LG math.ST stat.TH

Linear Regression with Unknown Truncation Beyond Gaussian Features

未知截断下的线性回归:超越高斯特征

Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, Constantine Caramanis

AI总结 本文研究了在截断线性回归中,当响应变量的生存集未知时,如何高效估计未知的回归参数问题。不同于以往依赖已知生存集或强假设(如高斯分布)的工作,本文提出了一种仅需特征向量满足次高斯条件的算法,其运行时间仅为多项式时间,显著提升了计算效率。该方法的核心在于设计了一种新的子程序,能够在仅有正例且满足平滑条件的情况下高效学习有限个区间联合的模型,具有独立的理论价值和应用前景。

详情
AI中文摘要

在截断线性回归中,只有当结果 $y$ 落在某个生存集 $S^\star$ 内时,样本 $(x,y)$ 才被观测到,目标是估计未知的 $d$ 维回归系数 $w^\star$。该问题在统计学和机器学习中有着悠久的研究历史,可追溯到 (Galton, 1897; Tobin, 1958) 的工作,以及近期如 (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024) 的研究。然而,尽管历史久远,大多数先前工作仅限于 $S^\star$ 精确已知的特殊情况。更实际相关的情况——$S^\star$ 未知且需从数据中学习——仍然开放:实际上,目前可用的算法要么要求特征向量分布有强假设(如高斯性),即使如此,达到 $\varepsilon$ 精度的运行时间也为 $d^{\mathrm{poly} (1/\varepsilon)}$。在本工作中,我们给出了首个针对未知生存集的截断线性回归算法,运行时间为 $\mathrm{poly} (d/\varepsilon)$,仅要求特征向量是次高斯的。我们的算法依赖于一个新颖的子程序,该子程序在某种平滑条件下,利用正例(无负例)高效学习有界数量区间的并集。该学习保证补充了正例仅 PAC 学习的研究路线,并可能具有独立意义。

英文摘要

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.