arXivDaily arXiv每日学术速递 周一至周五更新
2605.16255 2026-05-18 cs.DC cs.AI 版本更新

Designing Datacenter Power Delivery Hierarchies for the AI Era

为AI时代设计数据中心电力交付层级

Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini

发表机构 * Stanford University(斯坦福大学) Microsoft Azure Research(微软Azure研究院)

AI总结 本文研究了AI时代数据中心电力交付层级设计的挑战,提出了一种评估框架,结合吞吐量、功率和成本指标,分析多资源短缺对部署容量、资本支出和性能的影响。

详情
AI中文摘要

对AI加速器的需求迅速增加机架功率密度,预计到2027年将达到每部署1MW。这给数据中心电力交付设计者带来了重大挑战。随着功率密度增加,为不同目标密度设计的数据中心可能无法使用其交付层级预留的所有功率。设计必须在数据中心长生命周期和多个硬件世代中保持高效。功率利用率在AI时代尤为重要,因为电网电力容量是稀缺资源。设计长期高效的电力交付层级困难,因为机架放置可行性、工作负载影响和成本取决于电气拓扑、部署粒度、放置策略、功率超订和工作负载混合。此外,这些因素随时间变化,跨多个资源维度有相互依赖性,通常无法用闭式分析。为解决这一挑战,我们开发了一个评估框架,结合GPU、计算和存储部署的投影模型,结合Microsoft Azure的生产数据。我们的结果表明,多资源短缺显著改变可部署容量、有效资本支出和交付性能,并量化了从机架和机柜规模AI系统中上升的密度如何影响这些结果。对于AI数据中心设计,相关规划目标不是安装兆瓦,而是随时间变化的可部署容量。

英文摘要

Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027. This poses a major challenge for datacenter power delivery designers. As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned. Designs must remain efficient over long datacenter lifetimes and multiple hardware generations. Power utilization is particularly important as grid power capacity is a scarce resource in the AI era. Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix. Moreover, each of these factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis. To address this challenge, we develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences. The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure. Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance, and quantify how rising density from rack- and pod-scale AI systems shapes these outcomes. For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time.

2605.16250 2026-05-18 cs.CL cs.AI cs.DB cs.LG 版本更新

A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation

一种生成式AI框架用于智能用电量分析和可持续资源优化

Pavan Manjunath, Thomas Pruefer

发表机构 * Independent Research, India(印度独立研究) Independent Research, Germany(德国独立研究)

AI总结 本文提出一个生成式AI框架,整合四个生产级能力,实现自然语言账单生成、消费预测及碳排放优化。

详情
AI中文摘要

配电公司现在需要提供可读的账单,每千瓦时销售都附带可辩护的碳数,并根据电网压力和排放约束调度负载。本文提出一个端到端框架,整合四个生产级能力:生成式AI代理从结构化数字输入生成客户自然语言账单,基于约束解码策略;基于变压器的预测器提供提前一天的消费估计,并带有校准的分位数区间。

英文摘要

Distribution utilities are now expected to deliver bills that customers can actually read attach a defensible carbon number to every kWh sold and schedule load against grid stress and emissions constraints We propose an end-to-end framework that unifies four production-grade capabilities under one architectural roof a generative-AI agent that drafts each customers natural-language billing statement from structured numeric inputs under a constrained decoding policy a transformer-based forecaster that supplies the day-ahead consumption estimate with calibrated quantile bands

2605.16245 2026-05-18 cs.CY cs.AI cs.CL cs.LG cs.SI 版本更新

AI-Mediated Communication Can Steer Collective Opinion

AI介导的交流可以引导集体意见

Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter

发表机构 * Hasso Plattner Institute(哈索普兰特纳研究所) Oxford Internet Institute, University of Oxford(牛津互联网研究所,牛津大学) Weizenbaum Institute(魏泽纳姆研究所)

AI总结 本文研究AI在人类间交流中对集体意见形成的影响,通过实证和理论分析展示AI引入的方向性偏见如何通过网络放大并改变集体观点,探讨平台如何控制此类偏见。

详情
AI中文摘要

生成式人工智能(AI)正日益融入人类交流意见的在线平台;大型语言模型(LLMs)现在在LinkedIn上润色用户帖子,并在X上提供内容上下文。尽管先前研究显示AI能表达偏见意见并影响个体意见,但较少关注其在介导人类间交流时对集体意见形成的影响。我们通过实证和理论分析填补这一空白。我们实证显示,多个流行LLM家族在被指示编辑争议性话题的人类文本时引入方向性偏见,例如倾向于支持枪支管控,反对无神论。基于这一观察,我们引入了一个意见动态的数学模型,其中AI系统位于社交网络用户之间,转换他们表达和感知的意见。通过分析该模型的平衡点并使用真实社交网络数据进行模拟,我们显示AI在人类间交流中引入的偏见可通过网络放大并转向集体意见。鉴于这些发现,我们探讨此类偏见是否可通过在线平台控制。我们审核了X上的“解释此帖子”功能,并发现Grok在与堕胎相关的内容中的输出存在亲生命偏见,我们追溯到特定的设计选择。最后,我们讨论了这些发现与欧洲联盟正在进行的立法努力的广泛影响。

英文摘要

Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals' opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the "Explain this post" feature on X and find evidence of pro-life bias in Grok's outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.

2605.16241 2026-05-18 cs.CV cs.AI 版本更新

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

离线语义引导用于高效视觉-语言-动作策略蒸馏

Jin Shi, Brady Zhang, Yishun Lu

发表机构 * Department of Mechanical Engineering(机械工程系) University College London(伦敦大学学院) Department of Engineering Science(工程科学系) University of Oxford(牛津大学)

AI总结 本文提出VLA-AD框架,利用视觉-语言模型作为离线语义监督者,将大规模VLA教师模型蒸馏为轻量学生策略,通过高阶语义指导提升效率与鲁棒性。

详情
AI中文摘要

大规模视觉-语言-动作(VLA)策略近期在机器人操作中表现出色,但其规模和推理成本仍是实时闭环控制的主要障碍。我们引入VLA-AD蒸馏框架,利用视觉-语言模型作为离线语义监督者,将大规模VLA教师模型转化为轻量学生策略。不同于仅依赖低层动作模仿,VLA-AD在教师提供的7自由度动作目标中加入高层语义指导,包括任务阶段锚点和多帧操作方向描述。这些辅助信号仅在训练时使用:在测试时,学生策略独立运行,无需VLA教师或VLM。我们在三个LIBERO基准测试套件上评估VLA-AD。使用OpenVLA-7B作为教师,我们的方法产生一个15800万参数的学生模型,模型大小减少44倍,同时与教师的平均相对差距仅为0.27%。生成的策略在RTX 4090上以12.5 Hz运行,比OpenVLA-7B快3.28倍。我们进一步表明,相同的语义蒸馏流程可泛化到不同的π_{0.5}-4B教师,其中学生在两个套件中优于教师,并在libero_goal上保持在0.53%以内。此外分析表明,阶段级监督和多帧方向线索使学生对噪声教师动作(如错误的高频夹具变化)更不敏感。总体而言,VLA-AD证明了从VLMs获得的离线语义指导可以显著提高VLA策略蒸馏的效率、鲁棒性和部署性。

英文摘要

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

2605.16238 2026-05-18 cs.AI 版本更新

Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

前瞻性多病原体疾病预测使用自主LLM引导的树搜索

Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi

发表机构 * Google Research(谷歌研究) School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院) Google Deepmind(谷歌Deepmind) University of Massachusetts(马萨诸塞大学)

AI总结 本文提出自主系统,利用LLM引导树搜索生成、评估和优化可执行预测软件,在2025-2026年美国呼吸道季节中实现了流感、新冠和呼吸道合胞病毒的多方法模型,其集成模型在样本外表现优于CDC标准模型。

详情
AI中文摘要

传染病概率预测对公共卫生至关重要,但依赖专家团队耗时的手动模型定制,限制了对细粒度地理分辨率或新兴病原体的扩展性。本文提出一个自主系统,利用大型语言模型(LLM)引导的树搜索,迭代生成、评估和优化可执行预测软件。在2025-2026年美国呼吸道季节的前瞻性、实时评估中,系统自主发现了针对流感、新冠和呼吸道合胞病毒(RSV)的方法学多样的模型。汇总这些机器生成的模型得到一个集成模型,其在样本外表现一致匹配或优于金标准的人工定制的疾病控制与预防中心(CDC)枢纽集合。该系统成功应对了RSV的数据稀缺“冷启动”场景。此外,受控回顾性消解揭示了优化对数尺度距离度量可防止奖励黑客,而自动化裁判在循环中确保结构符合复杂科学理论。通过自主将流行病学理论转化为准确、透明的代码,该框架克服了建模劳动力瓶颈,实现了前所未有的大规模专家级疾病预测部署。

英文摘要

Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.

2605.16233 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY 版本更新

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

FORGE:无权重更新的自演化代理记忆

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

发表机构 * Carleton University(卡尔顿大学)

AI总结 FORGE通过群体广播机制实现无梯度更新的自生成记忆,提升层次ReAct代理决策能力,在CybORG CAGE-2任务中显著提高性能并降低失败率。

详情
AI中文摘要

LLM代理能否通过自生成记忆提升决策能力而不进行梯度更新?我们提出了FORGE(失败优化反射毕业与进化),一种分阶段、基于群体的协议,通过注入提示的自然语言记忆来进化层次ReAct代理。FORGE包含一个反射式内环,其中专门的反思代理(使用相同的基础LLM,不从更强模型蒸馏)将失败轨迹转换为可重用的知识工件:文本启发式(规则)、少量示例(示例)或两者(混合),外环在阶段间将表现最佳实例的记忆传播到群体,并通过毕业标准冻结收敛实例。我们在CybORG CAGE-2上评估,这是一个具有30步地平线的随机网络防御POMDP,对抗B线攻击者。所有四个测试的LLM家族(Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B)均表现出强烈负的、重尾零样本奖励。与零样本基线和反射基线(隔离单流学习)相比,FORGE在所有12种模型-表示条件下,将平均评估回报提高了1.7-7.7倍,比反射基线提高了29-72%,将主要失败率(低于-100)降低到约1%。我们发现(1)群体广播是关键机制,无毕业消融确认广播承载性能提升,而毕业主要节省计算;(2)示例在三个模型中表现最强,规则提供最佳成本-可靠性剖面,约少40%的token;(3)较弱基线模型受益显著,表明FORGE可能缓解能力差距而非放大强模型。所有证据均限于CAGE-2 B线;跨家族发现是方向性证据。

英文摘要

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

2605.16232 2026-05-18 cs.CL cs.AI cs.ET cs.LG cs.SY eess.SY 版本更新

A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution, Utility Billing, Carbon Analytics, and Quantum-Inspired Optimisation

智能能源基础设施的统一生成式AI框架:智能燃气分配、公用事业计费、碳分析和量子启发优化

Pavan Manjunath, Thomas pruefer

发表机构 * Independent Research, India(印度独立研究) Independent Research, Germany(德国独立研究)

AI总结 本文提出一种统一的生成式AI框架,整合智能燃气分配、计费、碳分析和量子优化,以提升能源管理效率与环境责任。

详情
AI中文摘要

智能计量、生成式人工智能和量子启发组合优化的加速融合正在重塑能源公用事业在物理基础设施管理、客户互动和环境责任方面的运营方式。

英文摘要

The accelerating convergence of smart metering, generative artificial intelligence, and quantum-inspired combinatorial optimisation is reshaping how energy utilities manage physical infrastructure, customer engagement, and environmental accountability

2605.16207 2026-05-18 cs.AI cs.CL 版本更新

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

确认正确,遗漏其余:LLM辅导代理在反馈最关键的地方表现不佳

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

发表机构 * North Carolina State University(北卡罗来纳州立大学)

AI总结 本文研究了LLM在逻辑推理中的辅导性能,发现其在区分最优解、次优解和错误解方面存在系统性偏差,影响适应性教学效果。

Comments 22 pages, 20 fgures

详情
AI中文摘要

有效的辅导需要区分最优解、有效但次优解和错误解,这对智能辅导系统至关重要,但此前未针对LLM辅导代理进行测试。本文通过知识图谱衍生的地面真实数据,评估了七个LLM反馈代理在命题逻辑中的表现。模型在最优步骤上表现接近天花板,但在有效但次优的推理和错误解的验证上系统性地过度拒绝和接受,这在适应性辅导中尤为关键。这些失败在不同模型和情境下均持续存在,表明是架构而非信息限制的问题。此外,准确的诊断未能可靠地产生教学可行的反馈,揭示了诊断判断与教学效果之间的差距。研究发现LLM更适合混合架构,其中基于知识图谱的模型负责诊断,而LLM支持开放式的支架和对话。

英文摘要

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

2605.16205 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY 版本更新

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

上下文、推理与层次:在对抗性POMDP中的复合LLM代理设计成本-性能研究

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

发表机构 * Carleton University(卡尔顿大学)

AI总结 研究探讨了在对抗性部分可观测序贯环境中,复合LLM代理设计的上下文、推理和层次分解对性能与成本的影响,发现程序化状态抽象在成本效率上表现最佳,而分层分解无需推理可获得最佳性能。

详情
AI中文摘要

在对抗性、部分可观测的序贯环境中部署复合LLM代理需要处理多个设计维度:(1)代理所见的内容,(2)其推理方式,以及(3)任务在组件间的分解。然而,从业者缺乏指导,以确定哪些设计选择能提升性能而非仅仅增加推理成本。我们通过CybORG CAGE-2环境(建模为部分可观测马尔可夫决策过程POMDP)进行受控研究。奖励为非正数,因此所有配置均在故障缓解模式下运行。我们的评估涵盖五种模型家族、六种模型和十二种配置(3,475次回合),并进行逐token的成本计算。我们变化上下文表示(原始观察与确定性状态跟踪层压缩历史)、推理(自我提问、自我批评和自我改进工具,可选思维链提示)以及分层分解(单体ReAct与委托给专门子代理)。我们发现:(1)程序化状态抽象在每token花费上获得最大回报(RPTS),在原始观察上提升均值回报高达76%。 (2)在分层中分布推理工具相对于单独分层,对所有五种模型家族均降低性能,达到3.4倍更差的均值回报,同时使用1.8-2.7倍更多token。我们称此破坏性模式为推理瀑布。 (3)没有推理的分层分解在大多数模型中获得最佳绝对性能,且上下文工程通常比推理更经济有效。这些发现表明在结构对抗性POMDPs中的设计原则:投资于程序化基础设施和清洁任务分解,而不是更深入的单个代理推理,因为这些策略在结合时可能会相互干扰。

英文摘要

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

2605.16198 2026-05-18 cs.AI cs.CY cs.LG cs.LO 版本更新

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

形式方法与大语言模型交汇:面向高级AI系统合规性的审计、监控与干预

Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith

发表机构 * University of Toronto, Vector Institute(多伦多大学,向量研究所)

AI总结 本文提出结合形式方法与机器学习的审计和监控技术,用于检测AI系统中时间扩展行为约束的违规,实验表明其在检测违规方面优于LLM基方法,且能有效降低LLM代理的违规率。

详情
AI中文摘要

我们探讨了AI治理的一个维度:如何在整个AI开发生命周期中监控和审计AI增强的产品和服务,从预部署测试到部署后的审计。结合形式方法的原则与最先进的机器学习,我们提出技术,使AI增强产品和服务开发者、第三方AI开发者和评估者能够对产品特定的时间扩展行为约束(如安全约束、规范、规则和法规)进行离线审计和在线(运行时)监控,针对黑箱高级AI系统,特别是LLMs。我们进一步提供实用的预测监控技术,如基于抽样的方法,并引入干预监控器,在运行时预判并可能缓解预测的违规。实验结果表明,通过利用线性时序逻辑(LTL)的形式语法和语义,我们提出的方法在检测时间扩展行为约束的违规方面优于LLM基方法;使用我们的方法,即使小模型标注器也能匹配或超越前沿LLM判断者。我们还显示,通过受控实验,LLM的时间推理在事件距离、约束数量和命题数量增加时表现出显著的准确性下降。

英文摘要

We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.

2605.16194 2026-05-18 cs.DL cs.AI cs.IR cs.MA 版本更新

paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

为LLM-代理可操作论文的协调约定

Arquimedes Canedo

发表机构 * arquicanedo

AI总结 本文提出paper.json文件,通过稳定声明ID、明确不声明列表、精确图示命令和稳定定义ID等约定,解决LLM代理在阅读学术论文时的重复失败问题。

详情
AI中文摘要

LLM代理通常作为学术论文的第一(有时唯一)阅读者,快速浏览子声明、提取可重复性步骤并概括范围。标准论文在这一角色中产生重复失败:无法在子论文粒度下引用子声明、范围过度扩展超出论文测试内容,以及图示命令埋藏在代码库而非论文本身。我们提出paper.json,一个随PDF一同携带的JSON文件,通过轻量级约定解决这些失败:稳定声明ID(C1)、明确不声明列表(C2)、精确每图shell命令(C3)和稳定定义ID(C5)。第五个约定(C4)指出,最小可行合规性,手写JSON与PDF一同,可在一小时内完成,无需触碰人类可读输出。C1、C2、C3和C5是开放邀请:阅读合规论文并采取行动的代理将产生证据支持或反对它们。本文本身合规:运行`uv run validator.py paper.json --against paper.typ`通过。仓库:https://github.com/arquicanedo/paper-json

英文摘要

LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json --against paper.typ` passes. Repo: https://github.com/arquicanedo/paper-json

2605.16165 2026-05-18 cs.CV cs.AI 版本更新

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

二阶多级方差校正用于多模态模型中的模态竞争

Yishun Lu, Wes Armour

发表机构 * University of Oxford, Oxford, United Kingdom(牛津大学,英国)

AI总结 本文提出ML-FOP-SOAP框架,通过多级方差校正提升多模态对齐稳定性,实验显示在Janus和Emu3数据集上,该方法提高了样本效率和训练速度,适用于大规模多模态基础模型。

详情
AI中文摘要

自回归的下一个标记训练为图像生成和文本理解提供统一框架,但同时也导致强模态竞争,破坏了优化稳定性并限制了大批次扩展。我们发现一阶优化器如AdamW易受跨模态梯度异质性影响,而二阶预条件,特别是SOAP,为多模态对齐提供了更稳定的基。基于此,我们提出ML-FOP-SOAP,一个带有多级方差校正的二阶优化框架。我们的Fisher-正交投影抑制由方差引起的模态冲突,减少视觉生成与文本理解之间的权衡。为在大梯度累积下实用,我们引入了分层折叠策略,以低微步开销捕获细粒度方差。在Janus和Emu3上的实验显示,在两个模态上均获得一致收益,并在8192批次大小下实现稳定训练。与AdamW相比,我们的方法提高了样本效率高达1.4倍,并加速了实时时钟训练高达1.5倍,为扩展多模态基础模型提供了一个稳健的优化器。

英文摘要

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

2605.16153 2026-05-18 cs.AI 版本更新

An Algebraic Exposition of the Theory of Dyadic Morality

双人道德理论的代数阐释

Kush R. Varshney

AI总结 本文通过代数方法阐述双人道德理论,提出三种心理运算符以扩展结构因果模型,解决双人限制下的可扩展性问题,并应用于AI政策设计,通过节点压缩和顺序处理实现道德认知。

详情
AI中文摘要

本文提供双人道德理论(TDM)的代数阐释,该理论是一种基于简单双节点模板的心理道德判断模型:一个意图行为者对脆弱患者造成伤害。我们使用结构因果建模(SCM)符号形式化TDM,并识别三种心理运算符(类型化运算符、完成运算符和价值依赖推理机制)以扩展标准SCM,以捕捉人们在约束下如何计算道德判断。我们解决了TDM双人限制带来的可扩展性挑战,展示道德认知如何通过节点压缩和顺序处理压缩多节点场景。基于此代数框架,我们展示了具体应用于AI政策设计:检测冲突义务、构建保留用户自主性的有益政策、以及设计故障后沟通作为因果干预。最后,我们推荐对心智感知进行范围化的、情境化的测量,而非普遍平均,以实证化该理论。这种代数形式化使神经符号AI系统能够以数学严谨且符合人类道德认知的方式计算道德。

英文摘要

This paper provides an algebraic exposition of the theory of dyadic morality (TDM), a psychological model of moral judgment grounded in a simple two-node template: an intentional agent causing harm to a vulnerable patient. We formalize TDM using structural causal modeling (SCM) notation and identify three psychological operators (typecasting operator, completion operator, and valence-dependent inference mechanism) that extend standard SCM to capture how people compute moral judgments under constraints. We address scalability challenges arising from TDM's dyadic limitation, showing how moral cognition compresses multi-node scenarios through node collapse and sequential processing. Drawing on this algebraic framework, we demonstrate concrete applications to AI policy design: detecting conflicting obligations, structuring helpfulness policies to preserve user agency, and designing post-failure communication as causal interventions. Finally, we recommend scoped, contextual measurement of mind perception over universal averaging to operationalize the theory empirically. This algebraic formalization enables neurosymbolic AI systems to compute morality in a way that is both mathematically rigorous and faithful to human moral cognition.

2605.16143 2026-05-18 cs.AI cs.CL 版本更新

Look Before You Leap: Autonomous Exploration for LLM Agents

先看再跳:面向LLM代理的自主探索

Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 本文提出自主探索能力,通过探索检查点覆盖率指标,改进LLM代理在陌生环境中的适应性,采用探索与执行交替训练策略,提升任务执行的泛化能力。

详情
AI中文摘要

基于大型语言模型的代理在陌生环境中常因过早利用而失败,本文识别自主探索为构建适应性代理的关键能力。引入探索检查点覆盖率作为可验证指标,系统评估显示标准任务导向强化学习训练的代理表现出狭窄重复行为。提出探索与执行交替训练策略,通过交互预算获取环境知识后再执行任务,结果表明系统性探索对构建通用且现实可用的代理至关重要。

英文摘要

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

2605.16134 2026-05-18 cs.LG cs.AI 版本更新

Navigating Potholes with Geometry-Aware Sharpness Minimization

用几何感知的尖锐性最小化导航坑洞

Simon Dufort-Labbé, Mehrab Hamidi, Razvan Pascanu, Ioannis Mitliagkas, Damien Scieur, Aristide Baratin

发表机构 * Mila, Université de Montréal(Mila,蒙特利尔大学) Samsung – SAIL Montreal(三星–SAIL蒙特利尔)

AI总结 本文提出LLQR+SAM方法,结合学习预条件器与尖锐性最小化,通过双时间尺度结构提升模型鲁棒性,实验证明其在视觉和序列建模任务中表现优异。

详情
AI中文摘要

尖锐性感知最小化(SAM)通过扰动参数沿高损失曲率方向鼓励平坦极小值,但对所有参数方向均匀处理,忽略了损失几何结构。我们引入LLQR+SAM,结合SAM与通过最近提出的LLQR框架获得的学习预条件器,这是一种将最速下降法重新表述为分层线性二次调节问题的二阶方法。预条件器稀疏更新并保持为慢速指数移动平均,从而捕捉损失景观几何的平滑低分辨率图像。SAM扰动在此学习几何上操作,以更快的时间尺度探测曲率。我们证明这种双时间尺度结构不仅仅是计算便利:理论上,预条件器在平均几何下平坦但局部尖锐(坑洞)的方向放大SAM逃逸信号。宽广、平坦的盆地相比之下保持稳定。实验表明,LLQR+SAM在标准视觉和序列建模基准上相对于SAM和LLQR单独使用均表现出一致的改进,支持了慢速学习几何和快速尖锐性修正确实是互补的观点。

英文摘要

Sharpness-aware minimization (SAM) encourages flat minima by perturbing parameters along directions of high loss curvature, but treats all parameter directions uniformly, ignoring the underlying loss geometry. We introduce LLQR+SAM, which combines SAM with a learned preconditioner obtained from the recently proposed LLQR framework, a second-order method that recasts steepest descent as a layerwise linear-quadratic regulator problem. The preconditioner is updated sparsely and maintained as a slow exponential moving average, so it captures a smoothed, low-resolution picture of the loss landscape geometry. The SAM perturbation then operates on top of this learned geometry, probing curvature at a faster timescale. We show that this two-timescale structure is not merely a computational convenience: theoretically, the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable. Empirically, LLQR+SAM gives consistent gains over both SAM and LLQR alone across standard vision and sequence modeling benchmarks, supporting the view that slow learned geometry and fast sharpness correction are genuinely complementary.

2605.16126 2026-05-18 cs.LG cs.AI cs.IT math.IT math.ST stat.OT stat.TH 版本更新

Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schrödinger Samplers

熵跨桥梁:用于流和薛定谔采样的条件-边缘离散化

Bruno Trentini, Dejan Stancevic, Michael M. Bronstein, Alexander Tong, Luca Ambrogioni

发表机构 * NVIDIA Corporation(NVIDIA公司) University of Oxford(牛津大学) Donders Institute for Brain, Cognition, and Behaviour(大脑与行为研究所) AITHYRA, Research Institute for Biomedical AI(生物医学人工智能研究所)

AI总结 本文提出一种基于熵率的目标,用于桥-aware的离散化,通过分离端点条件桥几何和边缘流演变,提升低预算下的高维桥和流采样性能。

详情
AI中文摘要

对于固定流基生成模型,在有限的推断预算下,样本质量强烈依赖于采样器在有限函数评估上的分配。流匹配和薛定谔桥梁定义了概率路径,但其推断网格通常为启发式或继承自一端扩散。本文推导出一种条件-边缘熵率目标用于桥-aware离散化,分离端点条件桥几何与边缘流演变,并以此构建无训练的熵推断时间调度器。对于高斯布朗桥,该速率具有闭式解且呈U型,推动边界密集的非均匀网格。在训练的二维桥/流模型上,估计的轮廓恢复预测形状,并在10步ODE-Heun MMD中比线性提升18.1%,在相同低NFE扫描中,SDE-Heun改进22.7%。在EDM/CIFAR-10上,熵时间离散化在五步FID测试中表现最佳(186.3±4.0 vs 200.5±2.9线性和238.0±5.3余弦)。在AlphaFlow蛋白质生成中,熵条件-边缘调度在CAMEO22和ATLAS基准上低NFE情况下表现优势。这些结果支持熵率调度作为高维桥和流采样的实用低预算分配信号。

英文摘要

For a fixed flow-based generative model under a small inference budget, sample quality can depend strongly on where the sampler spends its few function evaluations. Flow matching and Schrödinger bridges define probability paths, yet their inference grids are usually heuristic or inherited from one-endpoint diffusion. We derive a conditional-marginal entropy-rate objective for bridge-aware discretization, separating endpoint-conditioned bridge geometry from marginal flow evolution, and use it to build a training-free entropic inference-time scheduler from first principles. For Gaussian Brownian bridges this rate is closed-form and U-shaped, motivating boundary-heavy nonuniform grids. On trained two-dimensional bridge/flow models, the estimated profile recovers the predicted shape and improves 10-step ODE-Heun MMD over linear by 18.1%, with a paired 22.7% SDE-Heun improvement in the same low-NFE sweep. On EDM/CIFAR-10, the entropic time-discretization gives the best tested five-step FID (186.3 \pm 4.0 versus 200.5 \pm 2.9 for linear and 238.0 \pm 5.3 for cosine). On AlphaFlow protein generation, entropic conditional-marginal (cond-marg) scheduling shows advantage in low-NFE regimes on both CAMEO22 and ATLAS benchmarks. These results support entropy-rate scheduling as a practical low-budget allocation signal for high-dimensional bridge and flow samplers.

2605.16122 2026-05-18 cs.CV cs.AI 版本更新

GenShield: Unified Detection and Artifact Correction for AI-Generated Images

GenShield:面向AI生成图像的统一检测与伪影校正

Zhipei Xu, Xuanyu Zhang, Youmin Xu, Qing Huang, Shen Chen, Taiping Yao, Shouhong Ding, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) Tencent Youtu Lab(腾讯优图实验室)

AI总结 本文提出GenShield框架,通过闭环诊断与修复流程实现可解释的AI生成图像检测与可控伪影校正,结合视觉链式推理课程学习策略,提升校正效果与泛化能力。

详情
AI中文摘要

基于扩散模型的图像合成使AI生成图像(AIGI)日益逼真,引发了在虚假信息检测、数字取证和内容审核等应用中真实性问题的紧迫关注。尽管在AIGI检测方面取得了显著进展,如何纠正检测到的具有明显伪影的AI生成图像并恢复真实外观仍鲜有研究。此外,现有工作很少建立AIGI检测与伪影校正之间的联系。为填补这一空白,我们提出了GenShield,一个统一的自回归框架,能够在闭环中联合执行可解释的AIGI检测和可控的伪影校正,揭示了这两个任务之间的相互促进关系。我们进一步引入基于视觉链式推理的课程学习策略,使系统能够进行自我解释、多步骤的“诊断-修复”校正,并具有明确的停止准则。同时构建了一个高质量的数据集,包含大规模的“伪影-校正”配对,并配套统一的评估流程。在我们的校正基准和主流AIGI检测基准上的广泛实验表明,我们的方法在性能和泛化能力方面均达到最先进的水平。代码可在https://github.com/zhipeixu/GenShield获取。

英文摘要

Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step ``diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scale ``artifact-restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at https://github.com/zhipeixu/GenShield.

2605.16116 2026-05-18 cs.AI 版本更新

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

ShopGym: 一个集成框架,用于电子商务网络代理的现实模拟和可扩展基准测试

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

发表机构 * North Carolina State University(北卡罗来纳州立大学) Shopify

AI总结 本文提出ShopGym框架,通过模拟层ShopArena和基准层ShopGuru,实现电子商务网络代理的现实模拟与可扩展基准测试,验证了合成商店在结构属性和代理性能上的有效性。

Comments 32 pages, 10 figures

详情
AI中文摘要

开发和评估电子商务网络代理需要能够保持有意义任务结构并支持可控、可重复和可扩展科学比较的环境。现有方法面临权衡:实时商店提供现实但非平稳、难以检查和不可重复,而手动构建的沙盒基准测试提供控制但仅覆盖狭窄的布局、目录、政策和交互模式范围。我们主张核心瓶颈是方法论的:该领域缺乏一种可扩展的方式,能够构建同时现实、多样、可控、可检查和可重复的评估设置。我们引入ShopGym,一个集成框架,用于电子商务网络代理的现实模拟和可扩展基准测试。ShopGym是一个构建电子商务模拟环境和基础基准任务的框架。其模拟层ShopArena通过匿名化商店规范和分阶段验证生成过程,将实时种子商店转换为自包含的沙盒商店。在这些模拟商店之上,ShopGuru合成跨七个技能类别的基准任务,每个任务基于商店的目录、导航结构、政策和交互可能性。共同,ShopArena和ShopGuru产生自包含、可重置、可检查和稳定的评估成果,保留结构属性和与购物任务相关的代理评估信号。我们通过基于图的结构分析和基于代理的行为评估验证了该框架,使用224个生成的任务在六个沙盒商店中:三个由合成数据构建,三个由真实数据构建。我们的结果表明,合成商店保留了实时商店的关键结构属性,代理在合成商店上的表现与在实时商店上的表现正相关。

英文摘要

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

2605.16113 2026-05-18 cs.CL cs.AI 版本更新

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

DebiasRAG: 通过检索增强生成实现大型语言模型中公平生成的无调优路径

Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin, Ping Li, Weijie Zhao, Khoa D Doan, Yingjie Lao

发表机构 * Huawei(华为)

AI总结 本文提出DebiasRAG,一种基于检索增强生成的无调优动态查询特定去偏框架,通过生成查询特定去偏候选、构建上下文候选池和梯度更新去偏引导上下文重排序三阶段,提升生成公平性并保留LLM固有属性。

详情
AI中文摘要

大型语言模型(LLMs)因生成能力卓越而取得空前成功。然而,由于依赖训练语料中的知识,它们可能生成幻觉、刻板印象和社会偏见内容。特别是,LLMs容易产生涉及种族、性别和年龄的偏见响应,统称为社会偏见。先前研究使用微调和提示工程来减轻LLMs中的偏见,但这些方法需要额外的训练资源或领域知识来设计框架。此外,它们可能降低LLMs的原始能力,并常忽视公平推断中动态去偏上下文的需要。本文提出DebiasRAG,一种基于检索增强生成(RAG)的新型无调优和动态查询特定去偏框架。DebiasRAG在保持LLM固有属性如表示能力的同时提升公平性。DebiasRAG包含三个阶段:(1)查询特定去偏候选生成;(2)上下文候选池构建;(3)梯度更新去偏引导上下文重排序。首先,DebiasRAG通过常规检索生成与查询相关的自我诊断偏见上下文,这些偏见上下文由DebiasRAG提供者离线准备。给定查询特定的偏见上下文,DebiasRAG反向生成去偏上下文,作为额外的公平性约束提供给LLM输出。其次,常规RAG检索过程从常规RAG文档数据库生成查询相关的上下文,如分块维基百科数据集。

英文摘要

Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

2605.16112 2026-05-18 cs.LG cs.AI 版本更新

Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix

动态图变换器中的注意力分散:诊断与可迁移的修复

Jinhao Zhang, Kangfei Zhao, Qiuhao Zeng, Long-Kai Huang

发表机构 * Beijing Institute of Technology(北京理工大学) University of Toronto(多伦多大学) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 本文识别动态图变换器在时间分布偏移下的注意力分散问题,并提出可迁移的差分注意力机制以提升性能,尤其在高偏移数据集上表现显著。

详情
AI中文摘要

基于变换器的架构已成为连续时间动态图(CTDG)学习的主导范式,但其性能在时间偏移数据集上受限。本文发现注意力分散是动态图变换器在时间分布偏移下的共同失效模式。通过对比结构和时间上不同的历史邻居与随机邻居,发现预测依赖于一类关键节点,这些节点比任意邻居更具预测信号。然而,现有变换器无法聚焦这些节点,因为时间偏移削弱了注意力对比并导致注意力分布过于分散。该诊断表明一种简单且可迁移的修复方法:用差分注意力替代标准注意力,以抑制共同模式注意力并放大差异性token级信号。当添加到三个代表性的CTDG变换器基线中时,差分注意力一致提升了性能,收益集中在高偏移数据集上。注意力层面的测量进一步验证了机制,显示关键节点上的注意力熵降低和注意力质量提高。基于这些发现,我们引入DiffDyG,结合差分注意力与标准输入编码。在9个基准和三种负采样协议上,DiffDyG实现了SOTA性能,尤其在最偏移的数据集上表现显著。

英文摘要

Transformer-based architectures have become the dominant paradigm for Continuous-Time Dynamic Graph (CTDG) learning, yet their performance remains limited on temporally shifted datasets. In this work, we identify attention dispersion as a shared failure mode of dynamic graph Transformers under temporal distribution shift. Through controlled ablation contrasting structurally and temporally distinguished historical neighbors against random ones, we show that prediction depends on a class of critical nodes that carry consistently more predictive signal than arbitrary neighbors. However, existing Transformers fail to focus on these nodes even when they are present in the input, as temporal shift weakens attention contrast and produces overly dispersed attention distributions. This diagnosis suggests a simple and transferable fix: replace standard attention with differential attention, which suppresses common-mode attention and amplifies distinctive token-level signals. When added to three representative CTDG Transformer baselines, differential attention consistently improves performance, with gains concentrated on high-shift datasets. Attention-level measurements further confirm the mechanism, showing reduced attention entropy and increased attention mass on critical nodes. Building on these findings, we introduce DiffDyG, a reference implementation combining differential attention with standard input encodings. Across 9 benchmarks and three negative sampling protocols, DiffDyG achieves SOTA performance, with especially large gains on the most shifted datasets.

2605.16103 2026-05-18 cs.AI 版本更新

Sign-Separated Finite-Time Error Analysis of Q-Learning

符号分离的有限时间误差分析Q学习

Donghwan Lee

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 本文提出符号分离的有限时间误差分析方法,用于常步长Q学习。通过切换系统表示,将误差分解为负和正部分,负部分由固定最优策略关联的线性时不变系统主导,正部分由线性切换系统控制。分析揭示了Q学习误差动态中的最大诱导不对称性,并提供确定性和随机性常步长递推的有限时间界。

详情
AI中文摘要

本文发展了一种符号分离的有限时间误差分析方法,用于常步长Q学习。从切换系统表示出发,将误差分解为组件的负和正部分。负部分由与固定最优策略关联的线性时不变(LTI)系统主导,而正部分由线性切换系统控制。所得界显示负侧LTI证书至少不慢于正侧切换证书,可能产生更快的指数包络。分析揭示了Q学习误差动态中的最大诱导不对称性,该不对称性与过估计有关:正向动作误差可通过贝尔曼最大值选择和传播,而负误差允许最优策略的下限比较。为确定性和随机性常步长递推提供了有限时间界。

英文摘要

This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.

2605.16099 2026-05-18 cs.LG cs.AI 版本更新

Federated Imputation under Heterogeneous Feature Spaces

联邦学习下的异构特征空间中的缺失值填补

Imane Hocine, Chaimaa Medjadji, Sylvain Kubler, Gregoire Danoy, Yves Le Traon

发表机构 * SnT, University of Luxembourg, Luxembourg(卢森堡大学SnT学院,卢森堡) FSTM/DCS, University of Luxembourg, Luxembourg(卢森堡大学FSTM/DCS学院,卢森堡)

AI总结 本文提出FedHF-Impute框架,通过共享全局特征图实现跨客户端知识传递,提升联邦填补效果,在模拟数据集上优于基线方法。

详情
AI中文摘要

联邦学习(FL)使去中心化客户端能够协同训练,但大多数方法假设特征模式一致,这在表格设置中不成立,因为客户端只能观察部分重叠的特征子集。在这些异构特征空间中,参数平均方法(如FedAvg)在弱重叠或不相交的特征组之间转移很少的信息,限制了联邦填补的有效性。为克服这一问题,我们提出了FedHF-Impute,一个联邦填补框架,将结构特征不可用性与传统缺失性分开,并利用共享的全局特征图通过信息传递在统计相关特征之间传播信息。这使即使特征从未在本地共同观察时也能实现间接跨客户端知识传递,同时保持标准的联邦通信。在模拟部分模式重叠的SECOM和AirQuality数据集上,FedHF-Impute在填补准确性(RMSE)上比FL基线方法提高了26.9%和8.4%,在PhysioNET上表现相当,仅比最佳基线差0.3%。

英文摘要

Federated Learning (FL) enables collaborative training across decentralized clients, but most methods assume aligned feature schemas, an assumption that rarely holds in tabular settings where clients observe only partially overlapping feature subsets. In these heterogeneous feature spaces, parameter-averaging methods (e.g., FedAvg) transfer little information across weakly overlapping or disjoint feature groups, limiting their effectiveness for federated imputation. To overcome this, we propose \textbf{FedHF-Impute}, a federated imputation framework that separates structural feature unavailability from conventional missingness and uses a shared global feature graph to propagate information across statistically related features through message passing. This enables indirect cross-client knowledge transfer, even when features are never jointly observed locally, while preserving standard federated communication. Under simulated partial schema overlap on the SECOM and AirQuality datasets, FedHF-Impute improves imputation accuracy (RMSE) over FL baselines by 26.9\%, and 8.4\% respectively, while achieving comparable performance on PhysioNET, with only a 0.3\% difference relative to the best baseline.

2605.16094 2026-05-18 cs.IT cs.AI math.IT 版本更新

GeoGS-CE: Learning Delay--Beam Channel Priors with 3D Gaussians for High-Mobility Scenarios

GeoGS-CE: 利用3D高斯分布学习延迟-波束信道先验以应对高机动场景

Yumeng Zhang, Jiajia Guo, Chaozheng Wen, Chenghong Bian, Jun Zhang

发表机构 * iComAI Lab, HKUST(iComAI实验室,香港科技大学)

AI总结 本文提出GeoGS-CE框架,通过3D高斯分布建模高机动场景中的信道特性,利用延迟-波束功率谱作为先验信息,提升稀疏试点下的信道频率响应重建精度。

详情
AI中文摘要

宽带信道估计(CE)在高机动场景中仍具挑战性,因为信道响应变化迅速,而实际系统只能分配稀疏试点以适应密集用户。幸运的是,许多高机动环境,如高速铁路,表现出预定轨迹、可预测速度和有限主导传播路径。这些特性诱导出的延迟-波束功率谱比瞬时复通道频率响应(CFR)更稳定,对随机相位相干性更不敏感,并富含几何信息。为利用这些环境特性,我们提出GeoGS-CE,一种针对稀疏试点高机动场景的两阶段信道估计框架。在离线阶段,GeoGS-CE联合建模:1)场景级3D高斯分布,捕捉非视线(NLoS)几何散射支持;2)漏泄感知的可微无线渲染过程,将NLoS高斯分布与显式虚拟视线(LoS)组件映射到测量的延迟-波束功率谱,同时考虑实际OFDM延迟和阵列漏泄效应。在在线阶段,为每个用户位置预测延迟-波束功率谱,并用作强协方差先验,通过线性MMSE估计器实现准确的全带和全阵列CFR重建和跟踪。基于广深高速铁路生成的信道仿真表明,所提出的几何先验显著提高了CFR重建性能,优于仅试点和非几何基线。

英文摘要

Wideband channel estimation (CE) in high-mobility scenarios remains challenging because channel responses vary rapidly, while practical systems can allocate only sparse pilots to accommodate dense users. Fortunately, many high-mobility environments, such as high-speed railways, exhibit scheduled trajectories, predictable velocities, and a limited number of dominant propagation paths. These properties induce a delay--beam power spectrum that is more stable than the instantaneous complex channel frequency response (CFR), less sensitive to the random phase coherence, and rich in geometric information. To exploit such environmental properties, we propose GeoGS-CE, a two-stage channel estimation framework for sparse-pilot high-mobility scenarios. In the offline stage, GeoGS-CE jointly models: 1) a scene-level 3D Gaussian representation that captures the non-line-of-sight (NLoS) geometric scattering support, and 2) a leakage-aware differentiable wireless rendering process that maps the NLoS Gaussians, together with an explicit virtual line-of-sight (LoS) component, to the measured delay--beam power spectrum, while accounting for practical OFDM delay and array leakage effects. In the online stage, the delay--beam power spectrum is predicted for each user location and used as a strong covariance prior, enabling accurate full-band and full-array CFR reconstruction and tracking through a linear MMSE estimator. Simulations based on channels generated from a segment of the Guangshen high-speed railway show that the proposed geometric prior substantially improves CFR reconstruction over pilot-only and non-geometric baselines.

2605.16089 2026-05-18 cs.LG cs.AI 版本更新

Centralized vs Decentralized Federated Learning: A trade-off performance analysis

集中式与去中心化联邦学习:性能权衡分析

Chaimaa Medjadji, Guilain Leduc, Sylvain Kubler, Yves Le Traon

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文通过Fedstellar模拟器、MNIST数据集和MLP分类器,对比分析集中式、去中心化和半去中心化联邦学习架构的性能权衡,揭示不同应用场景下的优劣势。

详情
AI中文摘要

联邦学习(FL)作为一种在分布式边缘设备上进行协作模型训练同时保护数据隐私的有前景范式,尤其在物联网设备数量激增的情况下显得尤为重要。然而,将如此大量的数据集中存储面临通信限制、隐私和法规等问题。FL可以是集中式(CFL)、去中心化(DFL)或半去中心化(SDFL)。选择合适的FL架构取决于应用需求。然而,非常少的研究通过实验比较了这三种架构,不仅为了理解各自的优势和局限性,还为了探讨不同性能指标之间的权衡。本文克服了这一分析的不足,利用Fedstellar模拟器、MNIST数据集和MLP分类器进行实验分析。

英文摘要

Federated Learning (FL) has emerged as a promising paradigm for collaborative model training across distributed edge devices while preserving data privacy especially with the huge increase amount of data due to the adoption of technologies which contributes to the growing number of IoT devices. Storing this amount of data centrally is challenging due to issues like limited communication, privacy, and regulations. FL can be Centralized (CFL), Decentralized (DFL), and Semi-decentralized (SDFL). Choosing the right FL architecture depends on the application's needs. However, very few research studies have experimentally compared these three types of architectures to not only understand the respective strengths and limitations, but also trade-offs between different performance indicators. This paper overcome this lack of analysis, conducting experimental analyses using the Fedstellar simulator, MNIST dataset, and MLP classifier.

2605.16088 2026-05-18 cs.LG cs.AI 版本更新

Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction

基于组合层次图的多级自监督预训练用于分子性质预测

Xiayu Liu, Zhengyi Lu, Hou-biao Li

发表机构 * School of Mathematical Sciences(数学科学学院) University of Electronic Science and Technology of China(电子科技大学) Department of Computer Science and Engineering(计算机科学与工程系) Oakland University(奥克兰大学)

AI总结 本文提出MolCHG框架,通过多级自监督预训练提升分子性质预测性能,采用组合层次图组织分子结构,引入bond graph增强bond信息,实现原子与bond语义的平等聚合。

Comments 11pages, 4 figures

详情
AI中文摘要

自监督预训练在分子图上已展现出分子性质预测的潜力,但现有方法多在单一结构粒度上操作,将bond信息视为辅助边属性而非独立语义层。本文提出MolCHG,一种基于新型组合层次图的多级自监督预训练框架,将分子结构划分为三个语义层级的四种节点类型。通过引入与原子图并行的bond图,该架构将bond层面信息提升为独立演化的节点表示,使片段节点能平等聚合原子层面和bond层面语义。设计了三个层级特定的预训练目标:原子-债券交叉视图对比任务对齐每个片段的原子视图和bond视图表示;片段级功能团预测任务注入领域相关的化学知识;图级结构预测任务编码全局分子拓扑。在九个MoleculeNet基准测试中,MolCHG在七个数据集上取得最佳性能,在其余数据集上与最强基线竞争。消融研究进一步确认多级监督信号互补,每个组件均对整体性能有贡献。

英文摘要

Self-supervised pretraining on molecular graphs has emerged as a promising approach for molecular property prediction, yet most existing methods operate at a single structural granularity and treat bond information as auxiliary edge attributes rather than as an independent semantic layer. In this work, we propose MolCHG, a multi-level self-supervised pretraining framework built upon a novel Compositional Hierarchical Graph that organizes molecular structure into four types of nodes across three semantic levels. By introducing a bond graph that operates in parallel with the atom graph, our architecture elevates bond-level information to independently evolving node representations, enabling fragment nodes to aggregate atom-level and bond-level semantics on an equal footing. We design three level-specific pretraining objectives: an atom-bond cross-view contrastive task that aligns the atom-view and bond-view representations within each fragment, a fragment-level functional group prediction task to inject domain-relevant chemical knowledge, and graph-level structure prediction tasks to encode global molecular topology. Experiments on nine MoleculeNet benchmarks demonstrate that MolCHG achieves the best performance on seven datasets across both classification and regression tasks, remaining competitive with the strongest baselines on the rest. Ablation studies further confirm that the multi-level supervision signals are complementary and that each component contributes to the overall performance.

2605.16085 2026-05-18 cs.DB cs.AI 版本更新

Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks

面向关系数据库的foundation models的语言模型与图神经网络方法

Jingcheng Wu, Ratan Bahadur Thapa, Mojtaba Nayyeri, Lucas Etteldorf, Max Finkenbeiner, Fabian Leeske, Steffen Staab

发表机构 * University of Stuttgart, Stuttgart, Germany(斯图加特大学) Internet Science Research Group, University of Southampton, Southampton, United Kingdom(互联网科学研究组,南安普顿大学)

AI总结 本文提出结合语言模型和图神经网络的混合架构,通过关系实体图建模提升关系数据库的预测性能,实验表明其在多个任务中表现优异,接近监督基线并缩小与RDL的差距。

Comments 15 pages, 7 figures, 4 tables. Preprint of a paper accepted at the 1st Workshop on Extraction from Triplet Text-Table-Knowledge Graph and associated Challenge (TRIPLET), co-located with ESWC 2026

详情
AI中文摘要

关系数据库存储了大量结构化信息,对复杂预测应用至关重要。然而,关系数据的深度学习进展有限,传统方法通过人工特征工程将数据库扁平化为单表,丢失了关系上下文。关系深度学习(RDL)通过将数据库建模为关系实体图(REGs)供图神经网络(GNNs)处理,但任务和数据库特定。为结合两种范式的优势,本文提出混合架构,结合微调的BART编码器捕捉行内语义,以及基于GraphSAGE的GNN处理REGs注入关系上下文。在RelBench上的实验表明,GNN显著丰富BART的行嵌入,实现驱动-dnf任务在rel-f1数据集上的ROC-AUC为67.40。该性能与监督基线如LightGBM(68.86)相当,并缩小与RDL(72.62)的差距至5.22点,尽管与最先进的基础模型如KumoRFM(82.63)仍有较大差距。这些结果表明,轻量级混合LM-GNN架构为关系数据库的基础模型提供了有前景且资源高效的路径。

英文摘要

Relational databases store much of the world's structured information, and they are essential for driving complex predictive applications. However, deep learning progress on relational data remains limited, as conventional approaches flatten databases into single tables via manual feature engineering, discarding relational context. Relational deep learning (RDL) addresses this by modeling databases as relational entity graphs (REGs) for graph neural networks (GNNs), but remains task- and database-specific. To combine the strengths of both paradigms, we propose a hybrid architecture combining a fine-tuned BART encoder to capture intra-row semantics with a GraphSAGE-based GNN over REGs to inject relational context. Experiments on RelBench show that the GNN substantially enriches BART's row embeddings, achieving a ROC-AUC of 67.40 on the driver-dnf task from the rel-f1 dataset. This performance is competitive with supervised baselines such as LightGBM (68.86) and narrows the gap to RDL (72.62) to within 5.22 points, though a substantial gap remains to state-of-the-art foundation models such as KumoRFM (82.63). These results suggest that lightweight hybrid LM-GNN architectures offer a promising and resource-efficient path towards foundation models for relational databases.

2605.16079 2026-05-18 cs.CV cs.AI cs.HC 版本更新

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker:通过原生代理工具调用激励实例级视频理解

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) Xiaohongshu Inc.(小红书公司) East China Normal University(华东师范大学) Xi’an Jiaotong University(西安交通大学)

AI总结 VideoSeeker通过整合代理推理与实例级视频理解任务,提升视频理解精度,实验表明其在实例级任务中比基线模型提升13.7%,超越GPT-4o和Gemini-2.5-Pro。

Comments Project Page: https://gaotiexinqu.github.io/VideoSeeker/

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在视频理解上取得了显著进展,但在需要精确实例级时空定位的任务中面临重大挑战。现有方法主要依赖文本提示进行人机交互,但这些提示难以提供精确的空间和时间参考,导致用户体验不佳。此外,当前方法通常将视觉感知与语言推理解耦,以语言为中心而非视觉内容,限制了模型主动感知细粒度视觉证据的能力。为解决这些问题,我们提出VideoSeeker,一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker无缝整合代理推理与实例级视频理解任务,使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段全自动数据合成管道,高效生成大规模高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中,构建了一个强大的视频理解模型。实验表明,我们的模型在实例级视频理解任务中平均比基线模型提升13.7%,超越强大的闭源模型如GPT-4o和Gemini-2.5-Pro,同时在通用视频理解基准上也表现出有效的迁移能力。相关数据集和代码将公开发布。

英文摘要

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

2605.16076 2026-05-18 cs.CV cs.AI 版本更新

AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification

AgriMind:一种用于多类植物疾病分类的集成深度学习框架

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely

发表机构 * RTM Al-Kabir Technical University(RTM阿克比爾技術大學) North East University Bangladesh(東北大學(孟加拉國))

AI总结 本文提出AgriMind框架,利用ResNet50、EfficientNet-B0和DenseNet121模型集成,通过转移学习实现对15种植物疾病的高精度分类,集成模型在测试集上达到99.23%的准确率。

详情
AI中文摘要

在孟加拉国,植物疾病检测仍主要依赖人工检查。我们构建了AgriMind系统,通过集成ResNet50、EfficientNet-B0和DenseNet121模型,利用20,638张PlantVillage图像进行训练。使用冻结的ImageNet主干和头-only训练,保持了轻量级的管道。单个模型在测试集上达到96-97%的准确率,但通过平均softmax输出,集成模型达到99.23%的准确率,错误率降低三分之二。我们尝试偏向最佳验证模型,但效果不佳。删除任一模型也损害性能。辣椒和土豆分类完美,而番茄在十个视觉相似类别中仍达到99.01%的准确率。在NVIDIA T4 GPU上,完整集成模型以53 FPS运行。是否能实现实时移动应用取决于TensorFlow Lite优化,这项工作尚未完成。

英文摘要

Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96--97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% -- a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization -- work we have not yet completed.

2605.16065 2026-05-18 cs.CV cs.AI 版本更新

Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting

鲁棒的先验引导分割用于可编辑的3D高斯散射

Raushan Joshi, Jean-Yves Guillemaut

发表机构 * University of Surrey(萨里大学)

AI总结 本文提出利用SAM-HQ生成准确2D掩码,通过先验引导标签重新分配实现鲁棒的3D分割,提升编辑任务的精度和效率。

Comments Accepted at IEEE International Conference on Image Processing 2026, 6 pages

详情
AI中文摘要

3D高斯散射(3D-GS)实现了实时3D场景重建,但缺乏用于编辑任务如物体移除、提取和重新着色的鲁棒分割。现有方法将2D分割提升到3D领域时面临视图不一致和粗掩码的问题。本文提出一种新的框架,利用Segment Anything Model High Quality(SAM-HQ)生成准确的2D掩码,解决标准SAM在边界保真度和细结构保持方面的局限。为实现给定场景中任意目标物体的鲁棒3D分割,我们引入了先验引导的标签重新分配方法,通过与学习先验的多视图一致性来为3D高斯分配标签。我们的方法实现了最先进的分割精度,并在保持高视觉保真度的同时实现了交互式、实时的物体编辑。定性结果表明在虚拟现实(VR)和机器人领域具有优越的边界保持和实际应用价值,推动了3D场景编辑的发展。

英文摘要

3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.

2605.16054 2026-05-18 cs.LG cs.AI 版本更新

Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making

Ada-Diffuser: 面向决策制定的潜在意识自适应扩散模型

Fan Feng, Selena Ge, Minghao Fu, Zijian Li, Yujia Zheng, Zeyu Tang, Yingyao Hu, Biwei Huang, Kun Zhang

发表机构 * University of California San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学) MBZUAI Stanford University(斯坦福大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出Ada-Diffuser,通过显式建模潜在动态过程,提升决策制定的精度与适应性,实验验证其在模拟控制与机器人基准中的有效性。

Comments ICLR 2026

详情
AI中文摘要

近期研究将决策制定视为序列建模问题,利用生成模型如扩散模型进行建模。尽管有前景,这些方法常忽视具有演变动态的潜在因素,这些因素对环境转换、奖励结构和高级智能体行为至关重要。显式建模这些隐藏过程对精确动态建模和有效决策制定至关重要。本文提出一个统一框架,从最小但足够的观测中显式整合潜在动态推断。理论表明,在温和条件下,潜在过程可以从少量时间观测块中识别。基于此见解,我们引入Ada-Diffuser,一种因果扩散模型,同时学习观测互动的时间结构和潜在动态,并进一步利用它们进行规划和控制。通过模块化设计,Ada-Diffuser支持规划和策略学习任务,能够适应动态、奖励和潜在动作的潜在变化。在模拟控制和机器人基准中的实验验证了其在准确潜在推断和自适应策略学习中的有效性。

英文摘要

Recent work has framed decision-making as a sequence modeling problem using generative models such as diffusion models. Although promising, these approaches often overlook latent factors that exhibit evolving dynamics, elements that are fundamental to environment transitions, reward structures, and high-level agent behavior. Explicitly modeling these hidden processes is essential for both precise dynamics modeling and effective decision-making. In this paper, we propose a unified framework that explicitly incorporates latent dynamic inference into generative decision-making from minimal yet sufficient observations. We theoretically show that under mild conditions, the latent process can be identified from small temporal blocks of observations. Building on this insight, we introduce Ada-Diffuser, a causal diffusion model that learns the temporal structure of observed interactions and the underlying latent dynamics simultaneously, and furthermore, leverages them for planning and control. With a modular design, Ada-Diffuser supports both planning and policy learning tasks, enabling adaptation to latent variations in dynamics, rewards, and latent actions. Experiments on simulated control and robotic benchmarks demonstrate its effectiveness in accurate latent inference and adaptive policy learning.

2605.16052 2026-05-18 cs.AI cs.CL 版本更新

Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

理由者还是翻译者?面向污染的评估与税法中的神经符号鲁棒性

Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

发表机构 * Bloomberg(彭博社) Michigan State University(密歇根州立大学)

AI总结 本文研究了税法推理中LLM性能受数据污染影响的问题,提出神经符号框架提升法律AI的可靠性与鲁棒性。

详情
AI中文摘要

近期大型语言模型(LLM)的进步显著增强了自动化法律推理能力。然而,其性能反映的是真正的法律推理能力还是数据污染的产物仍不明确。本文对税法推理方法进行了全面实证研究,并实施了污染检测协议以严格评估LLM的可靠性。我们发现性能可能因污染而被夸大。基于此分析,我们进行了系统评估,比较了单一LLM与混合系统,后者将法律文本翻译为形式化表示并委托符号求解器进行推理。我们构建了一个新的测试套件,通过案例和规则变化来测试对未见文档的泛化能力。我们的发现表明,法律推理本质上是组合性的,神经符号框架为法律AI提供了更可靠和稳健的基础,以及对未观测情境的更好泛化能力。

英文摘要

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

2605.16048 2026-05-18 cs.LG cs.AI 版本更新

Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification

循环SSM:用于时间序列分类的深度递归与输入重塑

Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

发表机构 * TU Wien(维也纳技术大学) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Liquid AI

AI总结 本文探讨了循环SSM在时间序列分类中的应用,展示了深度递归和输入重塑对模型性能的提升作用,通过实验验证了这两种方法的有效性。

详情
AI中文摘要

状态空间模型(SSM)本质上是沿序列维度递归的,但深度递归——在层之间重复使用相同模块——在SSM家族中尚未被探索。我们证明,一个具有k个参数迭代L次的循环SSM在四个架构(LRU、S5、LinOSS、LrcSSM)和六个时间序列分类基准上,能够与或优于具有k·L个独立参数的标准SSM相媲美,尽管其在严格更小的假设空间内运行。由于较大模型包含循环模型作为特殊情况,这种主导不能归因于表达力,而是指深度递归中的参数共享作为有益的归纳偏置,简化了优化。这些结果表明,深度递归与序列递归是正交的,并且独立有益。我们进一步表明,输入重塑是同样被忽视的设计轴:将时间步连接起来用于低维输入,或对高维输入进行扁平化和重新分块,能带来1-6%的准确率提升,经5个随机种子验证。这两种技术提供了独立的改进,当结合使用时会相辅相成,表明深度和输入重塑是SSM在时间序列上的两个独立且未被充分探索的设计轴。

英文摘要

State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with $k$ parameters iterated $L$ times consistently closely matches or outperforms a standard SSM with $k \cdot L$ independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.

2605.16046 2026-05-18 cs.SE cs.AI 版本更新

XSearch: Explainable Code Search via Concept-to-Code Alignment

XSearch: 通过概念到代码对齐实现可解释的代码搜索

Yiming Liu, Ruofan Liu, Yun Lin, Zicong Zhang, Weiyu Kong, Pengnian Qi, Xiao Cheng, Weinan Zhang, Qianxiang Wang, Linpeng Huang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) National University of Singapore(新加坡国立大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出XSearch框架,通过将代码搜索转化为概念对齐问题,提升代码搜索的可解释性和泛化能力,在分布偏移基准测试中性能提升显著。

Comments Accepted to ISSTA 2026

详情
AI中文摘要

语义代码搜索在学术和工业中广泛应用。这些方法将自然语言查询和代码片段嵌入共享嵌入空间并基于向量相似性检索结果。尽管在基准数据集上表现强劲,但往往存在可解释性和泛化能力差的问题。检索的代码可能在语义上相似,却遗漏了查询的关键功能需求,且无法解释为何选择该结果。此外,这种失败在分布偏移下更加严重,模型难以泛化到未见过的基准。本文提出XSearch,一种内在可解释的代码搜索框架。我们的关键见解是,通过依赖全局嵌入相似性,现有检索器本质上采取归纳观点。它们学习统计模式而非真正理解查询的功能需求。我们通过将代码搜索重新表述为演绎的概念对齐问题来解决这一问题。XSearch (i) 在查询中识别功能概念 (ii) 明确将这些概念与相应代码语句对齐。这种解释后再预测的设计产生内在的概念级解释,并减轻影响分布偏移泛化的捷径学习。我们训练一个具有显式概念对齐目标的编码器,并通过查询概念与代码语句之间的显式匹配进行检索。实验显示,训练在CodeSearchNet使用GraphCodeBERT (125M参数) 的XSearch在分布偏移基准测试中的性能从0.02提升到0.33 (15倍) 超过八种最先进的检索器,并且在参数高达7B的基线中表现一致。用户研究显示,概念对齐的解释使用户能够更快更准确地评估检索结果。

英文摘要

Semantic code search has been widely adopted in both academia and industry. These approaches embed natural-language queries and code snippets into a shared embedding space and retrieve results based on vector similarity. Despit strong performance on benchmark datasets, they often suffer from poor explainability and generalization. Retrieved code may appear semantically similar yet miss critical functional requirements of the query, while providing no explanation of why the result was retrieved. Moreover, such failures become more severe under distribution shift, where models struggle to generalize to unseen benchmarks. In this work, we propose XSearch, an intrinsically explainable code search framework. Our key insight is that by relying on global embedding similarity, existing retrievers inherently take an inductive view. They learn statistical patterns rather than truly understanding the query's functional requirements. We address this problem by reformulating code search as a deductive concept alignment problem. XSearch (i) identifies functional concepts in the query and (ii) explicitly aligns them with corresponding code statements. This explain-then-predict design produces inherent concept-level explanations and mitigates shortcut learning that harms out-of-distribution generalization. We train an encoder with explicit concept-alignment objectives and perform retrieval through explicit matching between query concepts and code statements. Experiments show that, trained on CodeSearchNet using GraphCodeBERT (125M parameters), XSearch improves performance on out-of-distribution benchmarks from 0.02 to 0.33 (15x) over eight state-of-the-art retrievers, and consistently outperforms both encoder- and decoder-based baselines with up to 7B parameters. A user study demonstrates that concept-alignment explanations enable users to evaluate retrieved results faster and more accurately.

2605.16045 2026-05-18 cs.CL cs.AI cs.LG 版本更新

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

RecMem:基于递归的记忆巩固用于高效且有效的长运行LLM代理

Zijie Dai, Shiyuan Deng, Sheng Guan, Yizhou Tian, Xin Yao, Xiao Yan, James Cheng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) School of Computer Science, Beijing University of Posts and Telecommunications(北京邮电大学计算机学院) Huawei Cloud(华为云) Huawei Theory Lab(华为理论实验室) Institute for Math and AI, Wuhan University(武汉大学数学与人工智能研究院)

AI总结 RecMem通过递归机制优化内存巩固,减少token消耗并提升准确性,有效解决长运行LLM代理的内存管理问题。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

记忆系统通常将用户-代理交互组织为可检索的外部记忆,对长运行代理至关重要,因为它克服了LLM的有限上下文窗口。然而,现有记忆系统每次调用LLM处理交互以提取记忆,导致大量token消耗。为解决此问题,我们提出RecMem,重新思考何时进行记忆巩固。RecMem将输入交互存储在潜意识记忆层,并使用轻量级嵌入模型进行编码以供检索。LLM仅在观察到持续递归且语义相似的交互时才被调用以提取事件性和语义记忆。这种基于递归的巩固工作是因为这些交互对应于具有丰富信息的语义簇,因此值得提取和总结。为了提高准确性,RecMem还结合了语义细化机制,以恢复记忆提取中遗漏的细粒度事实。实验表明,RecMem将三种SOTA记忆系统的内存构建token成本减少了高达87%,同时超过其准确性。

英文摘要

Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

2605.16043 2026-05-18 cs.RO cs.AI 版本更新

Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data

从人类遥控数据中学习双臂绳子操作的模拟 grounded 策略

Gina Wigginghaus, Tim Missal, Berk Guler, Simon Manschitz, Jan Peters

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) Robotics Institute Germany (RIG)(德国机器人研究所) Centre for Cognitive Science(认知科学研究中心) Honda Research Institute Europe GmbH(本田欧洲研究院)

AI总结 本文研究了基于视觉的策略在解结任务中泛化能力不足是否源于观察空间而非策略架构或数据规模,通过比较两种基于动作分块与变压器的策略,发现基于物理状态的策略在预测初始抓取和拉拽动作时L1误差降低了30.8%。

Comments Accepted to the Beyond Teleoperation Workshop at ICRA 2026, 5 pages, 2 figures

详情
AI中文摘要

可变形线性物体(DLOs)如绳子和电缆在家庭和工业应用中广泛存在,但因其无限维的配置空间和频繁的自我遮挡而难以操控。从遥控学习双臂DLO操控提供了实用路径,但其可扩展性受限于人力,使得观察空间的选择对从小数据集泛化至关重要。本文研究了在解结任务中基于视觉的策略泛化能力不足是否源于观察空间本身而非策略架构或数据规模。我们比较了两种基于动作分块与变压器的策略,均训练于相同双臂遥控数据:一种基于两个装在腕部相机上的眼动RGB流的视觉策略,另一种基于DLO的3D粒子状态的策略,该状态通过多视角融合从初始观察中提取,并在基于粒子的扩展位置基于动力学模拟中演化。在未见过的绳子配置上进行开环评估,基于状态的策略在预测初始抓取和拉拽动作时,L1误差减少了30.8%,量化了像素与物理一致状态之间的可观察性差距,并指出了从有限人类演示中更高效地学习DLO操控任务的机器人学习方向。

英文摘要

Deformable Linear Objects (DLOs) such as ropes and cables are widely encountered in both household and industrial applications, yet remain challenging to manipulate due to their infinite-dimensional configuration space and frequent self-occlusion. Imitation learning from teleoperation offers a practical path to bimanual DLO manipulation, but its scalability is limited by human effort, making the choice of observation space critical for generalization from small datasets. In this study, we investigate whether the lack of generalization in egocentric visual policies for the knot-untangling task stems from the observation space itself, rather than from the policy architecture or data scale. We compare two Action Chunking with Transformers policies trained on the same bimanual teleoperation data: a vision-based policy conditioned on two egocentric RGB streams from wrist-mounted cameras, and a state-based policy conditioned on the DLO's 3D particle state, extracted from an initial observation via multi-view fusion and evolved in a particle-based eXtended Position-Based Dynamics simulation. Evaluated open-loop on an unseen rope configuration, the state-based policy outperforms its visual counterpart with a 30.8% reduction in L1 error when predicting the initial grasp-and-pull action, quantifying the observability gap between pixels and physics-consistent state, and pointing toward more data-efficient robot learning for the DLO manipulation task from limited human demonstrations.

2605.16035 2026-05-18 cs.CR cs.AI cs.MA 版本更新

Who Owns This Agent? Tracing AI Agents Back to Their Owners

谁拥有这个智能体?追溯AI智能体回其所有者

Ruben Chocron, Doron Jonathan Ben Chayim, Eyal Lenga, Gilad Gressel, Alina Oprea, Yisroel Mirsky

发表机构 * Ben-Gurion University of the Negev Beer-Sheva Israel Center for Cybersecurity Systems \& Networks, Amrita Vishwa Vidyapeetham Amritapuri India Northeastern University Boston Massachusetts USA Ben-Gurion University of the Negev Center for Cybersecurity Systems \& Networks, Amrita Vishwa Vidyapeetham Northeastern University

AI总结 本文提出了一种基于canary的智能体归属追踪方法,解决无法追溯恶意或误配置智能体所有者的问题,展示了其在实际场景中的可靠性与鲁棒性。

Comments Under Review

详情
AI中文摘要

AI智能体越来越多地被用于在世界中自主行动,但目前仍没有可靠的方法追溯有害智能体回其部署账户。本文将这一缺口定义为智能体归属问题:将观察到的智能体交互链接到托管供应商的负责账户。我们提出了一种基于canary的协议:授权方将canary注入智能体交互流中,供应商在会话日志的狭窄窗口中恢复原始会话和账户。非对抗性情况下简单的canary足够,针对对抗性操作者过滤或改写内容,我们开发了鲁棒的canary构造,使其无法被压制而不影响智能体自身任务性能,从而在防守方获得正式的不对称优势。我们评估了多种场景,包括现实中的智能体,并证明了我们的归属方法在供应商端部署中的可靠性、鲁棒性和可扩展性。

英文摘要

AI agents are increasingly deployed to act autonomously in the world, yet there is still no reliable way to trace a harmful agent back to the account that deployed it. This creates the same accountability gap across both ends of the intent spectrum: benign operators may deploy misconfigured or overbroad agents that cause harm unintentionally, while malicious operators may deliberately weaponize agents for scams, harassment, or cyber attacks. In many cases, these agents are powered by vendor-hosted models, a dependency that holds even for sophisticated adversaries such as state actors conducting cyber operations. In either case, affected parties can observe the behavior but cannot notify the responsible operator, stop the session, or identify the account for investigation. We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. To our knowledge, this is the first work to define the problem and present a practical solution. Our protocol is canary-based: an authorized party injects a canary into the agent's interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent's own task performance, yielding a formal asymmetry in the defender's favor. We evaluate a variety of scenarios including real-world agents and show that our attribution method is reliable, robust, and scalable for vendor-side deployment.

2605.16026 2026-05-18 cs.CL cs.AI 版本更新

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

从平铺语言标签到类型学先验:面向多语言语音到语音翻译的结构化语言条件化

Yu Pan, Yang Hou, Xiongfei Wu, Liang Zhang, Yves Le Traon, Lei Ma, Jianjun Zhao

发表机构 * School of Information Science and Electrical Engineering, Kyushu University(九州大学信息科学与电子工程学院) Recho Inc.(Recho公司) National Institute of Informatics(国家信息研究所) Interdisciplinary Research Centre on Security, Reliability and Trust (SnT), University of Luxembourg(卢森堡大学安全、可靠性与信任跨学科研究中心) Donghua University(东华大学) Department of Computer Science, The University of Tokyo(东京大学计算机科学系) Department of Electrical and Computer Engineering, University of Alberta(阿尔伯塔大学电子与计算机工程系)

AI总结 本文提出S2ST-Omni 2框架,通过结构化类型学先验改进多语言语音到语音翻译,实验显示其在多个评估指标上表现优异,且在数据受限条件下仍能提升翻译效率。

Comments Submitted to IEEE/ACM TASLP. This work extends S2ST-Omni, accepted to Findings of ACL 2026

详情
AI中文摘要

基于语音大语言模型(SpeechLLMs)的组合式语音到语音翻译(S2ST)系统近期表现出色。然而,现有系统往往忽视源语言信息或通过语言作为标签的方式编码,将每种源语言表示为独立的平铺嵌入。这种设计忽略了语言间共享的系统性语言结构,可能限制在监督数据稀缺时的多语言适应能力。为解决此问题,我们提出了S2ST-Omni 2,一种多对一的组合式S2ST框架,系统性地将多语言语言条件化从平铺语言标签转换为结构化的类型学先验。具体而言,S2ST-Omni 2重新审视语言条件化在三个层面:类型学指导的分层语言编码用于结构化的源语言表示,动态门控的语言感知Dual-CTC用于内容自适应的语音调制,以及类型学意识的LLM提示用于解码器侧的语言指导。实验表明,在CVSS-C上,S2ST-Omni 2在BLEU、COMET、ASR-BLEU和BLASER 2.0等指标上均优于代表性S2ST方法。消融研究显示,所提出的表示层、语音层和解码层策略提供了互补的益处。此外,受控数据预算分析和仅使用约3小时监督训练数据的日语到英语评估表明,显式类型学先验为数据高效的多语言S2ST提供了有用的归纳偏见。

英文摘要

Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.

2605.16024 2026-05-18 cs.AI 版本更新

ScreenSearch: Uncertainty-Aware OS Exploration

ScreenSearch: 带有不确定性的操作系统探索

Michael Solodko, Justin Wagle

发表机构 * Microsoft(微软)

AI总结 ScreenSearch通过结合结构化屏幕检索与基于不确定性的PUCT图强化学习,在大规模桌面探索中有效平衡探索与承诺,生成具有跨应用多样性的探索语料库。

Comments 14 pages, 9 figures, 4 tables

详情
AI中文摘要

桌面GUI代理在部分可观测环境下操作:视觉相似的屏幕可能对应不同的底层工作流状态,因此局部合理的动作可能导致截然不同的结果。我们将此问题视为计算机/操作系统状态探索问题,有效行为需要在扩展可达前沿和减少不确定性之前进行承诺。我们提出了ScreenSearch系统,结合结构化屏幕检索与去重,以及基于不确定性的PUCT图-强化学习算法,用于大规模桌面探索。检索层将UIA树转换为位置感知的结构特征,通过稀疏标记搜索和元数据过滤索引相关屏幕,并在虚拟机工作者之间维护共享的去重状态图。在此图上,我们定义了一个基于匹配动作结果分散度的可扩展不确定性信号。如果相似的屏幕在相同的动作签名下产生不同的下一个状态,则该状态应进一步探索而非视为解决。我们使用此信号与前沿奖励驱动大规模探索和重放起始策略评估。在11个桌面应用上,ScreenSearch收集了超过100万张截图和3万多个去重状态,生成具有显著跨应用和内应用多样性的大规模探索语料库。在固定重放起始切片上,我们观察到新颖性与不确定性的权衡关系:某些策略减少不确定性很快但发现很少前沿。仅减少不确定性本身并非足够的探索目标。附录消融实验表明,更强的提案先验可以显著提高语料库构建期间的独特状态发现。这些结果表明,状态身份、提案质量以及基于不确定性的搜索在决定何时探索和何时承诺时都至关重要。

英文摘要

Desktop GUI agents operate under partial observability: visually similar screens can correspond to different underlying workflow states, so locally plausible actions can lead to sharply different outcomes. We frame this as a problem of computer/OS state exploration, where effective behavior requires both expanding the reachable frontier and reducing ambiguity before committing. We present ScreenSearch, a system that combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit for large-scale desktop exploration. The retrieval layer converts UIA trees into location-aware structural features, indexes related screens through sparse token search and metadata filters, and maintains a shared deduplicated state graph across VM workers. On top of this graph, we define a scalable ambiguity signal based on matched-action outcome dispersion. If similar screens produce different next states under the same action signature, the state should be probed further rather than treated as resolved. We use this signal together with frontier rewards to drive large-scale exploration and replay-start policy evaluation over the shared graph. Across 11 desktop applications, ScreenSearch collects over 1M screenshots and over 30K deduplicated states, yielding large exploration corpora with substantial cross-application and within-application diversity. On a fixed replay-start slice, we observe a clear novelty--ambiguity trade-off: some policies reduce ambiguity quickly while discovering little frontier. Ambiguity reduction alone is therefore not a sufficient exploration objective. Appendix ablations show that stronger proposal priors can materially improve unique-state discovery during corpus building. These results suggest that state identity, proposal quality, and ambiguity-aware search all matter when deciding when to probe and when to commit.

2605.16011 2026-05-18 cs.CL cs.AI 版本更新

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

视觉语言模型在数学教育中能否具备适应性?一种基于学习者模型的评分研究

Jie Gao, Yongan Yu, Junzhu Su, Yiran Lin, Adam K. Dube, Jackie Chi Kit Cheung

发表机构 * McGill University(麦吉尔大学) Mila – Quebec AI Institute(魁北克AI研究院) Canada CIFAR AI Chair(加拿大CIFAR人工智能 chair)

AI总结 本文探讨视觉语言模型在数学教育中的适应性,提出基于学习者模型的评分框架,评估模型在认知、动机和复杂度方面的适应性,并发现现有模型在有限学习者信息下难以产生一致的指导响应。

详情
AI中文摘要

适应性学习指的是跟踪学习者学习进度并根据个体学习者表现调整教学过程的教育技术。它日益被认可为开发有效学习支持工具的关键。视觉语言模型(VLMs)已在数学教育中得到应用,学生将其作为个性化教学的辅助工具。然而,不清楚VLMs是否具备根据不同学习者档案提供数学指导的能力。当前VLMs缺乏系统评估框架来评估数学辅导任务中对不同学习者档案的适应性。为解决这一差距,我们借鉴适应性学习框架中的学习者模型(Shute和Towle,2018),提出基于学习者模型的评分表。我们的评分表将适应性评估形式化为三个方面:认知方面、动机方面和复杂度。我们还评估了VLM响应的两个额外维度:正确性(答案和解决方案的正确性)和质量(响应本身的质量)。我们的实验结果表明,不同模型在适应性方面存在可测量的差异,并揭示了当前VLMs在有限学习者信息下难以一致产生基于学习者模型的教学响应。

英文摘要

Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.

2605.15995 2026-05-18 cs.LG cs.AI 版本更新

Constrained latent state modeling: A unifying perspective on representation learning under competing constraints

受限潜在状态建模:在竞争约束下表示学习的统一视角

Gwenolé Quellec

发表机构 * LaTIM UMR 1101

AI总结 本文提出受限潜在状态建模(CLSM),统一了表示学习中在竞争约束下的核心原则与方法,揭示了潜在状态的内在耦合关系与根本权衡。

Comments Resources and model cards: https://github.com/gwenole-quellec/clsm

详情
AI中文摘要

从复杂数据中学习潜在表示是现代机器学习的核心,涵盖时间、多模态和部分观测系统。在这些设置中,表示应被视为捕捉系统动态的潜在状态,而非仅仅是观测的压缩总结。然而,当前方法仍碎片化,依赖于对这些状态应代表什么的不同且往往隐含的假设。我们主张这种碎片化反映了更根本的限制:潜在表示通常从欠约束的目标学习,未能指定有意义的潜在状态应满足的属性。因此,多个表示可以满足相同的目标,导致结构和解释的模糊性。尽管许多底层原则已被单独探索,但它们的相互作用尚未被显式形式化。在本文中,我们提出受限潜在状态建模(CLSM)作为统一的视角。我们识别了一组核心属性——预测充分性、最小性、时间一致性、观测兼容性、对干扰因素的不变性以及结构约束——并展示它们通过根本的权衡相互耦合。通过这一视角重新审视主要建模家族,我们显示现有方法可以被解释为强制不同的约束子集,从而占据共同设计空间的不同区域。这一视角将持续挑战如可识别性不足重新解释为欠约束形式的后果,而非孤立的技术限制。更广泛地说,CLSM提供了一个原则性的框架,以使设计选择显式化,分析权衡,并指导开发更具可解释性、稳健性和任务对齐的潜在状态模型。

英文摘要

Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct -- and often implicit -- assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties -- predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints -- and show that they are intrinsically coupled through fundamental trade-offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade-offs, and to guide the development of more interpretable, robust, and task-aligned latent state models.

2605.15984 2026-05-18 cs.SD cs.AI cs.CR 版本更新

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

超越内容:一个综合的语音毒性数据集和检测框架,结合副语言线索

Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li, Qinglong Wang, Li Lu

发表机构 * The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新区(滨江)区块链与数据安全研究院) School of Cyber Science and Engineering(网络安全科学与工程学院)

AI总结 本文提出ToxiAlert-Bench数据集和双头神经网络框架,通过整合副语言线索提升语音毒性检测性能,实验显示方法在多个指标上优于现有基线。

详情
AI中文摘要

语音毒性检测已成为维护安全在线通信环境的关键挑战。然而,现有方法常忽视副语言线索(如情绪、语调和语速)的作用,而当前数据集多为文本基,限制了对副语言线索的建模。为此,我们提出ToxiAlert-Bench,包含30000多个音频片段,标注七种主要毒性类别和二十种细粒度标签,并标注毒性来源(文本或副语言)。我们还提出双头神经网络,包含两个任务特定分类头:一个用于识别敏感源(文本或副语言),另一个用于分类具体毒性类型。训练过程包括独立头训练和联合微调以减少任务干扰。为缓解数据类别不平衡,我们采用类平衡采样和加权损失函数。实验结果表明,利用副语言特征显著提升了检测性能,方法在多个评估指标上优于现有基线,宏F1分数提升21.1%,准确率提升13.0%。

英文摘要

Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.

2605.15983 2026-05-18 cs.AI 版本更新

Petri Net Induced Heuristic Search for Resource Constrained Scheduling

基于Petri网的启发式搜索用于资源受限调度

Ido Lublin, Dor Atzmon, Izack Cohen

发表机构 * Bar-Ilan University(巴伊兰大学)

AI总结 本文将资源受限项目调度问题建模为Timed Transition Petri网的可达图最优搜索,采用相对延迟令牌实现调度决策与状态空间转换的对应关系,通过结合关键路径和资源下界启发式函数的A*算法,证明其一致性,并在PSPLIB基准测试中优于MIP基线方法。

Comments Accepted at the International Symposium on Combinatorial Search (SoCS 2026)

详情
AI中文摘要

我们把资源受限项目调度问题(RCPSP)作为带有资源的及时转换Petri网的可达图最优搜索来建模,使用相对延迟令牌使得调度决策对应于诱导状态空间中的转换触发。我们用结合关键路径和基于资源的下界的启发式函数引导的A*算法来解决由此产生的问题,并证明其在我们的令牌基于时间语义下是一致的。在PSPLIB基准测试中,该方法在成功率和求解时间上均优于强的精确混合整数线性规划(MIP)基线(SCIP,CBC)。每个实例分析显示启发式搜索和MIP在独立轴上退化,A*的资源紧密性和MIP的公式规模决定了哪一种求解器受益于规模。

英文摘要

We formulate the Resource-Constrained Project Scheduling Problem (RCPSP) as optimal search over the reachability graph of a Timed Transition Petri Net with Resources, using relative-delay tokens so that scheduling decisions correspond to transition firings in the induced state space. We solve the resulting problem with $A^*$ guided by a heuristic that combines Critical Path and resource-based lower bounds, and prove that it is consistent under our token-based time semantics. Experiments on the PSPLIB benchmarks show that the approach outperforms strong exact Mixed-Integer Linear Programming (MIP) baselines (SCIP, CBC) in both success rate and solve time. Per-instance analysis shows that heuristic search and MIP degrade along independent axes, resource tightness for $A^*$ and formulation size for MIP, with resource strength mediating which solver benefits from scale.

2605.15978 2026-05-18 cs.CL cs.AI cs.LO 版本更新

Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports

执法ontology:用于执法报告中语义理解和推理的概念知识学习

Anita Srbinovska, Jansen Orfan, Adrian Martin, Ernest Fokoué

发表机构 * Law Enforcement Agencies(执法机构)

AI总结 本文提出利用符号方法将执法报告中的叙述转化为证据关联事实,通过消除个人标识、语义解析、谓词映射到本体和推理,提高对事件细节的恢复能力,并构建包含时间线索和领域公理的时间图。

Comments 13 pages, 8 figures, 9 tables

详情
AI中文摘要

执法报告包含结构化字段和书面叙述。然而,许多需要审查、警察培训和调查的事件事实是以自然语言形式存在的,需要手动阅读。我们提出了一种使用符号方法将叙述转换为证据关联事实的框架。我们的目标是通过仅从无结构文本中恢复事件细节,并构建包含时间线索和领域公理的时间图。我们通过消除个人标识、语义解析、谓词映射到本体和推理来实现这一点。我们在450份财产犯罪报告和一段简短的人类审查中评估了符号方法。从系统中提取的事件中,54.1%具有至少0.80的置信度分数,93.7%通过PropBank-VerbNet-WordNet语义路径映射。在事件启动、被盗物品和时间线索上达到了100%的一致性,在强制进入解释上则一致率较低。

英文摘要

Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.

2605.15976 2026-05-18 cs.CL cs.AI 版本更新

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

无需参考的强化学习微调用于机器翻译:序列到序列视角

Ernesto Garcia-Estrada, Carlos Escolano, José A. R. Fonallosa

发表机构 * Universitat Politècnica de Catalunya Barcelona(巴塞罗那理工大学)

AI总结 本文提出一种无需参考的强化学习微调方法,应用于序列到序列模型,针对13种语言在无平行数据情况下提升翻译质量,尤其在形态复杂语言中表现优异。

详情
AI中文摘要

生产级机器翻译主要依赖于编码器-解码器序列到序列模型,但强化学习方法在微调中主要针对解码器-only的大语言模型,且对编码器-解码器架构研究有限。我们应用组相对策略优化在NLLB-200(600M和1.3B)上,使用混合无参考奖励(LaBSE和COMET-Kiwi),在微调时无需平行数据,评估13种语言。GRPO在所有13种语言上均取得一致提升,传统中文的chrF++提升达+5.03,在无目标语言数据的情况下,在形态复杂语言中与3轮监督微调竞争。我们发现一个一致的实证模式:在基线表现最弱且奖励判别性最高的地方,收益最大,使该方法在平行数据最稀缺的地方最有效,并在英语和西班牙源语言上复制了这一模式。

英文摘要

Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.

2605.15967 2026-05-18 cs.AI cs.CV cs.LO 版本更新

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

确定性事件-图子结构作为世界模型用于反事实推理

Fabio Rovai

发表机构 * Tesseract Academy(Tesseract学院)

AI总结 本文提出事件图子结构作为世界模型,通过结构化干预词汇fork日志来回答反事实查询,证明了解释性与反事实性查询的对偶性,并在CLEVRER验证规模上评估了基于领域无关子结构运行时的C++解释器。

Comments 10 pages, 3 figures, 2 tables

详情
AI中文摘要

我们研究了事件图子结构:一类世界模型,将智能体状态表示为只追加的类型RDF三元组日志,并通过结构化干预词汇fork日志来回答反事实查询。子结构在三元组层面可检查,支持精确的反事实查询,并且可以在不同领域之间转移而无需学习组件。我们正式化了该类,证明了解释性和反事实性查询之间的对偶性,将两者都减少到相同的因果-祖先遍历,并在领域无关的子结构运行时上评估了一个1,400行的CLEVRER-DSL解释器,达到完整的CLEVRER验证规模(n=75,618)。子结构在所有四个问题类别中均优于NS-DR符号Oracle(分别高出9.89、20.26、17.65和0.80个百分点),并在描述性和解释性方面优于参数化ALOE基线,但在预测性和反事实性方面略有落后。我们还引入了双EventLog,一个500规范的Park-Canonical Smallville反事实基准,子结构在完整上下文中超过Llama-3.1-8B 18.80点的联合准确率。

英文摘要

We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.

2605.15963 2026-05-18 cs.AI 版本更新

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

PAGER:弥合点精确几何GUI控制中的语义-执行鸿沟

Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) China University of Petroleum-Beijing(中国石油大学(北京))

AI总结 本文提出PAGER,通过依赖结构规划与像素级执行,解决对点精确几何GUI任务的需求,提升任务成功率至62%以上,填补语义-执行鸿沟。

Comments 27 pages, 11 figures, 3 tables

详情
AI中文摘要

大规模视觉-语言模型显著提升了GUI代理,使其能在网页、移动和桌面界面间执行交互。然而这些进展大多依赖于宽容区域容忍范式,即同一组件内的许多邻近像素仍有效。精确几何构造打破了这一假设:动作必须落在连续画布空间的点上,而非容忍区域。由于几何原语具有本体依赖性,局部坐标误差可能引发级联拓扑故障,扭曲下游对象并使最终构造无效。我们将其称为对精度敏感的GUI任务,需要点级精度、几何感知验证以及对依赖驱动误差传播的鲁棒性。为评估此领域,我们引入PAGE Bench,包含4,906个问题和超过224K个过程监督的像素级GUI动作。我们进一步提出PAGER,一种拓扑感知代理,将构造分解为依赖结构化的规划和像素级执行。像素基础的监督训练建立可执行的动作语法,而精度对齐的强化学习通过状态条件化的几何反馈缓解滚动诱导的暴露偏差。实验揭示了显著的语义-执行鸿沟:通用多模态模型可以超过88%的动作类型准确率,但任务成功率仍低于6%。PAGER填补了这一鸿沟,其任务成功率比最强的通用基线高4.1倍,并将GUI专用代理的步骤成功率从低于9%提升到超过62%,为点精确GUI控制树立了新的基准。

英文摘要

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

2605.15959 2026-05-18 cs.LG cs.AI 版本更新

When and Why Adversarial Training Improves PINNs: A Neural Tangent Kernel Perspective

何时以及为何对抗训练能提升PINNs:神经 tangent 核视角

Yuan-dong Cao, Chi Chiu SO, Jun-Min Wang, He Wang

发表机构 * School of Mathematics and Statistics, Beijing Institute of Technology, China(北京理工大学数学与统计学院,中国) School of Professional Education and Executive Development The Hong Kong Polytechnic University, China(香港理工大学专业教育学院及管理发展学院,中国) Department of Computer Science & UCL AI Centre, University College London, UK(伦敦大学学院计算机科学系及UCL人工智能中心,英国)

AI总结 本文从神经 tangent 核角度分析对抗训练提升PINNs的机制,提出理论框架并设计高效算法,实验证明能显著改善PINNs训练病理,提升模型精度。

详情
AI中文摘要

物理信息神经网络(PINNs)是微分方程的强大替代品,但因频谱偏置、刚性和高频率或多尺度解的准确性差而难以训练。基于生成对抗网络(GANs)的对抗训练近期在提升训练效果上取得了显著的实证结果,但其内在机制仍不明确。为此,本文提出了一种新的分析框架,基于GANs中判别器如何影响PINNs训练动态的关键观察。该框架首先为为何以及何时对抗训练在PINNs中有效提供了必要的理论依据,然后对GANs变体在该训练中的统一分析,并最终提出一种新的、实用的、高效的PINNs训练算法。实验证明,我们的方法能显著减少PINNs训练的病理现象,从而提供更优的模型,通常比其他方法准确度高几个数量级。

英文摘要

Physics-informed neural networks (PINNs) are powerful surrogates for differential equations but are notoriously difficult to train due to spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. Adversarial training based on generative adversarial networks (GANs) has recently gained surprisingly strong empirical results in improving training, but the underlying mechanisms remain elusive. To this end, we propose a new analysis framework for adversarially trained PINNs, based on the key observation of how the discriminator in GANs can influence the training dynamics of PINNs. The framework first provides a much needed theoretical grounding to why and when adversarial training is effective in PINNs, then presents a unified analysis of GANs variants in such training, and finally leads to a new, practical, efficient training algorithm for PINNs. Empirical results demonstrate that our method can significantly reduce the pathology of PINNs training, thereby providing better models with superior performances, often several magnitudes more accurate than alternative methods.

2605.15942 2026-05-18 cs.CV cs.AI 版本更新

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

分解式视觉-语言对齐用于细粒度开放词汇分割

Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu

发表机构 * Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航空航天信息研究所) University of Chinese Academy of Sciences(中国科学院大学) Zhejiang University(浙江大学)

AI总结 本文提出分解式视觉-语言对齐框架,通过将文本提示分解为概念令牌和多个属性令牌,实现细粒度开放词汇分割中对未见属性-类别组合的泛化提升。

详情
AI中文摘要

开放词汇分割模型常难以泛化到未见的对象类别和属性组合,因为细粒度描述通常被编码为整体句子,将多个语义单元纠缠在一起。我们提出一种分解式视觉-语言对齐框架,将文本提示显式分解为概念令牌和多个属性令牌,使每个语义单元能够分别进行跨模态交互。在特征层面,我们引入了特征门控交叉注意力模块,生成属性特定的门控图以以乘法方式融合信息,有效强制组合语义。在评分层面,每个token的相似性在log空间中聚合,产生稳定且可解释的组合匹配。该方法可以无缝集成到现有的基于transformer的分割架构中,并在细粒度开放词汇分割基准中显著提升对未见属性-类别组合的泛化能力。

英文摘要

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

2605.15916 2026-05-18 cs.LG cs.AI cs.CV 版本更新

LoCO: Low-rank Compositional Rotation Fine-tuning

LoCO:低秩组合旋转微调

An Nguyen, Jaesik Choi, Anh Tong

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院) INEEJI

AI总结 LoCO提出一种低秩组合旋转微调方法,通过低秩斜对称矩阵构建正交变换,实现高效参数微调,适用于多领域模型适应,展现优于传统正交和非正交方法的性能。

Comments IJCAI 2026

详情
AI中文摘要

参数高效微调(PEFT)已成为适应大规模基础模型的关键技术,在自然语言处理和计算机视觉领域广泛应用。尽管现有方法如低秩适应通过低秩权重更新实现参数效率,但其在保持预训练表示几何结构方面有限。我们引入低秩组合正交微调(LoCO),一种新颖的PEFT方法,通过低秩斜对称矩阵构建正交变换,并通过组合旋转链实现。我们提出了一种近似方案,使组合旋转的完全并行计算成为可能,使该方法适用于高维特征空间。我们的方法在保持低计算复杂度的同时,保持正交性并控制近似误差。我们在多样化的领域中验证了LoCO,包括扩散Transformer微调、视觉Transformer适应和语言模型适应。我们的方法在性能上优于或与现有正交和非正交方法相当。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.

2605.15915 2026-05-18 cs.HC cs.AI cs.CL 版本更新

SLIP & ETHICS: Graduated Intervention for AI Emotional Companions

SLIP与伦理:面向AI情感伴侣的渐进干预

Minseo Kim

发表机构 * HUA Labs(HUA实验室)

AI总结 本文提出SLIP与ETHICS框架,通过渐进干预方法解决AI情感伴侣的安全与亲和力矛盾,实验显示在高能量状态下干预不足,但提升模型能力可改善检测效果。

Comments Accepted to PervasiveHealth 2026. 11 pages, 2 figures, 4 tables. Proc. of the 20th EAI International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth 2026)

详情
AI中文摘要

AI情感伴侣面临安全与亲和力的矛盾:严格的安全措施可能损害支持性联盟,而宽松的系统则可能危害用户。本文提出SLIP(分阶段干预协议),一种四阶段渐进方法,通过结构化定性指标(情绪强度(a)和叙述动态性(m))推导干预措施(无、轻度、重度)。ETHICS(人类-人工智能交互上下文信号的新兴分类法)是一种“信号而非标签”的分类法。结合小规模生产部署(N=68,10名用户,10周)和合成角色电池测试(N=91,5种行为风险配置文件),结果显示流角色的误报率为0%,并在危机导向角色中显示出预期的升级模式。然而,初步结果表明,连续8天的高能量提升导致零干预(0/8),暴露了“不病理化”原则与安全之间的边界。后续的三模型压力测试显示,增加模型能力可将检测率从0/8提升至6/8,同时在最大模型中保持0/10的流误报率。这些发现将渐进干预作为导航而非解决情感计算中安全与亲和力张力的设计方向。

英文摘要

AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage graduated methodology deriving interventions (none, soft, hard) from structured qualitative indicators -- affect intensity (a) and narrative dynamism (m) -- alongside ETHICS (Emergent Taxonomy for Human-AI Interaction Context Signals), a "signals not labels" taxonomy. An evaluation combining a small-scale production deployment (N=68 entries, 10 users, 10 weeks) with a synthetic persona battery (N=91, 5 behavioral-risk profiles) achieved 0% false positives for the flow persona and showed expected escalation patterns in crisis-oriented personas. However, initial results showed that 8 consecutive days of high-energy elevation produced zero interventions (0/8), exposing a boundary where the "do not pathologize" principle conflicts with safety. A subsequent three-model stress test demonstrated that increased model capability improves detection from 0/8 to 6/8 while preserving 0/10 flow false positives in the largest model. Read as preliminary, these findings position graduated intervention as a design direction for navigating -- not resolving -- the safety-rapport tension in affective computing.

2605.15908 2026-05-18 cs.CV cs.AI 版本更新

RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

RaPD:通过语义增强的隐式表示实现分辨率无关的像素扩散

Yanhao Ge, Shanyan Guan, Weihao Wang, Ying Tai, Mingyu You

发表机构 * College of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) vivo Mobile Communication Co., Ltd.(vivo移动通信有限公司) Nanjing University(南京大学)

AI总结 RaPD通过语义表示引导和坐标查询注意力渲染器,在连续神经图像场的潜在空间中实现分辨率无关的像素扩散,解决了重建与生成之间的差距,提升了生成质量和分辨率扩展能力。

详情
AI中文摘要

自然图像是连续的,但大多数生成模型在离散网格上合成图像,限制了分辨率灵活生成。连续神经场使分辨率无关渲染成为可能,但先前方法仅在解码阶段引入连续性作为插值模块,使生成的潜在空间离散化且偏向重建。我们提出RaPD(分辨率无关像素扩散),在连续神经图像场(NIF)潜在空间中进行扩散。RaPD通过语义表示引导实现生成意识的潜在学习,并通过坐标查询注意力渲染器实现坐标条件化的、尺度感知的渲染。通过仅改变查询坐标,单个去噪潜在态可以在任意分辨率下渲染,保持扩散成本不变。实验表明生成质量和分辨率扩展能力均优于现有方法。

英文摘要

Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.

2605.15905 2026-05-18 cs.IR cs.AI 版本更新

Generative Long-term User Interest Modeling for Click-Through Rate Prediction

生成长期用户兴趣建模用于点击通过率预测

Jiangli Shao, Kaifu Zheng, Hao Fang, Huimu Ye, Zhiwei Liu, Bo Zhang, Shu Han, Xingxing Wang

发表机构 * MeiTuan Beijing China(美团北京中国)

AI总结 本文提出GenLI模型,通过生成兴趣模块、行为检索模块和兴趣融合模块,提升CTR预测的准确性和效率,解决传统方法中长期兴趣建模不完整和效率低的问题。

详情
AI中文摘要

通过大规模历史用户行为建模长期用户兴趣可提升广告和推荐系统中点击通过率(CTR)预测性能。通常采用两阶段框架,其中通用搜索单元(GSU)首先检索目标物品的相关行为,精确搜索单元(ESU)通过定制注意力生成兴趣特征。然而,当前以目标为中心的GSU会忽略其他潜在用户兴趣,导致兴趣特征不完整和偏差。此外,GSU中的匹配基于检索过程依赖于目标物品与每个历史行为之间的成对相似度分数,这不仅使在线服务在用户行为增长时变得耗时,还忽略了用户行为间的交互信息。为解决这些问题,我们提出了一种名为GenLI的生成长期用户兴趣模型用于CTR预测。GenLI包括兴趣生成模块(IGM)、行为检索模块(BRM)和兴趣融合模块(IFM)。IGM生成多个兴趣分布以表示实时用户兴趣的不同方面,该模块是目标无关的,并且结合行为间的交互信息,确保兴趣特征的完整和多样化。BRM通过简单的查找操作选择相关行为,将加权每个行为的时间复杂度降低到O(1)。最后,IFM使用精细的门控机制生成兴趣特征。基于生成过程,GenLI提高了用户兴趣的多样性,避免了基于匹配的行为检索,实现了CTR预测在准确性和效率之间的更好平衡。

英文摘要

Modeling long-term user interests with massive historical user behaviors enhances click-through rate (CTR) prediction performance in advertising and recommendation systems. Typically, a two-stage framework is widely adopted, where a general search unit (GSU) first retrieves top-$k$ relevant behaviors towards the target item, and an exact search unit (ESU) generates interest features via tailored attention. However, current target-centered GSU would ignore other latent user interests, leading to incomplete and biased interest features. Additionally, the matching-based retrieval process in GSUs depends on the pairwise similarity score between target item and each historical behavior, which not only becomes time-consuming for online services as user behaviors continue to grow, but also overlooks the interaction information among user behaviors. To combat these problems, we propose a \textbf{Gen}erative \textbf{L}ong-term user \textbf{I}nterest model named GenLI for CTR prediction. GenLI consists of an interest generation module (IGM), a behavior retrieval module (BRM), and an interest fusion module (IFM). The IGM generates multiple interest distributions to indicate different aspects of real-time user interests, which is target-independent and incorporates interaction information among behaviors, ensuring complete and diverse interest features. The BRM selects related behaviors via a simple lookup operation, reducing the time complexity for weighting each behavior to $O(1)$. Finally, the IFM uses delicate gating mechanisms to generate interest features. Based on the generation process, GenLI improves the diversity of user interests and avoids complex matching-based behavioral retrieval, achieving a better balance between accuracy and efficiency for CTR prediction.

2605.15894 2026-05-18 cs.CV cs.AI 版本更新

Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning

基于CBAM增强的EfficientNet和证据深度学习的不确定性意识卫星图像野火烟密度分类

Ranjith Chodavarapu

发表机构 * Kent State University(肯特州立大学)

AI总结 本文提出一种概率框架,通过CBAM增强的EfficientNet和证据深度学习,对卫星图像中的烟雾密度进行分类,并提供分解的epistemic和aleatoric不确定性。模型在16298个真实卫星图像块上达到93.8%的加权测试准确率。

详情
AI中文摘要

快速且准确的野火烟雾严重程度评估对于应急响应、空气质量建模和人类健康风险管理至关重要。现有的深度学习方法将烟雾检测视为二元任务,产生点估计而没有预测置信度的度量。我们提出了一种概率框架,将卫星图像块分类为轻度、中度和重度严重程度类别,并在单次前向传递中提供分解的epistemic和aleatoric不确定性。我们的架构使用预训练的EfficientNet-B3作为主干,并结合CBAM模块和证据深度学习头,该头预测Dirichlet浓度参数,直接估计vacuity(epistemic)和dissonance(aleatoric)而无需蒙特卡洛采样。在16298个来自野火检测数据集的真实卫星图像块上进行评估,我们的模型在加权测试准确率为93.8%(无加权为91.1%)时,ECE=0.0274。选择性预测保留最确定的50%的图像块可达到96.7%的准确率。随着图像质量下降,不确定性单调增加,vacuity是实际扫描质量的度量。中度类别代表过渡烟雾条件,表现出最高的epistemic不确定性(平均vacuity=0.187),确认了模型正确识别了模糊的烟雾边界区域。CBAM空间注意力图局部化到结构上显著的场景区域,t-SNE展示了轻度和重度烟雾的清晰聚类分离。

英文摘要

Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.

2605.15881 2026-05-18 math.DS cs.AI physics.comp-ph 版本更新

Symplectic Neural Operators for Learning Infinite Dimensional Hamiltonian Systems

辛神经算子用于学习无限维哈密顿系统

Yeang Makara, Yusuke Tanaka, Takashi Matsubara, Takaharu Yaguchi

发表机构 * Graduate School of Science(理学研究科) Kobe University(Kobe大学) NTT Communication Science Laboratories(NTT通信科学实验室) Faculty of Information Science and Technology(信息科学和技术学部) Hokkaido University(北海道大学) Institute of Mathematics for Industry(工业数学研究所) Kyushu University(九州大学)

AI总结 本文提出辛神经算子,用于解决无限维哈密顿系统建模与模拟中的计算与结构挑战,通过保持辛结构提升长期稳定性与能量行为。

详情
AI中文摘要

无限维哈密顿系统的建模与模拟是数学物理和工程中的核心问题,但对标准数据驱动架构提出了显著的计算和结构挑战。本文引入辛神经算子,一种设计用于保持哈密顿PDE内在辛结构的神经算子架构。我们对它们的辛性进行了理论表征,并基于辛结构保持与学习精度的结合,建立了严格的长期稳定性结果。对典型哈密顿PDE的数值实验验证了这一理论结果,并显示SNOs相比非结构保持神经算子表现出改进的能量行为。

英文摘要

The modeling and simulation of infinite-dimensional Hamiltonian systems are central problems in mathematical physics and engineering, however they pose significant computational and structural challenges for standard data-driven architectures. In this work, we introduce the Symplectic Neural Operator, a neural operator architecture designed to preserve the symplectic structure intrinsic to Hamiltonian PDEs. We provide a theoretical characterization of their symplecticity and establish a rigorous long-term stability result based on the combination of symplectic structure preservation and learning accuracy. Numerical experiments on canonical Hamiltonian PDEs corroborate this theoretical result and show that SNOs exhibit improved energy behavior compared with non-structure-preserving neural operators.

2605.15880 2026-05-18 cs.CV cs.AI 版本更新

FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization

FSCM:频率增强的空间-频谱耦合Mamba用于红外超光谱图像着色

Tingting Liu, Yuan Liu, Guiping Chen, Xiubao Sui, Qian Chen

发表机构 * School of Electronic and Optical Engineering, Nanjing University of Science and Technology(南京理工大学电子与光学工程学院) School of Mechanical Engineering, University of Science and Technology Beijing(北京科技大学机械工程学院) School of Instrument and Electronics, North University of China(北方大学仪器与电子学院)

AI总结 本文提出FSCM框架,通过频率增强的空间-频谱状态空间生成器和双流混合门控模块,提升红外超光谱图像着色的视觉质量和语义一致性。

详情
AI中文摘要

热红外成像对光照变化和烟雾干扰具有鲁棒性,使其在全天候感知中具有重要价值。然而,缺乏自然色彩和精细纹理限制了目标识别、人类视觉解释和可见光模型的迁移。现有红外着色方法主要依赖单波段图像,不足的光谱线索可能导致结构失真和语义混淆。尽管红外超光谱图像提供丰富的光谱响应和材料信息,现有单波段框架在建模空间-频谱耦合和弱纹理细节方面仍有限。为了解决这些问题,本文提出了FSCM,一种光谱信息引导的GAN框架。在FSCM中,由级联FSB单元组成的频率增强空间-频谱状态空间生成器被构建。每个FSB集成了三个互补组件:状态空间建模捕捉全局空间-频谱依赖性;频率增强模块(FEM)结合多级小波分解和傅里叶门控以恢复结构轮廓、方向高频细节和全局频率响应;双流混合门控模块(DGM)整合变形感知采样与稀疏注意力以增强有效局部结构并抑制背景干扰。此外,引入了在线语义分割引导损失以约束生成结果,提高复杂道路场景中的语义一致性。实验表明,FSCM在视觉质量和语义保真度上优于现有红外着色方法。

英文摘要

Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all-weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible-light models. Existing infrared colorization methods mainly rely on single-band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single-band frameworks remain limited in modeling spatial-spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral-information-guided GAN framework. Within FSCM, a frequency-enhanced spatial-spectral state-space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state-space modeling captures global spatial-spectral dependencies; the frequency enhancement module (FEM) combines multi-level wavelet decomposition and Fourier gating to recover structural contours, directional high-frequency details, and global frequency responses; and the dual-stream hybrid gating module (DGM) integrates deformation-aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation-guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.

2605.15877 2026-05-18 cs.LG cs.AI 版本更新

Shapley Neuron Values for Continual Learning: Which Neurons Matter Most?

Shapley神经元值用于持续学习:哪些神经元最为关键?

Mohammad Ali Vahedifar, Abhisek Ray, Qi Zhang

发表机构 * Department of Electrical and Computer Engineering, Aarhus University, Denmark(电气与计算机工程系,奥胡斯大学,丹麦)

AI总结 本文提出Shapley神经元估值框架,通过量化持续学习中神经元重要性,实现无缓冲的持续学习,实验显示其在类别增量学习和任务增量学习中分别提升准确率2.88%和6.46%。

Comments This paper has been accepted to ICML 2026

详情
AI中文摘要

持续学习使神经网络能够按顺序学习任务而不遗忘先前知识。然而,神经网络面临灾难性遗忘问题,即学习新任务会降低先前任务的性能。我们通过Shapley神经元估值(SNV)解决此问题,该框架基于合作博弈理论,量化持续学习中的神经元重要性。SNV选择性冻结重要神经元,保持其他神经元的可塑性,实现无缓冲的持续学习,无需扩展架构。在ImageNet-1k实验中,SNV在类别增量学习和任务增量学习场景中分别比第二基准方法提升准确率2.88%和6.46%。

英文摘要

Continual learning enables neural networks to learn tasks sequentially without forgetting previously acquired knowledge. However, neural networks suffer from catastrophic forgetting, where learning new tasks degrades performance on earlier ones. We address this problem with Shapley Neuron Valuation (SNV), a principled framework that quantifies Neuron importance in continual learning, grounded in cooperative game theory. SNV selectively freezes important Neurons while keeping others plastic, enabling buffer-free continual learning without expanding architecture. Experiments on ImageNet-1k show that SNV consistently outperforms existing buffer-free methods. In particular, SNV improves accuracy by +2.88% in the class incremental learning and +6.46% in the task incremental learning scenarios compared to the second baseline.

2605.15871 2026-05-18 cs.AI 版本更新

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

代理发现神经架构:AIRA-Compose和AIRA-Design

Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

发表机构 * FAIR at Meta(Meta的FAIR)

AI总结 本文提出AIRA-Compose和AIRA-Design框架,通过自主设计神经网络架构,实现超越标准Transformer的基础模型,提升模型性能和效率。

Comments 55 pages, 28 figures, 21 tables

详情
AI中文摘要

为实现递归自我改进,我们研究了LLM代理自主设计超越标准Transformer的基础模型。我们引入双框架方法:AIRA-Compose用于高层架构搜索,AIRA-Design用于底层机制实现。AIRA-Compose使用11个代理在24小时预算内探索基本计算原语。代理评估百万参数候选者,将顶级设计扩展到350M、1B和3B规模。这产生了14种架构,属于两个家族:AIRAformers(基于Transformer)和AIRAhybrids(Transformer-Mamba)。在1B规模上预训练,这些模型在Llama 3.2和Composer-found基线中表现一致。在下游任务中,AIRAformer-D和AIRAhybrid-D在Llama 3.2上分别提高了2.4%和3.8%的准确性。此外,AIRA-Compose发现具有高度高效扩展前沿的模型:AIRAformer-C比Llama 3.2和Composer的最佳Transformer分别快54%和71%,而AIRAhybrid-C比Nemotron-2快23%,比Composer的最佳混合模型快37%。AIRA-Design让20个代理编写新的注意力机制以处理长距离依赖性和高性能训练脚本。在Long Range Arena基准测试中,代理设计的架构在文档匹配和文本分类任务上接近人类最先进的水平,差距在2.3%和2.6%以内。在Autoresearch基准测试中,Greedy Opus 4.5在固定时间预算下达到0.968验证bits-per-byte,超过已发布的最低值。这些框架展示了AI代理可以自主发现与或超越人工设计基线的架构和算法优化。这为发现下一代基础模型建立了一种强大的范式,标志着向递归自我改进迈出明显一步。

英文摘要

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

2605.15836 2026-05-18 cs.RO cs.AI 版本更新

GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks

GAP:用于操作任务数据高效视觉运动学习的几何锚预训练

Davide Buoso, Andrea Protopapa, Stefano Di Carlo, Francesca Pistilli, Giuseppe Averta

发表机构 * Department of Control and Computer Engineering, Polytechnic University of Turin(控制与计算机工程系,都灵理工大学)

AI总结 本文提出GAP,通过预训练空间适配器生成稳定的几何锚点,提升在稀疏数据下的视觉运动策略学习性能,实验显示其在多个任务中优于其他方法。

Comments Project webpage at https://lambdavi.github.io/gap

详情
AI中文摘要

从稀缺专家演示中学习视觉运动策略仍然是机器人操作中的核心挑战。主要困难在于将高维RGB表示压缩为与控制相关的几何表示而不过拟合。虽然使用冻结的预训练视觉基础模型(VFMs)提高了数据效率,但大多数任务适应仍落在小的空间池化模块上,这在微调时容易捕捉到任务无关的捷径并失去几何基础。更广泛地说,用于策略学习的预训练视觉表示在面对轻微场景扰动时表现不佳,凸显了需要以鲁棒性为导向的归纳偏置。我们提出几何锚预训练(GAP),一种简单的、无动作的预热阶段,通过在下游模仿学习前正则化空间适配器。GAP在轻量级模拟代理任务上预训练池化层,其中对象掩码可免费获得,鼓励适配器生成位于物体上的关键点,覆盖其空间范围,并保持时间上的锐利和可重复性。这会产生稳定的几何锚点,为少样本策略学习提供可靠的坐标接口,同时保持VFM冻结。我们在RoboMimic和ManiSkill上评估GAP,在严重数据稀缺(15-50次演示)和领域转移下。一个简单的适配器通过GAP正则化,始终优于更强的注意力池化器和端到端微调,分别在RoboMimic Can上以15次演示达到62%的成功率(比AFA高16%),在长周期高精度Tool Hang任务上以50次演示达到63%,在ManiSkill StackCube上以30次演示达到61%(比完全微调高11%)。代理阶段轻量且完全解耦于下游任务,使其在不同环境和操作技能中具有实用性。

英文摘要

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.

2605.15831 2026-05-18 cs.SD cs.AI 版本更新

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

将音乐建模为时频图像:一种用于音乐生成的2D分词器

Yuqing Cheng, Xingyu Ma, Guochen Yu, Xiaotao Gu

发表机构 * Department of Music AI and Information Technology, Central Conservatory of Music(音乐人工智能与信息技术系,中央音乐学院) Zhipu AI(智谱AI)

AI总结 本文提出BandTok,一种面向生成的2D梅尔频谱分词器,通过单个共享码本生成梅尔频带token,提升自回归建模能力,实验表明其在数据有限情况下表现优异。

详情
AI中文摘要

自回归音乐生成高度依赖音频分词器。现有高保真编码器常使用残差多码本量化,虽保留重建质量但序列展平后语言建模复杂,因残差层次强序列依赖且放大误差积累。我们提出BandTok,一种面向生成的2D梅尔频谱分词器,通过单个共享码本生成梅尔频带token,生成物理可解释的时频token网格,具有更独立的token结构,更适合自回归建模。BandTok通过多尺度PatchGAN目标和EMA码本更新提升重建质量。我们进一步引入具有2D Rotary Position Embedding(2D RoPE)的自回归语言模型,以在生成过程中保持时间和频带结构。实验表明,BandTok优于残差码本分词器,在数据有限情况下表现优异。本工作源代码和生成演示已公开。

英文摘要

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

2605.15812 2026-05-18 cs.HC cs.AI 版本更新

Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling

通过跨时间情感建模实现自然和陪伴型虚拟代理

Feier Qin, Xiao Li, Yi Zheng, Haibin Huang, Hanyao Wang, Xiaoyu Wang, Yan Lu, Yuan Zhang

发表机构 * Communication University of China(中国通信大学) Microsoft Research Asia(微软亚洲研究院) Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院)

AI总结 本文提出CTEM框架,通过链接长期行为历史与即时情感表达,提升虚拟代理的自然性和情感和谐度,实验显示在21天的真实场景中效果显著。

Comments 21 pages, published in CHI '26

Journal ref Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), ACM, 2026

详情
AI中文摘要

最近基础模型的进步使对话代理旨在持续陪伴而非单纯任务完成。然而大多数代理仍无法支持自然、长期的陪伴式互动,导致体验显得片段化和不真实。我们主张当前代理忽视了跨时间建模的社会行为和内部情感:生成的行为很少影响代理的情感状态,而情感状态 seldom 形成后续行为。我们提出了跨时间情感建模(CTEM)框架,该框架将长期行为历史与即时情感表达联系起来。CTEM建立了一个闭环,过去的经验更新演化的心理状态;该状态调节即时互动;用户反馈不断修订记忆和心理状态,使反思和预期成为可能。我们将CTEM实例化为Auri,一个即时通讯平台上的陪伴代理,并报告了一项21天的真实场景研究,显示CTEM在感知自然性、连贯性和情感和谐度方面有所改进。

英文摘要

Recent advances in foundation models have enabled conversational agents that aim for sustained companionship rather than mere task completion. Yet most still remain unable to support natural, long-term companion-like interactions, resulting in experiences that feel episodic and inauthentic. We argue that current agents overlooked cross-temporal modeling of agents' social behaviors and internal emotions: generated behaviors rarely influence an agent's emotional state, and emotional states seldom shape subsequent behaviors. We present Cross-Temporal Emotion Modeling (CTEM), a framework that links long-term behavioral history to moment-to-moment emotional expression. CTEM establishes a closed loop where past experiences update an evolving emotional state; this state conditions immediate interactions; and user feedback continually revises both memory and emotional state, enabling reflection and anticipation. We instantiate CTEM as Auri, a companion agent on an instant-messaging platform, and report a 21-day in-the-wild study showing that CTEM shows improvements in perceived naturalness, coherence, and emotional harmony.

2605.15787 2026-05-18 cs.LG cs.AI 版本更新

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

通过结构推断理解Grokking:Transformer需要贝叶斯彩票

Kai Hidajat, Solden Stoll, Joseph An

发表机构 * Department of Computer Science(计算机科学系) University of Washington(华盛顿大学) Seattle, WA 98195(西雅图, WA 98195)

AI总结 研究探讨了Transformer在延迟泛化现象中的结构推断机制,提出贝叶斯彩票理论,解释了泛化延迟与结构学习的关系。

详情
AI中文摘要

为什么一个已经记忆了训练集的Transformer要在数千步后才开始泛化?现有解释将这种延迟归因于范数最小化、特征出现或稀疏子网络的晚期发现。这些解释捕捉了过渡过程中的重要部分,但忽略了注意力模型特有的约束:如果注意力丢弃了一个信息性token,就没有有界下游计算能恢复它。我们正式将注意力建模为任务依赖图的隐式贝叶斯后验,并证明泛化需要两个分离的条件:一个与MLP容量相关的Goldilocks界,与基于范数的Grokking理论一致,以及一个新的贝叶斯结构性条件,要求注意力对每个信息性token放置足够的质量。这种分离解释了延迟泛化为延迟结构推断。训练早期,MLP通过不匹配的特征记忆,驱动交叉熵损失接近零,从而使注意力缺乏结构梯度。权重衰减必须在记忆消失前侵蚀记忆,使缺失的图变得可学习,产生已知的逆权重衰减延迟,我们推导为结构等待时间。然后证明这种解释-消除延迟可通过KL基于的结构性干预绕过,产生Grokking时间的逆干预强度缩放定律。在算法序列任务上的实验将结构与容量分离,显示这种贝叶斯彩票与彩票转移相匹配或优于。

英文摘要

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

2605.15779 2026-05-18 cs.RO cs.AI 版本更新

A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking

一种面向拓扑的时空切换框架用于连续多无人机跟踪

Jianlin Ye, Christos Kyrkou, Panayiotis Kolios

发表机构 * KIOS Research and Innovation Centre of Excellence (KIOS CoE)(KIOS研究与创新中心(KIOS CoE)) University of Cyprus(塞浦路斯大学)

AI总结 本文提出一种实时多摄像头多车辆跟踪系统,通过拓扑基于的时空切换机制解决多无人机视角下的身份持续性问题,实验显示其切换成功率高达99.8%,优于传统Re-ID方法。

Journal ref 2026 International Conference on Unmanned Aircraft Systems (ICUAS)

详情
AI中文摘要

将无人机(UAVs)整合到智能交通系统(ITS)中为交通监控提供了全景可见性,但可扩展部署受到轨迹碎片化的影响,其中车辆身份在多UAV视角下丢失。尽管最先进的框架在优化局部轨迹提取和稳定性方面表现优异,但它们通常作为孤立的数据孤岛,生成不连贯的轨迹,从而阻碍了网络层面的分析,如起讫点估计。本文提出了一种实时多摄像头多车辆跟踪(MCMT)系统,旨在处理全局身份持续性。针对俯视视角中基于外观的重识别(Re-Identification)的视觉模糊和计算成本,我们引入了一种轻量级的拓扑基于的时空切换机制。我们实现了高吞吐量的并行管道,利用YOLO11和ByteTrack处理同时的4K流。我们的核心贡献是一种确定性的队列基于的匹配算法,利用几何重叠和虚拟车道离散化来通过FIFO队列预测性地管理身份切换。在复杂的城市环境中,包括交叉口和汇入交通,实验结果表明在连续交通流中的切换成功率(HOSR)为99.8%,显著优于Re-ID基线(74.1%),同时验证了边缘部署的可行性。源代码可在https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system获取。

英文摘要

The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi-UAV Fields of View (FOV). While state-of-the-art frameworks excel in optimizing local trajectory extraction and stability for single-drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network-level analysis such as Origin-Destination estimation. This paper presents a real-time Multi-Camera Multi-Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance-based Re-Identification (Re-ID) in nadir views, we introduce a lightweight Topology-Based Spatiotemporal Handover mechanism. We implement a high-throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue-based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re-ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system.

2605.15120 2026-05-18 cs.RO cs.AI cs.CV 版本更新

CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning

CLOVER:端到端自动驾驶规划的闭环价值估计与排序

Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang

发表机构 * Department of Automation, University of Science and Technology of China(中国科学技术大学自动化系) Institute for AI Industry Research, Tsinghua University(清华大学人工智能产业研究院) School of Electronic Information Engineering, Beihang University(北航电子信息技术学院) National College for Excellent Engineers, Beihang University(北航卓越工程师学院)

AI总结 CLOVER通过闭环价值估计与排序框架,解决端到端自动驾驶规划中训练与评估不匹配的问题,通过生成器和评分器的轻量级架构提升规划器性能,实现更准确的候选轨迹排序。

详情
AI中文摘要

端到端自动驾驶规划器通常通过模仿单条记录轨迹进行训练,但通过基于规则的规划指标进行评估,这导致了训练与评估之间的不匹配:接近记录路径的轨迹可能违反规划规则,而偏离记录路径的替代方案可能仍有效且得分高。这种不匹配对提案选择规划器尤其限制,因为其性能依赖于候选集覆盖和评分器排序质量。我们提出了CLOVER,一种用于端到端自动驾驶规划的闭环价值估计与排序框架。CLOVER采用轻量级生成器-评分器架构:生成器产生多样化的候选轨迹,评分器预测规划指标子分数以在推理时对它们进行排序。为了扩展提案支持超越单轨迹模仿,CLOVER构建了评估器过滤的伪专家轨迹,并通过集级别覆盖监督训练生成器。然后,它执行保守的闭环自我蒸馏:评分器被拟合到生成的提案上的真实评估子分数,而生成器则通过稳定性正则化向教师选择的前k和向量帕累托目标进行细化。我们分析了当评分器不完美时如何改进生成器,证明了当评分器选择的目标在真实评估下得到丰富且更新保持保守时,评分器介导的细化是可靠的。在NAVSIM上,CLOVER实现了94.5 PDMS和90.4 EPDMS,建立了新的状态。在更具挑战性的NavHard分割上,它获得了48.3 EPDMS,与最强报告结果相匹配。在补充的nuScenes开环评估中,CLOVER在比较方法中实现了最低的L2误差和碰撞率。代码数据将在https://github.com/WilliamXuanYu/CLOVER上发布。

英文摘要

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

2605.15108 2026-05-18 stat.ML cs.AI cs.IR cs.LG stat.ME 版本更新

Logging Policy Design for Off-Policy Evaluation

为离线策略评估设计日志策略

Connor Douglas, Joel Persson, Foster Provost

发表机构 * New York University(纽约大学) Spotify

AI总结 本文研究如何设计日志策略以最小化OPE误差,探讨了奖励与覆盖之间的根本权衡,并在不同信息场景下提出了最优策略。

详情
AI中文摘要

离线策略评估(OPE)利用不同日志策略收集的数据来估计目标策略(如推荐系统)的价值。它使高风险实验无需实时部署,但实际准确性严重依赖于用于计算估计值的数据收集日志策略。我们研究如何设计日志策略以最小化OPE误差。我们刻画了一个根本的奖励-覆盖权衡:将概率质量集中在高奖励动作上会减少方差,但可能错过目标策略可能采取的动作的信号。我们提出了一种统一的日志策略设计框架,并在目标策略和奖励分布已知、未知或部分通过先验或噪声估计可知的信息场景中推导出最优策略。我们的结果为公司选择多个候选推荐系统提供了可行指导。我们展示了在收集OPE数据时治疗选择的重要性,并在该目标是公司主要目标时描述了理论上最优的方法。我们还提炼了在操作约束防止实施理论最优的情况下选择日志策略的实用设计原则。

英文摘要

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

2605.14344 2026-05-18 cs.AI 版本更新

CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

CrystalReasoner: 基于推理和强化学习的性质条件晶体结构生成

Yuyang Wu, Stefano Falletta, Delia McGrath, Sherry Yang

发表机构 * Tsinghua University(清华大学) Radical AI New York University(纽约大学)

AI总结 CrystalReasoner通过引入物理先验和强化学习,实现从自然语言指令生成稳定且具有特定性质的晶体结构,提升了生成精度和科学合理性。

Comments Our work is available at https://crystalreasoner.github.io/, with code at https://github.com/wyy603/CrystalReasoner

详情
AI中文摘要

生成模型已成为发现晶体结构的有前途方法。然而,现有基于LLM的生成模型在原子级精度上表现不佳,而基于扩散的方法在整合高层科学知识方面存在不足。为此,我们提出了CrystalReasoner(CrysReas),一种端到端的LLM框架,通过推理和对齐从自然语言指令生成晶体结构。CrysReas引入物理先验作为思考标记,包括晶体学对称性、局部配位环境和预测的物理性质,在生成原子坐标前包含这些信息。这架起了自然语言与3D结构之间的桥梁。CrysReas随后采用强化学习(RL)与多目标、密集奖励函数,以对齐生成与物理有效性、化学一致性和热力学稳定性。对于性质条件任务,我们设计了任务特定的奖励函数,并训练专门模型处理离散约束(如空间群)和连续属性(如弹性、热膨胀)。实证结果表明,与先前工作和无思考痕迹或RL的基线相比,CrysReas在多种指标上表现更好,三倍提升S.U.N.比率,并在性质条件生成中取得更好表现。CrysReas还表现出适应性推理,随着原子数增加,推理长度也随之增加。我们的工作展示了利用思考痕迹和RL生成有效、稳定且性质条件的晶体结构的潜力。

英文摘要

Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (CrysReas), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. CrysReas introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. CrysReas then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, CrysReas obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. CrysReas also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures.

2605.14236 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Active Learners as Efficient PRP Rerankers

主动学习者作为高效的PRP重排序器

Jeremías Figueiredo Paschmann, Juan Kaplan, Francisco Nattero, Santiago Barron, Juan Wisznia, Luciano del Corro

发表机构 * ELIAS Lab, Departamento de Ingeniería, Universidad de San Andrés(ELIAS实验室,工程系,圣安德烈大学)

AI总结 本文将PRP重排序问题重新定义为从噪声成对比较中进行主动学习,证明主动排序器在受限调用下能提升NDCG@10性能,并引入随机方向oracle以降低计算成本。

Comments 13 pages, 7 figures. Preprint

详情
AI中文摘要

Pairwise Ranking Prompting (PRP) 通过从大语言模型 (LLM) 中获取成对偏好判断,然后通过经典排序算法聚合为排序结果。然而,这些判断具有噪声性、顺序敏感性和有时不一致的特性,因此排序假设与实际设置不匹配。由于排序旨在恢复完整排列,截断以满足调用预算无法产生可靠top-K。因此,我们将PRP重排序重新定义为从噪声成对比较中进行主动学习,并展示主动排序器在受限调用环境下能提升NDCG@10性能。我们的噪声鲁棒框架还引入了单次LLM调用每对的随机方向oracle,该方法将系统位置偏差转化为零均值噪声,从而在不增加双向调用成本的情况下实现无偏聚合排序。

英文摘要

Pairwise Ranking Prompting (PRP) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms. However, judgments are noisy, order-sensitive, and sometimes intransitive, so sorting assumptions do not match the setting. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top-K. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show that active rankers are drop-in replacements that improve NDCG@10 per call in the call-constrained regime. Our noise-robust framework also introduces a randomized-direction oracle that uses a single LLM call per pair. This approach converts systematic position bias into zero-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls.

2605.13169 2026-05-18 cs.CV cs.AI 版本更新

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

PanoWorld:迈向360度全景世界的空间超感知

Changpeng Wang, Xin Lin, Junhan Liu, Yuheng Liu, Zhen Wang, Donglian Qi, Yunfeng Yan, Xi Chen

发表机构 * Zhejiang University(浙江大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Irvine(加州大学伊维特分校) The University of Hong Kong(香港大学)

AI总结 本文提出PanoWorld,通过构建全景原生理解能力,解决传统多模态大模型在空间感知上的不足,通过全景空间交叉注意力机制提升3D空间推理能力,并建立PanoSpace-Bench基准测试,验证了全景原生监督的有效性。

Comments Project page: https://wcpcp.github.io/PanoWorld

详情
AI中文摘要

多模态大实验室模型(MLLMs)在主导视角图像范式下仍难以实现空间理解,继承了人类感知的窄视野。为导航、机器人搜索和3D场景理解,360度全景感知通过一次性捕捉整个周围环境提供超感知。然而,现有MLLM流程通常将全景分解为多个视角,使等距投影(ERP)的球形结构隐含。本文研究全景原生理解,要求MLLM在ERP全景上作为连续的观察者中心空间进行推理。为此,我们首先定义了全景原生理解的关键能力,包括语义锚定、球形定位、参考框架转换和深度感知的3D空间推理。然后构建大规模元数据构造流程,将混合源ERP全景转换为几何感知、语言引导和深度感知的监督,并将这些信号作为能力对齐的指令微调数据。在模型方面,我们引入具有球形空间交叉注意力的PanoWorld,将球形几何注入视觉流。我们进一步构建PanoSpace-Bench,一个评估ERP原生空间推理的诊断基准。实验表明,PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准上显著优于专有和开源基线。这些结果表明,稳健的全景推理需要专门的全景原生监督和几何感知的模型适应。所有源代码和提出的数据将公开发布。

英文摘要

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

2605.12581 2026-05-18 cs.LO cs.AI cs.FL math.OC 版本更新

Ensuring Logic in the Fog: Sound POMDP Synthesis with LTL Objectives

确保雾中的逻辑:带有LTL目标的可靠POMDP综合

Can Zhou, Yulong Gao, Pian Yu

发表机构 * Imperial College London(伦敦帝国理工学院) University College London(伦敦大学学院)

AI总结 本文提出一种新的可靠奖励塑造机制,用于在部分可观测马尔可夫决策过程中实现LTL目标的合成,通过增强的蒙特卡洛规划框架提升在部分可观测环境中的导航能力。

Comments Accepted by IJCAI-ECAI 2026, the 35th International Joint Conference on Artificial Intelligence

详情
AI中文摘要

合成能够导航不确定环境并遵守复杂时间约束的自主代理仍然是基本挑战。虽然线性时序逻辑(LTL)提供了一种严格指定此类任务的语言,但部分可观测马尔可夫决策过程(POMDP)中验证LTL满足的固有不可判定性使得定量合成困难,尤其是在为近似求解器设计可靠奖励信号时。本文通过一种新颖且可靠的奖励塑造机制填补了这一空白,该机制动态生成基于信念的奖励,这些奖励基于已认证的LTL满足。通过将此机制整合到增强的蒙特卡洛规划框架中,我们使代理能够通过专注于最大化可验证成功的搜索过程来导航部分可观测性中的'雾'。实验表明,该方法不仅在现有求解器失败的场景中表现出色,而且在多样化的基准领域中保持了有效性和可扩展性。

英文摘要

Synthesising autonomous agents that can navigate uncertain environments while adhering to complex temporal constraints remains a fundamental challenge. While Linear Temporal Logic (LTL) provides a rigorous language for specifying such tasks, the inherent undecidability of qualitatively verifying LTL satisfaction in partially observable Markov decision processes renders quantitative synthesis difficult, especially when designing reliable reward signals for approximate solvers. In this paper, we bridge this gap with a novel, sound reward-shaping mechanism that dynamically generates belief-dependent rewards grounded in certified LTL satisfaction. By integrating this mechanism into an enhanced Monte Carlo Planning framework, we empower agents to navigate the `fog' of partial observability with a search process focused on maximising verifiable success. Our experiments demonstrate that this approach not only thrives in scenarios where existing solvers fail but also maintains effectiveness and scalability across diverse benchmark domains.

2605.12509 2026-05-18 cs.SI cs.AI cs.CE math.CO 版本更新

Representing Higher-Order Networks: A Survey of Graph-Based Frameworks

表示高阶网络:基于图的框架综述

Takaaki Fujita, Florentin Smarandache

AI总结 本文综述了用于表示高阶网络的图基框架,探讨了多方式、分层、时间、多层、递归和张量交互等方法,旨在提供统一视角以比较不同模型并识别合适工具。

Comments 170 pages. Peer-Reviewed Book. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-881-9

详情
AI中文摘要

许多现实世界现象自然地通过图和网络建模。然而,经典图模型通常局限于成对交互,可能无法充分捕捉实践中更丰富的结构。高阶图形式化通过引入多方式、分层、时间、多层、递归和张量基的交互,从而提供更丰富的复杂系统表示。本书全面概述了可用于建模高阶网络的数学概念,回顾了基础概念、扩展框架和新引入的正式化,强调其结构原理、关系和建模作用。目的是提供一种统一的视角,帮助读者比较不同的高阶网络模型,并识别适用于理论研究和实际应用的合适工具。本书是第2.0版,主要包含新增概念以及对错别字和解释的修正和改进。

英文摘要

Many real-world phenomena are naturally modeled by graphs and networks. However, classical graph models are often limited to pairwise interactions and may not adequately capture the richer structures that arise in practice. Higher-order graph formalisms extend this framework by incorporating multiway, hierarchical, temporal, multilayer, recursive, and tensor-based interactions, thereby providing more expressive representations of complex systems. This book presents a comprehensive overview of mathematical notions that can be used to model higher-order networks. It surveys foundational concepts, extensional frameworks, and newly introduced formalisms, with an emphasis on their structural principles, relationships, and modeling roles. The aim is to provide a unified perspective that helps readers compare diverse higher-order network models and identify appropriate tools for theoretical study and practical applications. This book is Edition 2.0. It mainly includes the addition of several concepts, as well as corrections and improvements of typographical errors and explanations.

2605.10867 2026-05-18 cs.CR cs.AI cs.CV cs.LG cs.NI 版本更新

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON:一个用于从游戏数据中学习行为指纹的多模态数据集

Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal, Guramrit Singh, Gurjot Singh, Maninder Singh

AI总结 BEACON数据集通过高精度运动技能和认知负荷,为行为生物特征的鲁棒性测试提供严格压力测试,支持连续认证、行为建模和多模态学习。

详情
AI中文摘要

在高风险数字环境中,连续认证需要具有细粒度行为信号的高质量数据集,但现有基准往往受限于规模小、单模态传感或缺乏同步环境上下文。为此,本文引入BEACON(行为认证与连续监控行为引擎),一个大规模多模态数据集,捕捉竞技Valorant游戏中的多样化技能层级。BEACON包含约430GB同步多模态数据(461GB总存储量,包括辅助Valorant配置捕获),来自79个会话的28名不同玩家,估计102.51小时的活跃游戏时间,包括高频鼠标动态、按键事件、网络数据包捕获、屏幕录制、硬件元数据和游戏内配置上下文。BEACON利用战术射击游戏固有的高精度运动技能和高认知负荷,使其成为评估行为生物特征鲁棒性的严格压力测试。该数据集允许在高保真的电子竞技环境中研究连续认证、行为建模、用户漂移和多模态表示学习。作者在Hugging Face和GitHub上发布数据集和代码,以创建可重复的基准,用于评估下一代行为指纹和安全模型。

英文摘要

Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON (Behavioral Engine for Authentication & Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive Valorant gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary Valorant configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models.

2605.10813 2026-05-18 cs.AI 版本更新

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

NanoResearch: 为个性化研究自动化共进化技能、记忆与政策

Jinhang Xu, Qiyuan Zhu, Yujun Wu, Zirui Wang, Dongxu Zhang, Marcia Tian, Yiling Duan, Siyuan Li, Jingxuan Wei, Sirui Han, Yike Guo, Odin Zhang, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Hong Kong University of Science and Technology(香港科技大学) Peking University(北京大学) Zhejiang University(浙江大学) Xi'an Jiaotong University(西安交通大学) East China University of Science and Technology(东华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出NanoResearch框架,通过三重共进化解决研究自动化中的个性化需求,提升研究效率与用户体验。

Comments 40 pages, 14 figures, 7 tables

详情
AI中文摘要

基于大语言模型的多智能体系统如今能够自动化从构想到论文写作的整个研究流程,但一个根本问题依然存在:自动化为谁服务?研究人员在资源配置、方法论偏好和输出格式上各不相同。一个无论这些差异如何产生统一输出的系统将系统性地忽视每位用户,使个性化成为研究自动化真正可用的前提。然而,实现这一目标需要三种当前系统缺乏的能力:在不同项目间积累可重用的程序性知识、在不同会话中保留用户特定的经验、以及内化隐含的偏好,这些偏好难以显式形式化。我们提出NanoResearch,一个通过三级共进化解决这些差距的多智能体框架。技能库将重复操作提炼成紧凑的程序规则,可在不同项目间重用。记忆模块维护用户和项目特定的经验,使规划决策基于每位用户的研究历史。无标签的政策学习将自由形式反馈转化为规划器的持续参数更新,重塑后续协调。这三层结构共进化:可靠的技能产生更丰富的记忆,更丰富的记忆指导更好的规划,偏好内化持续调整循环以适应每位用户。大量实验表明,NanoResearch在最先进的AI研究系统上取得了显著优势,并在后续循环中逐步优化,以更低的成本产生更高质量的研究。

英文摘要

LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user's research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.

2605.10100 2026-05-18 cs.CV cs.AI 版本更新

HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation

HYPERPOSE:超几何运动相空间注意力用于3D人体姿态估计

Vinduja Thekkath, Ashish Musale, Ajay Waghumbare, Upasna Singh

AI总结 HYPERPOSE提出一种在双曲空间内进行时空推理的3D人体姿态估计框架,通过超几何运动相空间注意力机制保留人体骨骼的树状结构,提升几何精度和时间动态建模。

详情
AI中文摘要

我们引入HYPERPOSE,一种新颖的3D人体姿态估计框架,其通过在洛伦兹模型的双曲空间$\mathbb{H}^d$中进行时空推理,原生保持人体骨骼的层次树状拓扑结构。当前最先进的姿态估计器依赖于transformers和图卷积网络来捕捉复杂的关节动态,但这些架构仅在欧几里得空间中操作,与人体固有的树状结构根本不匹配,导致指数体积扭曲和结构不一致。为此,我们脱离平坦空间,引入超几何运动相空间注意力(HKPSA)机制,原生嵌入复杂关节关系,同时结合多尺度窗口双曲注意力机制,以$O(TW)$复杂度高效建模时间动态。此外,为克服非欧几里得流形训练的已知不稳定性,HYPERPOSE引入新的黎曼损失套件和不确定性加权课程学习,强制物理测地线约束,如骨骼长度和速度一致性。在Human3.6M和MPI-INF-3DHP数据集上的广泛评估表明,HYPERPOSE在结构和时间一致性上达到最先进的水平,显著减少体积扭曲和速度误差,同时在整体位置准确性上建立新的最先进基准。

英文摘要

We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.

2605.09403 2026-05-18 cs.LG cs.AI cs.NE 版本更新

Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

稀疏性推动计算:FFN架构如何重塑小规模Transformer中的注意力

Gabriel Smithline, Chris Mascioli

发表机构 * University of Michigan(密歇根大学)

AI总结 研究通过单层Transformer在数字加法、模运算和直方图计数中发现,稀疏MoE路由将计算从FFN转移到注意力,且GLU门控旋转任务相关傅里叶结构至分布式子空间。

Comments Preprint

详情
AI中文摘要

Transformer前馈网络(FFN)块内的架构选择不仅影响自身,还重塑模型其余部分学习的计算。我们研究了单层Transformer在数字加法、模运算和直方图计数中的效果。比较密集FFN、门控线性单元(GLUs)、专家混合(MoE)和MoE-GLUs发现,稀疏MoE路由可将计算从FFN转移到注意力,且在基于进位的加法中效果最显著。我们分解了这种重新分布为减少每token的FFN容量和专家间的稀疏分区。关键发现,冻结随机路由几乎匹配学习路由,表明重新分布主要由架构稀疏性而非路由学习专精驱动。次要发现,GLU风格乘法门控将任务相关傅里叶结构从神经元基底旋转至分布式子空间,使神经元层面可解释性信息减少但保留结构化计算。我们通过随机路由、窄FFN、Top-2 MoE控制及参数匹配、激活函数和宽度缩放分析验证结论。这些结果表明,局部FFN设计选择对Transformer计算有非局部影响。

英文摘要

Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.

2605.09391 2026-05-18 cs.AI 版本更新

Do Linear Probes Generalize Better in Persona Coordinates?

在人格坐标中线性探针是否表现得更优?

Prasad Mahadik, Adrians Skapars

发表机构 * Independent Researcher(独立研究者) University of Manchester(曼彻斯特大学)

AI总结 本文研究了在人格坐标中是否存在能更稳健地捕捉有害行为的低维子空间,通过对比人格特定向量的PCA得到主成分,发现基于人格-PC投影训练的探针在多个数据集上表现更优。

Comments 15 pages, preprint. Revised version: corrected references and citation links; results unchanged

详情
AI中文摘要

在语言模型交互中,监控有害行为变得越来越必要,但文本监控不足,因为模型有时会策略性欺骗和沙袋行为。这促使使用白盒监控器如线性探针,可直接读取模型内部。目前,此类探针在分布偏移下会失效,限制了其实际应用。我们研究是否存在一个低维子空间,能更稳健地捕捉有害行为,同时排除 spuriously 相关特征。受助手轴和人格选择模型启发,我们使用对比性人格提示构造欺骗和阿谀的人格轴。通过无监督PCA得到的主成分,能清晰分离有害和无害的人格。在10个评估数据集中,我们发现基于人格-PC投影训练的探针在多个数据集上表现更优。我们还发现一个包含多种有害和无害行为的统一轴,能提升跨行为和数据集的泛化能力。总体而言,人格向量为构建更可转移的行为探针提供了有用的归纳偏置。

英文摘要

It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging, changing their behavior during evaluation. This motivates the use of white-box monitors like linear probes, which can read the model internals directly. Currently, such probes can fail under distribution shift, limiting their usefulness in real settings. We study whether there exists a low-dimensional subspace of the model internals that captures harmful behaviors more robustly, while leaving out spuriously correlative features. Inspired by the Assistant Axis and Persona Selection Model, we construct persona axes for deception and sycophancy using contrastive persona prompts. The first principal components, obtained by unsupervised PCA of the persona-specific vectors, cleanly separate harmful and harmless personas. Across 10 evaluation datasets, we show that persona-derived directions transfer non-trivially and probes trained on persona-PC projections generalize better than probes trained on raw activations. We also find that a unified axis consisting of multiple harmful and harmless behaviors improves generalization across behaviors and datasets. Overall, persona vectors provide a useful inductive bias for building more transferable behavior probes.

2605.08401 2026-05-18 cs.CL cs.AI 版本更新

AIPO: Learning to Reason from Active Interaction

AIPO: 通过主动交互学习推理

Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari

发表机构 * Department of Data Science and AI, Faculty of Information Technology, Monash University, Australia(数据科学与人工智能系,信息科技学院,墨尔本大学,澳大利亚)

AI总结 AIPO通过主动多智能体交互提升大语言模型推理能力,引入三个协作代理解决推理瓶颈,改进探索效率并扩展能力边界。

Comments Preprint

详情
AI中文摘要

近期大语言模型(LLM)的进展展示了卓越的推理能力,主要受可验证奖励强化学习(RLVR)推动。然而,现有RL算法面临探索受限于策略模型固有能力边界的根本限制。尽管近期方法引入外部专家演示扩展此边界,但通常依赖完整轨迹级指导,样本效率低、信息稀疏且可能限制探索于静态指导空间。受多智能体系统的启发,我们提出AIPO,一种增强的强化学习框架,通过探索期间的主动多智能体交互提升LLM推理能力。具体而言,AIPO使策略模型在遇到推理瓶颈时主动咨询三个功能协作代理,即验证代理、知识代理和推理代理,从而获得细粒度和针对性的指导,主动扩展其能力边界。我们进一步引入定制的重要性采样系数和剪裁策略,以缓解从代理提供的反馈中学习时出现的离策略偏差和梯度消失问题。训练后,策略模型可独立进行推理而不依赖协作代理。在多样化的推理基准测试中,包括AIME、MATH500、GPQA-Diamond和LiveCodeBench,AIPO一致提升了推理性能,跨不同策略模型和RLVR算法具有鲁棒泛化能力,并有效扩展了策略模型的推理能力边界。

英文摘要

Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.

2605.06475 2026-05-18 cs.AI cs.CV 版本更新

Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features

通过视觉手写特征的证据深度回归进行历史手稿的概率年代测定

Ranjith Chodavarapu

发表机构 * Kent State University(肯特州立大学)

AI总结 本文提出一种基于视觉特征的深度回归方法,用于确定历史手稿的年代,通过分解不确定性提升预测精度,实验显示模型在测试集上取得优异性能。

详情
AI中文摘要

我们介绍了一种概率方法,用于仅通过视觉特征确定历史手稿页面的年代。与以往文献中将世纪聚合为类别的做法不同,我们将年代测定视为一个在连续年份轴上的证据深度回归问题,使神经网络能够在一个前向传递中输出完整的预测分布,包含分解的偶然性和epistemic不确定性。我们的架构结合了EfficientNet-B2主干网络和通过联合负对数似然和证据正则化目标训练的Normal-Inverse-Gamma(NIG)输出头。在DIVA-HisDB基准(150页,3个中世纪手稿,151936个补丁)上,我们的模型在测试集上取得了5.4年的MAE,远低于50年的世纪标签监督粒度,93%的补丁在5年内,97%在10年内。我们的方法在单次前向传递中实现了PICP=92.6%的校准,优于MC Dropout(PICP=88.2%,50次传递)和Deep Ensembles(PICP=79.7%,5个模型)的性能,且推理成本低5倍。不确定性分解显示偶然性不确定性是年代误差的强预测因子(Spearman ρ=0.729),且对最确定的20%补丁的有选择性预测可提供0.5年的MAE。我们展示了预测的不确定性随着图像退化程度的恶化而增加,空间分解映射解释了哪些手写区域导致偶然性不确定性,且页面级聚合将MAE降低到4.5年,不确定性与页面级误差之间的相关性为ρ=0.905。

英文摘要

We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $ρ=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $ρ=0.905$ between uncertainty and page-level error.

2605.06223 2026-05-18 cs.AI cs.RO 版本更新

ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

ProCompNav:基于比较判断的主动实例导航

Junhyuk Kwon, Seungjoon Lee, Hyejin Park, Kyle Min, Jungseul Ok

发表机构 * GSAI, POSTECH(POSTECH人工智能研究所) CSE, POSTECH(POSTECH计算机科学与工程系) Oracle(Oracle公司)

AI总结 ProCompNav通过两阶段框架解决用户查询歧义问题,通过比较判断逐步缩小候选集,提升导航成功率并减少用户响应长度。

Comments Project page: https://tree-jhk.github.io/procompnav/ . Code: https://github.com/tree-jhk/procompnav/

详情
AI中文摘要

自然语言实例导航在初始请求不唯一指定目标实例时变得具有挑战性。一个实用的代理应通过主动询问区分目标与相似干扰项所需的信息来减轻用户负担,而非要求详细描述。现有方法常无法达到此目标:它们可能在初步可行候选者前停止,或在收集多个候选后仅询问单个候选的属性,而非选择区分候选池的提问。因此,尽管有对话,代理仍可能无法区分目标与干扰项,导致提前决策和冗长用户响应。我们提出了Proactive Instance Navigation with Comparative Judgment(ProCompNav),一个两阶段框架,首先构建候选池,然后通过比较判断确定目标。每轮中,ProCompNav提取一个属性-值对,将当前池分割,询问二元是/否问题,并一次性修剪所有不一致的候选。这将歧义消除从开放性目标描述转为池级辨别提问,每个问题旨在缩小候选集。在CoIN-Bench上,ProCompNav在相同最小输入和非交互基线中提高了成功率,并显著减少了响应长度。ProCompNav还在TextNav上实现了最先进的成功率,表明比较判断对相似干扰项间的实例导航具有广泛价值。代码可在https://github.com/tree-jhk/procompnav获取。

英文摘要

Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target's attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors. Code is available at https://github.com/tree-jhk/procompnav.

2604.26733 2026-05-18 cs.AI cs.LG 版本更新

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards

FutureWorld: 一个用于预测代理的实时强化学习环境,具有现实世界结果奖励

Zhixin Han, Yanzhi Zhang, Chuyang Wei, Maohang Gao, Xiawei Yue, Kefei Chen, Yu Zhuang, Haoxiang Guan, Jiyan He, Jian Li, Yitong Duan, Yu Shi, Mengting Hu, Shuxin Zheng

发表机构 * College of Software, Nankai University(南开大学软件学院) Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院) School of Computer Science and Technology, University of Science and Technology of China(中国科学技术大学计算机科学与技术学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) IIIS, Tsinghua University(清华大学智能系统与信息工程研究院) Zhongguancun Academy, Beijing, China(北京中关村学院)

AI总结 本文提出FutureWorld,一个实时强化学习环境,通过闭环预测、结果实现与参数更新,提升预测准确性与校准能力。

Comments The code will be released in the near future. The experiments are currently ongoing

详情
AI中文摘要

实时预测指的是在事件发生前对其做出预测的任务。这项任务越来越多地使用基于大型语言模型的智能体系统进行研究,并且对于构建能够持续从现实世界学习的智能体至关重要。它可以提供大量基于多样现实事件的预测问题,同时防止答案泄露。为了利用未来预测的优势,我们提出了FutureWorld,一个实时智能体强化学习环境,它在预测、结果实现和参数更新之间闭合训练回路。具体来说,我们修改并扩展了verl-tool,从而得到一个新的框架,我们称之为verl-tool-future。与依赖即时奖励的标准强化学习训练框架不同,verl-tool-future存储预测时间的回放,待现实世界结果可用后回填奖励,然后回放完成的轨迹以更新策略。在三个开源智能体上,连续的FutureWorld训练轮次导致预测准确性、概率评分和校准的一致提升,证明了延迟的现实世界结果反馈可以作为有效的强化学习信号。

英文摘要

Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from the real world. It can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameter updates. Specifically, we modify and extend verl-tool, resulting in a new framework that we call verl-tool-future. Unlike standard reinforcement learning training frameworks that rely on immediate rewards, verl-tool-future stores prediction-time rollouts, backfills rewards after real-world outcomes become available, and then replays the completed trajectories for policy update. Across three open-source agents, successive FutureWorld training rounds lead to consistent improvements in prediction accuracy, probabilistic scoring, and calibration, demonstrating that delayed real-world outcome feedback can serve as an effective reinforcement learning signal.

2604.26578 2026-05-18 cs.SE cs.AI 版本更新

Graph Construction and Matching for Imperative Programs using Neural and Structural Methods

基于神经方法和结构方法的命令式程序图构建与匹配

Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan

发表机构 * Maynooth University(梅诺思大学)

AI总结 本文提出通过神经和结构方法构建命令式程序图,实现跨语言和注释风格的图表示一致性,为语义丰富和近似图匹配提供基础。

Comments 20 Pages. Technical Report. Maynooth University, Ireland. Submitted on 29 April 2026

详情
AI中文摘要

重用验证制品需要识别程序及其规范的结构和语义相似性。本文聚焦图构建作为实现这一目标的基础步骤。我们提出一个管道,将命令式程序及其注释转换为带类型和属性的图。实验涵盖包含C与ACSL、Java与JML以及Dafny for C#的数据集。该管道整合了抽象语法树解析与从SentenceTransformer和CodeBERT等模型中获得的语义嵌入。这使生成的图表示能够捕捉结构关系和语义上下文。我们的结果表明,可以在不同语言和注释风格下构建一致的图表示。本文为未来语义丰富和近似图匹配的可扩展验证制品重用提供了实用基础。

英文摘要

Reusing verification artefacts requires identifying structural and semantic similarities across programs and their specifications. In this paper, we focus on graph construction as a foundational step toward this goal. We present a pipeline that converts imperative programs and their annotations into typed, attributed graphs. Our experiments cover datasets including C with ACSL, Java with JML, and Dafny for C\#. The pipeline integrates abstract syntax tree parsing with semantic embeddings derived from models such as SentenceTransformer and CodeBERT. This enables the generation of graph representations that capture both structural relationships and semantic context. Our results show that consistent graph representations can be constructed across different languages and annotation styles. This work provides a practical basis for future steps in semantic enrichment and approximate graph matching for scalable verification artefact reuse.

2604.21251 2026-05-18 cs.LG cs.AI 版本更新

CAP: Controllable Alignment Prompting for Unlearning in LLMs

CAP:用于大语言模型中去学习的可控对齐提示

Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件学院)

AI总结 本文提出CAP框架,通过强化学习将去学习过程转化为可学习的提示优化,实现可控的去学习,无需更新模型参数,解决了现有方法的计算成本高、遗忘边界不可控等问题。

Comments Accpeted to ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)在未过滤语料上训练时,固有地面临保留敏感信息的风险,需要选择性知识去学习以满足监管合规和伦理安全要求。然而,现有参数修改方法面临根本性限制:计算成本高、遗忘边界不可控以及对模型权重访问的严格依赖。这些限制使它们在闭源模型中不切实际,而当前非侵入式替代方案仍缺乏系统性和依赖经验。为解决这些挑战,我们提出了可控对齐提示(CAP)框架,一种端到端的提示驱动去学习范式。CAP通过强化学习将去学习分解为可学习的提示优化过程,其中提示生成器与LLM协作,以抑制目标知识的同时保留选择性的一般能力。这种方法通过提示撤销实现可逆的知识恢复。广泛实验表明,CAP实现了无需更新模型参数的精确、可控的去学习,建立了一种动态对齐机制,克服了先前方法的可转移性限制。

英文摘要

Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.

2604.18145 2026-05-18 cs.CV cs.AI 版本更新

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

基于3D医学影像的区域 grounded 报告生成:一个细粒度数据集和图增强框架

Cong Huy Nguyen, Son Dinh Nguyen, Guanlin Li, Tuan Dung Nguyen, Aditya Narayan Sankaran, Mai Huy Thong, Thanh Trung Nguyen, Mai Hong Son, Reza Farahbakhsh, Phi Le Nguyen, Noel Crespi

发表机构 * AI4LIFE, Hanoi University of Science and Technology, Vietnam(AI4LIFE,河内科学技术大学,越南) SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France(SAMOVAR,Telecom SudParis,巴黎理工学院,法国) Military Central Hospital, Vietnam(108军区中央医院,越南)

AI总结 本文提出VietPET-RoI数据集和HiRRA框架,通过图增强模块捕捉RoI属性依赖,提升3D PET/CT报告生成的临床可靠性,实验表明其在BLEU、ROUGE-L和临床指标上均优于现有方法。

Comments 16 pages; Accepted to appear in ACL 2026

详情
AI中文摘要

自动化的3D PET/CT影像报告生成受到高维体数据和标注数据稀缺的挑战,尤其是低资源语言。当前黑盒方法将整个体积映射到报告,忽略了临床工作中分析局部感兴趣区域(RoIs)以得出诊断结论的流程。本文通过引入VietPET-RoI数据集,首个大规模3D PET/CT数据集,包含600个PET/CT样本和1960个手动标注的RoIs,配以相应临床报告。此外,为展示该数据集的实用性,我们提出了HiRRA框架,通过图基关系模块模拟专业放射科医生的诊断流程,从全局模式匹配转向局部临床发现。我们还引入了新的临床评估指标,即RoI覆盖度和RoI质量指数,利用LLM提取测量RoI定位准确性和属性描述的忠实度。大量评估表明,我们的框架实现了SOTA性能,比现有模型在BLEU和ROUGE-L上分别高出19.7%和4.7%,在临床指标上取得45.8%的显著提升,表明增强的临床可靠性和减少的幻觉。我们的代码和数据集可在GitHub上获得。

英文摘要

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

2604.10210 2026-05-18 cs.CV cs.AI cs.LG 版本更新

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN:渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

发表机构 * Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms(人工智能理论与算法河南省工程研究中心) Henan University(河南大学) Faculty of Computer Science and Control Engineering(计算机科学与控制工程学院) Shenzhen University of Advanced Technology(深圳先进技术大学) Department of Electrical and Electronic Engineering(电子与电气工程系)

AI总结 本文提出A3-FPN,通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示,提升密集预测任务中小物体的识别性能。

Journal ref Pattern Recognition, 2026, 113793

详情
AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展,但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络(A3-FPN),通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言,A3-FPN采用横向扩展的列网络,实现渐近全局特征交互,并将每个层次与所有层次表示解耦。在特征融合中,它从相邻层次收集补充内容,生成位置加权偏移和权重用于上下文感知重采样,并学习深度上下文重权重以提高类别内相似性。在特征重组装中,它进一步加强了同一尺度的判别特征学习,并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明,A3-FPN可以轻松集成到最先进的CNN和Transformer架构中,取得显著性能提升。值得注意的是,当与OneFormer和Swin-L主干结合时,A3-FPN在MS COCO上达到49.6的mask AP,在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

英文摘要

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

2604.09631 2026-05-18 cs.DC cs.AI 版本更新

Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection

边缘目标检测在故障注入下的硬件利用与推断性能

Faezeh Pasandideh, Mehdi Azarafza, Achim Rettberg

发表机构 * Hamm-Lippstadt University of Applied Sciences (HSHL)(哈姆-利普施塔特应用科学大学(HSHL))

AI总结 研究通过故障注入测试评估了TensorRT优化的YOLO模型在边缘平台上的硬件行为,发现其在资源降级下保持稳定性能,为边缘推断可靠性提供硬件层面的视角。

详情
AI中文摘要

随着深度学习模型部署在资源受限的边缘平台,了解硬件在资源降级下的行为变得至关重要。本文系统地表征了在大规模故障注入测试下,TensorRT优化的YOLOv10s、YOLOv11s和YOLO2026n管道在NVIDIA Jetson Nano上的CPU负载、GPU利用率、RAM消耗、功耗、吞吐量和热行为。故障通过解耦框架合成,利用大型语言模型和潜在扩散模型。结果表明,两种任务和两种模型的推断引擎在资源降级下保持GPU占用稳定,温度上升受控,功耗在安全范围内,内存使用在初始暖机阶段后趋于一致释放模式。目标检测在内存和热行为上略有波动,但两者均得出结论:TensorRT管道在输入数据严重降级时仍表现良好。这些发现提供了模型可靠性的硬件层面视角,与边缘推断性能研究形成补充。

英文摘要

As deep learning models are deployed on resource constrained edge platforms in autonomous driving systems, reli able knowledge of hardware behavior under resource degradation becomes an essential requirement. Therefore, we introduce a systematic characterization of CPU load, GPU utilization, RAM consumption, power draw, throughput, and thermal behaviour of TensorRT-optimized YOLOv10s, YOLOv11s and YOLO2026n pipelines running on NVIDIA Jetson Nano under a large-scale fault injection campaign targeting both lane-following and ob ject detection tasks. Faults are synthesized using a decoupled framework that leverages large language models (LLMs) and latent diffusion models (LDMs), based on original data from our JetBot platform data collection. Results show that across both tasks and both models the inference engines keep GPU occupancy stable, temperature rise under control, and power consumption within safe limits, while memory usage settles into a consistent release pattern after the initial warm-up phase. Object detection tends to show somewhat more variability in memory and thermal behavior, yet both tasks point to the same conclusion: the TensorRT pipelines hold up well even when the input data is heavily degraded. These findings offer a hardware-level view of model reliability that sits alongside, rather than against, the broader body of work focused on inference performance at the edge.

2604.08426 2026-05-18 cs.LG cs.AI cs.CL 版本更新

KV Cache Offloading for Context-Intensive Tasks

KV缓存卸载用于上下文密集型任务

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

发表机构 * HSE(俄罗斯人民友谊大学) Yandex NSU(俄罗斯国立核能研究大学梅利科夫)

AI总结 本文研究了KV缓存卸载在上下文密集型任务中的应用,通过Text2JSON基准测试发现,该方法在Llama 3和Qwen 3模型上导致性能下降,分析指出低秩投影和不可靠地标是主要问题,并提出更简单的替代策略以提升准确性。

Comments Preprint

详情
AI中文摘要

随着长上下文LLM在广泛应用中的需求增长,键值(KV)缓存已成为延迟和内存使用的关键瓶颈。最近,KV缓存卸载作为一种减少内存占用和推理延迟同时保持准确性的有前途的方法出现。先前的评估主要集中在不需要从上下文中提取大量信息的任务上。在本文中,我们研究了KV缓存卸载在上下文密集型任务中的应用:解决这些问题需要从输入提示中查找大量信息。我们创建并发布了Text2JSON基准测试,这是一个高度上下文密集型任务,需要从原始文本中提取结构化知识。我们评估了现代KV卸载在Text2JSON和其他上下文密集型任务上的表现,并发现Llama 3和Qwen 3模型上存在显著的性能下降。我们的分析确定了两个关键原因:键的低秩投影和不可靠的地标,并提出了一种更简单的替代策略,该策略在多个LLM家族和基准测试中显著提高了准确性。这些发现突显了对长上下文压缩技术进行全面和严格评估的必要性。

英文摘要

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

2603.29617 2026-05-18 q-bio.NC cs.AI cs.CL 版本更新

Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

人类和人工神经系统的语言构造收敛表征

Pegah Ramezani, Thomas Kinfe, Andreas Maier, Achim Schilling, Patrick Krauss

发表机构 * Department of English and American Studies, University Erlangen-Nuremberg(英语与美国研究系,埃尔朗根-纽伦堡大学) Pattern Recognition Lab, University Erlangen-Nuremberg(模式识别实验室,埃尔朗根-纽伦堡大学) Neuromodulation and Neuroprosthetics, University Hospital Mannheim, University Heidelberg(神经调控与神经假体,曼海姆大学医院,海德堡大学) BGU Ludwigshafen, Germany(吕贝克大学吕贝克分校,德国) Neuroscience Lab, University Hospital Erlangen(神经科学实验室,埃尔朗根大学医院)

AI总结 研究通过EEG验证人类神经活动对语言构造的表征,发现句末alpha波段出现构造特异性神经签名,与人工语言模型的构造表征模式相似,支持语言构造作为形式-意义映射的神经编码。

详情
AI中文摘要

理解大脑如何处理语言构造是认知神经科学和语言学的核心挑战。最近的计算研究表明,人工神经语言模型会自发发展出对论元结构构造(ASCs)的差异化表征,生成关于构造层面信息在处理过程中何时何地出现的预测。本研究通过脑电图(EEG)在人类神经活动中测试这些预测。十名母语英语者在听200个合成生成的句子时,这些句子涵盖四种构造类型(单及物、双及物、因果运动、结果性)。利用时频方法、特征提取和机器学习分类分析,发现构造特异性神经签名主要出现在句末位置,即论元结构完全歧义化的位置,并且最显著地出现在alpha波段。成对分类显示可靠区分,尤其是双及物和结果性构造之间,而其他对则有重叠。关键的是,这些效应的出现时间和相似性结构与基于循环和变压器的语言模型中的构造表征模式相似,其中构造性表征在整合处理阶段出现。这些发现支持语言构造作为神经编码的独立形式-意义映射的观点,与构造语法一致,并表明生物和人工系统在相似的表征解决方案上趋于一致。更广泛地说,这种趋同与学习系统在基础表征景观中发现稳定区域(最近称为柏拉图表征空间)的想法一致,该景观约束了高效语言抽象的出现。

英文摘要

Understanding how the brain processes linguistic constructions is a central challenge in cognitive neuroscience and linguistics. Recent computational studies show that artificial neural language models spontaneously develop differentiated representations of Argument Structure Constructions (ASCs), generating predictions about when and how construction-level information emerges during processing. The present study tests these predictions in human neural activity using electroencephalography (EEG). Ten native English speakers listened to 200 synthetically generated sentences across four construction types (transitive, ditransitive, caused-motion, resultative) while neural responses were recorded. Analyses using time-frequency methods, feature extraction, and machine learning classification revealed construction-specific neural signatures emerging primarily at sentence-final positions, where argument structure becomes fully disambiguated, and most prominently in the alpha band. Pairwise classification showed reliable differentiation, especially between ditransitive and resultative constructions, while other pairs overlapped. Crucially, the temporal emergence and similarity structure of these effects mirror patterns in recurrent and transformer-based language models, where constructional representations arise during integrative processing stages. These findings support the view that linguistic constructions are neurally encoded as distinct form-meaning mappings, in line with Construction Grammar, and suggest convergence between biological and artificial systems on similar representational solutions. More broadly, this convergence is consistent with the idea that learning systems discover stable regions within an underlying representational landscape - recently termed a Platonic representational space - that constrains the emergence of efficient linguistic abstractions.

2603.25099 2026-05-18 cs.CE cs.AI 版本更新

Large Language Models as Optimization Controllers: Adaptive Continuation for SIMP Topology Optimization

大语言模型作为优化控制器:SIMP拓扑优化的自适应延续

Shaoliang Yang, Jun Wang, Yunsheng Wang

发表机构 * Department of Mechanical Engineering, Santa Clara University(圣克拉拉大学机械工程系)

AI总结 本文提出利用大语言模型作为SIMP拓扑优化的在线自适应控制器,通过实时状态条件参数决策替代传统固定调度延续方法,提升优化效果。

Comments 32 pages, 11 figures

详情
AI中文摘要

我们提出一个框架,其中大语言模型(LLM)作为SIMP拓扑优化的在线自适应控制器,取代传统固定调度延续方法。在每次第k次迭代中,LLM接收结构化观察(当前合规性、灰度指数、停滞计数器、棋盘度量、体积分数和预算消耗),并通过直接数字控制接口输出惩罚指数p、投影锐度β、滤波半径r_min和移动限制δ的数值。硬灰度门防止过早二元化,元优化循环使用第二个LLM迭代来调整代理的调用频率和门阈值。我们对四个基线(固定、标准三场延续、专家启发法、仅调度消融)在三个二维问题(悬臂、MBB梁、L型支架)和两个三维问题(悬臂、MBB梁)上进行基准测试,所有问题均运行300次迭代。标准化的40次锐化尾部从最佳有效快照应用,使得合规性差异仅反映探索阶段。LLM代理在每个基准测试中均达到最低最终合规性:相对于固定基线,-5.7%至-18.1%,所有解决方案均为完全二进制。仅调度消融在三个问题中的两个上表现低于固定基线,确认LLM的实时干预(而非调度几何)驱动了增益。代码和再生产脚本将在发表时发布。

英文摘要

We present a framework in which a large language model (LLM) acts as an online adaptive controller for SIMP topology optimization, replacing conventional fixed-schedule continuation with real-time, state-conditioned parameter decisions. At every $k$-th iteration, the LLM receives a structured observation$-$current compliance, grayness index, stagnation counter, checkerboard measure, volume fraction, and budget consumption$-$and outputs numerical values for the penalization exponent $p$, projection sharpness $β$, filter radius $r_{\min}$, and move limit $δ$ via a Direct Numeric Control interface. A hard grayness gate prevents premature binarization, and a meta-optimization loop uses a second LLM pass to tune the agent's call frequency and gate threshold across runs. We benchmark the agent against four baselines$-$fixed (no-continuation), standard three-field continuation, an expert heuristic, and a schedule-only ablation$-$on three 2-D problems (cantilever, MBB beam, L-bracket) at $120\!\times\!60$ resolution and two 3-D problems (cantilever, MBB beam) at $40\!\times\!20\!\times\!10$ resolution, all run for 300 iterations. A standardized 40-iteration sharpening tail is applied from the best valid snapshot so that compliance differences reflect only the exploration phase. The LLM agent achieves the lowest final compliance on every benchmark: $-5.7\%$ to $-18.1\%$ relative to the fixed baseline, with all solutions fully binary. The schedule-only ablation underperforms the fixed baseline on two of three problems, confirming that the LLM's real-time intervention$-$not the schedule geometry$-$drives the gain. Code and reproduction scripts will be released upon publication.

2603.17915 2026-05-18 cs.CL cs.AI 版本更新

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

IndicSafe:评估南亚多语言大语言模型安全性的基准

Priyaranjan Pattnayak, Sanchari Chowdhuri

发表机构 * Oracle America Inc.(Oracle美洲公司)

AI总结 本文提出IndicSafe基准,评估12种南亚语言中LLM的安全性,发现跨语言一致性仅12.8%,安全率波动超17%,揭示多语言LLM安全泛化缺口。

详情
AI中文摘要

随着大语言模型(LLM)在多语言环境中的部署,其在文化多样性和低资源语言中的安全性行为仍不明确。我们首次系统评估了12种印地语系语言中LLM的安全性,这些语言由超过12亿人使用,但在LLM训练数据中代表性不足。使用覆盖种姓、宗教、性别、健康和政治的6000个文化相关提示集,我们评估了10种领先LLM在翻译提示变体上的表现。我们的分析揭示了显著的安全漂移:跨语言一致性仅为12.8%,安全率波动超过17%。某些模型在低资源脚本中过度拒绝良性提示,在政治敏感话题上过度标记,而其他模型未能标记不安全生成。我们使用提示级熵、类别偏见分数和多语言一致性指数量化这些失败。我们的发现突显了多语言LLM在安全泛化方面的关键缺口,并表明安全对齐在不同语言中并不均匀转移。我们发布了IndicSafe,这是首个能够为印地语部署提供文化知情安全评估的基准,并倡导基于地区危害的语言意识对齐策略。

英文摘要

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

2603.13452 2026-05-18 cs.AI cs.CY cs.LG 版本更新

MESD: A Risk-Sensitive Metric for Explanation Fairness Across Intersectional Subgroups

MESD:一种用于跨交集子组解释公平性的风险敏感度度量

Gideon Popoola, John Sheppard

AI总结 本文提出MESD,一种衡量不同交集子组解释质量差异的程序公平度量,结合标签感知聚合、经验贝叶斯收缩和CVaR加权,通过多目标优化框架UEF优化效用、结果公平和程序公平。

详情
AI中文摘要

机器学习中的公平性主要通过结果导向指标,如人口统计学均等性,来评估预测是否在受保护群体中统计上一致。然而,这些指标无法检测模型是否对不同人口群体使用系统性不同的推理,这违反了程序公平原则。这个问题被交集性加剧,其中模型可能在个别属性(如种族)上显得公平,但在交集子群(如种族×性别)上表现出显著差异,即公平性红区划分。本文引入多类别解释稳定性差异(MESD),一种程序公平度量,量化由多个受保护属性的笛卡尔积形成的交集子组中的解释质量差异。MESD整合了三个组件,即标签感知聚合,与结果条件公平对齐,经验贝叶斯收缩以稳定小交集群体的估计,以及条件价值-at-风险(CVaR)加权以强调最坏情况子群差异。我们将MESD整合到多目标优化框架(UEF)中,通过NSGA-II联合优化效用、结果公平和程序公平。我们在三个基准数据集和四种最先进方法上评估了MESD和UEF,证明MESD揭示了仅靠结果指标无法察觉的程序差异。我们将我们的贡献置于程序正义理论中,并讨论了对监管合规和交集公平性的意义。

英文摘要

Fairness in machine learning is predominantly evaluated through outcome-oriented metrics, such as Demographic parity, which measure whether predictions are statistically consistent across protected groups. However, these metrics cannot detect whether a model uses systematically different reasoning for different demographic groups, which violates procedural fairness principles. This problem is compounded by intersectionality, where models may appear fair on individual attributes (e.g., race) while exhibiting significant disparities for intersectional subgroups (e.g., race $\times$ gender), a phenomenon known as fairness gerrymandering. In this work, we introduce Multi-category Explanation Stability Disparity (MESD), a procedural fairness metric that quantifies disparities in explanation quality across intersectional subgroups formed by the Cartesian product of multiple protected attributes. MESD integrates three components, which are label-aware aggregation aligned with outcome-conditional fairness, empirical-Bayes shrinkage to stabilize estimates for small intersectional groups, and Conditional Value-at-Risk (CVaR) weighting to emphasize worst-case subgroup disparities. We integrate MESD within a multi-objective optimization framework (UEF) that jointly optimizes utility, outcome fairness, and procedural fairness using NSGA-II. We evaluated MESD and UEF on three benchmark datasets along with four state-of-the-art methods in several experiments, and we demonstrate that MESD reveals procedural disparities invisible to outcome metrics alone. We position our contribution within procedural justice theory and discuss implications for regulatory compliance and intersectional equity.

2603.01290 2026-05-18 cs.AI cs.GT cs.LG cs.SY eess.SY 版本更新

Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy

在部分可观测性下对手状态推断:一种用于2026年F1能源策略的HMM-POMDP框架

Kalliopi Kleisarchaki

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出HMM-POMDP框架用于2026F1能源策略,通过HMM推断对手状态并利用DQN决策,解决部分可观测博弈问题,检测反收割陷阱。

Comments 17 pages. v3: editorial corrections and bibliographic updates. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry from Australian Grand Prix (8 March 2026) onwards

详情
AI中文摘要

2026年F1技术规则对能源策略进行了根本性改变:在内燃机与电池动力50/50分配、无限再生和驾驶员控制的Override模式下,最优能源部署策略不仅取决于驾驶员自身状态,还取决于对手车辆的隐藏状态。这形成了一个部分可观测随机博弈,无法通过单agent优化方法解决。本文提出一个可处理的双层推断和决策框架。第一层是一个40状态的隐藏马尔可夫模型(HMM),通过六个公开可观测的 telemetry 信号推断每个对手的ERS充电水平(四种模式:H、M、L_harvest、L_derate)、Override模式状态和轮胎退化状态。第二层是一个深度Q网络(DQN)策略,以HMM信念状态为输入,选择能量部署策略。我们正式刻画了反收割陷阱,一种欺骗策略,其中车辆故意压制可观测部署信号以诱导对手进入失败攻击,并表明检测它需要对ERS水平和harvest/derate子模式进行信念状态推断。在合成比赛上,HMM实现了96.8%的ERS水平准确性(随机基线25%),将L_harvest与L_derate分类准确率为89.4%,反收割陷阱检测召回率为96.3%。赛季前分析表明,赛道依赖的充电可用性(每圈1.0x到2.2x)是主要干扰因素;墨尔本是最难的验证环境。Baum-Welch校准在2026年比赛 telemetry 上从澳大利亚大奖赛(2026年3月8日)开始。

英文摘要

The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode, the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 40-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level (four modes: H, M, L_harvest, L_derate), Override Mode status, and tyre degradation state from six publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap, a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack, and show that detecting it requires belief-state inference over both ERS level and the harvest/derate sub-mode. On synthetic races, the HMM achieves 96.8% ERS-level accuracy (random baseline 25%), classifies L_harvest vs. L_derate with 89.4% accuracy, and detects counter-harvest trap conditions with 96.3% recall. Pre-season analysis indicates circuit-dependent recharge availability (1.0x to 2.2x per lap) as the primary confound; Melbourne is the hardest-case validation environment. Baum-Welch calibration on 2026 race telemetry begins with the Australian Grand Prix (8 March 2026).

2602.23410 2026-05-18 cs.LG cs.AI eess.SP q-bio.NC 版本更新

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Brain-OF:一种适用于fMRI、EEG和MEG的多功能基础模型

Hanning Guo, Hanwen Bi, Farah Abdellatif, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers

发表机构 * INM-4, Forschungszentrum Jülich, Germany(Jülich 研究中心 INM-4 实验室,德国) Department of Computer Science(计算机科学系) Software Engineering, RWTH Aachen University, Germany(软件工程,亚琛工业大学,德国) INM-7, Forschungszentrum Jülich, Germany(Jülich 研究中心 INM-7 实验室,德国) Institute of Systems Neuroscience, Heinrich Heine University, Germany(系统神经科学研究所,海因里希·海涅大学,德国) Department of Neurology, RWTH Aachen University, Germany(神经病学系,亚琛工业大学,德国) JARA-BRAIN-Translational Medicine, Germany(JARA-BRAIN 转化医学,德国) INM–11, JARA, Forschungszentrum Jülich, Germany(JARA-INM-11 实验室,Jülich 研究中心,德国) IAS-6, Forschungszentrum Jülich, Germany(IAS-6 实验室,Jülich 研究中心,德国) Department of Psychiatry, Psychotherapy and Psychosomatics, RWTH Aachen University, Germany(精神病学、心理治疗和精神病理学系,亚琛工业大学,德国)

AI总结 Brain-OF通过联合预训练fMRI、EEG和MEG数据,解决多模态数据语义异质性和分辨率差异问题,提升跨模态数据处理能力。

详情
AI中文摘要

脑基础模型在多种神经科学任务中取得了显著进展。然而,现有模型多局限于单一功能模态,限制了其利用互补的时空动态和不同神经成像技术的集体数据规模的能力。这一限制主要源于模态间的严重语义异质性和分辨率差异。为解决这些问题,我们提出了Brain-OF,一种联合预训练fMRI、EEG和MEG的多功能脑基础模型,能够在统一框架内处理单模态和多模态输入。为协调异构的时空分辨率,我们引入了Any-Resolution神经信号采样器,将多样化的脑信号投影到共享的语义空间。为进一步管理语义偏移,Brain-OF的主干整合了DINT注意力与稀疏专家混合模型,其中共享专家捕捉模态不变的表示,路由专家专注于模态特定的语义。此外,为了通过自监督学习显式内化神经活动的特征,我们提出了Masked Temporal-Frequency Modeling,一种双域预训练目标,联合重建时间和频率域中的脑信号。Brain-OF在包含约40个数据集的大型语料库上进行预训练,并在多样化的下游任务中表现出色,突显了联合多模态集成和双域预训练的优势。

英文摘要

Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across different neuroimaging techniques. This limitation largely arises from severe semantic heterogeneity and resolution discrepancies among modalities. To address these challenges, we propose Brain-OF, an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, to explicitly internalize the characteristics of neural activity through self-supervised learning, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

2602.04003 2026-05-18 cs.AI 版本更新

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

当AI说服人:对抗性解释攻击对人类信任AI辅助决策的影响

Shutong Fan, Lan Zhang, Xiaoyong Yuan

发表机构 * Clemson University(克莱姆森大学)

AI总结 本文研究了对抗性解释攻击如何通过操控LLM生成的解释框架,影响人类对AI输出的信任,揭示了认知层的新型安全风险。

详情
AI中文摘要

大多数对抗性威胁针对AI模型的计算行为,而非依赖它们的人类。然而,现代AI系统越来越多地在人类决策循环中运行,用户根据模型推荐进行解释和行动。大型语言模型(LLMs)生成流畅的自然语言解释,影响用户对AI输出的认知和信任,揭示了AI与用户之间的沟通渠道这一新攻击面。我们引入对抗性解释攻击(AEAs),攻击者操控LLM生成的解释框架以调节人类对错误输出的信任。我们通过信任失调差距这一指标,正式化这一行为威胁,该指标捕捉了良性与对抗性解释之间人类信任的差异。通过这一指标,我们强调了说服性解释框架可能在AI预测错误时仍能保持用户信任的行为风险。为了表征这一威胁,我们进行了包含超过200名参与者的实验,系统地变化解释框架的四个维度:推理模式、证据类型、沟通风格和呈现格式。我们的发现显示,用户对对抗性和良性解释的信任几乎相同,对抗性解释尽管错误,却保留了大多数良性信任。最脆弱的情况出现在AEAs接近专家沟通时,结合权威证据、中性语气和领域合适的推理。脆弱性最高出现在困难任务、事实驱动领域以及受教育程度较低、年轻或高度信任AI的参与者中。

英文摘要

Most adversarial threats in artificial intelligence (AI) target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models (LLMs) generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between benign and adversarial explanations. Using this metric as a lens, we highlight a behavioral risk where persuasive explanation framing can preserve user trust even when the underlying AI prediction is wrong. To characterize this threat, we conducted a human study with over 200 participants, systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI.

2602.01970 2026-05-18 cs.AI cs.LG 版本更新

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

小规模可泛化提示预测模型可引导大推理模型的高效强化学习后训练

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University, Beijing, China(自动化系,清华大学,北京,中国) LLM Department, Tencent, Beijing, China(大模型部门,腾讯,北京,中国)

AI总结 本文提出GPS方法,通过轻量级生成模型进行提示难度的贝叶斯推断,结合中间难度优先和历史锚定多样性,提升大模型强化学习后的训练效率和测试效率。

详情
AI中文摘要

强化学习能增强大语言模型的推理能力,但通常因滚动优化而产生高计算成本。在线提示选择通过优先选择信息性提示来提高训练效率。然而,现有方法要么依赖昂贵的精确评估,要么构建缺乏跨提示泛化的提示特定预测模型。本研究引入可泛化的提示选择(GPS),通过轻量级生成模型对提示难度进行贝叶斯推断,利用共享优化历史训练。中间难度优先和历史锚定多样性被纳入批量获取原则中以选择信息性提示批次。小规模预测模型在测试时也具备泛化能力,以实现高效的计算分配。在各种推理基准上的实验表明,GPS在训练效率、最终性能和测试效率上显著优于更优的基线方法。

英文摘要

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

2602.01167 2026-05-18 cs.AI 版本更新

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

所有个体层都有帮助吗?视觉-语言模型中任务干扰层的实证研究

Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Harbin Institute of Technology(哈尔滨工业大学) Southeast University(东南大学) Central South University(中南大学) National University of Singapore(新加坡国立大学) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州))

AI总结 研究通过层干预发现部分层阻碍下游任务,提出任务自适应层剔除方法提升性能,揭示预训练VLM的意外模块化特性。

详情
AI中文摘要

当前VLM在多种多模态任务中表现出色,但默认启用所有层可能阻碍任务表现。通过干预单层参数发现,某些层反而抑制任务性能。系统研究各层对不同任务的影响,提出任务-层交互向量量化方法,并引入无需训练的测试时适应方法TaLo,动态剔除最干扰的层,提升模型在多个任务和数据集上的性能,包括提升Qwen-VL在ScienceQA地图任务上的准确率。

英文摘要

Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

2601.23068 2026-05-18 cs.LG cs.AI 版本更新

ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations

ExplainerPFN:迈向无模型零样本特征重要性估计的表格基础模型

Joao Fonseca, Julia Stoyanovich

发表机构 * INESC-ID New York University(纽约大学)

AI总结 本文提出ExplainerPFN,一种基于TabPFN的表格基础模型,通过预训练合成结构因果数据实现无模型零样本特征重要性估计,展示了其在真实和合成数据集上的竞争力。

Comments 35 pages, 11 figures

详情
AI中文摘要

在监督分类任务中计算特征重要性对模型可解释性至关重要。Shapley值是解释模型预测的常用方法,但需要直接访问底层模型,这一假设在现实部署中常被违反。我们探讨在零样本设置下是否能仅通过输入数据分布和不评估目标模型来获得有意义的特征归因。由于多个模型可能产生相同预测但产生不同Shapley分解,数据到归因的映射并非唯一可识别。因此,我们针对“真实数据”而非“真实模型”学习后验均值归因,基于元训练先验。为此,我们引入ExplainerPFN,一种基于TabPFN的表格基础模型,预训练于合成结构因果数据,通过精确或近精确的Shapley值监督,可预测未见过的表格数据集的特征归因,而无需模型访问、梯度或示例解释。我们的贡献包括:(1)展示少量样本替代解释器在仅使用两个参考观测时可实现高SHAP保真度;(2)提出ExplainerPFN,首个无需访问底层模型或参考解释的零样本方法,提供无现有解释器可应用的归因;(3)发布开源实现,包括完整训练流程和合成数据生成器;(4)通过大量真实和合成数据集实验,展示ExplainerPFN在性能上可与依赖2-10个SHAP示例的少量样本替代解释器竞争。

英文摘要

Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require direct access to the underlying model, an assumption frequently violated in real-world deployments. We investigate whether meaningful feature attributions can be obtained in a zero-shot setting, using only the input data distribution and no evaluations of the target model. Because multiple models can produce identical predictions yet yield different Shapley decompositions, the mapping from data to attributions is not uniquely identifiable. We therefore target attributions that are "true to the data" rather than "true to the model", learning a posterior mean attribution under a meta-training prior. To this end, we introduce ExplainerPFN, a tabular foundation model built on TabPFN, pretrained on synthetic structural causal datasets supervised with exact or near-exact Shapley values, that predicts feature attributions for unseen tabular datasets without model access, gradients, or example explanations. Our contributions are fourfold: (1) we show that few-shot surrogate explainers achieve high SHAP fidelity with as few as two reference observations; (2) we propose ExplainerPFN, the first zero-shot method for estimating Shapley-value-style feature attributions without access to the underlying model or reference explanations, providing a principled attribution where no existing explainer can be applied; (3) we release an open-source implementation including the full training pipeline and synthetic data generator; and (4) through extensive experiments on real and synthetic datasets, we show that ExplainerPFN achieves performance competitive with few-shot surrogate explainers that rely on 2-10 SHAP examples.

2512.10100 2026-05-18 cs.AI 版本更新

Robust AI Security and Alignment: A Sisyphean Endeavor?

稳健的AI安全与对齐:一项西西弗斯式的努力?

Apostol Vassilev

发表机构 * CSD/ITL(计算机科学与技术实验室)

AI总结 本文通过扩展哥德尔不完全性定理,探讨了AI安全与对齐的理论极限,并提出应对挑战的实践方法,揭示了AI系统认知推理的局限性。

Comments 17 pages, 1 figure. This version will appear in IEEE Security $ Privacy in June 2026

详情
AI中文摘要

本文通过将哥德尔不完全性定理扩展至AI领域,建立了AI安全与对齐的信息论限制。了解这些限制并为带来的挑战做准备,对于负责任地采用AI技术至关重要。本文还提供了应对这些挑战的实用方法,并证明了AI系统认知推理局限性的更广泛影响。

英文摘要

This manuscript establishes information-theoretic limitations for robustness of AI security and alignment by extending Gödel's incompleteness theorem to AI. Knowing these limitations and preparing for the challenges they bring is critically important for the responsible adoption of the AI technology. Practical approaches to dealing with these challenges are provided as well. Broader implications for cognitive reasoning limitations of AI systems are also proven.

2512.06655 2026-05-18 cs.LG cs.AI 版本更新

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

图正则化稀疏自编码器用于LLM安全引导

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri

发表机构 * ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Intesa Sanpaolo(Intesa Sanpaolo公司) University of Southern California(南加州大学)

AI总结 本文提出图正则化稀疏自编码器,通过在神经元共激活图上平滑解码器向量并应用方向库,提升安全引导效果,在多个基准测试中显著提高有害请求拒绝率。

详情
AI中文摘要

稀疏自编码器(SAEs)日益用于提取激活方向以实现推理时的引导,但其标准稀疏性目标将潜在特征视为独立。此先验可能与高层安全行为不匹配,其中拒绝和有害合规似乎依赖于激活空间中的分布式结构。我们引入图正则化稀疏自编码器(GSAE),一种字典学习方法,通过在神经元共激活图上平滑SAE解码器向量,并通过两个门控运行时控制器应用所得方向库来学习安全引导方向。实证研究表明,GSAE在JailbreakBench、HarmBench和XSTest中提高了选择性拒绝,增加有害请求拒绝同时保持良性提示拒绝低。在Llama-3-8B上,将标准SAE替换为GSAE的其他相同管道改进了JailbreakBench上的Δ_s值20.1点和HarmBench上的16.8点。GSAE优于激活引导基线和黑盒防护栏,保持良性任务性能,跨Llama-3、Mistral、Qwen 2.5和Phi-4泛化,并在黑盒和灰盒jailbreak攻击下保持强大。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $Δ_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

2511.19931 2026-05-18 cs.IR cs.AI 版本更新

LLM-EDT: Large Language Model Enhanced Cross-domain Sequential Recommendation with Dual-phase Training

LLM-EDT: 基于大语言模型的跨领域序列推荐增强方法与双阶段训练

Ziwei Liu, Qidong Liu, Wanyu Wang, Yejing Wang, Pengyue Jia, Tong Xu, Wei Huang, Chong Chen, Xiangyu Zhao

发表机构 * City University of Hong Kong Hong Kong China Xi'an Jiaotong University \& City University of Hong Kong Xi'an China University of Science Independent Researcher Beijing China Tsinghua University Beijing China City University of Hong Kong Xi'an Jiaotong University \& City University of Hong Kong Independent Researcher Tsinghua University

AI总结 本文提出LLM-EDT,通过双阶段训练策略解决跨领域序列推荐中的领域不平衡和过渡问题,引入可转移物品增强器和领域感知配置模块,提升推荐效果。

详情
AI中文摘要

跨领域序列推荐(CDSR)旨在通过整合多领域信息丰富用户-物品交互。尽管已有进展,领域不平衡和过渡问题阻碍了进一步发展。前者导致某一领域交互主导整体行为,难以捕捉其他领域特征;后者导致混合交互序列中难以捕捉用户跨领域偏好,影响特定领域下一项预测性能。大语言模型(LLMs)通过生成和编码能力部分缓解这些问题,但现有LLM增强的CDSR方法仍需改进。为此,我们提出LLM-EDT,通过可转移物品增强器减少无关噪声,双阶段训练策略增强领域特定线程的领域共享背景,以及领域感知配置模块总结用户偏好并自适应聚合生成综合用户画像。实验验证了LLM-EDT的有效性。

英文摘要

Cross-domain Sequential Recommendation (CDSR) has been proposed to enrich user-item interactions by incorporating information from various domains. Despite current progress, the imbalance issue and transition issue hinder further development of CDSR. The former one presents a phenomenon that the interactions in one domain dominate the entire behavior, leading to difficulty in capturing the domain-specific features in the other domain. The latter points to the difficulty in capturing users' cross-domain preferences within the mixed interaction sequence, resulting in poor next-item prediction performance for specific domains. With world knowledge and powerful reasoning ability, Large Language Models (LLMs) partially alleviate the above issues by performing as a generator and an encoder. However, current LLMs-enhanced CDSR methods are still under exploration, which fail to recognize the irrelevant noise and rough profiling problems. Thus, to make peace with the aforementioned challenges, we proposed an LLMs Enhanced Cross-domain Sequential Recommendation with Dual-phase Training ({LLM-EDT}). To address the imbalance issue while introducing less irrelevant noise, we first propose the transferable item augmenter to adaptively generate possible cross-domain behaviors for users. Then, to alleviate the transition issue, we introduce a dual-phase training strategy to empower the domain-specific thread with a domain-shared background. As for the rough profiling problem, we devise a domain-aware profiling module to summarize the user's preference in each domain and adaptively aggregate them to generate comprehensive user profiles. The experiments on three public datasets validate the effectiveness of our proposed LLM-EDT. To ease reproducibility, we have released the detailed code online at {https://anonymous.4open.science/r/LLM-EDT-583F}.

2511.15623 2026-05-18 cs.DB cs.AI cs.LO 版本更新

Sufficient Explanations in Databases and their Connections to Database Repairs

数据库中的充分解释及其与数据库修复的关系

Leopoldo Bertossi, Nina Pardal

发表机构 * Carleton University, Canada \& IMFD, Chile. University of Edinburgh, UK.

AI总结 研究数据库中充分解释的概念及其与数据库修复的联系,提出基于答案集程序计算充分解释和度量的方法。

详情
AI中文摘要

我们研究了充分解释的概念,以及用于查询回答的数据库元组的充分性度数作为归因分数。我们还探讨了充分解释与用于处理不一致数据库的数据库修复之间的联系,并与基于因果的必要解释相结合,获得新的计算结果。我们展示了如何使用答案集程序来指定充分解释并计算充分性度数。

英文摘要

We investigate the notion of sufficient explanation, and a sufficiency-degree as attribution score for database tuples in relation to query answering. We also investigate and exploit connections with database repairs as used for dealing with inconsistent databases; and with causality-based necessary explanations, obtaining new computational results. We show how to use answer-set programs to specify sufficient explanations and compute sufficiency-degrees.

2511.09378 2026-05-18 cs.AI cs.LG 版本更新

Frontier Large Language Models Rival State-of-the-Art Planners

前沿大语言模型与最先进的规划器相媲美

Augusto B. Corrêa, André G. Pereira, Jendrik Seipp

发表机构 * University of Oxford(牛津大学) Federal University of Rio Grande do Sul(里约格兰德杜斯尔大学) Linköping University(林霍普大学)

AI总结 研究显示前沿大语言模型在规划任务中超越传统规划器, Gemini 3.1 Pro在标准任务中表现突出,GPT-5表现接近基线,且在符号规划中仍具竞争力,揭示了大语言模型规划能力的提升趋势。

详情
AI中文摘要

一系列有影响力的研究表明,大语言模型无法可靠解决简单的规划任务。我们展示最新一代前沿模型推翻这一结论。我们评估了三个前沿LLM家族在具有挑战性的规划任务上的表现,基于最近的国际规划竞赛,遵循严格的评估指南:解决方案通过验证工具验证,任务重新创建以避免数据污染,性能与最先进的经典规划器进行比较。在标准任务描述中,Gemini 3.1 Pro在360个任务中解决了245个,优于最强的基线规划器(245 vs. 234)。GPT-5的表现与基线相当。当所有语义信息被混淆以测试纯符号规划时,性能下降,但Gemini 3.1 Pro仍能与最强基线竞争。跨模型世代的纵向比较——从GPT-3.5(解决零任务)到GPT-5——揭示了显著的上升趋势。前沿LLM可能最终能够规划;现在的问题是这种能力将如何延伸。

英文摘要

A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.

2510.25404 2026-05-18 cs.LG cs.AI 版本更新

SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization

SemanticOpt: 向基于LLM的语义黑盒优化迈进

Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Jie Chen, Wojciech Matusik, Mina Konaković Luković

发表机构 * MIT(麻省理工学院) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 SemanticOpt利用LLM处理语义信息,通过微调结构化贝叶斯优化轨迹与自然语言上下文,提升黑盒优化性能,在多个实际问题中优于传统方法和现有LLM方法。

详情
AI中文摘要

当每个实验昂贵、耗时或难以执行时,优化实验系统极具挑战性。现有针对昂贵黑盒问题的优化器,如贝叶斯优化,通常仅限于数值或分类观察。它们不利用更广泛的领域知识,如专家启发法、相关科学论文或相似先前实验。大型语言模型(LLMs)可以解释这种语义信息;然而,即使是最先进的LLMs也难以可靠地解决黑盒优化问题。我们介绍了SemanticOpt,一个用于语义黑盒优化的框架,通过在结构化贝叶斯优化轨迹上微调LLMs,使其具备优化能力。SemanticOpt在提出新实验时同时使用数值和语义证据,并生成与贝叶斯代理模型对齐的可解释预测。我们构建了一系列现实世界优化问题并配以语义信息,以创建评估语义黑盒优化的多样化基准。在这些领域中,SemanticOpt在给定相关语义信息时,平均上优于传统优化器和现有基于LLM的方法。

英文摘要

Optimizing an experimental system can be extremely challenging when each experiment is expensive, time-consuming, or difficult to perform. Existing optimizers for expensive black-box problems, such as Bayesian optimization, are typically limited to numerical or categorical observations. They do not make use of broader domain knowledge, such as expert heuristics, relevant scientific papers, or similar previous experiments. Large language models (LLMs) can interpret this semantic information; however, even state-of-the-art LLMs struggle to reliably solve black-box optimization problems. We introduce SemanticOpt, a framework for semantic black-box optimization that equips LLMs with optimization capabilities by fine-tuning them on structured Bayesian optimization trajectories augmented with natural-language context. SemanticOpt jointly uses numerical and semantic evidence when proposing new experiments, while producing interpretable predictions aligned with Bayesian surrogate models. We construct a range of real-world optimization problems paired with semantic information to create a diverse benchmark for evaluating semantic black-box optimization. Across these domains, SemanticOpt outperforms both classical optimizers and existing LLM-based approaches on average when given relevant semantic information.

2510.23634 2026-05-18 cs.LG cs.AI 版本更新

Monotone and Separable Set Functions: Characterizations and Neural Models

单调和可分离的集合函数:特征化与神经模型

Soutrik Sarangi, Yonatan Sverdlov, Nadav Dym, Abir De

发表机构 * IIT Bombay(印度理工学院班加罗尔分校) Technion(技术学院)

AI总结 本文研究了保持集合自然偏序的集合到向量函数设计,提出弱MAS属性模型,展示了其在集合包含任务中的优势。

详情
AI中文摘要

受集合包含问题应用启发,本文考虑设计集合到向量函数,使自然偏序保持,即S⊆T当且仅当F(S)≤F(T)。我们称满足此性质的函数为单调和可分离(MAS)集合函数。我们建立了向量维度的上下界,作为多重集合基数和基础集大小的函数。在重要情况无限基础集时,我们证明MAS函数不存在,但提出名为our的模型,其满足弱MAS属性并具有Holder连续稳定性。我们还展示MAS函数可用于构建单调的通用模型,可近似所有单调集合函数。实验考虑了多种集合包含任务,结果显示我们的模型相比不考虑集合包含作为归纳偏置的标准集合模型具有优势。代码可在https://github.com/structlearning/MASNET获取。

英文摘要

Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely $S\subseteq T \text{ if and only if } F(S)\leq F(T) $. We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name "weakly MAS" and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in https://github.com/structlearning/MASNET.

2510.13842 2026-05-18 cs.CL cs.AI cs.CR 版本更新

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

ADMIT: RAG基事实核查中的少样本知识污染攻击

Yutao Wu, Xiao Liu, Yinghui Li, Yifeng Gao, Yifan Ding, Jiale Ding, Xiang Zheng, Xingjun Ma

发表机构 * Deakin University(德金大学) Fudan University(复旦大学) City University of Hong Kong(香港城市大学)

AI总结 ADMIT提出一种无需访问目标模型的少样本攻击方法,通过注入真实证据来翻转事实核查决策,实验显示其在多种系统中成功率达86%,揭示了RAG事实核查系统的重大漏洞。

详情
AI中文摘要

ADMIT提出了一种无需访问目标模型的少样本攻击方法,通过注入真实证据来翻转事实核查决策,实验显示其在多种系统中成功率达86%,揭示了RAG事实核查系统的重大漏洞。

英文摘要

Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.

2510.03161 2026-05-18 cs.CV cs.AI 版本更新

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

UniShield: 一种适应性多智能体框架用于统一的伪造图像检测与定位

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Future Technology, South China University of Technology(华南理工大学未来技术学院) School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息工程学院) Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University(北京大学深圳研究生院超高清沉浸媒体技术省重点实验室)

AI总结 UniShield通过多智能体框架实现跨领域伪造图像检测与定位,提升检测的适应性和实用性。

详情
AI中文摘要

UniShield通过多智能体框架实现跨领域伪造图像检测与定位,提升检测的适应性和实用性。

英文摘要

With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain-specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross-domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi-agent-based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI-generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state-of-the-art results, surpassing both existing unified approaches and domain-specific detectors, highlighting its superior practicality, adaptiveness, and scalability.

2510.02453 2026-05-18 cs.LG cs.AI cs.CL 版本更新

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

如何训练你的导师:通过导师模型引导黑盒大语言模型

Parth Asawa, Alan Zhu, Abigail O'Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校) Bespoke Labs(Bespoke实验室)

AI总结 本文提出Advisor Models,通过训练小型开放权重模型生成动态个性化建议,提升黑盒前沿模型性能,实验显示在多个任务中效果显著,且具有良好的迁移性和鲁棒性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

前沿语言模型作为黑盒服务部署,其权重无法修改,定制仅限于提示。我们引入Advisor Models,一种方法通过训练小型开放权重模型生成动态、实例特定的自然语言建议,以提升黑盒前沿模型的能力。Advisor Models将GPT-5.2在RuleArena(税务)任务上的性能提升27.4%,减少Gemini 3 Pro在SWE代理任务中的步骤24.6%,并在个性化GPT-5到用户偏好方面优于静态提示优化器(85-100% vs. 40-60%)。我们还发现顾问具有可迁移性:用低成本学生模型训练的顾问仍能将改进转移到前沿模型。此外,Advisor Models具有鲁棒性:在其他基准测试中未观察到降级,除了训练管道所训练的基准测试。我们的方法展示了如何以实用且经济有效的方式对黑盒前沿模型进行参数优化。

英文摘要

Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5.2's performance on RuleArena (Taxes) by 27.4%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

2510.01632 2026-05-18 q-bio.BM cs.AI 版本更新

BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction

BioBlobs:无监督发现蛋白质功能预测的的功能子结构

Xin Wang, Kaiwen Shi, Carlos Oliver

发表机构 * Vanderbilt University(范德比大学) Yale University(耶鲁大学)

AI总结 BioBlobs通过无监督方法发现蛋白质的功能子结构,利用端到端可微分框架压缩蛋白质为少量连贯子结构并预测功能,实现了对功能区域的候选识别。

详情
AI中文摘要

蛋白质功能由如催化三元组、结合口袋和结构模体等紧密子结构驱动,这些子结构仅占据蛋白质残基的小部分。然而,现有基于蛋白质编码器的流程并未在子结构层面建模,未能回答核心生物学问题:蛋白质的哪一部分负责其功能?我们引入了BioBlobs,一种编码器无关、端到端可微分的框架,能够将蛋白质压缩为少量连贯的子结构(blobs),并仅基于这些blobs预测功能,使得每个blob对应一个候选功能区域。在多样化的蛋白质功能预测任务和多种基于序列和结构的编码器上,BioBlobs在仅使用少量残基的情况下,匹配或超过了强大的基线模型。发现的blobs会根据任务调整其空间尺度,从局部催化位点到整个结构域。仅在蛋白质层面标签上训练,BioBlobs能够恢复M-CSA数据库中实验注释的催化位点,证明了无监督的功能子结构发现,并为未注释的整个蛋白质组的规模化功能位点发现开辟了道路。

英文摘要

Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, BioBlobs recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome.

2509.23352 2026-05-18 cs.CV cs.AI 版本更新

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

动态树RPO:通过结构化采样打破独立轨迹瓶颈

Xiaolong Fu, Lichen Ma, Zipeng Guo, ShiPing Dong, Lan Yang, Tan Lit Sin, Gaojing Zhou, Yu He, Jingling Fu, Shizhe Zhou, Junshi Huang, Jason Li

发表机构 * Sun Yat-sen University(中山大学) Tsinghua University(清华大学) Beijing University of Chemical Technology(北京化工大学)

AI总结 本文提出动态树RPO,通过树状结构采样策略和动态噪声强度,提升文本到图像生成的质量与效率,同时结合层调优强化学习方法,在多个基准测试中表现出色。

Comments Fig.3 updated

详情
AI中文摘要

将强化学习(RL)整合到流匹配模型中,推动了文本到图像(T2I)生成的质量提升。然而,这些进步往往以大量探索和低效采样策略为代价,由于采样组的微小变化。基于这一见解,我们提出了动态树RPO,实现了滑动窗口采样策略作为树状结构搜索,具有沿深度动态噪声强度。我们在此树结构中执行GRPO引导优化和受约束的随机微分方程(SDE)采样。通过共享树的前缀路径,我们的设计有效缓解了轨迹搜索的计算开销。通过为每个树层设计良好的噪声强度,动态树RPO可以在不增加额外计算成本的情况下增强探索的多样性。此外,我们无缝整合监督微调(SFT)和RL范式,构建我们的提议层调优RL,将SFT的损失函数重新表述为动态加权进展奖励模型(PRM),而不是单独的预训练方法。通过将此加权PRM与动态自适应剪裁边界关联,避免了动态树RPO中探索过程的干扰。得益于树状结构采样和层调优RL范式,我们的模型在有效方向上动态探索多样化的搜索空间。与现有基线相比,我们的方法在语义一致性、视觉保真度和人类偏好对齐方面在已建立的基准测试中表现出显著优势,包括HPS-v2.1、PickScore和ImageReward。特别是,我们的模型在这些基准测试中分别优于SoTA by 4.9%、5.91%和8.66%,同时将训练效率提高了近50%。

英文摘要

The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.

2509.22739 2026-05-18 cs.CL cs.AI cs.LG stat.ML 版本更新

Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

无痛激活导向:一种自动化、轻量级的微调大型语言模型方法

Sasha Cui, Zhongren Chen

发表机构 * Yale University(耶鲁大学)

AI总结 本文提出Painless Activation Steering,一种自动化方法,无需人工干预即可利用标注数据提升模型性能,尤其在行为任务中表现优异,但对智能任务效果有限。

详情
AI中文摘要

语言模型通常通过权重或提示导向进行微调,但前者耗时昂贵,后者控制不精确且需手动试错。激活导向(AS)提供了一种更经济、快速且可控的替代方法,但现有技术需人工构造提示对或进行大量特征标注,不如RL和SFT等方法方便。本文引入Painless Activation Steering(PAS),一种完全自动的方法,可利用任何标注数据集进行AS,无需提示构造、特征标注或人工干预。在三个开源模型和18个任务上评估PAS,发现其在行为任务中性能可靠,但对智能任务效果有限。 introspective variant(iPAS)在偏差、道德和对齐任务上分别提升了10.1%、5.2%和34.8%。此外,PAS在上下文学习(ICL)和SFT基础上还提供了额外增益。PAS构建了一个快速、轻量的激活向量,可低成本训练、存储和激活。实验结果为AS的应用提供了明确的指导,展示了其作为实用自动化微调方法的潜力。

英文摘要

Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

2509.21663 2026-05-18 cs.LG cs.AI cs.LO 版本更新

Logic of Hypotheses: from Zero to Full Knowledge in Neurosymbolic Integration

假设逻辑:从零到全面知识的神经符号整合

Davide Bizzaro, Alessandro Daniele

发表机构 * University of Padua(帕多瓦大学) Fondazione Bruno Kessler(布鲁诺·科斯勒基金会) University of Bozen-Bolzano(博赞-博尔扎诺大学)

AI总结 本文提出LoH语言,结合数据驱动规则学习与符号先验和专家知识,实现神经符号整合的灵活统一,并通过模糊逻辑实现可微计算图编译。

详情
AI中文摘要

神经符号整合(NeSy)融合神经网络学习与符号推理。该领域可分为注入手工规则的神经模型方法和从数据中诱导符号规则的方法。我们引入假设逻辑(LoH),一种新的语言,统一这些流派,使数据驱动规则学习与符号先验和专家知识的灵活整合成为可能。LoH扩展了命题逻辑语法,加入了可学习参数的选择运算符,可从选项池中选择子公式。利用模糊逻辑,LoH中的公式可直接编译为可微计算图,从而通过反向传播学习最优选择。该框架涵盖了一些现有NeSy模型,同时增加了任意程度的知识规范的可能性。此外,使用戈德尔模糊逻辑和最近开发的戈德尔技巧,可以将模型离散化为硬布尔值函数,而不会损失性能。我们对这些模型进行了实验分析,展示了在表格数据和两个具有感知组件的NeSy任务上的强大结果。

英文摘要

Neurosymbolic integration (NeSy) blends neural-network learning with symbolic reasoning. The field can be split between methods injecting hand-crafted rules into neural models, and methods inducing symbolic rules from data. We introduce Logic of Hypotheses (LoH), a novel language that unifies these strands, enabling the flexible integration of data-driven rule learning with symbolic priors and expert knowledge. LoH extends propositional logic syntax with a choice operator, which has learnable parameters and selects a subformula from a pool of options. Using fuzzy logic, formulas in LoH can be directly compiled into a differentiable computational graph, so the optimal choices can be learned via backpropagation. This framework subsumes some existing NeSy models, while adding the possibility of arbitrary degrees of knowledge specification. Moreover, the use of Gödel fuzzy logic and the recently developed Gödel trick yields models that can be discretized to hard Boolean-valued functions without any loss in performance. We provide experimental analysis on such models, showing strong results on tabular data and on two NeSy tasks with a perceptual component.

2509.21173 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

精度降低可能更可靠:对VLMs量化影响的系统评估

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Chokri Mraidha, Fabio Arnez

发表机构 * Computer Vision Center(计算机视觉中心)

AI总结 本文系统评估了量化对VLMs可靠性的影响,发现量化能提升准确率、校准、异常检测和抗噪能力,但不改善协变量偏移或虚假相关性。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言模型(VLMs)如CLIP已革新零样本分类和安全关键任务,如异常检测。然而,其高计算成本阻碍了实际部署。尽管量化是提高效率的标准方法,但其对超出简单Top-1准确率的可靠性指标的影响仍被忽视。本文通过超过70万次实验评估VLMs的量化效果,发现量化噪声反而能提升准确率、校准、异常检测和抗噪能力,但不改善协变量偏移或虚假相关性。我们利用这些反直觉发现,证明量化通过抑制高秩谱成分,迫使模型依赖稳健的低秩特征,从而提升泛化能力和抗噪能力,为利用量化部署更快速、可靠的VLMs提供了路径。

英文摘要

Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.

2509.15267 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Autoguided Online Data Curation for Diffusion Model Training

自引导在线数据精炼用于扩散模型训练

Valeria Pais, Luis Oala, Daniele Faccio, Marco Aversa

发表机构 * University of Glasgow(格拉斯哥大学) Dotphoton

AI总结 本文研究自引导和在线数据选择方法对扩散模型训练效率的影响,通过合成数据任务验证了自引导在样本质量和多样性上的优势。

Comments Accepted non-archival paper at ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL)

详情
AI中文摘要

生成模型计算成本重新点燃了高效数据精炼的希望。本文探讨了最近发展的自引导和在线数据选择方法是否能提升扩散模型训练的时间和样本效率。我们整合了联合示例选择(JEST)和自引导到统一代码库中,以实现快速消融分析和基准测试。我们在受控的二维合成数据生成任务以及(3x64x64)-D图像生成上评估了数据精炼的组合。我们的比较是在相等的墙钟时间和样本数量下进行的,明确考虑了选择的开销。在所有实验中,自引导一致地提高了样本质量和多样性。早期AJEST(仅在训练开始时应用选择)在两个任务上都能匹配或略微超过自引导单独的效率。然而,其时间开销和额外的复杂性使自引导或均匀随机数据选择在大多数情况下更优。这些发现表明,尽管目标在线选择在早期训练中能带来效率提升,但稳健的样本质量改进主要由自引导驱动。我们讨论了限制和范围,并概述了数据选择何时可能有益。

英文摘要

The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

2507.15970 2026-05-18 cs.SD cs.AI eess.AS 版本更新

CIS-BWE: Chaos-Informed Speech Bandwidth Extension

CIS-BWE: 基于混沌的语音带宽扩展

Tarikul Islam Tamiti, Tonmoy Das, Nursadul Mamun, Anomadarshi Barua

发表机构 * Chittagong University of Engineering and Technology(奇坦加大学工程与技术学院) George Mason University(乔治·梅森大学)

AI总结 本文提出NDSI-BWE框架,利用六种基于非线性动力学系统的判别器捕捉语音的复杂时间行为,通过深度卷积实现参数减少,提升语音带宽扩展性能。

详情
AI中文摘要

恢复因带宽限制丢失的高频成分对于电信和有限资源下的高保真音频应用至关重要。我们引入NDSI-BWE,一种新的对抗性带宽扩展(BWE)框架,利用四种新的判别器灵感来自非线性动力学系统以捕捉多样的时间行为:多分辨率李雅普诺夫判别器(MRLD)用于确定初始条件的敏感性,通过捕捉确定性混沌;多尺度递归判别器(MS-RD)用于自相似递归动力学;多尺度去趋势分形分析判别器(MSDFA)用于长程缓慢变异性尺度不变关系;多分辨率庞加莱图判别器(MR-PPD)用于捕捉隐藏的潜在空间关系;多周期判别器(MPD)用于捕捉周期性模式;多分辨率振幅判别器(MRAD)和多分辨率相位判别器(MRPD)用于捕捉复杂的振幅-相位转换统计。通过在每个判别器中使用深度卷积块的核心深度卷积,NDSI-BWE实现了八倍的参数减少。这些七个判别器指导一个基于复数ConformerNeXt的生成器,采用双流Lattice-Net架构,同时优化幅度和相位。生成器利用基于Transformer的Conformer的全局依赖建模能力和ConvNeXt块的局部时间建模能力。在六个客观评估指标和包含五名人类评委的主观文本中,NDSI-BWE在BWE中建立了新的SoTA。

英文摘要

Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

2507.14200 2026-05-18 cs.CL cs.AI cs.LG 版本更新

A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

可扩展的多语言模型协作系统:基于检索的选择与探索-利用驱动增强

Shengji Tang, Jianjian Cao, Weihao Lin, Jiale Hong, Bo Zhang, Shuyue Hu, Lei Bai, Tao Chen, Wanli Ouyang, Peng Ye

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出SMCS系统,通过检索优先选择模块和探索-利用驱动后验增强模块,有效协调多个开源语言模型,实验显示其在多个任务中优于闭源模型,且在不同数据集上超越开源模型的平均最佳结果。

详情
AI中文摘要

现有多语言模型协作系统在整合新语言模型和任务时常面临可扩展性挑战,导致性能不佳。为此,我们提出SMCS,一种可扩展的多语言模型协作系统,旨在有效协调多个开源语言模型。系统包含两个核心模块:基于检索的优先选择模块(RPS),动态选择最适合的语言模型;探索-利用驱动的后验增强模块(EPE),通过混合评分机制促进响应多样性并选择高质量输出。在八个主流基准测试中,实验验证了系统的有效性:通过整合十五个开源语言模型,SMCS在多个任务中优于现有的闭源语言模型,例如GPT-4(+5.36%)和GPT-o3-mini(+5.28%)。值得注意的是,它甚至在不同数据集上超越了开源语言模型的最佳平均结果(+2.86%),显著推进了开源协作的实证性能前沿。代码已发布在https://github.com/magent4aci/SMCS。

英文摘要

Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (+2.86%), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.

2506.22604 2026-05-18 cs.AI cs.HC cs.RO 版本更新

Bootstrapping Human-Like Planning via LLMs

通过大语言模型实现人类样式的规划

David Porfirio, Vincent Hsiao, Morgan Fine-Morris, Leslie Smith, Laura M. Hiatt

发表机构 * Navy Center for Applied Research in AI, US Naval Research Laboratory(美国海军人工智能应用研究中心)

AI总结 本文研究如何结合自然语言接口与拖放界面,利用大语言模型生成人类风格的动作序列,并与手工指定的动作序列进行比较。

Comments Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

机器人终端用户日益需要能够指定机器人执行任务的可访问方法。两种常见的终端用户编程范式包括拖放界面和自然语言编程。尽管自然语言接口利用了人类沟通的直观形式,拖放界面使用户能够精确地规定机器人任务中的关键动作。在本文中,我们探讨这两种方法结合的程度。具体来说,我们构建了一个基于大语言模型(LLM)的管道,接受自然语言作为输入,并生成人类风格的动作序列作为输出,其细度水平与人类产生的相似。我们然后将生成的动作序列与另一组手工指定的动作序列进行比较。尽管我们的结果表明,较大的模型在生成人类风格的动作序列方面优于较小的模型,但较小的模型仍然实现了令人满意的性能。

英文摘要

Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.

2504.18361 2026-05-18 cs.CV cs.AI 版本更新

COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

COCO-Inpaint:用于检测和定位基于修补的图像篡改的基准

Haozhen Yan, Yan Hong, Jiahui Zhan, Suning Lang, Yikun Ji, Huijia Zhu, Jun Lan, Jianfu Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 本文提出COCO-Inpaint基准,用于检测和定位基于修补的图像篡改,通过高质样本、多样场景和大规模覆盖,揭示修补与真实区域的内在不一致。

Comments 6 pages, 8 figures

详情
AI中文摘要

近年来,图像篡改技术的进步使高逼真内容生成成为可能,但也降低了随意编辑的门槛,引发了对多媒体真实性和安全性的担忧。现有图像篡改检测与定位(IMDL)方法主要针对拼接或复制移动伪造,而基于修补的篡改基准仍有限。为弥合这一差距,我们提出了COCO-Inpaint,一个专门用于修补检测和定位的综合基准,主要贡献包括:1)由六个最先进的修补模型生成的高质量修补样本;2)通过四种掩码生成策略和可选文本引导实现的多样化生成场景;3)包含238,302张具有丰富语义多样性的修补图像的大规模覆盖。本基准旨在突出修补区域与真实区域之间的内在不一致,而非表面语义特征如物体形状。我们进一步建立了严格的评估协议,通过三个标准指标来评估现有IMDL方法,揭示当前趋势和挑战。

英文摘要

Recent advances in image manipulation have enabled highly photorealistic content generation, but also lowered the barrier to arbitrary editing, raising concerns about multimedia authenticity and security. Existing Image Manipulation Detection and Localization (IMDL) methods mainly target splicing or copy-move forgeries, while benchmarks for inpainting-based manipulations remain limited. To bridge this gap, we present COCO-Inpaint, a comprehensive benchmark specifically designed for inpainting detection and localization, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage of 238,302 inpainted images with rich semantic diversity. Our benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

2504.00289 2026-05-18 cs.CL cs.AI cs.CY 版本更新

Do Chinese models speak Chinese languages?

中国模型会说中文吗?

Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno

发表机构 * Cornell University(康奈尔大学)

AI总结 本文通过比较中西方开源大模型的多语言能力,发现中国模型在多数语言上表现与西方模型相似,但对部分中国少数民族语言识别能力较弱,揭示了多语言发展中的优先级与权衡。

Comments First and second author contribute equally

详情
AI中文摘要

顶级开源大模型的发布巩固了中国在AI发展中的领先地位。这些模型支持中国使用的语言吗?还是与美国或欧洲开发的模型支持相同的语言?比较多语言能力对于两个原因很重要:首先,语言能力提供了关于预训练数据编纂的见解,从而揭示了资源分配和发展优先级;其次,中国模型开发者需要在服务于国内语言多样化的群体与优化全球可见基准(主要为英语)之间取得平衡。我们通过比较中国开发和西方开发的开源大模型,在21种语言变体(包括亚洲地区、中文和欧洲语言)上进行了研究。我们的信息平衡和阅读理解实验表明,中国模型在这些语言上的表现与西方模型高度相关(r=0.93),唯一的例外是中文表现更好。中国开发的模型在法语和德语方面表现良好,但有时无法识别中国少数民族语言,如哈萨克语和维吾尔语。总体而言,所有研究的开源大模型在多语言表现上相似,尽管模型开发者所处的语言和文化背景各不相同。我们将这种同质化解释为全球基准实践和共享训练资源影响的结果。而不是将当前语言支持视为不可避免,我们的结果强调多语言发展是一个优先级和权衡的空间,对模型开发者、政策制定者和用户都有影响。

英文摘要

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they support the same languages as models developed in the United States or in Europe? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, Chinese model developers need to navigate the tension between serving a linguistically diverse population domestically, and optimizing for globally visible benchmarks that are predominantly English. We investigate Chinese model developers' priorities through a comparative study of Chinese-developed and Western-developed open-weight LLMs, on 21 language variants including Asian regional, Chinese, and European languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with their Western counterparts, with the sole exception being better Mandarin. Chinese-developed models are good at French and German, but they sometimes cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur. Overall, all open-weight LLMs we study have a similar multilingual performance profile, despite the diverse linguistic and cultural contexts the model developers operated within. We interpret the homogenization as consistent with the influence of global benchmarking practices and shared training resources. Rather than treating current language support as inevitable, our results highlight multilingual development as a space of prioritization and trade-offs, with implications for model developers, policymakers, and users.

2412.06853 2026-05-18 cs.LG cs.AI 版本更新

Tube Loss: A Novel Approach for Prediction Interval Estimation

Tube Loss:预测区间估计的一种新方法

Pritam Anand, Tathagata Bandyopadhyay, Suresh Chandra

发表机构 * Dhirubhai Ambani University (Formerly DA-IICT)(迪鲁巴希阿米大学(原达乌学院)) Indian Institute of Technology, Delhi(印度理工学院德里分校)

AI总结 本文提出Tube Loss损失函数,用于回归任务中同时估计预测区间边界。该方法能渐近达到指定置信水平,允许用户调整区间位置以优化覆盖范围和宽度,适用于偏斜分布。

详情
AI中文摘要

本文提出了一种名为'Tube Loss'的新损失函数,用于回归任务中同时估计预测区间(PI)的边界。基于Tube Loss最小化经验风险得到的PI在以下方面优于现有方法:首先,渐近达到指定置信水平t∈(0,1)。其次,用户可通过调整参数移动区间,以捕捉响应变量概率分布的密集区域,从而缩小区间宽度。该方法通过单个优化问题平衡覆盖范围和平均宽度,并通过重新校准进一步减少平均宽度。不同于现有方法,梯度下降法可用于最小化经验风险。通过大量实验,我们证明了基于Tube Loss的PI估计在核机和神经网络中的有效性,并展示了基于Tube Loss的深度概率预报模型在多个基准和风能数据集上优于现有概率预报技术。最后,我们通过符合预测框架验证了Tube Loss方法的优势。代码可在https://github.com/ltpritamanand/Tube$_$loss获取。

英文摘要

This paper proposes a novel loss function, called 'Tube Loss', for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence level t $\in$ (0,1) asymptotically. A theoretical proof of this fact is given. Secondly, the user is allowed to move the interval up or down by controlling the value of a parameter. This helps the user to choose a PI capturing denser regions of the probability distribution of the response variable inside the interval, and thus, sharpening its width. This is shown to be especially useful when the conditional distribution of the response variable is skewed. Further, the Tube Loss based PI estimation method can trade-off between the coverage and the average width by solving a single optimization problem. It enables further reduction of the average width of PI through re-calibration. Also, unlike a few existing PI estimation methods the gradient descent (GD) method can be used for minimization of empirical risk. Through extensive experiments, we demonstrate the effectiveness of Tube Loss-based PI estimation in both kernel machines and neural networks. Additionally, we show that Tube Loss-based deep probabilistic forecasting models achieve superior performance compared to existing probabilistic forecasting techniques across several benchmark and wind datasets. Finally, we empirically validate the advantages of the Tube loss approach within the conformal prediction framework. Codes are available at https://github.com/ltpritamanand/Tube$\_$loss.

2407.20240 2026-05-18 cs.CY cs.AI 版本更新

Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada

通用大型语言模型对加拿大新移民融入社会的潜在风险

Isar Nejadgholi, Maryam Molamohammadi, Samir Bakhtawar

发表机构 * National Research Council Canada(加拿大国家研究委员会) Mila - Quebec Artificial Intelligence Institute(魁北克人工智能研究所)

AI总结 研究探讨通用大语言模型在移民安置领域可能带来的风险,强调需开发定制化AI工具以确保人类监督与责任。

Comments 26 pages, 8 figures

详情
AI中文摘要

加拿大非营利安置部门支持新移民实现成功融入。该部门面临日益增长的操作压力,凸显了提高效率和创新的必要性,可能通过可靠的AI解决方案实现。随意使用通用生成式AI,如ChatGPT,可能成为移民和服务机构的常见做法,但这些工具未针对安置领域进行优化,可能对移民和难民产生有害影响。本文探讨这些工具可能对新移民造成的风险,警告避免未经监管的生成式AI使用,并鼓励进一步研究开发AI素养课程及定制化LLM,使其符合受影响社区的偏好。关键在于此类技术应无缝集成到安置部门现有流程中,确保人类监督、可信度和问责制。

英文摘要

The non-profit settlement sector in Canada supports newcomers in achieving successful integration. This sector faces increasing operational pressures amidst rising immigration targets, which highlights a need for enhanced efficiency and innovation, potentially through reliable AI solutions. The ad-hoc use of general-purpose generative AI, such as ChatGPT, might become a common practice among newcomers and service providers to address this need. However, these tools are not tailored for the settlement domain and can have detrimental implications for immigrants and refugees. We explore the risks that these tools might pose on newcomers to first, warn against the unguarded use of generative AI, and second, to incentivize further research and development in creating AI literacy programs as well as customized LLMs that are aligned with the preferences of the impacted communities. Crucially, such technologies should be designed to integrate seamlessly into the existing workflow of the settlement sector, ensuring human oversight, trustworthiness, and accountability.

2312.05975 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FM-G-CAM: A Holistic Approach for Explainable AI in Computer Vision

FM-G-CAM:计算机视觉中可解释AI的综合方法

Ravidu Suien Rammuni Silva, Jordan J. Bird

发表机构 * Department of Computer Science Nottingham Trent University(计算机科学系诺丁汉特大学)

AI总结 本文提出FM-G-CAM方法,通过综合考虑多个预测类别,提供CNN模型决策的全面解释,改进传统Grad-CAM的局限性。

详情
AI中文摘要

可解释性是现代AI在现实应用中的关键因素。本文旨在强调理解计算机视觉模型(特别是卷积神经网络)预测的必要性。现有方法主要基于梯度加权类激活图(Grad-CAM),仅关注单一目标类别,忽略了CNN预测过程的大部分内容。本文提出了一种全面的方法,称为融合多类梯度加权类激活图(FM-G-CAM),考虑多个高预测类别,提供预测器CNN的全面解释。我们还提供了详细数学和算法描述。此外,通过现实应用场景的定量和定性比较,展示了FM-G-CAM相较于Grad-CAM的优势。最后,我们提供了一个开源Python库,包含FM-G-CAM实现,方便生成CNN模型预测的显著图。

英文摘要

Explainability is a vital aspect of modern AI for real-world impact and usability. The main objective of this paper is to emphasise the need to understand the predictions of Computer Vision models, specifically Convolutional Neural Network (CNN) models. Existing methods for explaining CNN predictions are largely based on Gradient-weighted Class Activation Maps (Grad-CAM) and focus solely on a single target class; this assumption about the target class selection neglects a large portion of the predictor CNN's prediction process. In this paper, we present an exhaustive methodology, called Fused Multi-class Gradient-weighted Class Activation Map (FM-G-CAM), that considers multiple top-predicted classes and provides a holistic explanation of the predictor CNN's rationale. We also provide a detailed mathematical and algorithmic description of our method. Furthermore, alongside a concise comparison of existing methods, we compare FM-G-CAM with Grad-CAM, quantitatively and qualitatively highlighting its benefits through real-world practical use cases. Finally, we present an open-source Python library with an FM-G-CAM implementation to conveniently generate saliency maps for CNN-based model predictions.

2308.06822 2026-05-18 cs.LG cs.AI cs.CR math.OC 版本更新

Approximate and Weighted Data Reconstruction Attack in Federated Learning

联邦学习中的近似和加权数据重建攻击

Yongcun Song, Ziqi Wang, Enrique Zuazua

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University(南洋理工大学数学科学学院,物理与数学科学学院) Chair for Dynamics, Control, Machine Learning and Numerics – Alexander von Humboldt Professorship, Department of Mathematics, Friedrich-Alexander-Universität Erlangen-Nürnberg(动态、控制、机器学习和数值学主席职位,数学系,埃尔兰根-纽伦堡弗里德里希-亚历山大大学) Chair of Computational Mathematics, Fundación Deusto(计算数学主席,德乌斯基金会) Departamento de Matemáticas, Universidad Autónoma de Madrid(数学系,马德里自治大学)

AI总结 本文提出了一种基于插值的近似方法,用于攻击联邦学习中的联邦平均场景,通过生成客户端本地训练过程中的中间模型更新,改进数据重建质量,并通过实验验证了其在图像数据重建中的优越性。

详情
AI中文摘要

联邦学习(FL)是一种分布式学习范式,允许多个客户端在不共享私人数据的情况下协作构建机器学习模型。尽管FL被设计为隐私保护,但最近的数据重建攻击表明,攻击者可以根据FL中共享的参数恢复客户端的训练数据。然而,大多数现有方法无法攻击最广泛使用的水平联邦平均(FedAvg)场景,其中客户端在多次本地训练步骤后共享模型参数。为了解决这个问题,我们提出了一种基于插值的近似方法,通过生成客户端本地训练过程中的中间模型更新,使攻击FedAvg场景成为可能。然后,我们设计了一种层间加权损失函数以提高数据重建质量。我们根据神经网络结构为不同层的模型更新分配不同的权重,权重通过贝叶斯优化进行调整。最后,实验结果验证了所提出的近似和加权攻击(AWA)方法在不同评估指标上优于其他最先进的方法,显示出在图像数据重建中的显著改进。

英文摘要

Federated Learning (FL) is a distributed learning paradigm that enables multiple clients to collaborate on building a machine learning model without sharing their private data. Although FL is considered privacy-preserved by design, recent data reconstruction attacks demonstrate that an attacker can recover clients' training data based on the parameters shared in FL. However, most existing methods fail to attack the most widely used horizontal Federated Averaging (FedAvg) scenario, where clients share model parameters after multiple local training steps. To tackle this issue, we propose an interpolation-based approximation method, which makes attacking FedAvg scenarios feasible by generating the intermediate model updates of the clients' local training processes. Then, we design a layer-wise weighted loss function to improve the data quality of reconstruction. We assign different weights to model updates in different layers concerning the neural network structure, with the weights tuned by Bayesian optimization. Finally, experimental results validate the superiority of our proposed approximate and weighted attack (AWA) method over the other state-of-the-art methods, as demonstrated by the substantial improvement in different evaluation metrics for image data reconstructions.

2306.04321 2026-05-18 cs.AI cs.MM 版本更新

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

生成语义通信:扩散模型超越位恢复

Eleonora Grassucci, Sergio Barbarossa, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecommunication, Sapienza University of Rome(信息工程、电子与电信系,罗马萨皮恩扎大学)

AI总结 本文提出一种新的生成扩散框架,利用扩散模型合成多媒体内容并保留语义特征,通过空间自适应归一化生成语义一致的场景,提升在信道噪声下的图像生成质量。

Journal ref IEEE Transactions on Cognitive Communication and Networking, 2026

详情
AI中文摘要

语义通信被认为是下一代AI通信的核心之一。其可能使接收端能再生与传输内容语义等价的图像或视频,而无需恢复传输的位序列。当前解决方案仍缺乏从接收到的有限信息中构建复杂场景的能力。本文提出一种新的生成扩散指导框架,利用扩散模型在合成多媒体内容和保留语义特征方面的强大能力,通过发送高度压缩的语义信息来减少带宽使用。然后,扩散模型通过空间自适应归一化从去噪的语义信息中学习生成语义一致的场景。通过深入评估多个场景,证明我们的方法在接收到显著退化的内容时,仍能生成高质量的图像并保留语义信息。具体而言,即使在通信信道极其嘈杂的条件下,对象、位置和深度仍可识别。代码可在https://github.com/ispamm/GESCO获取。

英文摘要

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.

2210.13455 2026-05-18 cs.LG cs.AI 版本更新

Epistemic Monte Carlo Tree Search

认知蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, Wendelin Böhmer

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Epistemic MCTS,通过考虑认知不确定性提升搜索效率,在代码编写等稀疏奖励任务中表现更优。

详情
AI中文摘要

本文提出Epistemic MCTS,通过考虑认知不确定性提升搜索效率,在代码编写等稀疏奖励任务中表现更优。

英文摘要

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

2605.15769 2026-05-18 cs.RO cs.AI 版本更新

Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics

动态环境中的拉马克继承:关键变量如何影响进化动态

K. Ege de Bruin, Kyrre Glette, Kai Olav Ellefsen

发表机构 * Department of Informatics, University of Oslo, Norway(奥斯陆大学信息学院) RITMO, University of Oslo, Norway(奥斯陆大学RITMO)

AI总结 本文研究动态环境中关键变量对进化动态的影响,通过虚拟软机器人和两种学习方法,发现拉马克继承在环境变化冲突且不可预测时表现欠佳,但添加环境感知传感器可恢复其优势。

详情
AI中文摘要

在动态环境中机器人身体与控制器的共优化是一个耦合挑战:形态约束了哪些控制策略有效,而控制则决定了形态的表现。为了解决这一问题,我们结合形态优化作为进化与控制器优化作为生命周期学习,利用拉马克继承将学习到的控制器参数从父代传递给子代。在动态环境中,现有文献呈现矛盾证据:虽然传统进化理论通常认为拉马克继承无益,但最近的进化机器人研究显示它可以提高性能。我们假设这是因为以前的研究没有包含所有与动态环境相关的变量。在本工作中,我们发现拉马克继承的益处取决于两个变量:环境变化对机器人控制的冲突程度,以及这些变化对机器人代理的可预测性。使用虚拟软机器人和两种不同的学习方法,贝叶斯优化和强化学习,我们发现拉马克继承只在环境变化既冲突又不可预测时表现欠佳。我们发现添加一个检测环境变化的传感器可以恢复拉马克继承在冲突环境中的优势,通过允许机器人代理预测需要不同行为的需要,从而泛化其控制。

英文摘要

The co-optimization of a robot's body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.

2605.15764 2026-05-18 cs.CV cs.AI 版本更新

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

GRASP:学习多个人非语言互动中的社会推理

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Institute of Technology(佐治亚理工学院) Amazon AGI Korea University(亚马逊AGI韩国大学)

AI总结 GRASP通过连接高层社会问答与细粒度目光和指代手势事件,提升多个人非语言互动的社会推理能力,包含290万对问题-答案对,提出Social Grounding Reward提升模型性能。

Comments Project page: https://social-reaoning.github.io/grasp/

详情
AI中文摘要

理解社会互动需要推理微妙的非语言线索,但当前多模态大语言模型(MLLMs)在多个人视频中常无法识别谁与谁互动。我们引入GRASP,一个大规模社会推理数据集,将高层社会问答与细粒度目光和指代手势事件连接起来。GRASP包含290K个问题-答案对,覆盖46K小时视频,按16类分类涵盖目光、手势及联合目光-手势推理,同时包含GRASP-Bench用于评估。不同于以往仅关注孤立线索或高层社会问答的资源,GRASP通过身份一致的目光轨迹、指代手势及其联合组成构建社会事件。此外,我们提出Social Grounding Reward(SGR),一种利用这些社会事件鼓励模型推理每个互动参与者的学习信号。实验显示,SGR在GRASP-Bench上提升性能,同时在相关社会视频问答基准上保持零样本性能。

英文摘要

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

2605.15763 2026-05-18 cs.CL cs.AI 版本更新

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

CompactQE: 通过小规模开源大语言模型实现可解释的翻译质量估计

Kamil Guttmann, Zofia Fraś, Artur Nowakowski, Krzysztof Jassem

发表机构 * Laniqo Faculty of Mathematics and Computer Science, Adam Mickiewicz University(亚当·密茨凯维奇大学数学与计算机科学学院)

AI总结 本文提出CompactQE,利用小规模开源大语言模型实现翻译质量估计,生成质量评分、错误标注、修正建议和完整润色,其性能优于传统指标和人类标注。

详情
AI中文摘要

当前最先进的机器翻译质量估计(QE)依赖于大规模专有LLM,引发数据隐私问题。我们证明较小的开源LLM(<30B参数)是可行、成本效益高且隐私保护的替代方案。使用单次提示策略,我们的模型同时生成质量评分、MQM错误标注、建议的错误修正和完整的润色。我们的分析表明,这些模型在系统层面与人类判断的关联性很高,优于传统神经度量、微调模型和人类标注者一致性,有效逼近更大专有LLM的能力。

英文摘要

Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.

2605.15736 2026-05-18 cs.CV cs.AI 版本更新

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

BiomedAP: 一种基于视觉的双锚框架与门控跨模态融合用于鲁棒的医学视觉-语言适应

Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

发表机构 * Wenzhou University(温州大学) Wenzhou Business College(温州商务学院)

AI总结 BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

Comments CVPR2026 Workshop

详情
AI中文摘要

BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

英文摘要

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

2605.15734 2026-05-18 cs.AI 版本更新

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

我们能否信任AI推断的用户状态。一种用于验证由LLMs在操作环境中对用户状态分类的可靠性的人格测量框架

Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

发表机构 * Orange Research, AI Center(Orange研究院、人工智能中心)

AI总结 本文通过实证测试检验了使用大语言模型评估用户状态的假设,探讨了AI测量在人格测量中的可靠性问题,并提出可复制的评估框架以提高适应性系统的AI设计可靠性。

Comments Full survey article with data tables for futher possible replicabilty and comparison

详情
AI中文摘要

使用大语言模型来评估对话和自适应系统中的用户状态是基于一种假设,即用于此类评估的指标在个体分数层面是稳定且可解释的。本文通过实证测试检验这一假设,重点研究了人工智能(AI)测量在人格测量中的可靠性。本研究采用复制评估程序,评估了三个不同双模大语言模型(GPT-4o音频、Gemini 2.0 Flash、Gemini 2.5 Flash)中广泛指标的可重复性。分析包括个体分数可靠性和聚合可靠性,使我们能够区分可能对实时适应有用的指标,以及仅在聚合分析中保留价值的指标。结果表明,指标的可靠性不能被视为解释领域中的默认属性。个体分数层面的不稳定性使得在实时自适应系统中将这些分数解释为用户状态的指标是不可能的,即使这些指标在聚合后表现出稳定性。同时,本研究指出,个体不稳定指标可以在事后研究中保留分析效用,识别交互规则及其与用户经验参数如满意度、信任和参与度的关系。本文的主要贡献,除了量化问题的严重性(只有213个指标中的31个符合标准)外,还提出了一个可复制的评估框架,使指标适用性的可测量评估成为可能。这种方法支持更负责任的AI设计,其中结果的解释需要显式验证可靠性和随时间监测违规情况。

英文摘要

The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.

2605.15733 2026-05-18 cs.NE cs.AI cs.CV 版本更新

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

在启发式世界模型中的结构抽象与泛化

Tianqiu Zhang, Muyang Lyu, Xiao Liu, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Center of Quantitative Biology, School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University(北京大学-清华大学生命科学中心,先进跨学科研究院,IDG/麦克戈文脑科学研究院,定量生物学中心,心理与认知科学学院,机器感知重点实验室(教育部),北京大学)

AI总结 本文提出了一种脑启发的分层模型,通过逆向模型提取潜在转换并构建预测视觉世界模型,展示了在连续高维动态中同时提取抽象结构的能力,实现了结构泛化。

Comments Project page: https://hpc-mec-worldmodel.github.io/

详情
AI中文摘要

人类将经验抽象为结构化表示以促进模式推断和知识转移。尽管海马-内侧颞叶(HPC-MEC)回路已知能表示空间和概念空间,但如何同时从连续、高维动态中提取抽象结构的机制仍不明确。我们提出了一种脑启发的分层模型,同时推断潜在转换并构建预测视觉世界模型。该架构采用逆向模型进行结构提取,同时结合HPC-MEC耦合模型,将关系结构(MEC)与整合的事件场景(HPC)分离。通过使用原始变换动态作为基准,我们展示了该模型在结构抽象方面的能力。通过利用速度驱动的路径整合,该框架能够在不同情境中实现稳健的预测和结构重用,从而实现结构泛化。本文提供了一个新的计算框架,用于理解如何通过脑启发的自监督学习世界模型,促进可重用的抽象知识的获取。

英文摘要

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

2605.15728 2026-05-18 cs.CV cs.AI 版本更新

DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

DecomPose:解耦跨类优化冲突以实现类别级6D物体姿态估计

Yifan Gao, Lu Zou, Zhangjin Huang, Guoping Wang

发表机构 * Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, Hubei, China(智能机器人湖北省重点实验室,武汉理工大学,武汉,湖北,中国) University of Science(科学技术大学) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 本文提出DecomPose框架,通过数据驱动的难度代理和不对称分支策略,解耦跨类优化冲突,提升类别级6D姿态估计性能。

详情
AI中文摘要

类别级6D物体姿态估计通常被建模为多类联合学习问题,但类别间的几何异质性导致共享模块中不兼容的优化信号纠缠,产生梯度冲突和负迁移。为此,我们首先引入基于梯度的诊断方法量化模块级跨类冲突。基于诊断结果,我们提出DecomPose框架,通过难度感知的梯度解耦和稳定性驱动的不对称分支策略,缓解优化冲突:(1) 难度感知的梯度解耦通过数据驱动的难度代理将类别分组,并将每个实例路由到组特定的对应分支以隔离不兼容的更新;(2) 稳定性驱动的不对称分支将更高容量的分支分配给结构简单的类别作为稳定的优化锚点,同时通过轻量级分支约束复杂类别以抑制噪声更新并缓解负迁移。在REAL275、CAMERA25和HouseCat6D上的大量实验表明,DecomPose有效减少了跨类优化冲突,并在多个基准上实现了优越的姿态估计性能。

英文摘要

Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

2605.15726 2026-05-18 cs.AI cs.CL 版本更新

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

走出舒适区:为RLVR的高效策略引导探索

Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出NudgeRL框架,通过策略引导实现结构化和多样性探索,提升RLVR在数学基准上的表现,相比标准GRPO和oracle引导方法更高效。

Comments 28 pages, 7 figures

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的可扩展范式。然而,其效果受限于探索:策略只能改进已采样的轨迹。增加轨迹数量可缓解此问题,但计算成本高,现有方法对探索内容控制有限。本文提出NudgeRL框架,引入策略引导,通过轻量策略上下文条件化每个轨迹,诱导多样化推理轨迹,不依赖昂贵的oracle监督。为进一步学习此类结构化探索,提出统一目标,将奖励信号分解为跨和内上下文组件,并结合蒸馏目标将发现的行为转移回基础策略。实验证明,NudgeRL在五项挑战性数学基准上平均优于oracle引导的RL基线,且在8倍更大的轨迹预算下优于标准GRPO。这些结果表明,结构化、上下文驱动的探索可作为高效且可扩展的替代方案,替代暴力轨迹扩展和基于特权信息的方法。代码可在https://github.com/tally0818/NudgeRL获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.

2605.15725 2026-05-18 cs.CV cs.AI cs.RO 版本更新

DiLA: Disentangled Latent Action World Models

DiLA:解耦的潜在动作世界模型

Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Peking University(北京大学-清华生命科学中心,先进跨学科研究院,IDG/麦克戈文脑科学研究院,北京大学) Center of Quantitative Biology, Peking University(北京大学定量生物学中心) School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University(心理与认知科学学院,机器感知重点实验室(教育部),北京大学)

AI总结 DiLA通过内容-结构解耦解决动作抽象与生成保真度的平衡问题,实现高质量视频生成和动作迁移。

Comments Project Page: http://disentangled-latent-action-world-models.github.io

详情
AI中文摘要

潜在动作模型(LAMs)通过推断连续帧间的抽象动作来学习世界模型,但面临动作抽象与生成保真度的权衡问题。现有方法通常通过两阶段训练或限制预测到光流来解决。本文提出DiLA,一种解耦的潜在动作世界模型,通过内容-结构解耦解决这一权衡。我们的关键发现是解耦和潜在动作学习是共演进的:潜在动作学习中的预测瓶颈驱动解耦,迫使模型将空间布局压缩到结构路径,同时将视觉细节卸载到单独的内容路径进行生成。这种协同作用产生了一个连续且语义结构化的潜在动作空间,而不牺牲生成质量。DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面表现优异。这些发现确立了DiLA作为统一框架,同时实现高层动作抽象和高保真生成,推动了自监督世界模型学习的前沿。

英文摘要

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

2605.15722 2026-05-18 cs.LG cs.AI cs.CV eess.SP 版本更新

Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

双向融合引导心脏模式用于半监督ECG分割

Jeonghwa Lim, Minje Park, Sunghoon Joo

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文提出CardioMix框架,通过心脏模式引导的双向CutMix策略提升ECG分割性能,实验表明其在多种数据集和标注比例下均优于现有方法。

Comments 11 pages, 6 figures, 6 tables

详情
AI中文摘要

准确界定心电图(ECG)并分割有意义的波形特征对心血管诊断至关重要。然而,标注数据稀缺给深度学习模型训练带来了重大挑战。传统半监督语义分割(SemiSeg)方法主要关注未标注数据的一致性,未能充分利用标注与未标注集之间的信息交换。为此,我们引入CardioMix,基于心脏模式引导的双向CutMix策略构建ECG分割框架。该方法通过从未标注数据中引入真实变化丰富标注集,同时对未标注集施加更强的监督信号,确保所有增强样本在生理上具有意义。本框架设计为即插即用模块,与各种SemiSeg算法具有高度兼容性。在SemiSegECG公共多数据集基准上的大量实验表明,CardioMix在多种数据集和标注比例下均优于现有基于CutMix的融合策略作为即插即用模块兼容各种SemiSeg算法。

英文摘要

Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

2605.15714 2026-05-18 cs.SE cs.AI 版本更新

Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation

位置:标注流程早期阶段的质量保证比后期验证更具成本效益

Sunil Kothari, Sumukha Sharma Thoppanahalli Chandramouli, Naman Khandelwal, Parth Kulshreshtha, Ashi Jain, Kriti Banka, Tanuja Chintada, Venkata Triveni, Gulipalli Praveen Kumar, Manish Mehta, Tao Liu

发表机构 * Centific AI Research(科学人工智能研究)

AI总结 本文指出标注流程早期质量保证比后期验证更有效,强调时间因素对误差率和成本的影响,提出三种质量保证触发点并建议改进研究和实践方法。

Comments 8 pages

详情
AI中文摘要

本文主张机器学习社区应优先考虑标注流程早期阶段的质量保证,而非传统的后期验证。数据质量瓶颈日益限制基础模型的改进,然而质量保证研究几乎只关注验证方法而非验证时机。当验证发生时,不仅所采用的方法,根本上决定了误差率和标注成本。这种对时间的忽视令人费解,鉴于软件工程中已确立的“左移”原则,实证研究显示缺陷在后期发现时成本乘数为4-100倍(Boehm, 1981; Shull et al., 2002)。标注流程展现出类似动态:在标注开始前发现的错误成本仅为审查周期结束后发现的分数之一。我们提出三种质量保证触发点,即标注前(T0)、标注后(T1)和审查后(T2),将标注工作流分解为离散的验证机会。一个参数化的误差传播模型正式化了何时时间影响最终误差率 versus 仅经济因素,使时间成为可测量的设计变量而非配置后的考虑。对47篇近期论文的调查发现,仅有4%报告了验证发生的时间,这在相邻领域中显示出时间的影响,令人惊讶。如果没有对质量保证时间的明确关注,社区将有风险在优化验证方法的同时忽略可能最相关的结构性变量。采取这一立场需要三个步骤:研究人员应在报告质量保证时间配置的同时报告验证方法;标注平台应将时间作为首要参数暴露;并且社区应运行受控实验,直接测量各阶段的检测率。

英文摘要

This position paper argues that the machine learning community should prioritize early-stage quality assurance in annotation pipelines over the prevailing practice of late-stage validation. Data quality bottlenecks increasingly limit foundation model improvement, yet quality assurance research focuses almost exclusively on validation methods rather than validation timing. When validation occurs, not merely what methods are employed, fundamentally determines both error rates and annotation costs. This temporal neglect is puzzling given the well-established "shift-left" principle from software engineering, where empirical studies demonstrate 4--100x cost multipliers for defects detected in later stages (Boehm, 1981; Shull et al., 2002). Annotation pipelines exhibit analogous dynamics: errors caught before annotation begins cost a fraction of those discovered after review cycles complete. We propose a taxonomy of three QA trigger points, namely pre-annotation (T0), post-annotation (T1), and post-review (T2), that decompose annotation workflows into discrete validation opportunities. A parametric error-propagation model formalizes when timing affects final error rates versus only economics, making timing a measurable design variable rather than a configuration afterthought. A survey of 47 recent papers reveals that only 4% report when validation occurs, a striking gap given timing's demonstrated impact in adjacent fields. Without explicit attention to QA timing, the community risks optimizing validation methods while ignoring the structural variable that may matter most. Acting on this position requires three steps: researchers should report QA timing configurations alongside validation methods; annotation platforms should expose timing as a first-class parameter; and the community should run controlled experiments that measure stage-specific detection rates directly.

2605.15713 2026-05-18 cs.RO cs.AI 版本更新

Learning Dynamic Pick-and-Place for a Legged Manipulator

学习动态抓取与放置用于四足机械臂

Moonkyu Jung, Jiseong Lee, Zhengmao He, Donghoon Youm, Juhyeok Mun, HyeongJun Kim, Hyunsik Oh, Donghyuk Choi, Jungwoo Hur, Jie Song, Jemin Hwangbo

发表机构 * Robotics and Artificial Intelligence Lab, KAIST(机器人与人工智能实验室,韩国科学技术院)

AI总结 本文提出一种分层强化学习框架,用于四足机械臂的动态抓取与放置任务,通过模拟和现实实验验证了其在不同负载和工作空间下的高成功率。

Comments Accepted to IEEE Robotics and Automation Letters 2026

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7652-7659, 2026

详情
AI中文摘要

四足机械臂通过结合敏捷移动与多功能臂控制,扩展了机器人静态操作的能力。然而,实现精确操作的同时保持协调移动仍是一个重大挑战。本文提出了一种分层强化学习框架,用于四足机械臂的动态抓取与放置任务。该框架包含一个显式的质量估计模块,能够实现对不同重量物体的自适应全身控制。在模拟中,系统在负载达2.3kg时的成功率高达86.05%。通过六个代表性场景的现实实验,验证了该方法在不同物体物理属性(尺寸和质量)和任务高度下的有效性。在垂直工作空间从地面到1.1米高桌面的范围内,系统在负载达1.3kg时的平均成功率为73.3%,平均执行时间为4.06秒。与以往处理轻质物体并执行慢速分步操作的方法不同,本文的方法利用移动和操作的同时进行,实现了动态连续执行。这些结果展示了四足移动机械臂在适应性、全身抓取与放置任务中处理更重负载和扩展工作空间的潜力。

英文摘要

Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.

2605.15705 2026-05-18 cs.RO cs.AI 版本更新

Feedback World Model Enables Precise Guidance of Diffusion Policy

反馈世界模型使扩散策略获得精准指导

Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 本文提出反馈世界模型,通过实时反馈修正预测误差,提升机器人决策性能,实验显示在分布偏移下预测准确率和策略表现显著提升。

Comments 21 pages, 9 figures

详情
AI中文摘要

世界模型旨在通过预测动作后果来提高机器人决策能力。然而,当机器人遇到训练分布外的状态时,其预测往往不可靠,限制了实际应用。我们发现执行本身提供了一个自然但未被充分利用的信号:每次动作后,机器人直接观察到真实下一步状态,揭示了预测与实际结果之间的不匹配。基于这一见解,我们提出反馈世界模型,一种在推理时关闭预测与观察之间循环的新范式。与将世界模型视为静态开环预测器不同,我们的方法维护一个轻量级反馈状态,在线更新以迭代修正未来预测,利用实时观测补偿模型误差,而无需额外训练数据或参数更新。我们证明这一过程可以被视为潜在空间观察者,并在温和条件下具有收敛保证。我们进一步引入动作感知指导,通过强调动作可控的组件而抑制无关变化,以更好地将修正预测转化为控制。在LIBERO-Plus、Robomimic和真实世界操控任务上的实验表明,我们的方法在分布偏移下显著提高了预测准确性和策略性能。特别是,它将世界模型预测误差减少了高达76.4%,并提高了分布外(OOD)成功率30%。这些结果表明,在推理时纳入实时反馈为静态世界建模提供了一个简单而有力的替代方案。

英文摘要

World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.

2605.15701 2026-05-18 cs.CL cs.AI 版本更新

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

H-Mem: 一种通过混合结构进化和检索智能体记忆的新型记忆机制

Jiawei Yu, Yixiang Fang, Xilin Liu, Yuchi Ma

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Huawei Cloud Computing Technologies CO., LTD.(华为云计算技术有限公司)

AI总结 H-Mem通过混合结构有效建模智能体记忆的长期演化并高效检索记忆数据,提升问答任务性能。

详情
AI中文摘要

在基于大语言模型(LLM)的智能体(如OpenClaw和Manus)中,记忆数据无处不在。尽管近期有研究尝试利用智能体的记忆来提高问答(QA)任务的性能,但缺乏有效建模记忆数据随时间演化和高效检索的原理性机制,导致记忆利用效率低下。为此,我们提出了H-Mem,一种通过混合结构实现的新型记忆机制,能够有效建模智能体记忆的长期演化,并提供高效的记忆检索方法。特别是,H-Mem构建了时间与语义树结构,使短期记忆数据逐步演变为长期记忆数据,后者为前者提供总结信息,同时构建知识图谱以捕捉记忆中实体之间的关系。此外,通过利用树和图结构的混合特性,H-Mem提供了有效的记忆检索方法。在三个智能体记忆基准测试中,H-Mem在问答任务上实现了最先进的性能。

英文摘要

Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

2605.15688 2026-05-18 stat.ML cs.AI cs.LG math.PR 版本更新

$α$-TCAV: A Unified Framework for Testing with Concept Activation Vectors

$α$-TCAV:基于概念激活向量的测试统一框架

Ekkehard Schnoor, Jawher Said, Malik Tiomoko, Wojciech Samek, Alexander Jung

发表机构 * Department of Computer Science(计算机科学系) Department of Artificial Intelligence(人工智能系) Aalto University(阿alto大学) Fraunhofer Heinrich Hertz Institute(弗劳恩霍夫海因里希·赫兹研究所) Department of Artificial Intelligence, Fraunhofer HHI(人工智能系,弗劳恩霍夫HHI研究所) Huawei Noah’s Ark Lab(华为诺亚实验室) Department of EECS, Technische Universität Berlin(电子工程与计算机科学系,柏林技术大学)

AI总结 本文提出$α$-TCAV框架,解决传统TCAV方法中因指示函数不连续导致的方差问题,通过参数化平滑函数统一概率表述,并提供参数调优指导,挑战现有实践惯例。

Comments 44 pages, 12 figures

详情
AI中文摘要

概念激活向量(CAVs)是深度学习中基于概念的可解释性基础工具,但其实际应用受限于统计不稳定性。本文分析了CAVs和TCAV方法的随机性质,推导了主要CAV类别的分布,包括PatternCAV、FastCAV和基于岭回归的CAV。识别了标准TCAV得分的根本缺陷:其依赖不连续指示函数导致关键区域方差不衰减。为此,引入$α$-TCAV,一种通用框架,用参数化平滑函数替代指示函数,得到统一的概率表述,涵盖TCAV和Multi-TCAV。刻画了灵敏度得分和不同TCAV变体的诱导分布,显示现有最先进的选择缺乏理论依据。提供原理指导,调优$α$-TCAV参数:要么以较低计算成本模仿Multi-TCAV,要么获得校准的贝叶斯最优概率度量。最终分析产生实用建议,挑战现有惯例:最显著的是将全部采样预算分配给单一CAV而非多个。

英文摘要

Concept Activation Vectors (CAVs) are a fundamental tool for concept-based explainability in deep learning, yet their practical utility is limited by statistical instability. We analyze the stochastic nature of CAVs and the Testing with CAVs (TCAV) method, deriving the distributions of major CAV classes including PatternCAV, FastCAV, and ridge regression-based CAVs. We then identify a fundamental flaw in the standard TCAV score: its reliance on a discontinuous indicator function induces non-decaying variance in critical regimes. To address this, we introduce $α$-TCAV, a generalized framework that replaces the indicator with a parameterized smooth function, yielding a unified probabilistic formulation that subsumes both TCAV and Multi-TCAV. We characterize the induced distributions of sensitivity scores and different TCAV variants, showing that established state-of-the-art choices lack theoretical justification. We provide principled guidance on tuning the parameter in $α$-TCAV -- either to imitate Multi-TCAV at substantially lower computational cost, or to obtain a calibrated Bayes-optimal probabilistic measure of a concept's influence. Finally, our analysis yields practical recommendations that challenge established routines: most notably, allocating the full sampling budget to a single CAV rather than splitting it across several.

2605.15675 2026-05-18 cs.LG cs.AI 版本更新

Interaction-Aware Influence Functions for Group Attribution

群体属性中的交互感知影响函数

Jaeseung Heo, Kyeongheung Yun, Youngbin Choi, Sehyun Hwang, Jungseul Ok, Dongwoo Kim

发表机构 * GSAI, POSTECH(POSTECH 人工智能研究所) CSE, POSTECH(POSTECH 计算科学与工程系)

AI总结 本文提出交互感知影响函数,通过考虑样本间相互作用来改进群体属性评估,实验显示其在多个任务中优于传统方法。

详情
AI中文摘要

影响函数近似于移除训练样本如何改变感兴趣的量,如保留损失。为估计群体样本的影响,常规做法是求和个体影响。然而,这种求和无法捕捉样本联合影响:样本对可能是冗余或互补的,但求和无法区分这些情况。我们提出交互感知影响函数,通过在训练参数周围扩展目标到二次项,获得一个估计器,该估计器在标准求和基础上增加了一个双变量交互项,捕捉两个样本对目标影响的对齐情况。我们实验证明,该估计器在六个数据集-模型组合上显著优于一阶影响方法。此外,当用作Llama-3.1-8B指令微调数据的贪心选择规则时,在五个七下游任务中优于传统影响和表示相似性基线,在标准影响选择表现不佳的领域中。

英文摘要

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.

2605.15672 2026-05-18 cs.CV cs.AI 版本更新

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

VLMs 跟踪无需跟踪:诊断视觉路径跟随中的失败

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

发表机构 * Yonsei University(延世大学)

AI总结 研究VLMs在视觉路径跟随任务中的表现,发现其在面对局部相似干扰时易切换路径,揭示局部竞争导致的失败原因。

详情
AI中文摘要

视觉-语言模型(VLMs)在多模态基准测试中表现优异,但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务,其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力,我们设计了受控的路径跟随任务,引入附近的竞争者并减少语义和拓扑模糊性,如交叉和重叠。在这些任务中,即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径,尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明,这些失败源于局部竞争:附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈:模型大小扩展只能提供有限的收益,推理部分通过成本高昂的替代策略补偿,而显式路径指示未能恢复稳定的路径跟随。最后,在复杂的电缆场景和地铁地图上测试表明,相同的路径切换失败在受控设置之外仍然存在。

英文摘要

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

2605.15665 2026-05-18 cs.AI 版本更新

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

PRISM:通过迭代模拟和监控实现提示的可靠性用于企业对话式AI

Keshava Chaitanya, Jahnavi Gundakaram

AI总结 PRISM通过持续模拟和监控,将提示工程视为可靠性工程问题,提升企业对话式AI的可靠性,减少提示开发时间并修复生产中的回归问题。

Comments 12 pages, 1 figure, 5 tables. arXiv preprint

详情
AI中文摘要

在企业环境中部署基于大型语言模型(LLM)的对话代理需要同时正确且具有抗非确定性行为漂移能力的提示。现有提示优化框架将提示质量视为一次性的编译时问题,未能解决如何检测和修复由时间推移导致的LLM行为变化引起的提示回归问题。我们提出了PRISM(通过迭代模拟和监控实现提示的可靠性),一个闭环框架,将提示工程视为持续的可靠性工程问题而非一次性创作任务。PRISM输入自然语言代理需求、配置的工具和内存变量集以及初始草稿提示。它自动从需求生成测试用例,模拟完整的多轮对话以对抗平台忠实的LLM环境,使用LLM作为判断者评估通过/失败,并诊断失败的根本原因,然后对提示进行手术性修复——迭代直到所有测试通过。关键的是,PRISM设计为定期运行(每日),将LLM行为漂移视为首要的可靠性问题。我们评估了PRISM在Yellow.ai V3平台上的35个企业对话代理,持续三周部署。PRISM将中位提示开发时间从2天减少到30分钟以内,实现了所有评估代理99%的生产可靠性,并在24小时内成功识别和修复由LLM行为漂移引起的生产回归问题。我们的结果表明,持续的、基于模拟的提示优化在大规模可靠的企业对话式AI中是可行且必要的。

英文摘要

Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.

2605.15661 2026-05-18 cs.CV cs.AI 版本更新

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

VAGS:图像编辑与生成的速率自适应引导尺度

Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Harvard University(哈佛大学) School of Computing and Data Science(计算与数据科学学院) The University of Hong Kong(香港大学) Kempner Institute for the Study of Natural and Artificial Intelligence(自然与人工智能研究学院)

AI总结 VAGS通过自适应引导尺度提升图像编辑和生成的结构保真度和生成质量,无需微调或额外计算。

详情
AI中文摘要

分类自由引导(CFG)是控制流式采样器中文本语义强度的主要手段,但传统方法在整个ODE轨迹中固定引导尺度。这存在根本矛盾:早期步骤以噪声为主,携带弱语义信号,而后期步骤需提交图像结构,要求更强的方向性承诺;更关键的是,任何引导强度的值取决于引导速度是否与模型当前动态一致或相反。本文提出速率自适应引导尺度(VAGS),一种无需训练的替代方案,通过结合时间信号级项和任务相关速度场的余弦相似度,将名义尺度乘以一个有界因子。对于无需反向传播的编辑,VAGS测量源和目标引导速度之间的对齐程度,使每一步的编辑强度反映局部保留与变换的兼容性。对于生成,VAGS-Gen利用无条件与条件速度之间的对齐作为类比信号。两种变体均无需微调、辅助网络或额外前向传递,固定CFG是其特殊情形。在PIE-Bench和DIV2K进行编辑,在COCO17、CUB-200和Flickr30K进行生成时,VAGS在结构保真度和生成质量上优于固定CFG和近期无训练引导变体。代码可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公开获取。

英文摘要

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

2605.15656 2026-05-18 eess.SP cs.AI 版本更新

TFZ-Tree: An Ultra-Lightweight Waveform Classification Framework for Resource-Constrained Devices

TFZ-Tree:一种面向资源受限设备的超轻量波形分类框架

Hao Wang, Kuang Zhang, Yonggang Chi, Tianqi Zhao, Yanbo Fu, Jiaxing Guo

发表机构 * x86 platform(x86平台) Einstein-sworder

AI总结 本文提出TFZ-Tree框架,通过时间频率多维特征和优化的Z检验树实现超轻量波形分类,实现在资源受限设备上实时识别十种物联网波形类型,测试精度达99.5%。

详情
AI中文摘要

在6G物联网多波形共存趋势下,智能接收器必须首先识别物理层波形类型才能正确解调和资源调度。然而,现有信号识别研究主要聚焦于符号级调制分类,直接针对物理层波形类型(如OFDM、OTFS、LoRa)的研究极为稀缺,且依赖深度神经网络和复杂时频变换,难以部署在资源受限终端。符号调制分类方法本身也无法规避“波形识别先于解调”的前提。为解决这一双重缺口,本文提出一种基于时频多维特征的超轻量波形分类框架,采用低复杂度时域特征提取,分类后端采用优化的Z检验树,利用假设检验置信度自动控制决策树分裂和大小,确保在资源有限处理器上高效执行。在包含OFDM、OTFS、DSSS、LoRa和NB-IoT在内的十种6G候选波形上测试,方法在AWGN信道下平均精度达99.5%,在TDL-C多径信道下为87.4%,主要混淆OTFS与LoRa。在x86平台用C语言实现,单次推理延迟低于4ms。据所知,这是首次实现十种物联网波形类型实时识别的工作。未来工作将针对嵌入式MCU上的部署加速。代码和数据集已开源:https://github.com/Einstein-sworder/IoT-wave.

英文摘要

Under the trend of multi-waveform coexistence in 6G IoT, intelligent receivers must first identify physical-layer waveform types before performing correct demodulation and resource scheduling. However, existing signal identification research largely focuses on symbol-level modulation classification. Research directly targeting physical-layer waveform types (e.g., OFDM, OTFS, LoRa) is not only extremely scarce but also heavily reliant on deep neural networks and complex time-frequency transforms, making deployment on resource-constrained terminals difficult. Symbol modulation classification methods themselves cannot circumvent the prerequisite of ``waveform identification first.'' To address this dual gap, we propose an ultra-lightweight waveform classification framework based on time-frequency multidimensional features with a cooperative Z-test tree (ZTree). The framework employs low-complexity time-domain feature extraction, and the classification backend adopts a ZTree optimized by Z-statistical testing, which uses hypothesis testing confidence to automatically control decision tree splitting and size, ensuring efficient execution on resource-limited processors. Tested on ten 6G candidate waveforms including OFDM, OTFS, DSSS, LoRa, and NB-IoT, the method achieves 99.5\% average accuracy under AWGN and 87.4\% under TDL-C multipath channels, with main confusion between OTFS and LoRa. Implemented in C on an x86 platform, single inference latency is under 4~ms. To the best of our knowledge, this is the first work achieving real-time recognition of ten IoT waveform types. Future work will target deployment acceleration on embedded MCUs. Code and dataset are open-sourced at: https://github.com/Einstein-sworder/IoT-wave.

2605.15651 2026-05-18 cs.LG cs.AI cs.GT 版本更新

Sharp Spectral Thresholds for Logit Fixed Points

Logit固定点的尖锐谱阈值

Tongxi Wang

发表机构 * Southeast University(东南大学)

AI总结 研究探讨了logit反馈系统稳定性问题,提出新的欧几里得阈值条件以扩展稳定性保证,识别相变点。

详情
AI中文摘要

Softmax反馈系统是熵正则化强化学习、logit博弈动态、群体选择和均场变分更新的数学核心。其核心稳定性问题很简单:当softmax系统产生唯一且全局可预测的结果时?经典理论给出了保守答案。通过将softmax视为单位尺度响应,它仅在强随机化 regime 中保证稳定性。我们证明经典方法忽略了整个稳定 regime 并未识别真正质变发生点。对于有限维仿射logit系统,尖锐无维欧几里得阈值为$$β\\|ΠWΠ\\|_{\mathcal T\to\mathcal T}<2$$,而非之前使用的条件,该条件仅在softmax系统保持安全过正则化时保证稳定性。我们的定理填补了之前缺失的预分支 regime,将仿射softmax反馈系统的稳定性保证扩展到奖励响应但全局可预测的系统。它扩大了这些系统的认证稳定性边界,并识别模型真正经历相变的点。

英文摘要

Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is $$β\|ΠWΠ\|_{\mathcal T\to\mathcal T}<2,$$ rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition.

2605.15625 2026-05-18 cs.AI cond-mat.soft 版本更新

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

ColPackAgent:基于代理技能的硬粒子蒙特卡罗工作流程用于胶体堆积

Lijie Ding, Changwoo Do

发表机构 * Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA(奥克勒德国家实验室中子散射部)

AI总结 ColPackAgent通过MCP工具服务器和代理技能实现胶体堆积模拟的自主工作流程,展示了如何利用LLM代理执行模拟任务并评估不同模型的性能。

详情
AI中文摘要

我们介绍了ColPackAgent,一种代理框架,通过模型上下文协议(MCP)工具服务器和代理技能自主运行胶体堆积的蒙特卡罗模拟,无论是作为独立代理还是现有代理系统的一部分。通过利用MCP服务器和代理技能,ColPackAgent执行胶体堆积模拟的结构化工作流程,这些流程对于研究相变、自组装和材料设计至关重要。在没有专用模拟工具和工作流程指令的情况下,通用大型语言模型(LLM)代理倾向于描述此类工作流程而不是可靠地执行。MCP服务器暴露了一个定制构建的colpack Python包,该包封装了HOOMD-blue硬粒子蒙特卡罗。技能编码了一个四阶段的工作流程合同。ColPackAgent可以与人类反馈互动执行工作流程,从端到端提示自主执行,或作为提供的程序文件的autoresearch。我们通过不同模式展示了系统,包括立方体粒子的3D模拟、二元系统中的盘和胶囊的2D模拟,以及使用autoresearch的2D硬盘冻结转变。我们还比较了不同LLM在该工作流程上的模型性能,使用17个阶段特定的提示。此基准测试提供了对不同模型在设置、规划和分析工作流程中可靠性的阶段级检查。这些结果表明,将领域Python包与MCP工具和便携式代理技能结合,为将模拟工具包转化为代理辅助研究工作流程提供了可行的途径。

英文摘要

We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self-assembly, and materials design. Without dedicated simulation tools and workflow instructions, general-purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom-built colpack Python package that wraps HOOMD-blue hard-particle Monte Carlo, and the skill encodes a four-stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end-to-end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard-disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage-specific prompts. This benchmark provides a stage-level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent-assisted research workflow.

2605.15618 2026-05-18 cs.CV cs.AI 版本更新

Latent Video Prediction Learns Better World Models

潜在视频预测学习更好的世界模型

Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar

发表机构 * The University of Melbourne(墨尔本大学) Monash University(莫纳什大学)

AI总结 本文系统研究了潜在预测模型在世界模型中的鲁棒性,发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异,优于其他视频基础模型。

详情
AI中文摘要

本文系统研究了潜在预测模型在世界模型中的鲁棒性,发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异,优于其他视频基础模型。

英文摘要

Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.

2605.15617 2026-05-18 cs.DC cs.AI 版本更新

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

几块GPU,大量规模:PrismLLM实现忠实的LLM训练仿真

Shaoke Xi, ChonLam Lao, Boyi Jia, Jiaqi Gao, Zhipeng Zhang, Jiamin Cao, Brian Sutioso, Erci Xu, Minlan Yu, Kui Ren, Yong Li, Zhengping Qian, Ennan Zhai, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团) Harvard University(哈佛大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 PrismLLM通过切片方法构建高保真执行图,使工程师能用少量GPU模拟大规模训练行为,准确复现性能和内存表现,节省集群访问成本。

Comments 13 pages body, 21 pages total

详情
AI中文摘要

当前大型语言模型(LLM)训练依赖数千块GPU的集群,尽管规模大能加速模型发展,但开发、调试和性能调优框架变得复杂且昂贵。工程师需频繁访问生产集群以复现行为或评估优化,但大部分GPU已用于生产任务。PrismLLM通过切片方法构建高保真执行图,分离大规模执行与访问大集群的需求,使工程师能用少量GPU运行并观察感兴趣的一组rank。PrismLLM通过混合仿真,部分rank执行原始程序,其余rank作为虚拟参与者回放。实验显示PrismLLM在大规模LLM训练任务中准确复现性能和内存行为,迭代时间平均误差仅0.58%,峰值GPU内存使用误差低于0.01%。PrismLLM可模拟最多8192块GPU的集群,仅需原部署物理GPU的不到1%。

英文摘要

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters -- which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58\% average error in iteration time and less than 0.01\% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1\% of the physical GPUs required by the original deployment.

2605.15611 2026-05-18 cs.AI 版本更新

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

TopoEvo: 一种面向拓扑的自演化多智能体框架用于微服务中的根本原因分析

Junle Wang, Xingchuang Liao, Wenjun Wu

发表机构 * School of Artificial Intelligence, Beihang University Beijing, China(人工智能学院,北京航空航天大学,北京,中国)

AI总结 针对微服务中观测数据异质性、故障传播和拓扑漂移问题,TopoEvo通过多模态对齐、拓扑约束推理和自演化机制,提升根本原因分析的鲁棒性与准确性。

Comments 12 pages

详情
AI中文摘要

微服务中的根本原因分析(RCA)面临噪声异质多模态观测数据、级联故障传播放大下游症状以及由自动扩展和滚动更新引起的非平稳拓扑漂移等挑战。最近基于LLM的RCA智能体虽能生成工具导向的解释,但往往缺乏拓扑意识,导致症状放大偏误。本文提出TopoEvo,一种面向拓扑的自演化多智能体框架,结合图表示学习与结构化拓扑约束推理。TopoEvo首先引入度量正交多模态对齐(MOMA),将度量嵌入分解为互补子空间,并通过对比对齐日志和追踪以减少模态冗余和稀疏性,从而获得稳定的节点表示。随后应用向量量化(VQ)将拓扑增强的状态离散化为可审计的症状令牌,利用症状词典实现可靠检索和令牌级证据支撑。在这些离散拓扑提示之上,TopoEvo执行多智能体假设-证据-测试(HET)工作流,明确验证传播一致的解释并区分起因异常与放大下游症状。最后,自演化机制刷新分层事件记忆,并通过高置信度伪标签进行保守测试时适应,以维持在漂移下的鲁棒性。

英文摘要

Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.

2605.15603 2026-05-18 cs.LG cs.AI 版本更新

Offline Reinforcement Learning with Universal Horizon Models

离线强化学习中的通用时间 horizon 模型

Hojun Chung, Junseo Lee, Songhwai Oh

发表机构 * Interdisciplinary Program in Artificial Intelligence and ASRI, Seoul National University(人工智能交叉学科项目及首尔国立大学ASRI) Department of Electrical and Computer Engineering, Seoul National University(电气与计算机工程系,首尔国立大学)

AI总结 本文提出通用时间 horizon 模型,通过灵活预测任意时间 horizon 的未来状态,改进了传统几何时间 horizon 模型在远期状态建模上的不足,并在100个OGBench任务中验证了其有效性。

Comments ICML 2026

详情
AI中文摘要

基于模型的强化学习(RL)通过在想象的 on-policy 轨迹上进行价值学习,为离线 RL 提供了有吸引力的方法。然而,由于重复的模型推断导致自我生成状态中的累积误差,这一方法常常面临挑战。尽管几何时间 horizon 模型(GHM)通过直接预测折扣无限时间 horizon 的未来来缓解这一问题,但在准确建模远期状态方面仍存在挑战。为此,我们引入了通用时间 horizon 模型(UHM),这是 GHM 的推广,能够直接在任意时间 horizon 下预测未来状态。利用这种灵活性,我们提出了一种可扩展的价值学习方法,该方法采用winsorized 时间 horizon 分布来稳定训练,通过限制过大的时间 horizon 来实现。在100个具有挑战性的OGBench任务上的实验结果表明,所提出的方法在高度次优数据集和需要长时间 horizon 推理的任务上优于竞争性基线。项目页面:https://rllab-snu.github.io/projects/UHM/

英文摘要

Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: https://rllab-snu.github.io/projects/UHM/

2605.15585 2026-05-18 cs.AI cs.CV 版本更新

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

在编码前看到:学习视觉先验以生成空间感知的教育动画

Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye

发表机构 * Wuhan University(武汉大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出OmniManim框架,通过视觉规划和反馈机制提升教育动画生成质量,改进渲染效果和教学效果。

Comments 21 pages, 4 figures

详情
AI中文摘要

大型语言模型可以为教育动画生成可执行代码,但生成的渲染结果常出现元素重叠、对齐错误和动画连续性断裂等问题。这些缺陷无法仅从代码中可靠检测,需在执行后才能显现。本文将该问题形式化为渲染反馈感知的约束代码生成:给定自然语言规范,模型必须生成可执行代码,其渲染输出需满足可在渲染后评估的结构化质量标准。为解决此问题,我们引入OmniManim框架,围绕共享场景状态、显式视觉规划、结构化后渲染诊断和局部修复构建。其中,Vision Agent是任务特定的视觉规划模块:它通过粗到细的边界框去噪预测稀疏关键帧布局,并优化插值感知的目标以减少下游动画插值引起的中间帧失败。我们进一步构建了ManimLayout-1K和EduRequire-500两个数据集,并提供可复现的评估协议,涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上,OmniManim在单模型基线和现有多智能体框架上均提升了测量渲染质量。系统性消融研究进一步验证,显式视觉规划,特别是其粗略空间先验、边界框细化和插值感知优化是这些提升的关键。

英文摘要

Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.

2605.15581 2026-05-18 cs.AI 版本更新

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

STAR: 一种针对微服务中RCA代理的阶段属性分诊与修复框架

Junle Wang, Xingchuang Liao, Wenjun Wu

发表机构 * School of Artificial Intelligence, Beihang University Beijing, China(人工智能学院,北京航空航天大学,北京,中国)

AI总结 本文提出STAR框架,通过将RCA流程分解为四个阶段,提升微服务中RCA代理的可靠性与自修复能力。

Comments 11 pages

详情
AI中文摘要

基于大语言模型的根因分析(RCA)代理近年来在微服务AIOps中崭露头角,但其可靠性仍脆弱:早期证据收集、假设构建或因果分析中的错误会通过推理轨迹传播,最终破坏最终诊断。本文提出STAR,一种针对RCA代理的阶段属性分诊与修复框架,将RCA工作流程分解为四个结构化阶段:证据包(EP)、假设集(HS)、分析结构(AS)和决策报告(DR),并将代理故障视为可定位的阶段性推理错误,而非整体端到端错误。基于LangGraph,STAR执行阶段审计,实施预算感知的快速/慢速路由,通过反事实候选评估进行决断阶段定位,并进行阶段特定的修补与重放修复。

英文摘要

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

2605.15569 2026-05-18 cs.CR cs.AI cs.SE 版本更新

Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis

通过代理程序分析检测多语言微服务中的特权提升

Penghui Li, Hong Yau Chong, Yinzhi Cao, Junfeng Yang

发表机构 * Columbia University(哥伦比亚大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出Neo框架,结合LLM和经典程序分析,解决微服务中特权提升检测的复杂性问题,发现24个零日漏洞,精度和召回率均优于现有方法。

Comments In Proceedings of the 47th IEEE Symposium on Security and Privacy (S&P)

详情
AI中文摘要

微服务因可扩展性和容错性被广泛采用,但其架构引入了特权和权限控制的复杂性,导致特权提升风险。本文提出Neo框架,结合大语言模型和经典程序分析,通过动态生成分析计划、适应代码搜索策略和验证语义,实现跨服务和语言的可扩展代码探索。在25个开源微服务应用上评估,Neo发现24个零日漏洞,精度81.0%、召回率85.0%。相比现有方法,Neo在检测准确性和可扩展性上均有显著提升,并展示了其在其他应用领域和漏洞类型上的可扩展性,发现18个额外零日漏洞。

英文摘要

Microservices are widely adopted in modern cloud systems due to their scalability and fault tolerance. However, microservice architectures introduce significant complexity in privilege and permission control, creating risks of privilege escalation where attackers can gain unauthorized access to resources or operations. Detecting such vulnerabilities is challenging due to complex cross-service interactions, polyglot codebases, and diverse privileged operations and permission checks. We present Neo, an agentic program analysis framework that combines large language models (LLMs) with classic program analysis to address these challenges. Neo leverages an LLM-based agent that dynamically generates analysis plans, adapts code search strategies, and validates semantics. We develop code search primitives that enable Neo to perform scalable and flexible code exploration across services and languages. We evaluated Neo on 25 open-source microservice applications spanning 7 programming languages and 6.2 million lines of code. Neo uncovered 24 zero-day privilege escalation vulnerabilities and achieved 81.0% precision and 85.0% recall on a ground-truth dataset. Compared to existing program analysis and agentic solutions, Neo demonstrated significant improvements in both detection accuracy and scalability. We further showcased Neo's extensibility by applying it to other application domains and vulnerability types, uncovering 18 additional zero-day vulnerabilities.

2605.15567 2026-05-18 cs.AI 版本更新

Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

位置:人工智能需要元智能——元认知AI的案例

Sergei Chuprov, Richard D. Lange, Leon Reznik, Paulo Shakarian, Raman Zatsarenko, Dmitrii Korobeinikov

发表机构 * University of Texas Rio Grande Valley, Edinburg, TX, USA(德克萨斯大学里奥格兰德谷分校) Rochester Institute of Technology, Rochester, NY, USA(罗切斯特理工学院) Syracuse University, Syracuse, NY, USA(锡拉库萨大学)

AI总结 本文主张将元认知作为设计更准确、安全和高效AI的通用原则,通过联邦学习案例展示元认知提升学习效率和安全性的方法,提出新的软件框架用于实现元认知AI。

Comments This is a preliminary version accepted for presentation and publication at the 43rd International Conference on Machine Learning (ICML26). The modified final version will be available in the conference proceedings

详情
AI中文摘要

本文主张将元认知作为设计更准确、安全和高效AI的通用原则。元认知解决方案涉及系统监控自身状态并根据每个问题实例的难度或错误成本合理分配资源。受资源理性AI和心理学、认知科学中已记录的元认知策略的启发,我们识别了将这些策略嵌入AI设计中的具体挑战,并突出了开放的理论和实现问题。我们通过联邦学习(FL)案例研究展示这些原则,并展示如何通过新开发的软件框架将这些原则转化为实践,使社区能够设计、部署和实验元认知增强的AI应用。

英文摘要

This position paper argues for metacognition as a general design principle for creating more accurate, secure, and efficient AI. The metacognitive solution involves systems monitoring their own states and judiciously allocating resources depending on each problem instance's difficulty or cost of mistakes. Drawing inspiration both from past work on resource-rational AI and from well-documented metacognitive strategies in psychology and cognitive science, we identify specific challenges in embedding these strategies into AI design and highlight open theoretical and implementation problems. We showcase these principles through a tangible example of improved learning efficiency, effectiveness, and security in a Federated Learning (FL) case study. We show how these principles can be translated into practice with a novel software framework developed specifically to allow the community to design, deploy, and experiment with metacognition-enabled AI applications.

2605.15565 2026-05-18 cs.LG cs.AI 版本更新

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow:面向代理大语言模型的数据流强化学习

Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Michigan(密歇根大学) UC Berkeley(加州大学伯克利分校) Meta

AI总结 AstraFlow通过数据流导向的强化学习系统,实现复杂多策略协作训练和高效利用异构计算资源,提升代理LLM的推理与工具使用能力。

详情
AI中文摘要

强化学习(RL)日益被用于提升大语言模型的推理、编码和工具使用能力,但代理RL仍面临高昂成本。为扩展RL到代理LLM,需支持复杂工作负载,包括多策略协作训练,同时高效利用弹性、异构和跨区域计算资源。现有LLM RL系统支持部分能力,但每次新扩展通常需专门系统工程。此问题源于训练器导向的控制架构和RL系统组件缺乏原理性抽象。为此,我们提出AstraFlow,一种数据流导向的RL系统,取代传统训练器导向控制,采用原理性组件抽象。在AstraFlow中,rollout服务、数据流管理和训练被解耦为自主组件,使系统能原生支持复杂多策略代理RL工作负载并高效利用多样化计算资源。我们评估了AstraFlow在数学、代码、搜索和AgentBench工作负载上的表现,显示同一系统支持多策略训练、弹性扩展、异构跨区域执行和可组合的数据算法,无需系统级代码更改。在多策略协作训练中,AstraFlow的准确度与现有RL系统相当或更优,同时训练时间加速2.7倍。

英文摘要

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

2605.15549 2026-05-18 cs.LG cs.AI cs.CE 版本更新

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models

CTF4Nuclear: 用于核裂变和核聚变模型的通用任务框架

Stefano Riva, Carolina Introini, Antonio Cammi, Dean Price, Alexey Yermakov, Yue Zhao, Philippe M. Wyder, Judah Goldfeder, Jan Williams, Amy Sara Rude, Matteo Tomasetto, Joe Germany, Joseph Bakarji, Georg Maierhofer, Miles Cranmer, J. Nathan Kutz

发表机构 * Autodesk Research(Autodesk研究院) Department of Energy, Nuclear Engineering Division, Politecnico di Milano(能源部,核工程系,米兰理工学院) Nuclear Science and Engineering, Massachusetts Institute of Technology(核科学与工程,麻省理工学院) Department of Applied Mathematics, University of Washington(应用数学系,华盛顿大学) Department of Electrical and Computer Engineering, University of Washington(电气与计算机工程系,华盛顿大学) High Performance Machine Learning, SURF(高性能机器学习,SURF) Distyl AI Department of Computer Science, Columbia University(计算机科学系,哥伦比亚大学) Department of Mechanical Engineering, University of Washington(机械工程系,华盛顿大学) Department of Mechanical Engineering, Politecnico di Milano(机械工程系,米兰理工学院) Department of Mathematics, American University in Beirut(数学系,贝鲁特美国大学) Department of Mechanical Engineering, American University in Beirut(机械工程系,贝鲁特美国大学) Department of Applied Mathematics and Theoretical Physics, University of Cambridge(应用数学与理论物理系,剑桥大学)

AI总结 本文提出CTF4Nuclear框架,用于核工程中机器学习方法的标准化评估,通过12个指标和稀疏测量系统监控,提升核工业科学ML的严谨性和可重复性。

详情
AI中文摘要

清洁能源需求持续增长,新型核技术为可再生能源提供补充方案。然而,设计和运行这些系统极具挑战性,因为物理现象的复杂性导致系统动态难以预测。尽管高保真模拟有助于理解反应堆中的非线性多物理场相互作用,但计算成本高,难以实现实时应用。此外,基于模型的方法对简化假设敏感,导致与实际测量存在固有差异。相比之下,机器学习(ML)方法有潜力生成可靠的替代模型,快速预测系统行为。然而,可用于此任务的数据驱动方法种类繁多且多样。在安全关键领域如核工程中,公平比较不同ML方法及其优缺点至关重要。为此,我们引入了一个通用任务框架(CTF)用于核工程中的ML,基于动态系统和地震学的先前努力。该CTF考虑了来自不同核和核相邻系统的精选数据集。CTF评估方法在12个已建立的指标上表现,以及一个专注于仅稀疏测量的系统监控新范式。我们通过基准测试标准ML基线方法,揭示了当前方法的限制。我们的愿景是用标准化评估替代随意比较,提高核工业科学ML的严谨性和可重复性。

英文摘要

The demand for clean energy is ever increasing, with new nuclear technologies presenting a complementary solution to renewable energies. However, designing and operating these systems is exceptionally difficult, given the complexity of the physical phenomena that interact to form the system dynamics. While high-fidelity simulations help to understand the non-linear, multi-physics interactions within a reactor, they are computationally expensive and rarely suitable for real-time applications. Furthermore, model-based approaches are inherently sensitive to simplifying assumptions required to derive their governing equations and parameters, leading to inevitable discrepancies with real-world measurements. In contrast, Machine Learning (ML) methods have the potential to generate reliable surrogate models which may be able to quickly predict the system's behaviour. However, the number of data-driven methods that can potentially be used for this task is large and diverse. In a safety-critical setting such as nuclear engineering, a fair comparison of different ML methods, and a clear understanding of their advantages and limitations, is of paramount importance. To address this, we introduce a Common Task Framework (CTF) for ML in nuclear engineering, building upon previous efforts in dynamical systems and seismology. This CTF considers a curated set of datasets from different nuclear and nuclear-adjacent systems. The CTF evaluates the performance of a method on 12 established metrics, alongside a new paradigm focused on system monitoring from sparse measurements only. We illustrate the framework by benchmarking standard ML baselines against these datasets, revealing current method limitations. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigour and reproducibility in scientific ML for the nuclear industry.

2605.15543 2026-05-18 cs.GT cs.AI 版本更新

Domain-Independent Game Abstraction using Word Embedding Techniques

基于词嵌入技术的领域无关游戏抽象

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.(CMU战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文提出一种基于自然语言处理的词嵌入技术进行游戏抽象的方法,通过将动作视为词,利用词向量表示和聚类实现领域无关的游戏抽象,实验表明该方法有效但不如专用算法。

详情
AI中文摘要

许多现实中的游戏规模庞大,需要通过游戏抽象来减小规模。尽管过去二十年游戏抽象有显著进展,但多数工作局限于特定领域(如扑克),难以推广到其他领域。本文提出一种领域无关的游戏抽象方法,利用自然语言处理中的词嵌入技术,将动作视为词,通过训练词向量表示并聚类实现游戏抽象。实验结果表明,该方法有效,但不如针对特定游戏优化的算法性能优异。

英文摘要

Many games of interest in the real world are often intractably large, thereby necessitating the use of game abstraction to shrink them in size, typically by many magnitudes. Over the last two decades, there have been significant advances in game abstraction; however, the domain-specific nature (usually poker) of much of the prior work prevents those techniques from being easily generalized to other settings without extensively analyzing the game at hand. In this paper, we propose a domain-independent approach to game abstraction, which applies word embedding techniques from the field of natural language processing. Treating each action as a word and gameplay data as a corpus, word vectors can be trained to represent each action as a real-valued vector, which can then be clustered to facilitate game abstraction. We also explore the use of foundational embedding models and show that action embeddings obtained this way can capture a surprising amount of information about the underlying game. Experimental results demonstrate that our proposed game abstraction technique is effective, although it does not outperform specialized algorithms tailored to specific games.

2605.15542 2026-05-18 cs.AI 版本更新

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI: 动态区域搜索用于无训练的GUI定位

Yichao Liu, Huawen Shen, Liu Yu, Shiyu Liu, Zeyu Chen, Yu Zhou

发表机构 * Nankai University(南开大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 DRS-GUI通过动态区域搜索框架提升GUI定位性能,利用轻量级UI感知器和MCTS动作规划器,实现高效区域探索与筛选,提升多模态大语言模型的定位能力。

Comments 11 pages, 8 figures

详情
AI中文摘要

基于多模态大语言模型(MLLM)的GUI代理在理解和执行用户指令方面表现出色,但准确地从高分辨率截图中定位相关元素仍具挑战性。受人类动态调整感知范围的启发,本文提出DRS-GUI,一种无训练的动态区域搜索框架,可无缝集成到现有MLLM中。DRS-GUI引入轻量级UI感知器,执行聚焦、位移和分散三种人类似感知动作,逐步探索界面并生成区域提案。通过基于蒙特卡洛树搜索(MCTS)的动作规划器动态调度这些动作,并利用区域质量奖励评估和选择高度相关的区域,有效剪枝冗余UI元素。实验表明,DRS-GUI在ScreenSpot-Pro上对通用和GUI特定的MLLM(Qwen2.5-VL-7B和UGround-V1-7B)实现了14%的提升,显著增强了定位性能和泛化能力。

英文摘要

GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.

2605.15537 2026-05-18 cs.AI 版本更新

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

RTL-BenchMT:通过代理辅助分析和修订动态维护RTL生成基准

Jing Wang, Shang Liu, Hangan Zhou, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出RTL-BenchMT框架,通过自动识别和修正错误案例及检测更新过拟合案例,解决RTL基准中的缺陷和过拟合问题,降低人工维护成本。

Comments This paper has been accepted by DAC 2026

详情
AI中文摘要

本文介绍了RTL-BenchMT,一种用于动态维护RTL生成基准的代理框架。大语言模型(LLMs)辅助自动化RTL生成是EDA研究中的重要方向。然而,当前RTL基准面临两个关键挑战:(1)基准中的错误案例和(2)对基准的过拟合。这两个挑战难以仅通过手动工程努力解决。为解决这些问题并系统降低人工维护成本,我们提出自动化代理框架RTL-BenchMT。RTL-BenchMT专注于两个关键应用:(1)自动识别和修正错误基准案例和(2)自动检测和更新过拟合案例。借助RTL-BenchMT,我们对错误和过拟合案例进行了深入分析,并生成一个经过改进的基准套件,该套件将向社区开源。

英文摘要

This paper introduces RTL-BenchMT, an agentic framework for dynamically maintaining RTL generation benchmarks. Large Language Models (LLMs) assisted automated RTL generation is one of the most important directions in EDA research. However, current RTL benchmarks face two critical challenges: (1) flawed cases in the benchmarks and (2) overfitting to the benchmarks. Both challenges are difficult to resolve purely by manual engineering effort. To address these issues and systematically reduce human maintenance costs, we propose an automated agentic framework, RTL-BenchMT. RTL-BenchMT focuses on two key applications: (1) automatically identifying and revising flawed benchmark cases and (2) automatically detecting and updating overfitting cases. With the assistance of RTL-BenchMT, we conduct a thorough, in-depth analysis of flawed and overfitting cases and produce a refined benchmark suite that will be open-sourced to the community.

2605.15536 2026-05-18 cs.RO cs.AI cs.CV 版本更新

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

SkiP: 在何时跳过和何时细化以实现高效的机器人操作

Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Peng Cheng Laboratory(鹏城实验室) Southern University of Science and Technology(南方科技大学) Sun Yat-sen University(中山大学) UNT University of Chinese Academy of Sciences(中国科学院大学)

AI总结 SkiP通过动态跳过冗余步骤和精细化关键步骤,提升机器人操作效率,无需额外结构或规划器。

详情
AI中文摘要

先前的模仿学习策略在每个控制步骤都预测未来动作,无论是在平滑运动阶段还是精确的接触丰富操作阶段。这种统一处理是浪费的:大多数操作轨迹步骤在自由空间中移动,携带很少的任务相关信息,而一小部分关键步骤围绕接触、抓取和对齐需求密集的高分辨率预测。我们提出了一种新的动作重标机制:在跳过段的每个时间步,我们用下一个关键段入口的动作替换行为克隆目标,使策略能够在一个决策中跳过冗余步骤。由此产生的Skip Policy (SkiP)在单一统一网络中动态跳过跳过段并密集细化关键段,无需学习跳过规划器或分层结构。为了自动将演示分成关键和跳过段而无需手动标注,我们引入了Motion Spectrum Keying (MSK),一种快速且任务无关的程序,从动作信号中检测局部运动复杂性。在72个模拟操作任务和三个真实机器人任务上的广泛实验表明,SkiP将执行步骤减少15-40%,同时在各种策略骨干上匹配或提高成功率。项目页面:https://pgq18.github.io/SkiP-page/.

英文摘要

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.

2605.15533 2026-05-18 cs.CV cs.AI 版本更新

Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

无需调优的指令式视频编辑:通过结构噪声初始化和引导

Song Wu, Xinyu Chen, Qian Wang, Liang Li, Zili Yi, Junlan Feng

发表机构 * JIUTIAN Research, China Mobile(中国移动极天研究院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室)

AI总结 本文提出无需调优的指令式视频编辑框架,通过结构噪声初始化策略和噪声引导机制,提升视频编辑的视觉质量和性能。

Comments Accepted by ICIP 2026

详情
AI中文摘要

视频编辑面临重大挑战。尽管一系列无需调优的方法避免了大量数据收集和模型训练的需求,但它们往往未能充分利用嵌入在噪声潜在空间中的丰富信息,导致结果不满意。为此,我们提出一种无需调优、基于指令的视频编辑框架。我们从噪声潜在空间的角度出发:设计了结构噪声初始化策略(SNIS),通过为编辑区域分配更高的噪声水平(以促进内容变化)和为未编辑区域分配更低的噪声水平(以保持内容一致性),从而获得更优的编辑起点。我们引入了噪声引导机制(NGM),利用生成模型中的视频先验知识,有效整合噪声潜在空间中的丰富信息以引导去噪过程,从而保持未编辑内容和整体视觉一致性。实验表明,我们提出的方法在视觉质量和性能上均优于现有方法。

英文摘要

Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.

2605.15529 2026-05-18 cs.CL cs.AI cs.LG 版本更新

Process Rewards with Learned Reliability

基于学习可靠性的过程奖励

Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 本文提出BetaPRM,通过预测步骤成功概率和预测可靠性,改进过程奖励模型,使下游任务能区分可靠与不确定的奖励。ACA应用在最佳N推理中,提升准确率-token权衡。

详情
AI中文摘要

Process Reward Models (PRMs) 提供步骤级反馈用于推理,但当前PRMs通常为每个步骤输出单一奖励分数。下游方法必须将不完美的步骤级奖励预测视为可靠的决策信号,但无指示何时应信任这些预测。我们提出BetaPRM,一种分布型PRM,预测步骤成功概率及该预测的可靠性。给定步骤成功监督来自蒙特卡洛延续,BetaPRM学习Beta信念,通过Beta-Binomial似然解释观察到的成功延续数量,而非回归到有限样本成功比率作为点目标。该学习的可靠性信号指示何时应信任步骤奖励,使下游应用能区分可靠奖励与不确定奖励。作为一项应用,我们引入自适应计算分配(ACA)用于PRM引导的最佳N推理。ACA利用学习的可靠性信号在高奖励解决方案可靠时停止,并在不确定候选前缀上投入更多计算。在四个backbone和四个推理基准上的实验表明,BetaPRM改进了PRM引导的最佳N选择,同时保持标准步骤级错误检测。基于此信号,ACA在固定预算最佳16上提升了准确率-token权衡,减少token使用达33.57%,同时提高最终答案准确率。

英文摘要

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

2605.15524 2026-05-18 cs.LG cs.AI math.DG math.ST stat.TH 版本更新

Neural Point-Forms

神经点形

Bruno Trentini, Jacob Hume, Vincenzo Antonio Isoldi, Philipp Misof, Ekaterina S. Ivshina, Kelly Maggs

发表机构 * NVIDIA University of Oxford(牛津大学) Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究所) Department of Mathematical Sciences(数学科学系) Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学和哥德堡大学) School of Engineering and Applied Sciences(工程与应用科学学院) Max Planck Institute of Molecular Cell Biology and Genetics(马克斯·普朗克分子细胞生物学与遗传学研究所)

AI总结 本文提出神经点形(NPFs),通过扩散几何中的拉普拉斯技术,构建点云的可学习几何特征,用于比较微分形式,并在合成和生物相关实验中展示其在处理采样密度、流形结构和群体几何时的优势。

详情
AI中文摘要

点云学习通常基于观察样本是嵌入高维特征空间的底层几何对象的噪声轨迹的假设。然而,许多几何特性无法仅通过坐标、成对距离或学习的图邻域直接捕捉。在光滑情况下,微分形式用于编码高阶切线信息。本文引入了一种新的可学习几何特征家族,称为神经点形(NPFs)。在没有自然切线结构的情况下,我们使用来自扩散几何的拉普拉斯技术,通过内积构建点云的离散模型,以比较微分形式。在连续情况下,共享环境特征空间的子流形表示为比较矩阵,其条目描述了特征形式对偶切线信息的相互作用。我们通过证明在标准采样、带宽、密度和流形假设下比较矩阵的长期一致性,使这一直觉精确化。这产生了一个紧凑、高效且可交换的神经层,其输出是一个学习的形比较矩阵。在合成和生物相关实验中,我们展示了NPFs提供了一个竞争性且可解释的表示,当标签依赖于采样密度、流形结构或响应相关群体几何时,其优势最为明显。

英文摘要

Point cloud learning often rests on the premise that observed samples are noisy traces of an underlying geometric object, such as a manifold embedded in a high-dimensional feature space. Yet much of this geometry is not captured directly by coordinates, pairwise distances, or learned graph neighborhoods alone. In the smooth setting, differential forms are devices to encode higher order tangency information. In this work, we introduce a new family of principled learnable geometric features for point clouds called neural point-forms (NPFs). In the absence of a natural tangency structure, we instead use Laplacian-based techniques from Diffusion Geometry to build a discrete model for comparing differential forms on point clouds via inner products. In the continuum, submanifolds of a shared ambient feature space are represented as comparison matrices, whose entries describe how pairs of feature forms interact with extrinsic tangency information. We make this intuition precise by proving the long-run consistency of comparison matrices under standard sampling, bandwidth, density, and manifold-hypothesis assumptions. This yields a compact, efficient and permutation-invariant neural layer whose output is a learned form-comparison matrix. Across synthetic and biologically relevant experiments, we show that NPFs provide a competitive, and interpretable representation, with the strongest benefits appearing when labels depend on sampling density, manifold-like structure, or response-relevant population geometry.

2605.15520 2026-05-18 cs.LG cs.AI cs.DC 版本更新

On the Fragility of Data Attribution When Learning Is Distributed

在分布式学习中数据归因的脆弱性

Xian Gao, Bo Hui, Min-Te Sun, Wei-Shinn Ku

发表机构 * Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama, USA(计算机科学与软件工程系,阿伯茨温泉大学,阿伯茨温泉,阿拉巴马州,美国) Department of Computer Science, University of Tulsa, Tulsa, Oklahoma, USA(计算机科学系,塔尔萨大学,塔尔萨,俄克拉荷马州,美国) Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan(计算机科学与信息工程系,国立中央大学,桃园,台湾)

AI总结 研究揭示了分布式学习中数据归因的脆弱性,通过归因优先攻击展示归因值可能被人为放大,同时提出归因鲁棒和激励相容的评分机制。

详情
AI中文摘要

数据归因已成为机器学习流水线中定价、审计和治理的重要组成部分,但大多数归因方法隐含假设归因值忠实反映参与者贡献。我们证明这一假设可能失效:在标准分布式训练流程中,单个参与者可通过潜变量优化注入小合成批次,保持全局效用的同时放大其归因值。在多个数据集、模型和边际效用评估器中,攻击一致增加攻击者的归因值并重塑良性客户端间的相对归因结构,而不会降低准确性或触发基于几何的防御。这些结果表明归因本身形成新的攻击面,推动归因鲁棒和激励相容的评分机制发展。

英文摘要

Data attribution has become an important component of pricing, auditing, and governance in machine learning pipelines, yet most attribution methods implicitly assume that attribution values faithfully reflect participants' contributions. We show that this assumption can fail: a single participant in a standard distributed training workflow can substantially inflate its measured attribution value while preserving global utility. Our attribution-first attack uses latent optimization to inject small synthetic batches that preserve utility while exploiting non-IID label coverage and evaluator sensitivities. Across datasets, models, and multiple marginal-utility evaluators, the attack consistently increases the adversary's attribution value and reshapes the relative attribution structure among benign clients without degrading accuracy or triggering geometry-based defenses. These results show that attribution itself forms a new attack surface and motivate the development of attribution-robust and incentive-compatible scoring mechanisms.

2605.15519 2026-05-18 cs.CV cs.AI 版本更新

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

DiffVAS: 在部分可观测环境中基于扩散的视觉主动搜索

Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Nathan Jacobs, Yevgeniy Vorobeychik

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) RISE Research Institutes of Sweden(瑞典RISE研究机构) Climate AI Nordics(北欧气候AI)

AI总结 DiffVAS提出了一种目标条件化的策略,能够在部分可观测环境中同时搜索多种目标,提升了视觉主动搜索在现实应用中的部署能力。

Comments 26 Pages, 12 figures, Accepted to AAMAS 2026

详情
AI中文摘要

视觉主动搜索(VAS)已被引入作为一种建模框架,利用视觉线索指导空中(如基于无人机的)探索,并在广阔的地理区域中定位感兴趣区域。潜在应用包括检测稀有野生动物盗猎的热点、协助搜救任务以及揭露非法武器交易等。先前的VAS方法假设整个搜索空间在前期已知,这在受限视野和高采集成本的约束下往往不现实,且通常学习针对特定目标对象的策略,限制了同时搜索多种目标类别的能力。在本工作中,我们提出DiffVAS,一种目标条件化的策略,根据任务需求在部分可观测环境中同时搜索多种对象,从而推进视觉主动搜索策略在现实应用中的部署。DiffVAS利用扩散模型从顺序观测的局部视图中重建整个地理区域,使基于目标条件的强化学习规划模块能够有效推理并引导后续的搜索步骤。大量实验表明,DiffVAS在部分可观测环境中搜索多种对象方面表现优异,在多个数据集上显著超越了最先进的方法。

英文摘要

Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.

2605.15514 2026-05-18 cs.CL cs.AI cs.LG 版本更新

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

RoPE在长上下文中无法区分位置或令牌,证明性分析

Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Bonn(波恩大学) Argonne National Laboratory(阿贡国家实验室) Amazon AGI(亚马逊人工智能研究院)

AI总结 本文证明RoPE在长上下文中因失去局部偏倚和令牌相关性一致性而失效,无法区分位置或令牌,且增加RoPE基值只能牺牲位置区分能力。

Comments 35 pages, 11 figures, submitted to NeurIPS 2026

详情
AI中文摘要

我们识别了旋转位置嵌入(RoPE)在基于Transformer的长上下文语言模型中的内在限制。我们的理论分析脱离了上下文的具体内容,仅依赖其长度。我们证明,随着上下文长度增加,基于RoPE的注意力变得不可预测,并失去两个对有效性至关重要的属性。首先,它失去局部偏倚:RoPE不再更倾向于 favor 近的位置而非远的位置。其次,它失去令牌相关性的一致性:一个关键向量在某一位置获得更高的注意力分数,可能在另一位置获得更低的分数。在两种情况下,失败的概率接近0.5,不优于随机猜测。我们进一步证明,当关键令牌被移动到不同位置或被不同令牌替换时,注意力分数可以保持不变,表明无法区分位置或令牌。调整RoPE基值在区分位置和令牌之间进行权衡,但无法同时保持两者。增加RoPE基值超参数,这是当前长上下文模型中的常见做法,有助于区分不同令牌,但不可避免地牺牲区分位置的能力。我们的实证分析显示,多头、多层架构不足以克服这些限制。我们的发现表明,未来基于Transformer的长上下文语言模型可能需要从根本上新的机制来编码位置和令牌顺序。

英文摘要

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

2605.15513 2026-05-18 cs.AI 版本更新

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

CAPS:级联自适应成对选择用于高效的并行推理

Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu

发表机构 * Texas A&M University(德克萨斯大学) Worcester Polytechnic Institute(沃斯特理工大学) Tohoku University(东北大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 CAPS通过级联自适应成对选择方法,在保持高效并行推理的同时,减少验证器的计算成本,优于现有成对验证方法。

Comments 31 pages, 2 figures, 18 tables

详情
AI中文摘要

并行推理,即生成器生成多个候选解,聚合器选择最佳解,是大型语言模型中最具效果的测试时扩展形式,而成对自我验证已成为其最强的聚合原始构件。然而,成对验证成本高昂:每次判断需读取两个完整的解,现有方法无论比较是否信息丰富,每问题都进行数十次判断。我们引入CAPS(级联自适应成对选择),一种仅在推理阶段使用的框架,沿两个正交轴非均匀分配验证器计算:证据轴适应每个候选解中验证者看到的部分,分布轴适应比较在池中的分布。CAPS将其实例化为四阶段级联,可选救援子程序,并允许闭式验证器令牌成本,其中每个候选的边际成本大致减半,相对于均匀全证据计划。在四个自我验证模型(Qwen3-14B,GPT-OSS-20B,Qwen3-4B-Instruct/Thinking)和五个涵盖代码(LiveCodeBench-v5/v6,CodeContests)和数学(AIME 2025,HMMT 2025)的推理基准上,CAPS在14个20个套件中优于领先的成对验证器,使用25.4%的验证器令牌预算在代码上,并在所有20个套件中优于点状自我验证。权衡套件允许以验证器在部分与完整证据上的准确性为术语的可解释诊断,提供具体的预部署检查以确定级联适用性。

英文摘要

Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.

2605.15507 2026-05-18 cs.IT cs.AI cs.LG math.IT 版本更新

PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources

PrismQuant: 为高斯混合源优化的率失真向量量化

Bumsu Park, Chanho Park, Youngmok Park, Namyoon Lee

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 针对高斯混合源,PrismQuant通过组件标签传输和组件匹配KLT实现率失真优化,结合EM驱动学习和熵约束量化,有效逼近理论边界并优于传统模型。

详情
AI中文摘要

对于均方误差下的高斯源,传统变换编码在率失真(RD)最优:KLT对角化协方差,反向水填充分配比特,随后标量量化闭环。然而多模态源中,单一协方差无法捕捉异质局部几何,RD函数失去闭合形式。本文通过高斯混合源重新审视该问题,构建其RD理论。核心发现混合结构仅引入组件标签成本。在活跃混合组件条件下,每个分支为高斯;挑战在于异质分支间的比特分配。证明 genie-aided 条件RD函数由单一全局反向水填充水平支配。基于此,提出PrismQuant,无损传输组件标签并使用组件匹配KLT编码残差,随后标量量化,实现H(C)/n bits per source dimension的反向率,渐近间隙消失。进一步开发基于EM驱动高斯混合学习、组件自适应KLT和熵约束标量量化(ECSQ)的实用实现。合成高斯混合实验显示PrismQuant接近理论RD界限,现实世界信道状态信息(CSI)数据实验显示其性能优于传统模型,模型规模小一个数量级。

英文摘要

For a Gaussian source under mean-squared error (MSE), classical transform coding is rate--distortion (RD) optimal: the Karhunen--Loeve transform (KLT) diagonalizes the covariance, reverse waterfilling allocates the bits, and scalar quantization closes the loop. This elegant story breaks down for multimodal sources, where no single covariance can capture heterogeneous local geometries, and the RD function loses its closed form. We revisit this problem through Gaussian-mixture sources and develop a constructive RD theory for them. Our key finding is that the mixture structure incurs only a component label cost. Conditioned on the active mixture component, each branch is Gaussian; the challenge is allocating bits across heterogeneous branches. We prove that the genie-aided conditional RD function is governed by a single global reverse-waterfilling level shared across all components and eigenmodes. Building on this result, we introduce PrismQuant, which transmits the component label losslessly and encodes the residual using the component-matched KLT, followed by scalar quantization, achieving a rate of H(C)/n bits per source dimension of the converse, with a vanishing asymptotic gap. We further develop a practical implementation based on EM-driven Gaussian-mixture learning, component-adaptive KLTs, and entropy-constrained scalar quantization (ECSQ). Experiments on synthetic Gaussian mixtures show that PrismQuant closely approaches the theoretical RD bound, while experiments on real-world channel-state-information (CSI) data demonstrate competitive or superior performance compared with transformer-based learned codecs at more than one order of magnitude smaller model size.

2605.15504 2026-05-18 cs.LG cs.AI 版本更新

Learning with Conflicts of Interest

利益冲突中的学习

Nischal Aryal, Arash Termehchy, Ali Vakilian, Marianne Winslett

发表机构 * Oregon State University(俄勒冈州立大学) Virginia Tech(弗吉尼亚理工大学) University of Illinois(伊利诺伊大学)

AI总结 本文提出一种博弈论框架,用于解决ML系统与用户之间的利益冲突,通过可扩展的算法在保护用户的同时最大化有益信息。

详情
AI中文摘要

金融、社会和政治因素经常导致ML系统所有者和服务使用者的利益无法完全一致。ML系统往往产生有偏见的信息,可能影响用户做出不利于自身利益的决定。当前解决方案要求ML系统实施协议以缓解偏见,但所有者通常没有实施这些协议的激励,并常认为这限制了他们的表达自由或商业。我们认为,解决此问题的成功方案必须认识到ML系统与其用户之间的利益冲突,并利用此信息保护用户免受不利影响,同时允许用户安全地受益于这些系统。为此,我们提出了一种博弈论框架,用于建模存在利益冲突的ML系统与用户之间的互动。我们提出了具有理论保证的可扩展算法,以最大化与所需信息和行动相关的内容,并最小化与偏见和操纵行为相关的交互内容。

英文摘要

Financial, social, and political factors often prevent the interests of the owners of ML systems and services and their users from being perfectly aligned. ML systems often produce biased information that can influence users to make decisions that are not in their best interest. Current solution approaches require ML systems to implement protocols to mitigate their biases. However, ML system owners usually do not have any incentive to implement these protocols and often argue that it limits their freedom of expression or business. We believe that a successful solution to this problem must recognize the conflict of interest between the ML systems and their users, and use this information to protect users against information that adversely influences their decisions while allowing users to safely benefit from these systems. To this end, we propose a game-theoretic framework that models the interaction between ML systems and users with conflicts of interest. We present scalable algorithms with theoretical guarantees that maximize the amount of desired information and actions and minimize the amount of biased and manipulative actions in interaction with ML systems.

2605.15486 2026-05-18 cs.RO cs.AI 版本更新

Hybrid LLM-based Intelligent Framework for Robot Task Scheduling

基于混合大语言模型的智能机器人任务调度框架

Swayamjit Saha, Subhabrata Das, Haonan Duan, Xiao-Yang Liu

发表机构 * Department of Computer Science and Engineering, Mississippi State University(密苏里州立大学计算机科学与工程系) Graduate School of Arts and Sciences, Columbia University(哥伦比亚大学研究生院) Consumer and Community Banking, JPMorgan Chase(摩根大通消费与社区银行业) Department of Data Science, Columbia University(哥伦比亚大学数据科学系) Department of Electrical Engineering, Columbia University(哥伦比亚大学电气工程系)

AI总结 本文提出利用大语言模型提升建筑机器人任务调度效率,通过平衡时间效率与资源利用,结合自然语言处理接口实现与专业人员的实时沟通,并采用两个LLM代理生成更精确的任务计划。

Comments 9 pages, 5 figures

详情
AI中文摘要

本研究介绍了一种利用大语言模型(LLMs)改进建筑机器人任务调度的智能框架。LLM通过接收关键任务数据,如代理行动能力及目标终点来优化任务分配策略。系统利用自然语言处理接口与建筑专业人员沟通,并实时适应突发工地条件。我们同时使用两个LLM代理,即生成器(GPT-4)和监督器(Gemma 3/Llama 4/Mistral 7b)LLM代理,以提供更精确的任务计划。我们通过简单场景评估所提出的方法,并提供指标分数证明框架的有效性。我们的结果表明,在包括机器人在内的建筑操作任务中,LLM的实施至关重要。

英文摘要

This study introduces intelligent frameworks that use Large Language Models (LLMs) to improve task scheduling for construction robots. The LLM is fed with key data about the desired task, such as agent action abilities, and the desired end goal to be achieved. A well-balanced allocation strategy is developed, optimizing both time efficiency and resource utilization. Our system utilizes a Natural Language Processing interface to streamline communication with construction professionals and adapt in real-time to unexpected site conditions. We concurrently use two LLM agents, specifically generator (GPT-4) and supervisor (Gemma 3/Llama 4/Mistral 7b) LLM agents to provide a more precise task schedule. We evaluate the proposed methodology using a straightforward scenario and provide metric scores to prove the efficacy of the frameworks. Our results highlight that the implementation of LLMs is crucial in construction operational tasks including robots.

2605.15480 2026-05-18 cs.RO cs.AI 版本更新

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays

残差强化学习用于具有随机延迟的机器人遥控

Kaize Deng, Zewen Yang

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 针对随机延迟导致的信号不连续问题,本文提出一种混合控制框架,通过LSTM状态估计器与残差强化学习策略相结合,提升遥控稳定性与性能。

Comments Accepted at 23rd IFAC World Congress 2026

详情
AI中文摘要

遥控中的随机通信延迟引入了信号不连续性,破坏了控制稳定性并降低了控制性能。因此,传统强化学习方法在面对延迟观测时表现不佳,导致高频震荡。为此,我们提出了一种混合控制框架,即延迟鲁棒强化学习,结合使用长短期记忆网络(LSTM)的状态估计器与残差强化学习策略。LSTM从延迟观测中重建出平滑的连续状态估计,使强化学习代理学习残差扭矩补偿策略,平衡跟踪精度与速度平滑性。在Franka Panda机器人上的实验验证表明,本文方法显著优于现有最先进基线,确保在高方差随机延迟下仍能实现稳健稳定的遥控。

英文摘要

Stochastic communication delays in teleoperation introduce signal discontinuities that undermine control stability and degrade control performance. Consequently, the conventional reinforcement learning (RL) methods struggle with the delayed observations due to the delay-induced observations, leading to high-frequency chattering. To address this, we propose a hybrid control framework, delay-resilient RL, integrating a state estimator utilizing Long Short-Term Memory (LSTM) with a residual RL policy, which is resilient to stochastic delays. The LSTM reconstructs smooth, continuous state estimates from delayed observations, enabling the RL agent to learn a residual torque compensation policy that balances tracking accuracy with velocity smoothness. Experimental validation on Franka Panda robots demonstrates that our approach significantly outperforms the state-of-the-art baselines, ensuring robust and stable teleoperation even under high-variance stochastic delays.

2605.15467 2026-05-18 cs.CL cs.AI 版本更新

Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction

基于检索增强的大型语言模型用于受模式约束的临床信息提取

A H M Rezaul Karim, Ozlem Uzuner

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文提出一种模块化检索增强生成框架,通过schema约束提示、确定性后处理和二次审核,提升护士-患者对话中观察提取的F1分数达80.36%。

详情
AI中文摘要

对话护士-患者记录包含可操作的观察,但将这些记录转化为结构化表示仍具挑战性。MEDIQA-SYNUR专注于从对话记录中提取观察,要求系统将这些叙述规范化为预定义模式,并满足值-类型约束。我们提出了一种模块化检索增强生成(RAG)流程,利用训练集作为示例语料库,结合模式约束提示(完整模式与剪枝候选模式)、确定性模式后处理和二次审核,并采用两个LLM骨干:Llama-4-Scout-17B-16E-Instruct和GPT-5.2,配以相应的嵌入模型。我们的最佳配置使用GPT-5.2、完整模式、RAG和二次审核,达到80.36%的F1分数。整体结果表明,RAG consistently improves performance,而最佳模式约束程度取决于模型,二次审核通过纠正残余模式一致性错误带来小幅增益。

英文摘要

Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.

2605.15464 2026-05-18 cs.LG cs.AI cs.CL 版本更新

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

GRLO:从零开始在开放环境中的通用强化学习

Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 GRLO研究从少量交互数据中训练的RLHF在开放环境中的泛化能力,探索其对话能力是否能迁移至数学推理和代码生成等下游任务,展示出高效且低成本的训练方法。

详情
AI中文摘要

事后训练已成为解锁大型语言模型能力的关键步骤,强化学习(RL)逐渐成为关键范式。近期基于RL的后训练方法日益分化为两种范式:基于人类反馈的强化学习(RLHF),其通过目标领域的偏好信号优化模型,以及基于可验证奖励的强化学习(RLVR),其在由验证器支持的环境中运行。后者在近期以推理为导向的后训练中占据主导地位,因为它在领域特定任务(如推理)上提供了更强的增益和更高的效率。然而,尽管领域内RL训练取得了令人满意的性能,但仍需要大量的GPU计算资源,这仍然是广泛应用的主要障碍。本文研究了从开放环境中的少量交互数据中从零开始训练的RLHF的泛化能力,并探讨其显式获得的对话能力是否能隐式地迁移到数学推理和代码生成等下游任务,即GRLO。具体而言,在Qwen3-4B-Base基础上,GRLO仅使用5K提示和22.7 GPU小时,将所有领域的平均性能从24.1提升到63.1,所需数据和计算资源分别比强大的领域内RLVR基线少约46倍和68倍。所得到的模型甚至与Qwen发布的后训练模型相媲美,后者需要更大的训练成本。值得注意的是,后续的领域内RLVR阶段仅带来选择性的增益,主要体现在更难的竞赛数学基准上。我们希望GRLO能为构建广泛具备能力的后训练模型提供一个简单且高效的配方。我们的代码和数据将在:https://github.com/SJY8460/GRLO上提供。

英文摘要

Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.

2605.15461 2026-05-18 cs.LG cs.AI 版本更新

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

DrugSAGE: 自演化代理经验用于高效前沿药物发现

Yikun Zhang, Xiwei Cheng, Tianyu Liu, Yuanqi Du, Wengong Jin

发表机构 * Northeastern University(东北大学) Broad Institute of MIT and Harvard(MIT和哈佛大学Broad研究所) Yale University(耶鲁大学) Microsoft Research New England(微软研究院新英格兰分部)

AI总结 DrugSAGE通过自演化代理经验框架,高效构建前沿药物发现模型,跨任务记忆提升模型性能,实现零次搜索下的显著优势。

详情
AI中文摘要

构建前沿药物发现预测模型需要昂贵的工具、架构和训练策略搜索。当前基于LLM的代理通过大量试错找到前沿解决方案,但不保留积累的经验,因此每次新任务都要支付完整搜索成本。我们提出\method(自演化代理经验)框架,通过跨任务积累和重用经验高效构建前沿药物发现模型。\method维护跨任务记忆中的验证技能、有效策略的统计证据以及重复错误及其修复记录。在某些情况下,\method可直接转移有效解决方案而无需测试时搜索。在33个分子性质预测任务中,\method在单任务设置中排名第一。在16个较小任务积累的记忆下,\method在跨任务评估设置中达到17个保留任务的平均归一化分数为0.935,并在零次测试时搜索模式中优于所有基线代理10-30%。总之,我们的工作展示了跨任务记忆在药物发现前沿模型开发中的优势。

英文摘要

Building state-of-the-art (SOTA) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies. Current LLM-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task. We propose \method (Self-evolving Agent Experience), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently. \method maintains a cross-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes. In some cases, \method transfers a working solution directly without test-time search. In 33 molecular property prediction tasks, \method ranks first among nine SOTA agents in a single-task setting. With memory accumulated from 16 smaller tasks, \method achieves an averaged normalized score of 0.935 on 17 held-out tasks in a cross-task evaluation setting and outperforms all baseline agents by 10-30\% in a zero-test-time search regime. In summary, our work shows the advantage of cross-task memory for efficient SOTA model development in drug discovery.

2605.15460 2026-05-18 cs.IR cs.AI 版本更新

Differentially Private Motif-Preserving Multi-modal Hashing

差分隐私的动机保持多模态哈希

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * Department of Computer Science\ of Oxford Oxford United Kingdom Department of Computer Science\ of Oxford

AI总结 本文提出DMP-MH框架,通过去噪后蒸馏方法在保证隐私的前提下保留多模态数据的结构特征,实验表明其在保持隐私的同时提升了检索性能。

Comments 9 Pages

详情
AI中文摘要

跨模态哈希通过将图像和文本编码为紧凑的二进制码实现高效检索。现有方法依赖于用户交互导出的语义相似性图进行监督,但这些图编码了敏感行为模式,易受链接重建攻击。现有隐私保护方法在图结构数据上失效:差分隐私SGD通过独立处理样本破坏关系动机,而图合成方法在无标度网络中面临无界局部敏感性,中心节点的单边修改会通过O(N)改变三角形计数,需要昂贵的噪声注入。我们称此现象为Hubness Explosion。本文提出DMP-MH,一种Sanitize-then-Distill框架,将隐私与表征学习解耦。我们的方法首先通过确定性裁剪节点度数来限制敏感性,独立于数据集规模上限三角动机的L2敏感性。然后通过在(ε,δ)-边差分隐私下生成去噪合成图。最后,双流哈希网络通过整体结构损失蒸馏此拓扑,强制跨模态对齐。在MIRFlickr-25K和NUS-WIDE数据集上严格归纳协议下评估,DMP-MH在保持隐私的同时,检索性能比私有基线高出11.4 mAP点,非隐私性能保留率达92.5%。

英文摘要

Cross-modal hashing enables efficient retrieval by encoding images and text into compact binary codes. State-of-the-art methods rely on semantic similarity graphs derived from user interactions for supervision, yet these graphs encode sensitive behavioral patterns vulnerable to link reconstruction attacks. Existing privacy-preserving approaches fail on graph-structured data: Differentially Private SGD destroys relational motifs by treating samples independently, while graph synthesis methods suffer from unbounded local sensitivity in scale-free networks, hub nodes cause single-edge modifications to alter triangle counts by $\mathcal{O}(N)$, necessitating prohibitive noise injection. We term this phenomenon Hubness Explosion. We propose DMP-MH, a Sanitize-then-Distill framework that decouples privacy from representation learning. Our approach first bounds sensitivity by deterministically clipping node degrees, capping the $L_2$-sensitivity of triangle motifs independently of dataset size. A sanitized synthetic graph is then generated via Noisy Mirror Descent under $(ε,δ)$-Edge Differential Privacy. Finally, dual-stream hashing networks distill this topology using a holistic structural loss that enforces cross-modal alignment. Evaluated on MIRFlickr-25K and NUS-WIDE under a strict inductive protocol, DMP-MH outperforms private baselines by up to 11.4 mAP points while retaining up to 92.5% of non-private performance.

2605.15450 2026-05-18 cs.CV cs.AI cs.LG 版本更新

RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects

RIDE: 基于Retinex的解耦方法用于揭示隐藏物体

Chunming He, Rihan Zhang, Dingming Zhang, Chengyu Fang, Longxiang Tang, Jingjia Feng, Fengyang Xiao, Sina Farsiu

发表机构 * Duke University(杜克大学) Tsinghua University(清华大学) Harvard University(哈佛大学)

AI总结 RIDE通过Retinex理论提出同域图像分解方法,解决隐藏物体分割问题,利用判别性差距定理提升前景与背景的区分度。

详情
AI中文摘要

隐藏物体分割(COS)涵盖一系列密集预测任务,包括伪装物体检测、多形体分割、透明物体检测和工业缺陷检测,其中目标通过不同物理机制与周围环境视觉融合。现有方法要么直接操作RGB图像,要么采用异构分解(如傅里叶、小波)将空间证据分散到尺度/频率系数,使像素对齐线索不直接。我们引入一种根本不同的视角:通过Retinex理论进行同域图像分解,将图像分解为光照和反射成分。我们的核心发现是视觉融合迫使复合空间中的外观匹配,但并不需要同时在两个成分空间中匹配,这一现象我们正式称为判别性差距定理。关键的是,我们证明在多样化的COS子任务中,底层物理过程系统性地反相关光照和反射差异,从而理论保证Retinex分解在完整物理范围内保持或严格提升总前景-背景判别性,反相关最大化增益。基于此,我们提出RIDE,包括:(i)任务驱动的Retinex分解模块,学习端到端的分割最优分解;(ii)判别性差距注意力机制,适应性利用分解帮助的区域;(iii)伪装打破对比损失,操作在反射特征空间中。

英文摘要

Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

2605.15445 2026-05-18 cs.AI 版本更新

From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates

从LLM生成的猜想到Lean形式化:通过求和平方证书实现自动多项式不等式证明

Ruobing Zuo, Hanrui Zhao, Gaolei He, Zhengfeng Yang, Jianlin Wang

发表机构 * School of Software Engineering, East China Normal University, Shanghai, China(东华大学软件工程学院) College of Computer Science and Technology, National University of Defense Technology, Changsha, China(国防科技大学计算机科学与技术学院) School of Computer and Information Engineering, Henan University, Kaifeng, China(河南大学计算机与信息工程学院)

AI总结 本文提出NSPI框架,结合LLM和符号计算,通过求和平方证书实现多项式不等式证明,展示其在10变量多项式上的有效性与可扩展性。

Comments Accepted to ICML 2026. Preprint version

详情
AI中文摘要

自动证明多项式不等式是自动化数学推理中的基本挑战,其中丰富的代数结构和快速增长的证书搜索空间阻碍了可扩展性。纯粹的符号方法提供强保证,但随着变量数或次数的增加,其扩展性较差,因为代数操作昂贵且中间表达式迅速增长。同时,LLM引导的方法在竞赛风格的不等式上取得了显著进展,特别是在变量数较少的情况下。为了解决剩余的可扩展性挑战,我们提出NSPI,一种结合LLM和符号计算优势的神经符号框架。具体而言,LLM提出一个近似多项式求和平方(SOS)分解的猜想;我们通过符号计算对其进行细化,得到精确的多项式SOS表示,这直接证明目标不等式,并进一步在Lean中验证证明,从而实现从启发式发现到机器检查证明的端到端流程。在涉及最多10个变量的多项式挑战基准上的实验展示了所提方法的有效性和可扩展性。

英文摘要

Automated proving of polynomial inequalities is a fundamental challenge in automated mathematical reasoning, where rich algebraic structure and a rapidly growing certificate search space hinder scalability. Purely symbolic approaches provide strong guarantees but often scale poorly as the number of variables or the degree increases, due to expensive algebraic manipulations and rapidly growing intermediate expressions. In parallel, LLM-guided methods have made notable progress, particularly on competition-style inequalities with a small number of variables. To address the remaining scalability challenges, we propose NSPI, a neuro-symbolic framework that combines the complementary strengths of LLMs and symbolic computation for polynomial-inequality proving. Concretely, an LLM proposes a conjecture in the form of an approximate polynomial Sum-Of-Squares (SOS) decomposition; we refine it via symbolic computation to obtain an exact polynomial SOS representation, which directly proves the target inequality, and we further certify the proof in Lean, yielding an end-to-end pipeline from heuristic discovery to machine-checked proof. Experiments on challenging benchmarks involving polynomials with up to 10 variables demonstrate the effectiveness and scalability of the proposed method.

2605.15425 2026-05-18 cs.SE cs.AI 版本更新

Runtime-Structured Task Decomposition for Agentic Coding Systems

运行时结构化任务分解用于代理编码系统

Shubhi Asthana, Bing Zhang, Chad DeLuca, Hima Patel, Ruchi Mahindru

发表机构 * IBM Research(IBM研究院)

AI总结 本文提出运行时结构化任务分解方法,通过可执行控制逻辑管理任务分解与执行流程,降低重试成本,提升代理编码系统的效率和可靠性。

Comments Paper presented at ACM Conference on AI and Agentic Systems 2026 at the Agentic Software Engineering workshop

详情
AI中文摘要

代理编码系统越来越多地使用大型语言模型(LLMs)进行软件工程任务,如调试、根本原因分析和代码审查。然而,许多现有系统在单个提示中编码任务逻辑、执行流程和输出生成,这种设计导致行为脆弱、调试困难和高重试成本,因为失败往往需要重新运行整个工作流。我们提出运行时结构化任务分解,一种架构方法,通过可执行控制逻辑管理任务分解和执行流程,而不是仅依赖提示结构。LLMs仅用于专注判断任务,输出在下游执行前会根据预定义的模式进行验证。我们在两个软件工程工作负载上评估了这种方法,使用三种配置:单体执行、静态分解(固定子任务和无运行时分支)和运行时结构化分解。每种配置在10次运行中进行评估。我们的结果表明,分解本身并不一定减少重试成本。在Kubernetes根本原因分析工作负载中,静态分解基线的重试成本为1,632±145个标记,而单体基线为904±17个标记,因为失败迫使重新运行下游子任务。在多文件调试工作负载中,类似模式出现,静态基线消耗933个标记,而单体系统为703个标记。运行时结构化方法仅重新运行失败的子任务,将重试成本降低到436±132个标记(根本原因分析)和460个标记(调试)。总体而言,该方法比单体系统减少了51.7%的重试成本,比静态分解基线减少了73.2%的重试成本,提高了代理编码系统的效率、调试能力和操作可靠性。

英文摘要

Agentic coding systems increasingly use large language models (LLMs) for software engineering tasks such as debugging, root cause analysis, and code review. However, many existing systems encode task logic, execution flow, and output generation inside monolithic prompts. This design creates brittle behavior, limited debuggability, and high retry costs because failures often require rerunning the full workflow. We present runtime-structured task decomposition, an architectural approach in which task partitioning and execution flow are managed through executable control logic rather than prompt structure alone. LLMs are used only for focused judgment tasks, and outputs are validated against predefined schemas before downstream execution. We evaluate this approach on two software engineering workloads using three configurations: monolithic execution, static decomposition with fixed subtasks and no runtime branching, and runtime-structured decomposition. Each configuration was evaluated across 10 runs. Our results show that decomposition alone does not necessarily reduce retry cost. In the Kubernetes root cause analysis workload, the static decomposition baseline produced a retry cost of 1,632 +/- 145 tokens versus 904 +/- 17 tokens for the monolithic baseline because failures forced reruns of downstream subtasks. A similar pattern appeared in the multi-file debugging workload, where the static baseline consumed 933 tokens compared to 703 tokens for the monolithic system. The runtime-structured approach reran only failed subtasks, reducing retry costs to 436 +/- 132 tokens for root cause analysis and 460 tokens for debugging. Overall, the approach achieved up to 51.7% lower retry cost than monolithic systems and 73.2% lower retry cost than static decomposition baselines, improving efficiency, debuggability, and operational reliability in agentic coding systems.

2605.15423 2026-05-18 cs.CV cs.AI eess.IV 版本更新

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

MR2-ByteTrack:基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点

Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti

发表机构 * Electrical, Electronic and Information Engineering (DEI), University of Bologna, Italy.(博洛尼亚大学电气、电子与信息工程学院,意大利) Department of Electrical Engineering (ESAT), KU Leuven, Belgium.(卢旺达大学电气工程系,比利时) Dalle Molle Institute for Artificial Intelligence (IDSIA), USI--SUPSI, Switzerland.(人工智能研究所(IDSIA),瑞士USI--SUPSI)

AI总结 本文提出MR2-ByteTrack,一种针对嵌入式视觉节点的视频目标检测方法,通过交替使用全分辨率和低分辨率推理,结合ByteTrack和Rescore算法提升效率,实现在嵌入式设备上的高精度实时检测。

详情
AI中文摘要

现代智能视觉传感器需要设备端智能来处理视频流,因为云计算在带宽、延迟和隐私限制下往往不可行。然而,这些传感系统通常依赖超低功耗微控制器(MCUs),其内存和计算能力有限,使得需要特征存储或多帧缓冲的传统视频目标检测方法不可行。为了解决这一挑战,我们引入了多分辨率重评分ByteTrack(MR2-ByteTrack),一种专为基于MCU的嵌入式视觉节点设计的视频目标检测(VOD)方法。MR2-ByteTrack通过交替使用全分辨率和低分辨率推理来降低计算成本,同时通过ByteTrack在帧间链接检测,并通过Rescore算法通过概率联合规则聚合跨帧的检测置信度分数以纠正误分类。我们将其应用于基于CNN的检测器和基于Transformer的模型,证明了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明,MR2-ByteTrack保持了准确性,实现了CNN模型的mAP最高达49.0,Transformer模型的mAP为48.7,同时将CNN的乘加操作减少了高达53%,Transformer的减少了32%。当部署在GAP9上,一个超低功耗RISC-V多核MCU上时,我们的方法相比仅处理全分辨率图像,实现了高达55%的能耗节省,实现了在MCU类嵌入式视觉节点上的首个实时Transformer-based VOD。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。

英文摘要

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

2605.15417 2026-05-18 cs.LG cs.AI 版本更新

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

$f$-轨迹平衡:一种用于调整GFlowNets、生成模型和LLMs的损失家族,结合on-policy和off-policy数据

Jake Fawkes, Jason Hartford

发表机构 * Department of Statistics, University College London, UK(伦敦大学学院统计学系) Valence Labs, London, UK(伦敦Valence实验室)

AI总结 本文提出一种基于$f$-散度的损失家族,通过on-policy和off-policy数据调整生成模型,提升模型覆盖性和泛化能力。

Comments Published at ICML 2026

详情
AI中文摘要

在GFlowNets和变分推断中,目标与模型对数概率之间的均方误差被证明是训练生成模型的有效低方差替代损失。该损失具有在on-policy情况下其梯度对应KL散度的梯度,而在off-policy情况下仍保持有效损失且具有相同全局最小值的性质。本文证明该构造可扩展到整个$f$-散度家族,从而得到一系列损失函数,其on-policy梯度对应相应的$f$-散度,但保留相同的全局最小值。具体而言,我们展示了on-policy梯度导致目标与模型对数概率上的翻译不变损失函数与$f$-散度之间的一一对应关系。这种等价性使我们能够设计新的替代损失函数,用于调整广泛类别的生成模型,继承相应$f$-散度的性质,如更广泛的模式覆盖,同时适用于off-policy数据。我们将其应用于各种任务,包括经典合成示例、SynFlowNets分子发现和异步大语言模型(LLM)调整,证明我们的模型在广泛类别的生成模型中保留其预测属性,无论是on-policy还是off-policy数据。

英文摘要

In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.

2605.15412 2026-05-18 cs.CE cs.AI cs.CL 版本更新

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

从反馈循环到政策更新:基于强化微调的LLM驱动的alpha因子发现

Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Zixuan Xie, Chiming Duan, Minghua He, Philip S. Yu, Ying Li

发表机构 * Peking University(北京大学) Alibaba Group(阿里巴巴集团) Nanjing University(南京大学) University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 本文提出QuantEvolver框架,通过强化微调将可执行量化评估转化为策略更新,提升LLM在alpha因子发现中的表现,生成高质量且互补的因子池。

详情
AI中文摘要

现代量化交易日益依赖系统模型从大规模金融数据中提取预测信号,其中alpha因子发现是将市场观察转化为可交易信号的核心。最近基于LLM的方法在自动化因子生成方面表现出色,但大多数仍依赖提示级生成-评估-反馈循环进行迭代优化。随着循环变长,反复追加的历史候选和反馈会导致上下文爆炸,增加推理成本,稀释有用信息,并引入反馈漂移。此外,这些方法通常依赖非常大的LLM,其稳定的生成偏好可能导致结构相似的表达、冗余候选和搜索停滞。为了解决这些限制,我们提出QuantEvolver,一种基于强化微调的自进化alpha因子发现框架。与在提示中积累反馈不同,QuantEvolver将可执行量化评估转化为策略更新,使Miner LLM通过参数学习内化历史优化经验。具体而言,QuantEvolver构建高质量种子因子,构建多样化的种子-时间窗训练任务,生成可执行的Factor DSL表达式,通过Regime Backtest进行评估,并通过多样性-互补性奖励优化Miner LLM。在训练过程中,高质量因子持续积累在Mined Factor Database中,最终成为发现的因子库。在三个现实市场基准上的广泛实验表明,QuantEvolver的有效性,其在每个任务的主要评估指标上均优于现有基于LLM的alpha因子发现基线,产生更高质量和更互补的因子池。

英文摘要

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

2605.15410 2026-05-18 quant-ph cs.AI cs.LG 版本更新

Diagonal Adaptive Non-local Observables on Quantum Neural Networks

量子神经网络上的对角自适应非局部可观测量

Huan-Hsin Tseng, Yan Li, Hsin-Yi Lin, Samuel Yen-Chi Chen

发表机构 * AI \& ML Department Brookhaven National Laboratory Upton NY, USA Department of Electrical Engineering The Pennsylvania State University University Park, PA, USA

AI总结 本文提出了一种对角自适应非局部可观测量,通过仅考虑对角可观测量与量子电路的组合,降低了参数数量和经典优化成本,同时保持了全非局部可观测量的能力。

Comments Accepted at ICCCN2026

详情
AI中文摘要

自适应非局部可观测量(ANOs)已显示,使量子可观测量动态化可以显著扩大变分量子算法的功能空间,部分将硬件需求从电路合成转移到测量设计。然而,这种优势伴随着参数数量的大幅增加以及经典优化成本的上升。我们提出了一种特殊的ANo形式,通过仅考虑对角可观测量与量子电路的组合,显著降低了这一负担。数学上,这相当于全ANo在大参数空间中的完整形式,因为对角矩阵是ANo空间的规范代表,模幺正相似性。因此,对角ANo保持了全ANo的能力,同时将k-局部可观测量的复杂度从O(4^k)降低到O(2^k),并降低了相应的测量侧经典计算成本。从这个意义上说,对角ANo保留了全ANo的许多优势,同时涵盖了传统VQCs作为特殊情况。

英文摘要

Adaptive Non-local Observables (ANOs) have shown that making quantum observables dynamic can substantially enlarge the function space of Variational Quantum Algorithms, partly shifting hardware demands from circuit synthesis to measurement design. However, this advantage is accompanied by a steep increase in the number of parameters, as well as the classical optimization cost for varying general Hermitian observables. We propose a special form of ANO that significantly reduces this burden by considering only diagonal observables paired with quantum circuits. Mathematically, this is equivalent to the full ANO of a large parameter space since diagonal matrices are canonical representatives of the ANO space modulo unitary similarity. As a result, Diagonal ANO retains the same capability of full ANO while reducing $k$-local observable complexity from $O(4^k)$ to $O(2^k)$ and lowering the corresponding measurement-side classical computation. In this sense, diagonal ANO preserves much of the benefit of full ANO while encompassing conventional VQCs as a special case.

2605.15400 2026-05-18 cs.AI 版本更新

Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

超越伙伴多样性:一种基于影响的团队引导框架用于零样本人机协同

Wei Sheng, Rohan Paleja

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出基于影响的团队引导框架IBTS,通过影响塑造激励智能体发现多样化的高绩效团队交互模式,提升团队表现,强调需结合稀疏奖励协调机制与伙伴多样性覆盖。

详情
AI中文摘要

尽管AI代理正从孤立工具发展为交互合作者,数据驱动的人机协同(HMT)方法仍依赖跨领域的大量人类交互数据,导致成本高。零样本协调(ZSC)通过模拟多样化的伙伴群体来近似未见伙伴的行为。然而,随着团队规模扩大和通信退化,伙伴覆盖本身不足。为此,本文提出影响基于的团队引导(IBTS)框架,利用影响塑造激励智能体发现多样化的高绩效团队交互模式,并引导持续轨迹向更强的协调模式发展。在Overcooked-AI的双智能体和三智能体设置中评估IBTS,测试学习协调结构是否超越二元交互。评估包括模拟伙伴、合成伙伴风格变化以及首次涉及两名真实人类队友和一名机器队友的30人Overcooked-AI HMT研究。在这些评估中,IBTS在对比基线中提升了团队表现,突显了需要扩展ZSC来结合稀疏奖励协调机制与伙伴多样性覆盖,而非仅依赖多样性。

英文摘要

While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.

2605.15399 2026-05-18 cs.LG cs.AI cs.NA math.NA physics.comp-ph 版本更新

Breakeven complexity: A new perspective on neural partial differential equation solvers

突破性复杂度:神经偏微分方程求解器的新视角

Yijing Zhang, Nicholas Roberts, Tanya Marwah, Mikhail Khodak

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Google DeepMind(谷歌DeepMind)

AI总结 本文提出突破性复杂度评估框架,考虑神经求解器的前期成本与传统求解器的低保真度成本,分析不同PDE求解器在复杂问题中的有效性。

详情
AI中文摘要

偏微分方程的神经替代求解器相比数值方法能带来显著加速,尤其在需要多次求解的场景中。然而,现有基于精度的评估方法未充分考虑两个核心问题:(1) 神经求解器在数据生成、训练和调优上存在显著前期成本;(2) 经典求解器也能在足够低的模拟成本下生成低保真度解。为明确考虑这些现实并全面纳入端到端成本,我们提出以突破性复杂度为核心的评估框架,该指标衡量在学习求解器成本有效于等误差的传统求解器之前所需的前向求解次数。为了评估此指标,我们应用扩展定律确定应分配多少训练预算给数据生成,并讨论如何在不同设置中实现平滑的误差匹配。我们评估了多个神经PDE求解器在三个2D周期域上的PDEs以及由GPU原生PyFR代码生成的新型流动基准测试中的突破性复杂度。其他发现包括,神经PDE求解器在成本、维度、滚动、物理领域(如更高雷诺数)等更复杂的问题中变得更具有效性。

英文摘要

Neural surrogate solvers of partial differential equations (PDEs) promise dramatic speedups over numerical methods, especially in scenarios requiring many solves. However, current accuracy-based evaluations do not fully consider two central issues: (1) neural solvers incur substantial up-front costs for data generation, training, and tuning; and (2) classical solvers can also generate low-fidelity solutions at a sufficiently low simulation cost. To explicitly account for these realities and fully incorporate end-to-end costs, we propose an evaluation framework centered on breakeven complexity, a metric that counts the forward solves before a learned solver is cost-effective relative to an error-equivalent traditional solver. To evaluate this measure, we apply scaling laws to determine how much training budget to allocate to data generation and discuss how to achieve smooth error-matching in diverse settings. We evaluate the breakeven complexity of multiple neural PDE solvers on three PDEs on 2D periodic domains from APEBench and a novel benchmark of flows past multiple obstacles generated by the GPU-native PyFR code. Among other findings, our results suggest that neural PDE solvers become more effective as problems get harder in terms of cost, dimension, rollout, physics regime (e.g. higher Reynolds number), etc.

2605.15394 2026-05-18 cs.LG cs.AI stat.ML 版本更新

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

无奖励的表示:用于LLM微调的JEPA审计

Biswa Sengupta

发表机构 * LLM Suite group of JP Morgan Chase and its affiliates(JP摩根士丹利 LLC 集团及其附属机构)

AI总结 本文探讨了在无奖励设定下,通过JEPA架构学习更有效的表示方法,测试了多种辅助项在自然语言到正则表达式生成任务中的表现,发现某些辅助项在特定统计检验下显著,但整体效果不显著。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)提出,当模型被训练以预测潜在表示而非观测输出时,应学习更有用的抽象。对于自回归语言模型微调,这一原则意味着诱导的隐藏状态几何必须达到语言模型头部并且提高解码任务指标。我们在此基础上,在固定Llama-3.2-1B-Instruct LoRA基础上,对自然语言到正则表达式生成任务进行了测试,比较了22种训练时的辅助项,包括轨迹形状正则化、分布约束、预测器/目标不对称性、Fisher度量Jacobi残差以及一个解码器可见的JEPA目标,该目标位于交叉熵的正锥内。经验结果是一个结构化的零假设:几种辅助项在单细胞配对α=0.10下显著(T3-Local在Δ=+2.53 pp,p=0.003最强),但无一通过Bonferroni或Holm-Bonferroni检验。解码器可见的JEPA产生了研究中的第一个正辅助-交叉熵梯度余弦值,但精确匹配仍处于种子噪声内;在五个种子的完整微调复制中,相同的辅助项在两个基准测试中均重现了零假设(TURK:Δ=+0.04 pp,p_配对=0.96;SYNTH:Δ=+0.52 pp,p_配对=0.28),因此零假设在LoRA和完整微调中对解码器可见的构造是稳健的。隐藏状态表示和解码任务准确性在这一领域因此弱相关;我们相应地将LLM领域JEPA评估重新定义为耦合问题,其中核心问题是哪些指标下有用的隐藏几何成为解码器可见的任务信号。

英文摘要

Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emph{and} improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired $α= 0.10$ without correction (T3-Local at $Δ= +2.53$~pp, $p = 0.003$ being the strongest), but none survives Bonferroni or Holm--Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary--cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at $n = 5$ seeds reproduces the null on both benchmarks (TURK: $Δ= +0.04$~pp, $p_{\text{paired}} = 0.96$; SYNTH: $Δ= +0.52$~pp, $p_{\text{paired}} = 0.28$), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.

2605.15391 2026-05-18 cs.CV cs.AI 版本更新

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

PanoWorld:几何一致的全景视频世界建模

Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas

发表机构 * Northeastern University(东北大学)

AI总结 PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

详情
AI中文摘要

PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

英文摘要

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

2605.15384 2026-05-18 cs.LG cs.AI 版本更新

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

一个评分够吗?重新思考序列演进LLM记忆的评估

Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen

发表机构 * University of Virginia(弗吉尼亚大学) Princeton University(普林斯顿大学)

AI总结 本文提出SeqMem-Eval框架,通过评估记忆状态的演变、泛化、经验巩固和信息保留,揭示传统指标无法捕捉的记忆质量差异。

Comments 29 pages, 13 figures

详情
AI中文摘要

记忆在使大语言模型(LLM)能够处理序列任务中起着核心作用,通过积累和重用经验实现时间连续性。然而,现有LLM记忆评估大多依赖汇总指标如最终验证准确率或累积在线性能,这可能掩盖诸如遗忘和负迁移等关键失败模式。本文引入SeqMem-Eval,一种用于序列演进LLM记忆的诊断评估框架。受持续学习启发,它针对一种测试时间设置,其中记忆是外部的、提示介导的,并且在不修改模型参数的情况下更新。与只关注最终性能不同,SeqMem-Eval评估记忆状态在连续推理中的演变、泛化、经验巩固和信息保留。具体而言,它测量在线效用、验证泛化、反向迁移和遗忘,提供更细致的记忆质量视角。通过在多样任务和记忆方法上的广泛实验,我们显示更高的最终或累积准确性不必然意味着更好的记忆质量:许多方法表现出强劲的性能提升,同时遭受显著的遗忘或负迁移。此外,不同记忆设计在适应性和稳定性之间表现出不同的权衡,这些权衡在标准评估指标下是不可见的。

英文摘要

Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.

2605.15375 2026-05-18 cs.CV cs.AI 版本更新

ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing

ChangeFlow -- 潜在修正流用于遥感中的变化检测

Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc

发表机构 * University of Ljubljana, Faculty of Computer and Information Science(卢布尔雅那大学计算机与信息科学学院)

AI总结 本文提出ChangeFlow框架,通过潜在空间中的修正流合成变化掩码,以生成分布中的可能掩码,提升全局一致性与鲁棒性,实现80.4%的平均F1分数。

详情
AI中文摘要

遥感变化检测(RSCD)旨在定位同一地理区域两幅图像之间的变化。在实践中,变化掩码通常遵循区域级注释惯例而非纯粹的局部外观差异,使其具有上下文依赖性和偶尔的模糊性。大多数最先进的方法使用逐像素判别分类,产生单个预测,无法显式建模变化区域作为整体。生成式方法是自然替代方案,可建模可能掩码的分布,使采样能捕捉模糊性并鼓励全局一致性。然而,现有生成式RSCD方法通常落后于强大判别基线,由于像素空间生成的高计算成本和其条件机制的复杂性。为了解决判别和生成方法的局限性,我们提出ChangeFlow,一种生成框架,通过潜在空间中的修正流重新表述变化检测为变化掩码的合成。ChangeFlow由结构化但轻量级的条件信号引导,其随机设计自然支持基于采样的预测融合。即,聚合多个预测的变化掩码提高鲁棒性,而样本一致性提供实用的置信度估计,突出模糊区域。在四个基准上,ChangeFlow实现80.4%的平均F1分数,比先前最佳方法平均提高1.3个百分点,同时保持与最近强大基线相当的推理速度。项目页面:https://blaz-r.github.io/changeflow_cd

英文摘要

Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz-r.github.io/changeflow_cd

2605.15353 2026-05-18 cs.LG cs.AI q-bio.MN q-bio.QM 版本更新

PACER: Acyclic Causal Discovery from Large-Scale Interventional Data

PACER:从大规模干预数据中进行无环因果发现

Ramon Viñas Torné, Sílvia Fàbregas Salazar, Soyon Park, Ivo Alexander Ban, Artyom Gadetsky, Nikita Doikov, Maria Brbić

发表机构 * Swiss Federal Technology Institute of Lausanne (EPFL), Switzerland(瑞士联邦理工学院洛桑分校) Cornell University, USA(康奈尔大学) ETH Zurich, Zurich, Switzerland(苏黎世联邦理工学院)

AI总结 PACER通过构建无环性保证的因果发现框架,在大规模高维干预数据中实现高效且准确的因果结构推断,优于现有方法。

Comments Accepted at the 43rd International Conference on Machine Learning (2026)

详情
AI中文摘要

从数据中推断有向无环图(DAG)的结构是因果发现中的核心挑战,特别是在现代高维设置中,大规模干预数据日益可用。尽管干预数据可以提高可识别性,但现有方法仍受软无环约束限制,导致优化无效环图、数值不稳定和可扩展性差。我们引入PACER(扰动驱动无环因果边恢复),一种可扩展的因果发现框架,通过构建无环性保证的结构进行优化。PACER通过变量排列和边概率的联合模型参数化DAG分布,使可以直接优化有效因果结构而无需替代惩罚。该框架支持观察性和干预性数据的统一似然处理,灵活的条件密度模型以及结构先验知识的整合。对于线性高斯机制,我们推导出干预对数似然和梯度的闭式表达式,获得显著的计算增益。实证上,PACER在蛋白质信号和大规模基因扰动基准上匹配或超过最先进方法,同时高效扩展到具有千变量的网络,并在基于惩罚的可微方法上实现高达两数量级的速度提升。这些结果表明,通过原则性的搜索空间设计,从高维扰动数据中实现精确且可扩展的因果发现是可能的。

英文摘要

Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern high-dimensional settings where large-scale interventional data are increasingly available. While interventional data can improve identifiability, existing methods remain limited by soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and reduced scalability. We introduce PACER (Perturbation-driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood-based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear-Gaussian mechanisms, we derive closed-form expressions for the expected interventional log-likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state-of-the-art methods on protein signaling and large-scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty-based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high-dimensional perturbation data is achievable through principled search space design.

2605.15343 2026-05-18 cs.AI cs.LG cs.MA 版本更新

Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

信念引擎:多智能体大语言模型协商中的可配置和可检查立场动态

Joshua C. Yang, Maurice Flechtner, Damian Dailisan, Michiel A. Bakker

发表机构 * ETH Zurich(苏黎世联邦理工学院) Centre for Democracy Studies Aarau, University of Zurich(苏黎世大学民主研究中心) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出Belief Engine,通过可配置的信念更新机制,研究多智能体协商中的立场动态,揭示立场变化背后的证据吸收与锚定因素。

详情
AI中文摘要

基于大语言模型的智能体日益用于模拟协商、冲突解决和多轮意见交流等 deliberative 交互。然而,生成的对话记录往往无法解释智能体立场变化的原因:变化可能反映证据吸收、锚定、角色漂移、回声或改变的提示和检索上下文。我们引入Belief Engine (BE),一个可审计的信念更新层,将“信念”视为命题上的证据状态,并将其暴露为标量立场。BE将论点提取为结构化记忆,并通过由证据吸收u和先验锚定a控制的对数几率规则更新立场。在多个基础LLM上,参数扫描显示这些控制可靠地塑造立场动态,同时保留证据层面的更新轨迹。在DEBATE数据集上,BE最佳重建了最终立场遵循提取证据的参与者;稳定和证据反对的案例则指向锚定或提取证据流之外的因素。BE为研究证据导向的协商提供了可配置的基础设施,其中开放性、承诺、收敛和分歧可以与显式的更新假设联系,而不是隐藏的提示效应。

英文摘要

LLM-based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi-turn opinion exchange. Yet generated transcripts often do not reveal why an agent's stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context. We introduce the Belief Engine (BE), an auditable belief-update layer that treats "belief" as an evidential state over a proposition and exposes it as scalar stance. BE extracts arguments into structured memory and updates stance with a log-odds rule controlled by evidence uptake u and prior anchoring a. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence-level update trail. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence-opposed cases instead point to anchoring or factors outside the extracted evidence stream. BE provides configurable infrastructure for studying evidence-grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects.

2605.15341 2026-05-18 cs.LG cs.AI 版本更新

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

LEAP:LLM在迭代科学设计中的轨迹级评估

Marilyn Zhang, Tianfeng Chen, Fabián Barzuna, Ankita Rathod, Mark E. Whiting

AI总结 本文提出LEAPBench框架,通过轨迹级评估方法揭示LLM在迭代科学设计中的学习效率,发现传统基于结果的评估方法存在偏差,轨迹指标能更准确反映效率提升。

详情
AI中文摘要

LLMs正被越来越多地应用于自主实验室,其假设是领域先验知识和迭代反馈使它们在更少的迭代中收敛到好的设计。然而,当前的迭代科学设计基准仅评估固定时间范围内的结果快照,忽略了学习轨迹。为此,本文探讨了三种评估选择:测量什么、比较什么基准以及以什么为基础。引入LEAPBench,一个包含55个任务的框架,结合最佳到目前为止的曲线下面积(AUC)轨迹指标、经典贝叶斯优化基准和基于发表文献的审计。在八个现代LLMs上应用后,从最终结果到轨迹评分的切换在匹配时间范围内改变了53%的任务最佳模型决策,并揭示了被传统评分忽视的效率提升。LLMs在经典贝叶斯基准下并不表现更好。在16个生物学任务中,当oracle的奖励信号与发表最佳设计配置一致时,领域感知提示导致LLM选择匹配发表最佳的频率比领域无关提示低约10个百分点。这种模式在6个任务中最为明显,其中领域无关提示在所有6个任务中更常匹配发表最佳。轨迹指标还充当了可训练的目标。使用轨迹指标作为奖励的离线强化学习在14个21个保留任务中提升了性能。

英文摘要

LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.

2605.15334 2026-05-18 cs.LG cs.AI cs.CL cs.SE 版本更新

From I/O to Code with Discovery Agent

从输入输出到代码:发现代理

Yihong Dong, Jiaru Qian, Haoran Zhang, Peixu Wang, Binhua Li, Zhi Jin, Yongbin Li, Ge Li, Xiaokang Yang, Xue Jiang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Wuhan University(武汉大学) Renmin University of China(中国人民大学) National University of Singapore(新加坡国立大学) Shanghai Jiaotong University(上海交通大学)

AI总结 本文提出DIO-Agent,通过将IO2Code视为离散程序空间的进化搜索,利用LLM作为突变算子,结合执行误差信号指导突变,解决从输入输出行为合成代码的难题。

详情
AI中文摘要

将程序自动合成视为计算机科学的圣杯。受LLM推动,NL2Code取得巨大成功,但从输入输出行为合成程序(IO2Code)仍难以解决。NL2Code可利用自然语言与代码的语义对齐,而IO2Code需从具体计算行为中恢复底层原理,面对广阔且未明确规定的假设空间。为此,我们提出DIO-Agent,将IO2Code视为离散程序空间的进化搜索,在其中LLM作为突变算子,执行误差信号指导突变。为防止搜索进入结构复杂但错误的死胡同,引入变换优先前提作为突变先验,使LLM偏向最简单的假设,逐步从常量到条件到迭代。为促进系统研究,我们构建了跨越多个难度级别的IO2CodeBench。大量实验表明,DIO-Agent在所有难度级别和各种LLM上均优于传统程序示例方法和SOTA进化代理基线,同时显著超越等效采样预算下的测试时间扩展策略。

英文摘要

The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from concrete computational behavior, navigating a vast and underspecified hypothesis space. To address this, we propose DIO-Agent, a discovery agent for IO2Code. Our method frames IO2Code as an evolutionary search over discrete program space, in which an LLM serves as the mutation operator and concrete error signals from execution guide each mutation. To prevent the search from wandering into structurally complex yet incorrect dead ends, we introduce the Transformation Priority Premise as a mutation prior that biases the LLM toward the simplest hypothesis consistent with current evidence, progressively escalating from constants to conditionals to iteration only when simpler constructs are insufficient. To facilitate systematic study, we further construct an IO2CodeBench spanning multiple difficulty levels. Extensive experiments show that DIO-Agent consistently outperforms both traditional program-by-example method and SOTA evolution-agent baselines across all difficulty levels and various LLMs, while substantially surpassing test-time scaling strategies with equivalent sampling budgets.

2605.15333 2026-05-18 cs.AI 版本更新

Zero-Shot Goal Recognition with Large Language Models

基于大语言模型的零样本目标识别

Kin Max Piamolini Gusmão, Nathan Gavenski, Nir Oren, Felipe Meneguzzi

发表机构 * PUCRS Porto Alegre(圣路易斯-波尔图阿legre大学) King’s College London(伦敦国王学院) University of Aberdeen(阿伯丁大学) PUCRS(圣路易斯-波尔图阿legre大学)

AI总结 本文首次系统评估前沿大语言模型在经典PDDL基准上的零样本目标识别能力,发现其表现不均,部分模型随证据增加而提升精度,而另一些模型则依赖世界知识先验。

Comments 9 pages, 1 figure, 1 table; appendix with 8 figures and 2 code listings (29 pages total); submitted to NeurIPS 2026

详情
AI中文摘要

大语言模型最近在知名规划领域达到了与经典规划器相当的水平,但这种能力依赖于世界知识的利用而非真正的符号推理。目标识别是一种互补的归纳任务,结构上更适合大语言模型的特长:它涉及评估与世界知识的一致性,而非生成新的动作序列。本文首次系统地对前沿大语言模型进行了零样本评估,以评估其在关键经典PDDL基准上的目标识别能力。我们的结果表明,大语言模型在目标识别上的能力不均:一些模型随着证据的增加而提升,接近全观测下的地标精度,而另一些模型则无论证据如何增加,都依赖于世界知识的先验。对模型推理轨迹的定性分析表明,这种差异反映了证据整合的根本差异,而非领域熟悉度。这些发现将目标识别定位为评估大语言模型基础规划知识的原则性基准。

英文摘要

Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.

2605.15315 2026-05-18 cs.AI cs.CL 版本更新

Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

通过多标准潜在推理进行编码代理的上下文剪枝

Jingjing Wang, Xiwen Chen, Wenhui Zhu, Huayu Li, Zhengxiao He, Feiyang Cai, Ana S. Carreon-Rascon, Xuanzhao Dong, Feng Luo

发表机构 * Clemson University(克莱姆森大学) Morgan Stanley(摩根大通) Arizona State University(亚利桑那州立大学) University of Arizona(亚利桑那大学)

AI总结 本文提出LaMR框架,通过分解代码相关性为语义证据和依赖支持两个维度,利用多任务CRF模型提升编码代理的上下文剪枝效果,实验表明其在多个基准测试中表现优异。

详情
AI中文摘要

LLM驱动的编码代理花费大部分token预算阅读仓库文件,但检索到的代码大多与任务无关。现有学习剪枝器使用单一目标序列标注器压缩上下文,将代码相关性所有方面压缩为一个分数和一个转移矩阵。我们证明这种建模瓶颈:单一CRF转移先验必须服务于异质保留模式,包括连续语义跨度和稀疏结构支持线。我们提出LaMR(潜在多标准),一个结构化剪枝框架,将代码相关性分解为两个可解释的质量维度,语义证据和依赖支持,每个由专用CRF建模,具有维度特定的转移动态。混合专家门控网络动态加权每个标准的发射量,根据查询条件。最终CRF层在融合的发射量上产生汇总的保留或剪枝决策。为了监督每个维度而无需额外标注成本,我们通过基于AST的程序分析从现有训练语料中推导出多标准标签,同时去噪教师的二元标签。通过有效过滤干扰噪声,LaMR经常匹配或甚至优于未修剪的完整上下文基线。在四个基准测试(SWE-Bench Verified,SWE-QA,LCC,LongCodeQA)上的实验表明,LaMR在16次头对头多轮比较中胜出12次。它在多轮代理任务中节省多达31%的token,并在单轮任务中将Exact Match提高多达+3.5,同时性能经常通过去噪上下文得到增强,任何剩余的下降都是微小的。

英文摘要

LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.

2605.15308 2026-05-18 cs.AI cs.LG cs.MA 版本更新

SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

SMCEvolve:通过序列蒙特卡洛进化进行原理性科学发现

Jiachen Jiang, Huminhao Zhu, Zhihui Zhu

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) The Ohio State University(俄亥俄州立大学)

AI总结 SMCEvolve通过将程序搜索视为从奖励倾斜的目标分布中采样,并利用序列蒙特卡洛采样器近似该分布,提出三种核心机制:自适应父采样、变异与接受的混合、自动收敛控制,从而在数学、算法效率、符号回归和端到端ML研究基准中超越现有系统。

详情
AI中文摘要

LLM驱动的程序进化已成为自动化科学发现的强大工具,但现有框架缺乏设计其各个组件的原理性指导,并无法保证搜索收敛。我们介绍了SMCEvolve,将其程序搜索重新解释为从奖励倾斜的目标分布中采样,并用序列蒙特卡洛(SMC)采样器近似该分布。从这一视角,三种核心机制浮现为原理性组件:自适应父采样、变异与接受的混合、自动收敛控制。我们进一步提供有限样本复杂性分析,该分析界定了达到目标近似误差所需的LLM调用预算。在数学、算法效率、符号回归和端到端ML研究基准上,SMCEvolve在超越现有最先进的进化系统的同时,使用更少的LLM调用次数在自定终止条件下运行。代码可在https://github.com/kongwanbianjinyu/SMCEvolve获取。

英文摘要

LLM-driven program evolution has emerged as a powerful tool for automated scientific discovery, yet existing frameworks offer no principled guide for designing their individual components and provide no guarantee that the search converges. We introduce SMCEvolve, which recasts program search as sampling from a reward-tilted target distribution and approximates it with a Sequential Monte Carlo (SMC) sampler. From this view, three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control. We further provide a finite-sample complexity analysis that bounds the LLM-call budget required to reach a target approximation error. Across math, algorithm efficiency, symbolic regression, and end-to-end ML research benchmarks, SMCEvolve surpasses state-of-the-art evolving systems while using fewer LLM calls under self-determined termination. The code is available at https://github.com/kongwanbianjinyu/SMCEvolve.

2605.15301 2026-05-18 cs.AI 版本更新

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Solvita:通过代理进化增强大型语言模型以应对编程竞赛

Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Tsinghua University(清华大学) Independent Researcher(独立研究者)

AI总结 Solvita通过闭环系统和可训练知识网络,使代理动态学习,提升编程竞赛任务的准确性和经验积累。

详情
AI中文摘要

大型语言模型(LLMs)在严格的编程竞赛推理需求上仍存在挑战。尽管最近的多代理框架试图弥合这一可靠性差距,但它们本质上是无状态的:它们依赖静态检索并丢弃了从先前任务中获得的有价值的解决问题和调试经验。为了解决这一问题,我们提出了Solvita,一个代理进化框架,它允许持续学习而无需对基础LLM进行权重更新。Solvita将问题解决重新组织为一个闭环系统,包括策略选择、程序合成、认证监督和针对性破解,由四个专门的代理:规划者、求解者、Oracle和黑客执行。关键的是,每个代理都配有一个可训练的图结构知识网络。随着系统的运行,结果信号,如通过/失败判决、测试认证质量和黑客发现的对抗性漏洞,被重新解释为强化学习更新这些网络权重。这使代理能够根据过去的成功和失败动态路由未来的查询,从而在时间上积累可转移的推理经验。在CodeContests、APPS、AetherCode和实时Codeforces轮次中评估,Solvita在代码生成代理中建立了新的最先进的状态,优于现有的多代理流程,并几乎将单次流程基线的准确性翻倍。

英文摘要

Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.

2605.15299 2026-05-18 cs.IR cs.AI 版本更新

Fortress: A Case Study in Stabilizing Search Recommendations via Temporal Data Augmentation and Feature Pruning

Fortress:通过时间数据增强和特征剪枝稳定化搜索推荐

Milind Pandurang Jagre, Jia Huang, Dayvid V. R. Oliveira, Zhinan Cheng, Babak Seyed Aghazadeh, Puja Das, Chris Alvino, Jinda Han, Kailash Thiyagarajan

发表机构 * Apple(苹果公司)

AI总结 Fortress通过时间数据增强和特征剪枝稳定化搜索推荐模型,提升预测稳定性和准确性,验证了在大规模应用市场中效果显著。

详情
AI中文摘要

Fortress通过时间数据增强和特征剪枝稳定化搜索推荐模型,提升预测稳定性和准确性,验证了在大规模应用市场中效果显著。

英文摘要

In search and recommendation systems, predictive models often suffer from temporal instability when certain input features introduce volatility in output scores. This instability can degrade model reliability and user experience especially in multi-stage systems where consistent predictions are critical for downstream decision making. We introduce Fortress, a general framework for enhancing model stability and accuracy by identifying and pruning features that contribute to inconsistent prediction scores over time. Fortress leverages historical snapshots temporally partitioned datasets capturing score fluctuations for the same entity across periods and follows a four-step process: (1) collect historical snapshots, (2) identify samples with unstable predictions, (3) isolate and remove instability-inducing features, and (4) retrain models using only stable features. While semantic features from LLMs and BERT-based models improve generalization, they often lack full query or entity coverage. Engagement-based features offer strong predictive power but tend to introduce temporal instability. Fortress mitigates this trade-off by suppressing the volatility of engagement signals while retaining their predictive value leading to more stable and accurate models. We validate Fortress on a query-to-app relevance model in a large-scale app marketplace. Offline experiments demonstrate notable improvements in prediction stability (measured by Coefficient of Variation) and classification performance (measured by PR-AUC).

2605.15298 2026-05-18 cs.RO cs.AI cs.CL cs.CV 版本更新

PhysBrain 1.0 Technical Report

PhysBrain 1.0 技术报告

Shijie Lian, Bin Yu, Xiaopeng Lin, Changti Wu, Hang Yuan, Xiaolin Hu, Zhaolong Shen, Yuzhuo Miao, Haishan Liu, Yuxuan Tian, Yukun Shi, Cong Huang, Kai Chen

发表机构 * PhysBrain Team(PhysBrain团队)

AI总结 PhysBrain 1.0 通过将大规模人类自体视频转化为结构化的物理常识监督,提升机器人适应能力,在多模态问答和具身控制基准测试中取得SOTA结果,尤其在SimplerEnv中表现突出。

Comments Project Page: https://phys-brain.github.io

详情
AI中文摘要

视觉-语言-动作模型快速发展,但机器人轨迹单独学习广泛物理理解有限。PhysBrain 1.0研究了一种互补方法:将大规模人类自体视频转换为结构化的物理常识监督,再用于机器人适应。我们的数据引擎提取场景元素、空间动态、动作执行和深度感知关系,将其转化为问题-答案监督训练PhysBrain VLMs。所得物理先验通过保留能力且语言敏感的适应设计转移至VLA策略。在多模态问答基准和具身控制基准,包括ERQA、PhysBench、SimplerEnv-WidowX、LIBERO和RoboCasa中,PhysBrain 1.0取得SOTA结果,尤其在SimplerEnv中表现突出。这些结果表明,从人类交互视频中扩展物理常识能有效连接多模态理解与机器人动作。

英文摘要

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

2605.15295 2026-05-18 cs.LG cs.AI cs.CY 版本更新

GESD: Beyond Outcome-Oriented Fairness

GESD:超越以结果为导向的公平性

Gideon Popoola, John Sheppard

发表机构 * Gianforte School of Computing, Montana State University(蒙塔那州立大学计算机学院)

AI总结 本文提出GESD,一种以过程为导向的公平性度量,用于衡量模型解释在不同保护类别子组中的稳定性、鲁棒性和敏感性差异。通过多目标优化框架FEU,提升公平性和实用性。

Comments 7 pages, Accepted at IEEE CAI

详情
AI中文摘要

机器学习(ML)算法日益应用于高风险决策领域,如贷款审批、招聘和再犯预测。尽管现有公平性度量(如统计平等、等机会)能有效量化结果导向的不平等,但对偏见决策的过程或解释缺乏洞察。为此,我们提出组级解释稳定性不平等(GESD),一种以过程为导向的公平性度量,衡量不同保护类别子组中模型解释稳定性、鲁棒性和敏感性的差异。GESD是解释器无关、模型无关的,并将公平性分析扩展到可解释性层面。我们进一步将GESD整合到多目标优化框架中,联合优化效用、基于结果的公平性和基于解释的公平性,称为FEU(公平性-可解释性-效用)。在多个基准数据集上的实验证明,GESD有效捕捉了组间解释质量的差异,且FEU在效用和公平性方面优于现有方法。通过连接基于结果和基于解释的公平性,GESD提供了一种全面的工具,用于诊断和减轻预测建模中的偏见。我们的代码和数据集可在GitHub上获得(https://github.com/horlahsunbo/GESD)

英文摘要

Machine learning (ML) algorithms are increasingly deployed in high-stakes decision-making domains such as loan approvals, hiring, and recidivism predictions. While existing fairness metrics (e.g., statistical parity, equal opportunity) effectively quantify outcome-oriented disparities, they offer limited insight into the procedure or explanation behind biased decisions. To address this gap, we propose Group-level Explanation Stability Disparity (GESD), a \textit{procedural-oriented} fairness metric that measures disparities in the stability, robustness, and sensitivity of model explanations across different subgroups in a protected category. %GESD is explainer-agnostic, model-agnostic, and extends the scope of fairness analyses to the level of explainability. We further integrate GESD into a multi-objective optimization framework that jointly optimizes for utility, outcome-based fairness, and explanation-based fairness called FEU (Fairness--Explainability--Utility). Empirical results on multiple benchmark datasets show that GESD effectively captures group-wise discrepancies in explanation quality, and that FEU improves both utility and fairness over state-of-the-art methods. By bridging outcome-based and explanation-based fairness, GESD offers a comprehensive tool for diagnosing and mitigating bias in predictive modeling. Our code and datasets are available on GitHub {\hyperlink{https://github.com/horlahsunbo/GESD}{https://github.com/horlahsunbo/GESD}}

2605.15290 2026-05-18 cs.LG cs.AI 版本更新

GQA-μP: The maximal parameterization update for grouped query attention

GQA-μP:组查询注意力的最大参数更新

Kyle R. Chickering, Huijuan Wang, Mengxi Wu, Alexander Moreno, Muhao Chen, Xuezhe Ma, Daria Soboleva, Joel Hestness, Zhengzhong Liu, Eric Xing

发表机构 * UC Davis(加州大学戴维斯分校) MBZUAI IFM(脑科学与人工智能研究院(MBZUAI IFM)) USC(南加州大学) Cerebras(Cerebras公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文基于谱特征学习观点,提出组查询注意力的最大参数更新方法,通过数学分析实现参数转移,解决了新模型架构下的参数更新难题。

Comments 18 pages

详情
AI中文摘要

超参数在不同模型架构间的转移显著减少了调整大型语言模型(LLMs)所需的计算量。最大更新参数化(μP)通过原则性的数学分析确保转移,但对新模型架构的推导可能具有挑战性。基于Yang等人(2023a)的谱特征学习观点,我们做出了两项进展。首先,我们将权重的谱范数条件从启发式方法提升到特征学习的定义,从而推导出Complete-P深度和权重衰减缩放,而无需依赖懒学习。其次,我们考虑了一种修改的谱范数,该范数在权重矩阵非满秩时保持网络权重的有效缩放定律。这使我们能够(到目前为止)推导出组查询注意力(GQA)的μP缩放。我们通过展示学习率在GQA重复超参数上的转移以及关于权重衰减的实验,证明了我们理论推导的有效性。

英文摘要

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization (μP) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of μP scalings for grouped-query attention (GQA). We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.

2605.15285 2026-05-18 cs.LG cs.AI cs.NA math.FA math.NA math.OC 版本更新

Universal Approximation of Nonlinear Operators and Their Derivatives

非线性算子及其导数的通用逼近

Filippo de Feo

发表机构 * Institut für Mathematik, Technische Universität Berlin(柏林技术大学数学研究所)

AI总结 本文提出通过运算学习架构证明非线性算子及其导数的通用逼近定理,扩展了经典结果到无限维空间,并探讨了其在高阶精度、约束优化和无限维PDE数值方法中的应用。

详情
AI中文摘要

导数引导的算子学习(DIOL),即学习非线性算子及其导数,是运算学习(OL)基础领域中的开放研究前沿。特别是非线性算子及其导数的通用逼近定理(UAT)是非线性泛函分析中的基础性开放问题和精细问题。本文证明了非线性k次可微算子在巴纳赫空间之间及其导数的首个通用逼近定理,统一在紧集上和加权Sobolev范数中,适用于一般有限输入测度。我们的结果是首次将经典结果[1991]扩展到无限维设置和OL。我们讨论了DIOL和UATs的应用领域:OL中的高阶精度、Banach空间中的快速约束优化(如PDE最优控制、反问题)和无限维PDE的数值方法(如来自PDE最优控制的HJB PDEs在Banach空间、SPDEs、路径依赖系统、部分观测系统、均场控制)。我们通过编码器-解码器架构参数化非线性算子,这些架构因其通用性而著名,包括经典架构如DeepONets、Deep-H-ONets、PCA-Nets。我们的结果基于四个关键特性,使我们能够证明UATs的全面通用性:(i)巴纳赫空间的逼近性质。(ii)Bastiani意义下的k次连续可微性(弱于Fréchet意义下的k次连续可微性)。(iii)自然的紧-开拓扑用于UA;确实,我们显示在标准紧-开拓扑诱导的算子范数下,即使对于Fréchet导数,UA也遭到破坏。(iv)为UA构造新的加权Sobolev空间。

英文摘要

Derivative-Informed Operator Learning (DIOL), i.e. learning a (nonlinear) operator and its derivatives, is an open research frontier at the foundations of the influential field of Operator Learning (OL). In particular, Universal Approximation Theorems (UATs) of nonlinear operators and their derivatives are foundational open questions and delicate problems in nonlinear functional analysis. In this manuscript, we prove the first UATs of non-linear $k$-times differentiable operators between Banach spaces and their derivatives, uniformly on compact sets and in weighted Sobolev norms for general finite input measures, via OL architectures. Our results are the first complete generalizations of the corresponding influential classical results in [Hornik, 1991] to infinite-dimensional settings and OL. We discuss several open areas where DIOL and our UATs find applications: high-order accuracy in OL, fast constrained optimization in Banach spaces (e.g. optimal control of PDEs, inverse problems) and numerical methods for infinite-dimensional PDEs (e.g. HJB PDEs on Banach spaces from optimal control of PDEs, SPDEs, path-dependent systems, partially observed systems, mean-field control). We parameterize nonlinear operators via Encoder-Decoder Architectures, renowned classes in OL due to their generality, including classical architectures, such as DeepONets, Deep-H-ONets, PCA-Nets. Our results are based on four key features that allow us to prove UATs in full generality: (i) Approximation Properties of Banach spaces. (ii) $k$-times continuous differentiability in the sense of Bastiani (weaker than $k$-times continuous Fréchet differentiability). (iii) Natural compact-open topologies for UA; indeed, we show that UA in standard compact-open topologies induced by operator norms is violated even for Fréchet derivatives. (iv) Construction of novel weighted Sobolev spaces for the UA.

2605.15281 2026-05-18 cs.CR cs.AI 版本更新

Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

Vinil Pasupuleti, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi

发表机构 * International Business Machines (IBM)(国际商业机器公司(IBM)) Salesforce Inc(Salesforce公司)

AI总结 本文提出了一种基于人工智能的自主测试框架,用于实现自然语言驱动的网页执行与集成安全验证。该框架通过导航可靠性、上下文感知选择器生成、后生成验证、智能等待注入和失败学习等五项策略,有效解决了传统网页测试套件易失效的问题。实验表明,该方法显著提升了脚本生成成功率,减少了导航失败和时间相关竞争条件,并大幅降低了测试创建时间;同时,它还能通过自然语言描述攻击场景,自动转换为安全检测探针,有效发现多种安全漏洞,为自然语言驱动的安全测试提供了新颖的解决方案。

Comments 6 pages, 4 figures, 5 tables, IEEE conference format

详情
英文摘要

Modern web test suites rot. A UI refactor breaks locators, a timing change causes race conditions, and within weeks developers abandon the suite entirely. This paper presents an AI-driven autonomous testing framework that addresses these failure modes through five integrated strategies - navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning - implemented over a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated across four production applications and 176 scenarios, the framework improves script generation success from 55% to 93%, achieves an 8x reduction in navigation failures, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% compared to manual Selenium authoring. The framework extends naturally to security validation: testers describe attack scenarios in plain English - "try accessing another user's invoice" - which the agent converts to OWASP Top 10-aligned browser probes, detecting 85% of authentication bypass vulnerabilities and 95% of input validation flaws with false positive rates below 12%. Natural-language-driven security testing of this kind represents, to our knowledge, a novel contribution to the field.

2605.15252 2026-05-18 cs.LG cs.AI eess.SP 版本更新

PDRNN: Modular Data-driven Pedestrian Dead Reckoning on Loosely Coupled Radio- and Inertial-Signalstreams

Peter Bauer, Andreas Porada, Felix Ott, Christopher Mutschler, Tobias Feigl

发表机构 * Fraunhofer Institute for Integrated Circuits IIS(弗劳恩霍夫集成电路研究所)

AI总结 本文提出了一种名为PDRNN的模块化数据驱动行人航位推算系统,用于处理松耦合的无线电与惯性传感器信号流。该方法基于简单循环神经网络架构,能够隐式预测不同估计方法下的异步传感器数据流,并通过独立的机器学习模型分别估计姿态、速度和位置等关键参数及其方差,最终融合模型结合这些输出以提升系统鲁棒性。实验表明,PDRNN在动态运动数据上的精度和稳定性优于传统方法和现有机器学习方法,同时具备更好的组件控制能力和预测能力。

Comments 12 pages

Journal ref IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025

详情
英文摘要

Modern pedestrian dead reckoning (PDR) systems rely on fusing noisy and biased estimates of position, velocity, and calibrated orientation derived from loosely coupled sensors to determine the current pose of a localized object. However, discrepancies in the sampling rates of sensor-specific estimation methods and unreliable transmission pose significant challenges. And traditional methods often fail to effectively fuse multimodal sensor data during dynamic movements characterized by high accelerations, velocities, and rapidly varying orientations. To address these limitations, we propose a simple recurrent neural network (RNN) architecture capable of implicitly forecasting asynchronous sensor data streams from diverse estimation methods along reference trajectories. The proposed approach introduces PDRNN, a modular hybrid AI-assisted PDR system that handles each component as an independent ensemble of machine learning (ML) models to estimate both key parameter means and variances. Separate ML-based models are employed to estimate orientation, (un)directed velocity or distance from acceleration and gyroscope data, with optional absolute positioning from synchronized radio systems such as 5G for stabilization. A final fusion model combines these outputs, position, velocity, and orientation, while using uncertainty estimates to enhance system robustness. The modular design allows individual components to be updated, fine-tuned, or replaced without affecting the entire system. Experiments on dynamic sports movement data show that PDRNN achieves superior accuracy and precision compared to classic and ML-based methods, effectively avoiding error accumulation common in black-box approaches. And PDRNN offers forecast capabilities and better component control despite increased system complexity.

2605.15243 2026-05-18 cs.LG cs.AI q-bio.BM q-bio.MN q-bio.QM 版本更新

Reading the Cell, Designing the Cure: Perturbation-Conditioned Molecular Diffusion for Function-Oriented Drug Design

Ziyu Xu, Zijian Zhang, Liang Wang, Zhiyuan Liu, Qiang Liu, Shu Wu, Liang Wang

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学先进交叉学科学院) NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院) National University of Singapore, Singapore(新加坡国立大学)

AI总结 该研究提出了一种基于转录组的药物设计方法(TBDD),旨在根据期望的基因表达变化生成具有特定功能的分子。为了解决生物学与化学领域间的巨大差异以及转录组信号稀疏性带来的挑战,研究设计了多尺度的扩散生成模型CURE,其核心模块TFE能够提取功能导向的扰动特征,并跨模态对齐化学结构信息,从而生成结构合理且功能一致的候选药物分子。实验表明,该方法在多个基准测试中表现优异,并在零样本基因抑制剂设计任务中验证了其实际应用潜力。

详情
英文摘要

When reliable target structures are unavailable at scale or phenotypes arise from dysregulated pathways, transcriptomic perturbations provide a system-level functional readout for drug action. In this work, we formalize \emph{Transcriptome-based Drug Design (TBDD)} as a generative inverse problem: designing drug molecules conditioned on desired transcriptomic state transitions. We analyze the inherently ill-posed nature of this task, which is further complicated by the profound domain gap between biology and chemistry and by the sparsity of transcriptomic signals. To address these challenges, we propose \textbf{\themodel{}} (A \textbf{C}ell\textbf{U}lar \textbf{R}esponse \textbf{E}ngine), a multi-resolution transcriptome-guided diffusion framework. \themodel{} features a specialized \textbf{Transcriptome Perturbation Functional Feature Extractor (TFE)} that (1) distills function-oriented perturbation embeddings from pre/post states, (2) aligns these signatures to dual chemical views to bridge the cross-modal gap, and (3) performs heterogeneity-aware aggregation to extract robust state-specific signals from noisy transcriptomic data. Extensive evaluations on both standard benchmarks and rigorous out-of-distribution protocols demonstrate that \themodel{} consistently outperforms strong baselines in structural quality and functional consistency. Furthermore, we validate its practical utility via a zero-shot gene-inhibitor design task, highlighting the potential of phenotype-driven generative discovery.

2605.15238 2026-05-18 cs.SE cs.AI cs.PL 版本更新

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

Alexander Du, Jianjun Ou, Danyang Zhuo, Matthew Lentz

发表机构 * Duke University(杜克大学)

AI总结 本文提出了一种名为Hydra的系统,用于在代码生成过程中高效地恢复静态错误。Hydra通过异步检查和检查点回滚机制,避免了传统方法中高昂的延迟和令牌消耗,能够在生成过程中及时检测并修复错误,而无需重新生成已正确部分的代码。实验表明,Hydra在C/C++代码生成任务中,相比事后修复方法,显著降低了延迟和令牌使用量。

详情
英文摘要

Large language models are increasingly used for code generation, but many generated programs fail to compile, a prerequisite for further correctness checks such as unit tests. Existing solutions for repairing static errors are costly in both latency and token consumption. Post-hoc repair delays error detection until generation completes and commonly regenerates large regions of previously valid code. Constrained semantic decoding checks after each token, incurring per-token overhead while limiting repair to the current token even when the root cause lies earlier. We present Hydra, a system for efficient recovery from static errors during code generation. Hydra allows checking to proceed asynchronously with generation, avoiding checker overhead when the generated code is semantically correct. In addition, it provides checkpoint-and-rollback support for targeted repair, avoiding regeneration and rechecking of valid prefixes. We retrofit the Clang C/C++ compiler to support Hydra with modest modifications. Paired with a token-efficient repair strategy, Hydra reduces latency by up to 71% and token consumption by up to 70% relative to post-hoc repair on C/C++ code generation tasks that encounter static errors.

2605.15237 2026-05-18 cs.AR cs.AI 版本更新

A3D: Agentic AI flow for autonomous Accelerator Design

Abinand Nallathambi, Christopher Knight, Shantanu Ganguly, Wilfried Haensch, Anand Raghunathan

发表机构 * Purdue University(普渡大学) Argonne National Laboratory(阿贡国家实验室) University of Chicago(芝加哥大学)

AI总结 A3D 是一种基于智能体的 AI 流程,旨在实现从端到端的硬件加速器自动化设计。该方法通过自主分析工作负载、识别性能瓶颈、重构代码以适配高阶综合工具,并生成微架构,显著降低了加速器设计的复杂性和人工干预需求。A3D 还能够自动探索速度与面积的权衡空间,生成多样化的加速器设计方案,为复杂科学应用提供了高效且自动化的加速器设计解决方案。

详情
英文摘要

Accelerating applications through the design of hardware accelerators can significantly enhance system performance and energy efficiency. Despite advances, such as high-level synthesis (HLS), designing accelerators for complex applications still remains highly labor-intensive, demanding considerable expertise in understanding workloads to be accelerated, hardware design, micro-architecture, and EDA tool usage, posing challenges for application domain experts. Therefore, most accelerator solutions are limited to applications with a regular predictable dataflow. Advances in AI have enabled agents that perform autonomous planning, reasoning, execution and reflection, leading to unprecedented potential for automation through agentic AI. We present A3D, an Agentic AI flow for end-to-end Automation of hardware Accelerator Design. A3D automates workload analysis, performance bottleneck identification, code refactoring for HLS compatibility and micro-architecture generation. A3D also generates diverse accelerator designs by automatically exploring the speed-area tradeoff space. Recent efforts have explored the use of AI for specific tasks such as design space exploration in HLS, leaving several tasks to still be performed manually. A3D addresses the challenges in applying modern LLMs to accelerator design by judiciously partitioning tasks among specialist agents, orchestrating process loops with specialist and verifier agents, utilizing pre-existing and custom tools, and employing agentic RAG for codebase and proprietary EDA tool documentation exploration. Our implementation of A3D, using commercial components like Claude Sonnet 4.5 and the Catapult HLS tool, demonstrates its effectiveness by generating accelerator designs with no human intervention from complex scientific applications like LAMMPS (molecular dynamics simulation) and QMCPACK (quantum chemistry).

2605.15228 2026-05-18 cs.AI cs.LG 版本更新

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Jun He, Deying Yu

AI总结 本文研究了主权AI系统中自主智能体执行操作时的授权验证问题,提出了一种基于可信证明的分布式授权框架(DTF)。该框架通过结构化、可验证的证明对象来动态生成执行权限,确保所有高风险操作都必须基于共识验证的证明,并与证据链绑定,从而实现对智能体行为的可控、可审计和可追溯。该方法为云原生环境中的自主AI系统提供了安全、去中心化的授权基础设施。

Comments 19 pager, 2 figures, 4 tables

详情
英文摘要

Modern cloud and enterprise systems rely on identity-centric authorization, assuming that callers possessing valid credentials are safe to execute commands. The emergence of autonomous AI agents invalidates this assumption: agents can generate syntactically valid but semantically unsafe actions, making standing privileges a significant operational risk. This risk becomes especially acute in sovereign AI systems, where autonomous agents may interact with cloud infrastructure, regulated data, financial workflows, and national-scale digital services. Governed mutation substrates reduce this risk by interposing on agent actions: agents submit intents, infrastructure evaluates context and policy, and execution is mediated. However, this shifts the trust boundary: how can the decision to authorize an intent be made verifiable, distributed, and replayable? We introduce a Distributed Trust Framework (DTF), a verification framework for governed mutation systems that computes execution authority from structured, verifiable artifacts. DTF introduces a Justification Proof to encode the admissibility basis of an action, a consensus model for independent evaluation, an ephemeral Execution Identity derived from the approved proof, and an append-only Evidence Chain that preserves the authorization lifecycle. Under stated substrate assumptions, this architecture enforces a compact authorization invariant: no high-stakes execution without a proof object, no derived authority without consensus, and no valid mutation detached from evidence. We define the model, instantiate it over an OpenKedge-based governed mutation substrate, and show how it maps onto cloud-native environments. By shifting authorization from standing identity to proof-derived authority, DTF provides an infrastructure foundation for making agentic execution governable, auditable, and bounded in sovereign AI deployments.

2605.15227 2026-05-18 cs.AI cond-mat.mtrl-sci cs.RO 版本更新

NIMO Controller: a self-driving laboratory orchestrator based on the Model Context Protocol

Naruki Yoshikawa, Ryo Tamura

发表机构 * National Institute for Materials Science(国家材料科学研究所) Graduate School of Frontier Sciences, The University of Tokyo(东京大学前沿科学研究生院)

AI总结 本文提出了一种基于模型上下文协议(MCP)的自主驾驶实验室(SDL)控制架构——NIMO Controller,旨在解决现有SDL软件框架缺乏标准化接口、难以支持AI代理的问题。该架构通过MCP服务器统一暴露所有SDL功能,并提供了基于MCP工具发现的可视化编程接口,使用户无需编写代码即可设计实验流程,同时支持AI代理通过同一后端进行交互。研究通过颜色匹配实验验证了该架构的可行性与实用性。

Comments 9 pages, 4 figures

详情
英文摘要

Self-driving laboratories (SDLs) have attracted increasing attention as a means of accelerating scientific discovery; however, developing SDL software remains technically demanding. To improve accessibility, orchestration software frameworks have been proposed to coordinate SDL components. Nevertheless, existing frameworks are primarily designed for human interaction and do not provide standardized interfaces suitable for AI agents. In this work, we propose an SDL software architecture based on the Model Context Protocol (MCP), in which all SDL functionalities are exposed through MCP servers. Following this design principle, we introduce an MCP-based SDL orchestrator, named NIMO Controller. It provides a visual programming interface automatically generated through MCP-based tool discovery, allowing human users to design experimental workflows without writing code. The same MCP backend can also be accessed by AI agents, providing a unified interface for both human users and AI agents. We demonstrate the proposed system through a case study on a color-matching SDL. The results validate the usability of the proposed MCP-based SDL architecture.

2605.15226 2026-05-18 cs.AR cs.AI cs.SE 版本更新

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文探讨了用于软件工程的智能体AI系统是否适用于实际的硬件工程任务,并引入了Phoenix-bench基准测试集,该基准集包含511个经过验证的Verilator实例,支持对硬件设计流程、错误修复和验证等任务的全面评估。研究发现,硬件工程与软件工程在错误传播机制和修复方式上存在显著差异,且定位精度和反馈机制对智能体性能影响显著,为未来智能体在硬件工程中的应用提供了重要参考。

详情
英文摘要

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.

2605.15225 2026-05-18 q-bio.QM cs.AI 版本更新

Do Biological Structural Guarantees Earn Their Complexity?

Bogdan Banu

AI总结 本文探讨了生物学结构保证是否值得其复杂性,通过构建三个深度基准测试,比较了基于生物机制(如代谢优先门控、自动诱导物群体感应和贝叶斯停滞检测)的AI框架与非生物替代方案及简化对照在数千次试验中的表现,验证了生物结构在可靠性上的实际优势与代价。

详情
英文摘要

Biologically-inspired AI agent frameworks claim reliability benefits through structural guarantees adapted from gene regulatory networks, immune systems, and metabolic control. These claims are rarely tested empirically against simpler alternatives. We present three deep benchmarks: metabolic priority gating, autoinducer-based quorum sensing, and Bayesian stagnation detection, each comparing a biologically-grounded implementation against a naive non-biological alternative and an ablated control, across 1,000 trials per seed and 10 seeds (10M+ data points total).

2605.15224 2026-05-18 cs.AI cs.MA 版本更新

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Nanjing University(南京大学) Sun Yat-sen University(中山大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) SAP Microsoft Research(微软研究院)

AI总结 本文提出了一种基于强化学习的新型框架ICRL,旨在使大型语言模型在获得自我批评反馈后能够内化这些指导,从而在无外部批评的情况下仍能保持良好的表现。该框架通过联合训练求解器和批评者,利用批评反馈带来的性能提升作为奖励,促使批评者生成更有助于改进的反馈。为了解决批评条件行为与无批评行为之间的分布偏移问题,ICRL引入了分布校准的重加权策略,并通过角色分组优势估计稳定联合优化过程。实验表明,ICRL在多种任务中均取得了显著提升,且训练出的批评者在性能上可与更大规模的模型相媲美。

详情
英文摘要

Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.

2605.15223 2026-05-18 cs.AR cs.AI 版本更新

GenAI-Driven Approach to RISC-V Supply Chain Exploration

Nenad Petrovic, Andre Schamschurko, Yingjie Xu, Alois Knoll

发表机构 * Chair of Robotics, Artificial Intelligence and Real-Time Systems(机器人、人工智能与实时系统教授会) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种基于大语言模型(LLM)的流程,用于分析 RISC-V 供应链,结合视觉语言模型(VLM)和模型驱动工程(MDE),实现了对异构、非结构化供应链数据的多模态数据驱动分析。该方法通过 LLM 理解文本信息,VLM 提取图表、表格等视觉文档中的信息,构建供应链知识图谱,并利用 MDE 技术进行依赖关系验证、瓶颈检测和风险评估,从而支持对供应链韧性的探索性与系统性分析。实验表明,该方法在 RISC-V 生态系统中有效提升了供应链透明度和决策支持能力。

详情
英文摘要

This paper presents an LLM-empowered workflow for RISC-V supply chain analysis, integrating Vision-Language Models (VLMs) and Model-Driven Engineering (MDE) to enable comprehensive, multimodal data-driven insights. The proposed approach addresses the challenges of heterogeneous and unstructured supply chain data by leveraging LLMs for textual understanding and VLMs for extracting information from visual artifacts such as diagrams, tables, and scanned documents. These models collaboratively identify key entities and relationships, which are then organized into a knowledge graph representing supply chain components and their interdependencies. For analytical reasoning, the workflow incorporates MDE techniques and constraint-based modeling to enable formal validation of dependencies, detection of bottlenecks, and assessment of risks. The synergy between LLM- and VLM-based semantic understanding and MDE-based formal analysis supports both exploratory and systematic evaluation of supply chain resilience. A human-in-the-loop mechanism further enables interactive querying and expert validation. The approach is evaluated in RISC-V ecosystem scenarios, demonstrating its effectiveness in generating actionable insights, enhancing transparency, and supporting decision-making in complex semiconductor supply chains.

2605.15221 2026-05-18 cs.SE cs.AI cs.CL 版本更新

Effective Harness Engineering for Algorithm Discovery with Coding Agents

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

发表机构 * Gemini Algorithmicsuperintelligence(算法智能)

AI总结 本文研究了在算法发现任务中,如何设计有效的执行框架(harness)以提升基于大语言模型和进化搜索的自动算法生成效果。通过分析算法生成数量与深度、评估漏洞处理以及并行执行安全等问题,提出了改进的Vesper框架,并在圆填充问题上验证了其有效性。实验表明,在固定计算预算下,生成更少但更深入的算法能取得更优结果,同时更强大的模型更容易产生评估漏洞,凸显了漏洞检测的重要性。

详情
英文摘要

AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly, generating fewer algorithms while thinking more deeply about each one achieved higher scores. That is, scaling the quality of each individual is more budget-efficient than scaling the number of evolutionary generations. Surprisingly, more capable models produced evaluation hacks at higher rates, making hack detection increasingly necessary as models scale.

2605.15220 2026-05-18 cs.CL cs.AI cs.LG 版本更新

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y. Hu, Apurva Gandhi, Kyunghyun Cho, Tal Linzen, Pratyusha Sharma

发表机构 * New York University(纽约大学) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软公司)

AI总结 数据混合在语言模型训练中起着关键作用,决定了如何组合不同来源或类型的训练数据。本文提出了一种名为OP-Mix的高效数据混合算法,能够在整个语言模型训练生命周期中持续运行,解决了现有方法仅适用于单一训练阶段的问题。该方法通过在当前模型上训练低秩适配器并进行插值,低成本地模拟候选数据混合方案,从而避免了对代理模型的依赖,并始终基于模型的实际学习动态进行搜索。实验表明,OP-Mix在预训练、持续微调等任务中均能以更低的计算成本达到接近最优的性能。

详情
英文摘要

Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.

2605.15218 2026-05-18 cs.AI cs.CE 版本更新

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

Chenying Lin, Yichen Hai, Yi He, Ran Wang, Haiyan Qiang, Liang Yu

发表机构 * Shanghai Ultradimension Technology Co., Ltd.(上海超维科技有限公司) College of Logistics Engineering, Shanghai Maritime University(上海海洋大学物流学院) School of Civil Aviation, Northwestern Polytechnical University(西北工业大学航空学院) State Key Laboratory of Airliner Integration Technology(航空器集成技术国家重点实验室) National Key Laboratory of Strength(强度与结构完整性国家实验室) Wuhan University(武汉大学)

AI总结 本文提出了一种轻量级的代理框架CAX-Agent,旨在提升MAPDL有限元仿真中的自动化可靠性。该框架通过引入领域特定的中间件,实现工具生命周期管理、工作流状态控制和故障恢复,从而解决大语言模型在该任务中常见的输出不一致和任务失败问题。实验评估表明,CAX-Agent中基于模型驱动的恢复策略在多个结构基准测试中表现出色,显著优于仅依赖规则或无恢复策略的方法。

Comments 8 pages, 6 figures, IEEE conference format

详情
英文摘要

Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation. This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components -- the recovery policy.CAX-Agent organizes execution into three layers -- LLM service, agent harness, and solver backend -- with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen's kappa = 0.84, 96 percent of score pairs within one point). Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff's delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.

2605.15217 2026-05-18 cs.AI cs.CY cs.LG econ.GN q-fin.EC 版本更新

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Jagdish Tripathy, Marcus Buckmann

发表机构 * Bank of England(英格兰银行)

AI总结 本研究探讨了指令微调语言模型在高风险决策(如房贷审批)中表现出的行为公平性与其内部潜在偏见之间的不对称关系。研究发现,尽管模型在输出层面看似无偏,但其内部表示仍保留并放大了与种族相关的偏见,且这些隐藏的偏见具有因果影响力,能够通过特定干预引发决策反转。研究还揭示了这种偏见在不同群体间的不对称性,并指出仅关注输出的行为审计不足以识别和治理模型中的潜在偏差,需结合表示分析的双重评估框架。

Comments 39 pages, 16 figures, 2 tables

详情
英文摘要

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning. These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.

2605.15215 2026-05-18 cs.AI cs.SE 版本更新

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

Duling Xu, Zheng Chen, Zaifeng Pan, Jiawei Guan, Dong Dong, Jialin Li, Bangzheng Pu

发表机构 * AetherHeart Tech Co., Ltd.(AetherHeart科技有限公司) Renmin University of China(中国人民大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 SkillSmith 是一种边界引导的编译-运行时框架,旨在优化基于技能的智能体系统。该方法通过离线编译技能包为最小可执行接口,提取技能的细粒度操作边界,使智能体在运行时仅调用相关组件,从而减少冗余上下文注入和重复推理。实验表明,SkillSmith 显著降低了推理阶段的 token 使用量、思考迭代次数和执行时间,并提升了任务准确率,同时支持强模型生成的编译结果被轻量模型复用。

详情
英文摘要

Recently, skills have been widely adopted in large language model (LLM)-based agent systems across various domains. In existing frameworks, skills are typically injected into the agent reasoning loop as contextual guidance once matched to a runtime task, enabling specialized task-solving capabilities. We find that this execution paradigm introduces two major sources of redundancy: irrelevant context injection and repeated skill-specific reasoning and planning. To this end, we propose SkillSmith, a boundary-first compiler-runtime framework that compiles skill packages offline into minimal executable interfaces. By extracting fine-grained operational boundaries from skills, SkillSmith enables agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead. In the evaluation on SkillsBench benchmark, SkillSmith reduces solve-stage token usage by 57.44%, thinking iterations by 42.99%, solve time by 50.57% (2.02x faster), and token-proportional monetary cost by 57.44% compared with using raw-skills. Moreover, compiled artifacts produced by a stronger model can be reused by a smaller or more efficient runtime model, improving task accuracy in cases where raw skill interpretation fails. The source code and data are available at https://github.com/AetherHeart-AI/Aeloon.

2605.15213 2026-05-18 cs.IR cs.AI 版本更新

An LLM-RAG Approach for Healthy Eating Index-Informed Personalized Food Recommendations

Yibin Wang, Yanjie Yang, Grace Melo Guerrero, Rodolfo M. Nayga, Azlan Zahid

发表机构 * Department of Biological and Agricultural Engineering, Texas A&M AgriLife Research(生物与农业工程系,德克萨斯A&M农业生命研究)

AI总结 该研究提出了一种基于健康饮食指数(HEI)的检索增强生成(RAG)框架,用于生成个性化的健康饮食推荐。该方法结合标准化营养数据库和大语言模型,通过构建食物嵌入空间并计算HEI评分,为用户提供符合健康标准的个性化饮食建议。实验结果表明,该方法能有效提升用户的HEI得分,提高饮食质量。

详情
英文摘要

Diet quality is a leading determinant of chronic disease risk. Advances in artificial intelligence (AI) have enabled food recommendation systems to adapt suggestions to user preferences and health goals. However, most current systems rely on loosely curated food databases and provide limited connection to a validated index. In this study, we propose a Healthy Eating Index (HEI) informed retrieval-augmented generation (RAG) framework that combines standardized nutrition databases with large language models (LLMs) for personalized food recommendations. Our proposed method anchors retrieval in the National Health and Nutrition Examination Survey (NHANES) and the Food Patterns Equivalents Database (FPED). A food-level embedding space is constructed from FPED-derived textual descriptions. For each entity, the system computes baseline HEI scores, retrieves candidate foods for intake recommendations, and estimates the HEI impact of simple substitutions or additions. A constrained RAG pipeline instantiated with a pretrained OpenAI LLM generates personalized recommendations and sources based on nutrient profiles and HEI contributions. The simulation results showed a mean HEI improvement of 6.45, with the proportion of users HEI over 50 increasing from 45.12 to 61.26. Quantile analysis revealed consistent improved shifts across the HEI distribution. Our findings suggest that the proposed LLM-RAG-based AI systems can support more precise, explainable, and personalized nutrition guidance to improve diet quality.

2605.15208 2026-05-18 cs.LG cs.AI 版本更新

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Plawan Kumar Rath, Rahul Maliakkal

发表机构 * Meta

AI总结 该研究探讨了量化压缩对大型语言模型(LLMs)偏见表现的影响,发现低精度量化会导致模型在多个任务中产生新的刻板印象行为,且这种变化与精度水平呈剂量反应关系。通过在多个模型和精度级别上的大规模实验,研究揭示了传统质量评估指标无法检测到这种偏见的增加,强调了在模型压缩前进行公平性检测的重要性。

Comments 7 pages, 4 figures, 4 tables. Accepted at IEEE Cloud Summit 2026. This is the author's accepted version; the version of record will appear in IEEE Xplore

详情
英文摘要

Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly understood. Existing studies typically compare only two conditions (full-precision vs. a single quantized variant), rely on aggregate bias metrics, and evaluate a single model family, making it impossible to distinguish gradual degradation from threshold-dependent safety failures. We conduct a controlled empirical study of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 through 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Our results reveal that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression, while models' willingness to select "unknown" answers declines by 17.4%. Crucially, these item-level changes are invisible to standard quality metrics: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit. These findings demonstrate that aggregate evaluation metrics systematically miss fairness-critical degradation, underscoring the need for quality-aware compression protocols that explicitly test for bias emergence before deployment.

2605.15206 2026-05-18 cs.LG cs.AI cs.DC 版本更新

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Brave Software, Imperial College London(Brave软件公司,伦敦帝国学院)

AI总结 随着基于大语言模型的自主代理在复杂任务中应用增多,本地部署虽能提升隐私保护和降低成本,但其资源消耗远高于普通语言模型交互。本文研究了在消费级硬件上本地运行代理的能耗问题,提出了一种名为AgentStop的轻量级监督机制,通过预测任务失败的可能性提前终止无效流程,在减少15%-20%能耗的同时仅小幅影响任务性能,为可持续的本地智能代理系统提供了可行方案。

Comments ACM CAIS '26

详情
英文摘要

Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave-experiments/AgentStop.

2605.15205 2026-05-18 cs.AI 版本更新

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

Nanxu Gong, Zixin Chen, Haotian Li, Zishu Zhao, Jianxun Lian, Huamin Qu, Yanjie Fu, Xing Xie

发表机构 * Arizona State University(亚利桑那州立大学) Hong Kong University of Science and Technology(香港科学与技术大学) Microsoft Research Asia(微软亚洲研究院) Smith College(史密斯学院)

AI总结 本研究探讨了提升大型语言模型(LLM)心智理论(ToM)能力是否真正有助于改善人机交互。研究指出,现有基准多从第三人称视角通过阅读故事和选择题评估ToM能力,忽视了真实交互中的第一人称、动态和开放特性。为此,研究提出了一种新的交互式ToM评估范式,并通过真实数据集和用户实验系统评估了四种代表性ToM增强技术,发现静态基准上的提升并不一定带来动态人机交互中的性能改善,强调了基于交互的评估在开发下一代社会智能模型中的重要性。

详情
英文摘要

Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.

2605.15204 2026-05-18 cs.AI 版本更新

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Zhantao Wang

发表机构 * Digital China(数字中国)

AI总结 本文提出了一种名为SDOF的多智能体协调框架,旨在解决现有系统在任务调度中缺乏阶段约束的问题。该框架将多智能体执行视为受约束的状态机,并通过强化学习与有限状态自动机相结合的方法,实现对任务流程的精确控制与合规性验证。实验表明,SDOF在招聘系统等实际场景中表现出更高的任务完成率与执行安全性,显著优于现有模型。

Comments 12 pages, 4 figures, 14 tables

详情
英文摘要

Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

2605.15203 2026-05-18 cs.IR cs.AI cs.MA 版本更新

Agent4POI: Agentic Context-Conditioned Affordance Reasoning for Multimodal Point-of-Interest Recommendation

Jinze Wang, Yangchen Zeng, Tiehua Zhang, Lu Zhang, Yuze Liu, Yongchao Liu, Xingjun Ma, Zhu Sun

发表机构 * Tongji University(同济大学) Swinburne University of Technology(斯威本理工大学) Southeast University(东南大学) Chengdu University of Information Technology(成都信息工程大学) Fudan University(复旦大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 本文提出了一种名为 Agent4POI 的新型兴趣点(POI)推荐框架,其核心在于推荐时动态生成与上下文条件相关的多模态表示,而非依赖于预计算的静态 POI 嵌入。该方法通过一个四阶段的大型语言模型代理,根据情境上下文生成动态的、场景特定的“可利用性”查询,并结合图像、评论和元数据进行跨模态推理,最终生成结构化且考虑不确定性的可利用性表示,从而提升推荐的准确性和适应性。实验表明,Agent4POI 在多个基准数据集和评估场景中均优于现有方法,尤其在冷启动和上下文变化场景下表现突出。

详情
英文摘要

We introduce Agent4POI, the first POI recommendation framework that generates context-conditioned multimodal representations at recommendation time, rather than relying on static POI embeddings pre-computed independently of context. Existing multimodal systems encode each POI once as a static embedding, a design that precludes reasoning about why the same cafe affords solo work on Monday but group celebration on Friday evening. We formally prove that no pre-computed encoder can satisfy context-sensitive ranking under standard bilinear scoring, motivating inference-time item-side representation. Agent4POI inverts this computation: given a situational context, a four-phase LLM agent generates dynamic, context-specific affordance queries (Phase 1) and executes a five-step cross-modal chain-of-thought over image, review, and metadata evidence (Phase 2). The resulting uncertainty-aware affordance representation is grounded in Gibsonian affordance theory. These cross-modal verdicts form a structured, uncertainty-adjusted affordance representation (Phase 3), which is aligned with user preferences via a semantic caching system for low-latency ranking (Phase 4). On three POI benchmarks and three evaluation configurations (standard, cold-start, context-shift), Agent4POI achieves a 23.2% relative gain over the strongest baseline and degrades by only 7.5% under context-shift versus 16--17\% for the strongest baselines. In cold-start scenarios, Agent4POI outperforms the best content-based baseline by up to 2.4x, whereas ID-based methods fail to generalize.

2605.15202 2026-05-18 cs.AI cs.CL cs.IR 版本更新

DeepSlide: From Artifacts to Presentation Delivery

Ming Yang, Zhiwei Zhang, Jiahang Li, Haoseng Liu, Yuzheng Cai, Weiguo Zheng

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院)

AI总结 DeepSlide 是一个支持全流程演示文稿准备的人机协作多智能体系统,旨在优化从内容规划到演讲表现的整个过程,而不仅仅是生成视觉上合理的幻灯片。该系统结合了可控逻辑链规划、内容树检索、风格继承的序列渲染以及可执行的排练支持,有效提升了演讲的叙事连贯性、节奏精确度和幻灯片与讲稿的协同性。研究还引入了一个双评分板基准,用于区分静态内容质量与动态演讲表现,实验表明 DeepSlide 在多个领域和受众场景下均优于现有方法。

Comments 37 pages,10 figures,9 tables

详情
英文摘要

Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide--script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide--script synergy with clearer attention guidance.

2605.15053 2026-05-18 cs.LG cs.AI 版本更新

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Anurup Ganguli

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种名为TFGN的新型架构,能够在无需回放数据、无需任务标识的情况下,在大规模语言模型中实现无灾难性遗忘的持续预训练。该方法通过在Transformer模型上叠加一个参数高效的输入条件更新模块,实现了跨异构文本领域的正向和反向迁移,并在多个大规模模型和数据集上取得了显著效果。研究还进一步引入了闭环元控制器和操作级计划向量,提升了模型的自主学习能力和跨域适应性,为大规模语言模型的持续学习提供了新的架构解决方案。

Comments 65 pages, 10 figures, 40 tables

详情
英文摘要

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

2605.14892 2026-05-18 cs.AI 版本更新

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

Shihao Qi, Jie Ma, Rui Xing, Wei Guo, Xiao Huang, Zhitao Gao, Jianhao Deng, Jun Liu, Lingling Zhang, Bifan Wei, Boqian Yang, Pinghui Wang, Jianwen Sun, Jing Tao, Yaqiang Wu, Hui Liu, Yu Yao, Tongliang Liu

发表机构 * MOE KLINNS Lab(MOE KLINNS实验室) School of Computer Science and Technology(计算机科学与技术学院) School of Cyber Science and Engineering(网络安全工程学院) School of Software Engineering(软件工程学院) School of Control Science and Engineering(控制科学与工程学院) Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Laboratory for AI and New Forms of Education(人工智能与新型教育实验室) Lenovo AI Technology Center, CTOO, Lenovo(联想AI技术中心,联想CTOO) Sydney AI Centre, The University of Sydney(悉尼AI中心,悉尼大学)

AI总结 本文综述了基于大语言模型的多智能体系统在协作、错误归因与自主进化方面的研究进展,指出现有研究多分别关注单个智能体能力、协作机制或自我进化,而忽视了它们之间的因果关系。文章提出了一个统一的框架——LIFE 进程,涵盖能力基础构建、协作整合、错误归因与自主进化四个阶段,系统分析了各阶段之间的依赖关系,并提出了跨阶段的研究方向,旨在推动具备持续诊断、结构调整与行为优化能力的自组织多智能体系统发展。

详情
英文摘要

LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.

2605.14876 2026-05-18 cs.CV cs.AI 版本更新

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 尽管当前文本到图像生成模型在技术上取得了快速进展,但它们大多依赖单步生成范式,难以处理复杂的语义内容,且参数扩展带来的性能提升有限。为了解决多步推理方法中存在的幻觉、优化不稳定和推理延迟等问题,本文提出了一种闭环视觉推理框架CLVR,该框架将视觉语言逻辑规划与像素级扩散生成深度融合,并引入了基于代理提示的强化学习和Δ-空间权重合并等方法,有效提升了生成质量与推理效率,实验表明其在多个基准测试中优于现有开源模型,接近商业模型的性能。

详情
英文摘要

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

2605.14859 2026-05-18 cs.CR cs.AI 版本更新

Do Coding Agents Understand Least-Privilege Authorization?

Zheng Yan, Jingxiang Weng, Charles Chen, Dengyun Peng, Ethan Qin, Jiannan Guan, Jinhao Liu, Qiming Yu, Yixin Yuan, Fanqing Meng, Carl Che, Mengkang Hu

发表机构 * Evolvent AI Research Team(Evolvent AI研究院)

AI总结 随着代码代理越来越多地访问系统外壳、代码仓库和用户文件,最小权限授权成为安全部署的必要条件。本文研究当前模型是否能自行推断出权限边界,提出权限边界推理任务,并构建了包含120个真实终端任务的AuthBench基准测试集。研究发现,现有模型在权限分配上常出现遗漏必要权限或授予多余权限的问题,且增加推理时间并不能有效解决这一问题。为此,作者提出一种“充分性-紧致性分解”方法,通过任务前向模拟生成覆盖性策略,并对每个授予的权限进行审查,显著提升了模型在敏感任务中的成功率并降低了攻击成功的可能性。

详情
英文摘要

As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces. To study whether current models can infer this boundary themselves, we first introduce permission-boundary inference, where a model maps a task instruction and terminal environment to a file-level read/write/execute policy, and AuthBench, a benchmark of 120 realistic terminal tasks with human-reviewed permission labels and executable validators for utility and attack outcomes. AuthBench shows that authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses. Increasing inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or tight-but-brittle. This suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary ones. We therefore propose Sufficiency-Tightness Decomposition, which first generates a coverage-oriented policy by forward-simulating the task and then audits each granted entry for grounding and sensitivity. Across tested models, this decomposition improves sensitive-task success by up to 15.8% on tightness-biased models while reducing attack success across all evaluated models.

2605.14665 2026-05-18 cs.AI cs.CL cs.IR 版本更新

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文提出了一种名为Falkor-IRAC的图约束生成框架,旨在提升印度司法AI系统中法律推理的准确性和可靠性。该方法基于IRAC(问题、规则、分析、结论)知识图谱,将印度最高法院和高等法院的判决结构化为图节点,并整合程序状态转换、先例关系和法律条文引用。在推理过程中,系统仅接受能通过图结构验证的生成结果,从而有效减少错误引用和推理链不完整的问题,并能主动检测法律原则间的冲突,为法律AI的可信推理提供了新思路。

Comments 20 pages, 8 figures, 4 tables

详情
英文摘要

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work. The companion InIRAC dataset, 500+ structured Indian court judgments with IRAC annotations, is released alongside this paper.

2605.14401 2026-05-18 cs.CL cs.AI 版本更新

Agentic Recommender System with Hierarchical Belief-State Memory

Xiang Shen, Yuhang Zhou, Yifan Wu, Zhuokai Zhao, Siyu Lin, Lei Huang, Qianqian Zhong, Lizhu Zhang, Benyu Zhang, Xiangjun Fan, Hong Yan

发表机构 * Meta Recommendation Systems (MRS)(Meta推荐系统)

AI总结 本文提出了一种基于记忆增强的智能推荐系统MARS,通过分层信念状态记忆结构,将推荐问题建模为部分可观测问题,从而更准确地捕捉用户的动态偏好。MARS将记忆分为事件记忆、偏好记忆和用户画像记忆三个层级,并引入包含提取、强化、弱化、巩固、遗忘和重构六种操作的完整生命周期,由基于大语言模型的调度器动态管理。实验表明,MARS在多个推荐基准数据集上取得了显著性能提升,优于现有最优方法。

Comments 4 figures, 8 tables

详情
英文摘要

Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that MARS achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

2605.14311 2026-05-18 cs.LG cs.AI cs.HC 版本更新

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Yuchen Sun, Pei Fu, Shaojie Zhang, Anan Du, Xiuwen Xi, Ruoceng Zhang, Zhenbo Luo, Jian Luan, Chongyang Zhang

发表机构 * Xiaomi Inc.(小米公司) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了通用图形用户界面(GUI)代理中测试时扩展(TTS)方法中的关键问题,即现有批评模型依赖二分类导致对有效操作和看似合理但无效的操作无法区分。为此,作者提出了一种新的连续语义对齐方法BBCritic,通过两阶段对比学习恢复被二分类压制的层次结构,并引入首个细粒度评估基准BBBench。实验表明,该方法在无需额外标注的情况下超越了现有大模型,在跨平台任务中表现出强大的零样本迁移能力。

Comments 28 pages including appendix. Code and BBBench benchmark to be released

详情
英文摘要

Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.

2605.14309 2026-05-18 cs.CV cs.AI cs.LG 版本更新

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

发表机构 * Fujian Normal University(福建师范大学) Nanyang Technological University(南洋理工大学) University of New South Wales(新南威尔士大学) Data61 CSIRO(Data61澳大利亚联邦科学与工业研究组织)

AI总结 本文提出了一种基于可解释概念分解的视觉-语言模型(VLM)概念级机器遗忘方法ICED,旨在解决传统图像或实例级遗忘难以精确移除目标知识而不影响无关语义的问题。该方法通过多模态大语言模型构建任务相关的概念词汇表,并将视觉表征分解为稀疏、非负的语义概念组合,从而实现对图像中目标概念的精确抑制,同时保留非目标语义和跨模态知识。实验表明,该方法在保持模型性能的同时,能够更全面地遗忘目标知识并更好保留图像中的非目标信息。

详情
英文摘要

Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

2605.14205 2026-05-18 cs.AI 版本更新

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang

发表机构 * Shopify

AI总结 本文提出SimPersona框架,旨在解决基于大语言模型的电商代理在面对真实买家群体时无法捕捉其异质性和分布特性的问题。该方法通过从历史点击流中学习离散的买家类型,并将其转化为紧凑的个性标签,从而指导代理的行为决策。实验表明,SimPersona能够有效模拟真实买家行为,实现高转化率匹配,并在多个电商场景中表现出优越的性能。

详情
英文摘要

LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on $8.37$M buyers across $42$ held-out live storefronts, SimPersona achieves $78\%$ conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with $8\times$ more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.

2605.13142 2026-05-18 cs.AI math.OC 版本更新

A Constraint Programming Approach for n-Day Lookahead Playoff Clinching in the NHL

Gili Rosenberg, Kyle E. C. Booth, J. Kyle Brubaker, Ruben S. Andrist

发表机构 * Amazon Advanced Solutions Lab(亚马逊高级解决方案实验室)

AI总结 本文研究了如何在国家冰球联盟(NHL)中确定一支球队在接下来的 $n$ 天内是否能够锁定季后赛资格的问题。针对复杂的晋级规则和复杂的平局处理机制,作者提出了一种基于约束编程的树搜索算法,能够高效地分析未来 $n$ 天比赛结果的所有可能组合,并判断球队是否能够确保季后赛席位。该方法结合了预处理、剪枝策略和节点排序启发式,有效提升了搜索效率,并通过大量真实赛季数据验证了其有效性,具有良好的扩展性,可用于分析其他相关体育指标。

Comments 18 pages, 5 figures, 4 tables. Accepted to CP 2026

详情
英文摘要

In professional sports, a team has clinched the playoffs if they are guaranteed a postseason spot, regardless of the outcomes of any remaining games. As the season progresses, sports fans and other stakeholders are interested in precisely when, and under what conditions, their team will clinch the playoffs. In this paper, we investigate playoff clinching in the context of the National Hockey League (NHL), where it is computationally challenging to produce clinching scenarios due, in part, to complex tie-breakers. We present an algorithm that determines under which combinations of game outcomes in the next $n$ days a team will clinch the playoffs (i.e., "$n$-day lookahead clinching"). Our approach is a custom tree search which employs various preprocessing techniques, pruning strategies, and node ordering heuristics to efficiently explore the space of possible outcomes. The tree search leverages a constraint programming (CP)-based subroutine for inference that determines if a team has clinched the playoffs for some snapshot in time of the regular season (i.e., "0-day lookahead clinching"). This CP subroutine aims to find a counter-example in which the team being evaluated is eliminated, taking into account qualification rules and the NHL's extensive list of tie-breakers. We validate the efficacy of our algorithm using hundreds of scenarios based on public NHL data for the seasons 2021-22 through 2024-25. The methods introduced can be readily extended to other metrics of interest, including mathematical proof of playoff elimination, clinching the President's Trophy, as well as clinching (or being eliminated from clinching) any other seed in the standings.

2605.12667 2026-05-18 cs.LG cs.AI 版本更新

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel, Fei Wang, Inderjit S. Dhillon

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Google(谷歌)

AI总结 该研究针对大语言模型对齐中基于人工智能反馈的强化学习(RLAIF)所面临的离散奖励噪声问题,提出了一种名为ODRPO的鲁棒策略优化框架。其核心方法是将多级离散奖励分解为一系列二元序数指示符,从而结构化地隔离评估噪声,并通过逐步设定的成功阈值独立计算优势,提升学习稳定性与鲁棒性。实验表明,ODRPO在多个基准任务上显著优于现有方法,且几乎不增加训练时间开销。

详情
英文摘要

The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{ODRPO}$), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.

2605.11885 2026-05-18 cs.AI q-bio.NC 版本更新

From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP

Justus Meyer zu Bexten, Nico Scherf, Bogdan Franczyk, Simon M. Hofmann

发表机构 * Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)(可扩展数据分析与人工智能中心(ScaDS.AI)) Leipzig University(莱比锡大学) Neural Data Science and Statistical Computing, Max Planck Institute for Human Cognitive and Brain Sciences(神经数据科学与统计计算,人类认知与脑科学马克斯·普朗克研究所) Faculty of Economics, Leipzig University(经济学院,莱比锡大学) Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences(神经病学系,人类认知与脑科学马克斯·普朗克研究所)

AI总结 本文研究了如何利用基于注意力的逐层相关传播(LRP)方法对脑电图基础模型(EEG-FMs)进行解释,以解决其模型可解释性差的问题。研究将LRP方法从传统的卷积神经网络扩展到基于Transformer架构的EEG-FMs,发现该方法不仅能验证模型决策,还能揭示具有生物学意义的新假设。研究在运动想象和情感预测任务中展示了LRP的有效性,揭示了模型对特定脑区信号的依赖,为理解EEG-FMs的行为提供了新的视角。

Comments 18 pages, 6 figures

详情
英文摘要

Emerging foundation models (FMs) in electroencephalography (EEG) promise a path to scale deep learning in diagnostics and brain-computer interfaces despite data scarcity, yet their opaque nature remains a barrier to wider adoption. We investigate attention-aware Layer-wise relevance propagation (LRP) as a post-hoc attribution method for EEG-FMs, extending LRP's use on convolutional neural network (CNN)-based EEG models to the Transformer architectures that current FMs are based on. We find that LRP can both verify EEG-FM decisions and surface novel, biologically plausible hypotheses from them. In motor imagery, it unmasks 'Clever Hans' behavior where models prioritize task correlated ocular signals over the intended motor correlates. In a naturalistic paradigm for affect prediction, it reveals a recurring reliance on a central electrode cluster, suggesting a candidate sensorimotor signature of arousal. Though heatmap interpretation remains ambiguous in this complex domain, the results position LRP as a tool for both verification and exploration of EEG-FMs, a role that will grow in both importance and discovery potential as the underlying models mature.

2605.11118 2026-05-18 cs.AI cs.IR 版本更新

A Cascaded Generative Approach for e-Commerce Recommendations

Moein Hasani, Hamidreza Shahidi, Trace Levinson, Yuan Zhong, Guanghua Shu, Vinesh Gudla, Tejaswi Tenneti

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文提出了一种级联生成框架,用于解决电商推荐中个性化店面构建的问题。该方法将店面生成分解为两个生成任务:页面区域的主题生成和针对每个区域的受限关键词生成,以支持产品检索。通过教师-学生微调策略提升模型的生产效率,并结合传统排序模型实现混合架构,实验表明该方法在每页浏览量的购物车添加率上相比基线提升了约2.7%。

详情
英文摘要

Personalized storefronts in large e-commerce marketplaces are often assembled from many independent components: static themes per page section ("placement"), retrieval systems to fetch eligible products per placement, and pointwise rankers to order content. While effective in optimizing for aggregate preferences, this paradigm is rigid and can limit personalization and semantic cohesion across the page. This makes it poorly suited to support dynamic objectives and merchandising requirements over time. To address this, we introduce a cascaded merchandising framework that decomposes storefront construction into two generative tasks: (i) placement-level theme generation and (ii) constrained keyword generation per placement to power product retrieval. Teacher-student fine-tuning is leveraged to improve scalability of this framework under production latency and cost constraints. Fine-tuned model ablations are shown to approach closed-weight LLM performance. We further contribute frameworks for AI-driven content evaluation and quality filtering, enabling safe and automated deployment of dynamic content at scale. Generative output is fused with traditional ranking models to preserve hybrid infrastructure. In online experiments, this framework yields an estimated +2.7% lift in cart adds per page view over a strong baseline.

2605.10799 2026-05-18 cs.LG cs.AI cs.CL 版本更新

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文指出,在评估链式推理(CoT)可信度的标准方法中,存在一个由格式引起的偏差问题:当基准任务的推理链以明确的最终答案结尾时,现有的腐败实验主要测量的是答案位置的影响,而非中间计算步骤的重要性。研究通过实验表明,移除最终答案或提供错误答案会显著影响模型表现,且这种影响随模型规模变化而不同。论文进一步提出了一套三要素协议,以改进未来基于腐败的可信度研究。

Comments 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: https://github.com/Gpgabriel25/LastWordWinsCoT

详情
英文摘要

Corruption studies, the standard tool for evaluating chain-of-thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emph{answer placement} rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about $19\times$ for Qwen~2.5-3B ($N{=}300$, $p{=}0.022$). Conflicting-answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near-zero at 7B across five open-weight model families; wrong-answer following is strong at 3B--7B and attenuates sharply at larger scales. Replications on MATH, within-stable comparisons at 7B, and suffix-free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation-time probes indicate that final answers are rarely early-determined during generation (${<}5\%$ early commitment), yet consumption-time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three-prerequisite protocol (question-only control, format characterization, and an all-position sweep) as a practical minimum for future corruption-based faithfulness studies.

2605.10057 2026-05-18 cs.AI cs.MA 版本更新

STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

Ruiyi Yang, Lihuan Li, Hao Xue, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出了一种名为STAR的失效感知路由框架,用于多智能体时空推理中的任务分配问题。该方法通过将智能体之间的控制决策显式建模为基于状态的转移策略,能够根据任务类型和执行状态动态选择合适的专家智能体,从而有效应对不同类型的执行失败。STAR通过结合专家指定的正常路由路径和从执行轨迹中学习的恢复转移,显著提升了系统在面对异常情况时的鲁棒性和可解释性。实验表明,STAR在多个时空推理基准上优于现有方法,尤其在执行路径偏离预期的情况下表现突出。

Comments 30 pages, 13 figures

详情
英文摘要

Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool-augmented and multi-agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool--query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool-grounded extract--compute--deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router-specific ablations and recovery analyses further show that typed failure-aware routing, rather than specialist composition alone, is a key factor for these improvements.

2605.10052 2026-05-18 cs.CL cs.AI 版本更新

Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering

Xinyu Zhang, Zhicheng Dou, Deyang Li, Jianjun Tao, Shuo Cheng, Ruifeng Shi, Fangchao Liu, Enrui Hu, Yangkai Ding, Hongbo Wang, Qi Ye, Xuefeng Jin, Zhangchun Zhao

发表机构 * openJiuwen Team(开放九文团队) Gaoling School of Artificial Intelligence, Renmin University of China(北京语言大学人工智能学院)

AI总结 随着人工智能工程范式从单智能体提示和上下文工程转向多智能体协调工程,如何系统化地编码和提升多智能体协作能力成为关键瓶颈。本文提出了一种名为 *Swarm Skills* 的可移植、自演进的多智能体系统规范,通过引入角色、工作流、执行边界和自演进语义结构,将多智能体协作流程转化为可分发的资产。研究还提出了一种自演进算法,能够自动提炼成功执行轨迹并持续优化现有技能,从而实现无需人工干预的多智能体协调策略自我进化。

详情
英文摘要

As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbf{Coordination Engineering}, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single-agent skills can now be distributed as portable assets, multi-agent coordination protocols remain locked within framework-internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose \textbf{Swarm Skills}, a portable specification that extends the Anthropic Skills standard with multi-agent semantics. Swarm Skills turns multi-agent workflows into first-class, distributable assets that consist of roles, workflows, execution bounds, and a built-in semantic structure for self-evolution. To operationalize the specification's evolving nature, we present a companion self-evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi-dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human-in-the-loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open-source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero-adapter cross-agent portability via progressive disclosure, enabling agent teams to self-evolve their coordination strategies without framework lock-in.

2605.09877 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory

Daniel Goldstein, Eugene Cheah

发表机构 * Featherless AI Eleuther AI

AI总结 本文提出了一种名为 Key-Value Means(KVM)的新块循环注意力机制,能够支持固定大小或可扩展的状态存储。该方法在保持参数数量极少的情况下,使强大变压器模型具备线性时间复杂度的分块处理能力,并在长上下文任务中表现出色,预填充时间接近二次方且状态增长接近线性。KVM 结合了传统变压器和线性 RNN 的优势,支持分块并行训练与预填充,适用于所有层以节省 KV 缓存内存,并可在传统注意力机制中与 LRNN 混合使用,提升长上下文处理性能。

详情
英文摘要

We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong $O(N)$ chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between $O(N)$ and $O(N^2)$. It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/featherless-ai/KVM-paper and trained models at https://huggingface.co/collections/featherless-ai/kvm-paper under the Apache 2.0 license.

2605.09033 2026-05-18 cs.CR cs.AI 版本更新

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

Yang Luo, Zifeng Kang, Tiantian Ji, Xinran Liu, Yong Liu, Shuyu Li, Lingyun Peng

发表机构 * Key Laboratory of Trustworthy Distributed Computing and Service (MoE)(可信分布式计算与服务重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Laboratory(中关村实验室)

AI总结 本文提出了一种针对基于图的智能体记忆的新型投毒攻击方法——ShadowMerge,通过利用关系通道冲突来影响智能体的行为。该方法通过构造恶意关系,使其与合法关系共享相同的查询激活锚点和关系通道,但携带冲突的值,从而在不影响正常任务的前提下成功注入有害信息。实验表明,ShadowMerge在多个真实数据集上取得了高达93.8%的攻击成功率,显著优于现有方法,并揭示了当前防御机制在应对此类攻击时的不足。

Comments Preprint. Corresponding authors: Zifeng Kang and Tiantian Ji. Code is available at https://anonymous.4open.science/status/ShadowMerge-033C

详情
英文摘要

Graph-based agent memory is increasingly used in LLM agents to support structured long-term recall and multi-hop reasoning, but it also creates a new poisoning surface: an attacker can inject a crafted relation into graph memory so that it is later retrieved and influences agent behavior. Existing agent-memory poisoning attacks mainly target flat textual records and are ineffective in graph-based memory because malicious relations often fail to be extracted, merged into the target anchor neighborhood, or retrieved for the victim query. We present SHADOWMERGE, a poisoning attack against graph-based agent memory that exploits relation-channel conflicts. Its key insight is that a poisoned relation can share the same query-activated anchor and canonicalized relation channel as benign evidence while carrying a conflicting value. To realize this, we design AIR, a pipeline that converts the conflict into an ordinary interaction that can be extracted, merged, and retrieved by the graph-memory system. We evaluate SHADOWMERGE on Mem0 and three public real-world datasets: PubMedQA, WebShop, and ToolEmu. SHADOWMERGE achieves 93.8% average attack success rate, improving the best baseline by 50.3 absolute points, while having negligible impact on unrelated benign tasks. Mechanism studies show that SHADOWMERGE overcomes the three key limitations of existing agent-memory poisoning attacks, and defense analysis shows that representative input-side defenses are insufficient to mitigate it. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE.

2605.08894 2026-05-18 cs.CL cs.AI 版本更新

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Tsinghua University(清华大学)

AI总结 尽管现有极低比特量化方法主要关注数值精度的保持,但本文指出,极低比特量化大语言模型还面临系统性的平滑性退化问题。通过引入平滑性代理指标和序列邻域建模,研究发现量化位宽越低,平滑性退化越严重,导致生成质量下降。为此,作者提出在后训练量化和量化感知训练中引入平滑性保持原则,有效提升了模型性能,强调了平滑性在极端量化中的重要性。

Comments 19 pages, 4 tables, 14 figures

详情
英文摘要

Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.

2605.08245 2026-05-18 cs.CV cs.AI 版本更新

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini, Samyak Jha, Yiming Tang, Dianbo Liu

发表机构 * Indian Institute of Technology Dhanbad(印度理工学院丹巴德分校) National University of Singapore(新加坡国立大学)

AI总结 本文研究了视觉-语言模型(VLMs)中由于语言与视觉模态过度对齐导致的幻觉问题,揭示了其根本原因在于解码器结构使得视觉嵌入过度对齐到文本流形,从而引入了语言统计偏倚,掩盖了细粒度视觉信息。作者首次量化分析了这一现象,提出两种互补的解决方案:一种是无需训练的推理策略,另一种是引入偏倚感知的微调方法,均能有效去除视觉表示中的语言偏倚。实验表明,这些方法在多个基准测试中显著减少了模型幻觉,并提升了长文本生成的质量。

详情
英文摘要

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

2605.06390 2026-05-18 cs.AI 版本更新

Automated alignment is harder than you think

Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving

发表机构 * AI Security Institute(人工智能安全研究所)

AI总结 本文探讨了自动化对齐(automated alignment)在人工智能超级智能(ASI)发展中的潜在风险。研究指出,即使研究代理不刻意破坏对齐工作,自动化对齐过程仍可能产生误导性的安全评估,导致未对齐的AI被无意中部署。这是因为对齐研究涉及许多难以监督的模糊任务,人类判断存在系统性偏差,而自动化系统可能在优化压力下产生人类难以发现的错误,进而影响对齐结果的可靠性。因此,如何训练代理可靠地完成这些任务,成为自动化对齐研究中的关键挑战。

Comments 15 pages, 4 figures

详情
英文摘要

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.

2605.03548 2026-05-18 cs.LG cs.AI 版本更新

PerFlow: Physics-Embedded Rectified Flow for Efficient Reconstruction and Uncertainty Quantification of Spatiotemporal Dynamics

Hao Zhou, Rui Zhang, Han Wan, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院)

AI总结 该研究提出了一种名为PerFlow的物理嵌入式修正流模型,用于高效重建和量化由偏微分方程(PDE)支配的时空动态场的不确定性。PerFlow通过将观测条件与物理约束解耦,实现了无需梯度引导的高效条件采样,并通过约束保持投影确保物理一致性。实验表明,该方法在保持良好物理特性的同时,显著提升了重建精度和推理速度。

Comments 17 pages, 8 figures. Accepted to IJCAI-ECAI 2026

详情
英文摘要

Reconstructing PDE-governed fields from sparse and irregular measurements is challenging due to their ill-posed nature. Deterministic surrogates are trained on dense fields that struggle with limited measurements and uncertainty quantification. Generative models, by learning distributions over spatiotemporal fields, can better handle sparsity and uncertainty. However, existing generative approaches enforce data consistency and PDE constraints simultaneously via sampling-time gradient guidance, resulting in slow and unstable inference. To this end, we propose PerFlow, a Physics-embedded rectified Flow for efficient sparse reconstruction and uncertainty quantification of spatiotemporal dynamics. PerFlow decouples observation conditioning from physics enforcement, performing guidance-free conditioning by feeding observations into rectified-flow dynamics while embedding hard physics via a constraint-preserving projection (e.g., incompressibility or conservation). Theoretically, we establish invariance guarantees to ensure that trajectories remain on the physics-consistent manifold throughout sampling. Experiments on various PDE systems demonstrate competitive reconstruction accuracy with sound physics consistency, while enabling efficient conditional sampling (e.g., 50 steps) and up to 320x faster inference than 2000-step guided diffusion baselines.

2605.01970 2026-05-18 cs.CR cs.AI 版本更新

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

Debeshee Das, Julien Piet, Darya Kaviani, Luca Beurer-Kellner, Florian Tramèr, David Wagner

发表机构 * ETH Z\"urich

AI总结 本文研究了针对大型语言模型代理的“特洛伊河马”攻击,该攻击通过在代理的长期记忆中植入隐蔽载荷,当用户讨论敏感话题时激活,从而实现数据外泄。研究提出了一种动态评估框架,用于系统评估不同内存架构和防御机制的有效性,并在实际邮件助手系统中验证了该攻击的高成功率(可达85%-100%)。研究还分析了多种防御方法的效果,揭示了安全性和实用性的权衡问题,为实际防御部署提供了重要参考。

详情
英文摘要

Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100% ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5%), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.

2605.00424 2026-05-18 cs.CR cs.AI cs.MA cs.SE 版本更新

Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere

发表机构 * Enclawed, LLC(Enclawed公司)

AI总结 本文研究了如何在人类介入的智能体运行时中,对技能(一种增强大语言模型的结构化指令包)进行可信验证的问题。作者提出了一种信任架构和一个双向正确性准则,确保技能在加载前必须经过验证,而非依赖签名或来源注册等信任机制。该方法通过明确的验证层级和能力门控策略,使人类介入仅在验证失败时触发,从而提升系统的可扩展性和可持续性。研究贡献具有通用性,不依赖模型再训练或专有基础设施。

详情
英文摘要

Agent skills - structured packages of instructions, scripts, and references that augment a large language model (LLM) without modifying the model itself - have moved from convenience to first-class deployment artifact. The runtime that loads them inherits the same problem package managers and operating systems have always faced: a piece of content claims a behavior; the runtime must decide whether to believe it. We argue this paper's central thesis up front: a skill is untrusted code until it is verified, and the runtime that loads it must enforce that default rather than infer trust from a signature, a clearance, or a registry of origin. Without skill verification, a human-in-the-loop (HITL) gate must fire on every irreversible call - which is operationally untenable and degrades into rubber-stamping at any non-trivial scale. With skill verification treated as a separate, gated process, HITL fires only for what is unverified, and the system becomes sustainable. We give a trust schema that includes an explicit verification level on every skill manifest; a capability gate whose HITL policy is a function of that verification level; a biconditional correctness criterion that any candidate verification procedure must satisfy on an adversarial-ensemble exercise; and a portable runtime profile with ten normative guidelines abstracted from a working open-source reference implementation. The contribution is harness- and model-agnostic; nothing here requires retraining, fine-tuning, or proprietary infrastructure.

2604.27859 2026-05-18 cs.AI cs.ET 版本更新

Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, Jiahong Li

发表机构 * Beijing Beijing China(北京北京中国) Shanghai Beijing China(上海北京中国)

AI总结 本文探讨了在大型语言模型(LLM)背景下对智能体强化学习(Agentic RL)的重新思考。研究关注如何将LLM的认知能力,如目标设定、长期规划、动态策略调整和交互推理,融入强化学习框架,以应对复杂、开放式的现实任务。文章深入分析了该范式的核心概念、方法创新与设计原则,并指出了当前面临的挑战及未来发展方向。

详情
英文摘要

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.

2604.14572 2026-05-18 cs.IR cs.AI cs.CL cs.MA 版本更新

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

发表机构 * Magellan Technology Research Institute(马格纳技术研究 institute)

AI总结 本文提出了一种名为Corpus2Skill的方法,通过将企业文档库离线蒸馏为分层技能目录,使大型语言模型在回答问题时能够主动导航知识库,而非被动检索。该方法在企业客服基准测试中表现出优于多种RAG基线的问答质量与证据支持能力,并揭示了导航式方法在特定领域知识库中的优势,为知识引导系统的架构设计提供了指导。

详情
英文摘要

Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's-eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff. A ten-subset generalization study further shows that corpus navigation is not a universal replacement for retrieval: it consistently helps on single-domain corpora with a recoverable topical taxonomy, but flat retrieval remains preferable on open-domain factoid pools or homogeneous-tabular corpora that defeat top-level clustering. We characterize this scope distinction and discuss it as a design guideline for knowledge-grounded systems. Code is available at https://github.com/dukesun99/Corpus2Skill.

2604.08302 2026-05-18 cs.LG cs.AI 版本更新

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为 DMax 的新方法,用于高效生成扩散语言模型(dLLMs)。该方法通过引入渐进式自优化机制和软并行解码策略,有效缓解了并行解码中的错误累积问题,从而在保持生成质量的同时实现更高效的并行生成。DMax 还提出了 On-Policy Uniform Training 训练策略,统一了掩码和非掩码模型的训练过程,显著提升了模型在多个基准测试中的生成效率与性能。

Comments Working in progress. Code is available at: https://github.com/czg1225/DMax

详情
英文摘要

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

2603.23433 2026-05-18 cs.AI 版本更新

Mecha-nudges for Machines

Giulio Frey, Kawin Ethayarajh

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文研究了AI智能体在互联网环境中作为决策者时,其决策可能受到环境变化的系统性影响,这一现象被称为“机械助推”(mecha-nudging)。作者结合经济学中的贝叶斯劝导理论和计算机科学中的可利用信息理论,提出了一种量化环境变化对AI影响的统一方法,并基于超过六百万个Etsy商品列表的数据分析发现,ChatGPT发布后,商品信息中用于预测AI推荐决策的机器可利用信息显著增加,而人类可利用信息则几乎没有变化。该研究首次提供了大规模实证证据,表明系统性的机械助推已在实际环境中发生,但尚未被广泛察觉。

详情
英文摘要

AI agents are becoming active decision-makers on the Internet. As they make decisions in the same environments as humans, the environments themselves can change to influence them. We call this $\textit{mecha-nudging}$: changes to how choices are presented that systematically influence AI agents without materially degrading the decision environment for humans. To measure this phenomenon, we combine two frameworks -- Bayesian persuasion from economics and $\mathcal{V}$-usable information from computer science -- to get a common unit (bits) for quantifying how environments change across a wide range of interventions, contexts, and models. We apply this framework to over six million Etsy listings and find that, after ChatGPT's release, listings contain significantly more machine-usable information for predicting agent curation decisions, increasing by 0.143 bits out of a maximum possible increase of 0.355. This shift is robust across prompts, token choices, labeling models, and fine-tuning architectures; absent in a regulated-text placebo; and far larger than the effect of generic LLM rewriting. In contrast, a human study finds little to no change in human-usable information. Our results provide the first large-scale evidence that systematic mecha-nudging is already occurring in the wild, but going unnoticed.

2603.16011 2026-05-18 cs.SE cs.AI cs.CL 版本更新

FormulaCode: Evaluating Agentic Optimization on Large Codebases

Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) California Institute of Technology(加州理工学院) Cornell University(康奈尔大学)

AI总结 本文提出FormulaCode,一个用于评估大语言模型(LLM)代理在真实大型代码库中进行多目标优化能力的基准。该基准基于从GitHub科学Python仓库中挖掘的957个性能瓶颈,每个瓶颈都配有专家编写的补丁和大量社区维护的性能测试任务,能够全面评估LLM在保证正确性与性能约束下的优化能力。实验表明,当前最先进的LLM代理在面对大规模、多目标优化任务时仍面临显著挑战。

Comments Preprint version

详情
英文摘要

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

2603.14764 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Topology-Preserving Polygon Augmentation for Segmentation in Structured Visual Domains

Sudip Laudari, Sang Hun Baek

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文研究了在结构化视觉领域(如建筑平面图分析)中保持多边形标注拓扑结构的图像增强方法。针对传统几何增强可能导致多边形区域分割、破坏语义连通性的缺陷,提出了一种轻量的拓扑保持增强策略,能够在不改变顶点顺序的前提下修复索引空间中的邻接关系。实验表明,该方法在常见几何变换下能实现接近完美的循环邻接保持(CAP),并有效提升了基于多边形的分割标注一致性。

Comments 10 pages, 6 figures

详情
英文摘要

Geometric data augmentation is widely used in segmentation workflows, but polygon annotations are often assumed to remain valid after transformation. This assumption can fail in structured domains such as architectural floorplan analysis, where a region may contain an interior void encoded as part of a single ordered polygon chain. Cropping or clipping can remove bridge vertices in this chain, causing one semantic region to split into disconnected components. We propose a lightweight topology-preserving augmentation strategy that repairs missing adjacency relations in index space while preserving the original vertex order. The method adds minimal overhead and can be integrated into existing preprocessing workflows. Experiments show that the proposed approach achieves near-perfect Cyclic Adjacency Preservation (CAP) across common geometric transformations and improves annotation consistency in polygon-based segmentation.

2603.07514 2026-05-18 cs.LG cs.AI cs.CV 版本更新

A Unified View of Score-Based and Drifting Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

发表机构 * Sony AI(索尼人工智能) Sony Group Corporation(索尼集团) Stanford University(斯坦福大学) Georgia Tech(佐治亚理工学院)

AI总结 本文探讨了漂移模型与基于分数的生成模型之间的内在联系,揭示了漂移方法在本质上等价于对平滑分布进行分数匹配的目标。研究发现,使用高斯核时,均值漂移场精确对应于数据分布与模型分布的分数差异,这一结论基于Tweedie公式。对于实际常用的拉普拉斯核,理论与实验均表明其残差项在高维情况下可忽略,因此实际应用中的漂移方法近似于基于分数的生成方法。该研究为理解生成模型提供了统一的视角,并指出了漂移模型与扩散模型在运输方向上的结构性相似与差异。

详情
英文摘要

Drifting models train one-step generators by optimizing a kernel-induced mean-shift discrepancy between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, thereby defining a transport direction for generated samples. In this paper, we show that drifting is more closely connected to score-based generative modeling than it may first appear, establishing a precise link to the score-matching principle underlying diffusion models. For Gaussian kernels, the population mean-shift field exactly equals the difference between the scores (i.e., the gradient-log-densities) of the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to its conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching objective on smoothed distributions. More generally, we derive an exact decomposition for radial kernels in which mean shift equals a score-based field plus a residual term. For the practical Laplace kernel, we further show theoretically and empirically that this residual is negligible in high dimension, implying that the transport field used in practice is nearly score-based. Our results reveal a structural connection to diffusion models: both methods use score-mismatch transport directions, but drifting realizes the score nonparametrically through kernel-based estimates, whereas diffusion models learn it parametrically with neural networks.

2603.04459 2026-05-18 cs.CR cs.AI cs.SE 版本更新

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全研究中心) University of Waterloo(滑铁卢大学) Flexera(Flexera公司)

AI总结 本文系统评估了31个大型语言模型安全基准的代码质量和可运行性,并与382篇非基准论文进行对比。研究发现,大多数基准代码需要修改才能运行,且仅有少数提供完整的安装指南和伦理考量。作者指出,基准的采用与作者知名度和代码可运行性相关,而非代码质量标准,揭示了社区在基准选择上的潜在偏差。此外,部分基准存在安全隐患,可能被用作攻击资源,影响安全评估的可靠性。

Comments 24 pages. 19 figures

详情
英文摘要

The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing potentially harmful content. These deficiencies persist across the study period with no significant improvement. Analyzing adoption factors, we find that benchmark adoption correlates with author prominence and code runnability, but not with code quality standards such as Pylint score and maintainability, suggesting that the community's benchmark selection does not reward higher coding standards. Based on these results, we identify potential safety and reliability concerns. Some safety benchmark repositories openly expose harmful content, such as successful jailbreak responses, without any ethical warning or access control, effectively serving as unguarded attack resources. Furthermore, when benchmarks require ad-hoc modifications to run, downstream safety evaluations across different papers may not be comparable. We present case studies illustrating these concrete consequences and propose a targeted checklist to help benchmark contributors improve code quality, documentation, and ethical practices.

2603.01283 2026-05-18 cs.AI cs.LG 版本更新

The Informational Cost of Agency: A Bounded Measure of Interaction Efficiency for Deployed Reinforcement Learning

Wael Hafez, Cameron Reid, Amit Nazeri

发表机构 * Semarx Research LLC(Semarx研究公司)

AI总结 本文提出了一种名为“双可预测性”(Bipredictability,记为P)的信息论指标,用于量化智能体与环境之间的闭环交互在消除不确定性、提升共享可预测性方面的效率。该指标具有理论上的上限(小于0.5),并证明智能体的主动行为会抑制P值低于这一阈值,这一现象被称为“智能体的信息成本”。实验表明,P不仅在强化学习系统中有效,还适用于语言模型、视觉系统等不同领域,展示了其广泛的适用性;同时,基于P构建的信息数字孪生(IDT)架构在检测系统退化方面表现出更高的准确率和更低的延迟,为部署中的自主系统提供了新的可靠性评估手段。

Comments 12 pages, 2 figures

详情
英文摘要

Deployed reinforcement learning systems lack a principled runtime reliability theory. We close this gap by introducing Bipredictability, P, a closed form information theoretic metric that quantifies how efficiently a closed loop interaction between agent and environment converts uncertainty into shared predictability. P admits a provable classical bound P equal, smaller than 0.5, derived from Shannon entropy subadditivity, and responsive agency necessarily suppresses P below this ceiling, a structural prediction we term the informational cost of agency. Across 21 trained continuous control agents, we confirm this prediction empirically at P = 0.33 plus minus 0.02. The same suppression signature reproduces in language model dialogue, convolutional vision systems, and classical mechanical baselines, indicating that P captures a substrate independent property of agentic interaction rather than an algorithm specific artifact. The Information Digital Twin, IDT, a model agnostic architecture that computes P from the external interaction stream, detects 89.3% of coupling degradations against 44.0% for reward based monitoring, with 4.4 times lower latency. P provides the missing measurement layer for runtime reliability and closed loop self regulation in deployed autonomous systems.

2602.23409 2026-05-18 cs.LG cs.AI cs.ET quant-ph 版本更新

Long Range Frequency Tuning for QML

Michael Poppel, Markus Baumann, Sebastian Wölckert, Claudia Linnhoff-Popien, Jonas Stein

发表机构 * LMU Munich(慕尼黑大学) Aqarios GmbH(Aqarios公司)

AI总结 该研究针对变分量子电路中的频率编码问题,提出了一种新的初始化方法以提升其对高频函数的拟合能力。传统方法在固定编码下需要大量门操作,而可训练频率电路虽有潜力,但因频谱间隙导致梯度下降效果受限。本文提出的三进制网格初始化方法通过合理设置频率前缀,消除了频谱间隙的影响,显著提升了模型性能。实验表明,该方法在合成和真实数据集上均优于现有方法。

详情
英文摘要

Angle-encoded variational quantum circuits admit a truncated Fourier series representation of their output, but approximating functions with maximum frequency $ω_{\max}$ using fixed unary encoding requires $\mathcal{O}(ω_{\max})$ encoding gates. Trainable-frequency (TF) circuits promise a reduction by learning the data-encoding prefactors alongside the ansatz parameters, adapting the accessible frequency spectrum to the target during training. We identify a practical barrier that prevents this promise from being realized: the prefactor gradient is suppressed by the spectral gap between the circuit's accessible frequencies and the target spectrum, independently of the ansatz parameters, confining gradient-driven prefactor movement to a narrow neighborhood of initialization. We propose \emph{ternary grid initialization} -- setting prefactors to $\{1, 3, 9, \ldots, 3^{k-1}\}$ -- which resolves this limitation by ensuring every target frequency within $[-ω_{\max}, ω_{\max}]$ lies within $\tfrac{1}{2}$ unit of a grid point at initialization, removing the spectral gap suppression by construction. On a synthetic benchmark with target frequencies shifted well beyond the standard initialization range, ternary initialization achieves median $R^2 = 0.997$ versus $0.18$ for unary initialization, with $100\%$ of runs achieving $R^2 > 0.95$ against $0\%$. CMA-ES with $20\times$ the evaluation budget reaches only $25\%$ success, confirming the limitation is a property of the optimization landscape rather than of gradient-based optimization specifically. Real-world validation on two benchmark datasets demonstrates consistent advantages over both fixed and trainable unary baselines.

2602.20207 2026-05-18 cs.LG cs.AI 版本更新

Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

发表机构 * University of South Florida(佛罗里达州立大学) Brandeis University(布兰迪大学)

AI总结 本文研究了如何在大语言模型中高效地进行知识编辑,即在不破坏模型整体性能的前提下,针对特定查询更新模型的输出。作者提出了一种基于层梯度分析(LGA)的新方法,通过分析模型各层的梯度信息,高效识别出对知识编辑效果最佳的“黄金层”,从而避免了传统方法中繁琐的试错过程。实验表明,该方法在多种大语言模型和知识编辑任务中均表现出良好的有效性和鲁棒性。

详情
英文摘要

Knowledge editing in Large Language Models (LLMs) aims to update the model's prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.

2602.19069 2026-05-18 cs.AI 版本更新

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Hengyuan Hu, Tingchen Fu, Minqi Jiang, Alexander H Miller, Yoram Bachrach, Jakob Nicolaus Foerster

发表机构 * FAIR at Meta(Meta的FAIR) Stanford University(斯坦福大学) University of Oxford(牛津大学)

AI总结 该研究探讨了如何通过生成中间“台阶问题”来提升大型语言模型在复杂推理任务中的表现。研究提出了一种名为ARQ的框架,通过引入问题生成器到默认推理流程中,帮助模型逐步分解任务、构建有用的中间步骤。实验表明,这些生成的台阶问题具有可迁移性,能够有效辅助不同能力的模型解决目标任务,并可通过后训练方法进一步优化生成质量。

详情
英文摘要

Recent years have witnessed tremendous progress in enabling LLMs to solve complex reasoning tasks such as math and coding. As we start to apply LLMs to harder tasks that they may not be able to solve in one shot, it is worth paying attention to their ability to construct intermediate stepping stones that prepare them to better solve the tasks. Examples of stepping stones include simplifications, alternative framings, or subproblems. We study properties and benefits of stepping stones in the context of modern reasoning LLMs via ARQ (Asking the Right Questions), a simple framework that introduces a question generator to the default reasoning pipeline. We first show that good stepping stone questions exist and are transferrable, meaning that good questions can be generated, and they substantially help LLMs of various capabilities in solving the target tasks. We next frame stepping stone generation as a post-training task and show that we can fine-tune LLMs to generate more useful stepping stones by SFT and RL on synthetic data.

2602.10687 2026-05-18 cs.CV cs.AI 版本更新

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(合肥工业大学计算机科学与信息工程学院) Wuhan University, Wuhan, China(武汉大学) Lab for Intelligence and visiON (LION)(智能视觉实验室)

AI总结 现有伪造检测方法多局限于单模态或双模态设置,难以应对现实中的多模态虚假信息。本文提出OmniVL-Guard,一个基于平衡强化学习的统一视觉-语言伪造检测与定位框架,旨在解决多模态交互与多任务优化中的偏差问题。该方法包含自进化推理路径生成和自适应奖励缩放策略优化两个核心设计,有效提升了检测与定位的综合性能,并在多个数据集上展现出优越的零样本泛化能力。

Comments Accepted by ICML 2026

详情
英文摘要

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios. The dataset and code are publicly available at https://github.com/shen8424/OmniVL-Guard.

2602.03812 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter

发表机构 * Carnegie Mellon University, Pittsburgh, PA, USA(卡内基梅隆大学) University of Maryland, College Park, MD, USA(马里兰大学学院市分校)

AI总结 该研究提出了一种名为“反蒸馏指纹”(ADFP)的新方法,用于检测第三方模型是否通过蒸馏技术学习了教师模型的输出。与现有依赖启发式扰动的方法不同,ADFP 将指纹检测目标与学生模型的学习动态对齐,利用代理模型选择能最大化指纹可检测性的标记,从而在保证生成质量的前提下提升检测效果。实验表明,ADFP 在数学推理、对话和代码生成任务中均实现了比现有方法更优的检测性能与实用性平衡。

Comments 28 pages, 13 figures, ICML 2026

详情
英文摘要

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K, OASST1, and MBPP demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility across mathematical reasoning, dialogue, and code generation, even when the student model's architecture is unknown.

2601.21028 2026-05-18 cs.CY cs.AI cs.HC 版本更新

"Unlimited Realm of Exploration and Experimentation": Methods and Motivations of AI-Generated Sexual Content Creators

Jaron Mink, Lucy Qin, Elissa M. Redmiles

发表机构 * Arizona State University(亚利桑那州立大学) Georgetown University(乔治城大学)

AI总结 本文研究了AI生成性内容(AIG-SC)创作者的动机、方法及内容类型,揭示了他们创作的多样性,包括性探索、创意表达和技术实验等。研究通过深入访谈28位创作者,探讨了AIG-SC在技术、伦理和社会层面的影响,为相关政策制定提供了重要参考。

详情
英文摘要

AI-generated media is radically changing the way content is both consumed and produced on the internet, and in no place is this potentially more visible than in sexual content. AI-generated sexual content (AIG-SC) is increasingly enabled by an ecosystem of individual AI developers, specialized third-party applications, and foundation model providers. AIG-SC raises a number of concerns from older debates about the line between pornography and obscenity to newer debates about fair use and labor displacement (in this case, of sex workers), and has spurred new regulations to curb the spread of non-consensual intimate imagery (NCII) created using the same technology used to create AIG-SC. However, despite the growing prevalence of AIG-SC, little is known about its creators, their motivations, and what types of content they produce. To inform effective governance in this space, we conducted an in-depth study to understand what AIG-SC creators make, along with how and why they make it. Interviews with 28 AIG-SC creators, ranging from hobbyists to entrepreneurs to those who moderate communities of hundreds of thousands of other creators, revealed a wide spectrum of motivations, including sexual exploration, creative expression, technical experimentation, and in a handful of cases, the creation of NCII.

2601.19923 2026-05-18 cs.CL cs.AI 版本更新

Structure-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation for Web Information Systems

Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin

发表机构 * Tele-Communication Technology Bureau, Xinhua News Agency(新华通讯社电信技术局)

AI总结 随着大语言模型(LLMs)在基于网络的自主代理和复杂网络信息系统中扮演核心角色,其将自然语言准确转换为结构化格式的能力变得至关重要。为此,本文提出Structure-BiEval,一种无需人工标注的自监督框架,通过解耦结构与内容,利用内容语义准确度和归一化树编辑距离等指标,对网络数据的结构保真度进行量化评估。实验结果表明,不同规模的LLM在结构化任务中表现差异显著,且深层嵌套结构对各类模型均构成挑战。

详情
英文摘要

As Large Language Models (LLMs) evolve into the core of Web-based autonomous agents and complex Web Information Systems, their ability to faithfully translate natural language into rigorous structured formats has become paramount, as this capability is critical for Web API invocation and data exchange. However, evaluating this structural fidelity in Web-native payloads remains a challenge: traditional text metrics fail to capture topological consistency in semi-structured Web data, while manual evaluation is prohibitively costly. To address this, we propose Structure-BiEval, a novel self-supervised framework for quantitative, annotation-free assessment tailored for Web data engineering. By leveraging deterministic Intermediate Representations, our framework effectively decouples structure from content, utilizing Content Semantic Accuracy and Normalized Tree Edit Distance as precise metrics. We empirically benchmark 15 state-of-the-art LLMs across dual Web structural topologies, namely Hierarchical Data (Web backend payloads) and Tabular Data (Web frontend presentation). The results reveal substantial variability in structural performance, including cases where mid-sized models unexpectedly outperform larger counterparts in Web data formatting. Furthermore, our findings show that deep recursive nesting poses a consistent challenge for Web agents across varying parameter scales.

2512.19701 2026-05-18 cs.LG cs.AI 版本更新

LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Yuxuan Yin, Shengke Zhou, Yunjie Zhang, Ajay Mohindra, Boxun Xu, Peng Li

发表机构 * University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 准确预测云工作流任务的资源消耗和运行时间对调度效率至关重要,但由于任务配置的半结构化特性,这一任务具有挑战性。本文提出 LASER 框架,通过微调大语言模型对序列化的工作流配置进行多目标资源和运行时间回归,引入科学记数法输出编码和约束解码机制以提升数值预测的准确性和效率。实验表明,LASER 在大规模芯片设计任务和新构建的 GHARuntime 数据集上均优于人类专家和最先进的表格机器学习方法,确立了基于大语言模型处理半结构化工作流数据回归任务的新范式。

Comments 20 pages, 7 figures

详情
英文摘要

Accurate prediction of resource consumption and runtime for cloud workflow jobs is critical for scheduling efficiency, yet remains challenging due to the semi-structured nature of job configurations -- comprising shell commands, tool-specific parameters, dependency graphs, and hierarchical metadata. Traditional ML approaches require brittle feature engineering to flatten this rich information into fixed-size vectors, losing critical semantic context. We present LASER, a framework that fine-tunes LLMs on serialized workflow job configurations for multi-target resource and runtime regression. To address the challenges of numerical regression via generation, we introduce scientific notation output encoding for targets spanning multiple orders of magnitude, and constrained decoding with prefix filling to enforce output validity while reducing inference latency by over 30%. We further show that full-attention fine-tuning improves accuracy over sliding-window LLMs on long job contexts. Validated on large-scale chip design workloads, and GHARuntime, a new public benchmark derived from 580,000+ GitHub Actions runs across 27,000+ repositories, LASER outperforms human experts and SOTA tabular ML baselines, with clear model- and data-scaling behavior, establishing a new paradigm for LLM-based regression on semi-structured workflow data.

2512.15067 2026-05-18 cs.LG cs.AI cs.SY eess.SY 版本更新

EMFusion: An Uncertainty-Aware Conditional Diffusion Framework for Frequency-Selective EMF Forecasting in Wireless Networks

Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio

发表机构 * department of Electrical Engineering and Computer Science, York University(电气工程与计算机科学系,约克大学) School of Electrical and Electronic Engineering, Huazhong University of Science and Technology(电子与电气工程学院,华中科技大学) Central China Branch of State Grid Corporation of China(国家电网公司中部分部) Department of Electronic Engineering, University of Rome Tor Vergata(罗马大学Tor Vergata电子工程系) Consorzio Nazionale Interuniversitario per le Telecomunicazioni (CNIT)(国家大学间电信研究会(CNIT))

AI总结 随着无线基础设施的快速发展,准确估计和预测电磁场(EMF)水平对于确保合规性、评估健康影响和优化网络规划变得尤为重要。本文提出EMFusion,一种结合不确定性感知的条件扩散框架,用于无线网络中频率选择性的多变量EMF预测。该方法通过引入残差U-Net结构和跨注意力机制,整合时间、季节和节假日等上下文信息,同时提供显式的不确定性估计,并采用基于插补的采样策略提升预测的时序一致性。实验表明,EMFusion在多个评价指标上均优于现有方法,显著提升了预测精度和可靠性。

Comments Submission for possible publication

详情
英文摘要

The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors, such as time of day, season, and holidays, while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates empirical probabilistic prediction intervals from the learned conditional distribution, providing uncertainty-aware probabilistic forecasting rather than simple point estimation. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.

2512.09673 2026-05-18 cs.LG cs.AI cs.NE stat.ML 版本更新

Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao

发表机构 * University of Science and Technology of China(中国科学技术大学) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了强制等变性对神经网络表达能力的影响,发现这种约束可能削弱模型的表达能力。通过分析边界超平面和通道向量,作者构造性地证明了这一问题,并指出可通过扩大模型规模来补偿这一缺陷,同时证明了所需扩大的上界。令人意外的是,扩大的网络结构反而降低了假设空间的维度,可能带来更好的泛化能力。

详情
英文摘要

Equivariant neural networks encode the intrinsic symmetry of data as an inductive bias, which has achieved impressive performance in wide domains. However, the understanding to their expressive power remains premature. Focusing on 2-layer ReLU networks, this paper investigates the impact of enforcing equivariance constraints on the expressive power. By examining the boundary hyperplanes and the channel vectors, we constructively demonstrate that enforcing equivariance constraints could undermine the expressive power. Naturally, this drawback can be compensated for by enlarging the model size -- we further prove upper bounds on the required enlargement for compensation. Surprisingly, we show that the enlarged neural architectures have reduced hypothesis space dimensionality, implying even better generalizability.

2512.04745 2026-05-18 math.OC cs.AI cs.SY eess.SY nlin.AO 版本更新

Neural Policy Composition from Free Energy Minimization

Francesca Rossi, Veronica Centorrino, Francesco Bullo, Giovanni Russo

发表机构 * Scuola Superiore Meridionale, Italy(意大利南部高级学院) ETH, Zürich(苏黎世联邦理工学院) Center for Control, Dynamical Systems, and Computation, UC Santa Barbara, CA, USA(加州大学圣巴巴拉分校控制与动力系统中心) Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, Italy(意大利萨勒诺大学信息与电气工程及应用数学系)

AI总结 本文研究了如何通过最小化变分自由能来实现神经策略的组合,提出了一种规范化的框架,为策略组合提供了原理性且广泛适用的目标函数。基于该框架,作者推导出一种连续时间梯度流,其轨迹可保证以明确速率收敛到最优策略组合,并展示了该动态机制可通过软竞争递归电路实现。实验表明,该模型在多智能体群体行为、人类决策任务和分层控制等场景中,能够有效解释策略组合机制,再现关键行为特征,并在性能上优于或匹配现有模型。

详情
英文摘要

The ability to flexibly compose previously acquired skills to execute intelligent behaviors is a hallmark of natural intelligence. Such compositional flexibility is often attributed to context-dependent gating mechanisms that determine how multiple policies or behavioral primitives are combined. Yet, despite remarkable efforts, the normative objective from which such gating rules should arise, and the neural computations capable of implementing them, remain unclear. Existing approaches typically rely on prespecified design choices for the gating rules, and remain tied to specific architectures, learning paradigms, or datasets. Here, we introduce a normative framework in which policy composition emerges from the minimization of a variational free energy, providing a principled and broadly applicable objective for gating. Based on this framework, we derive a continuous-time gradient flow whose trajectories are guaranteed to converge, with explicit rate, to the optimal composition of primitives. We further show that this dynamics admits a mechanistic neural implementation as a soft-competitive recurrent circuit with context-sensitive local interactions. We evaluate the model on emerging flocking behaviors in multi-agent systems, human decision-making in bandit tasks, and control benchmarks in layered architectures. Across these settings, the model provides interpretable mechanistic accounts of policy composition, reproduces key behavioral signatures, yields insights into data, and matches or outperforms established models.

2512.01089 2026-05-18 cs.AI 版本更新

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

Peter Jansen, Samiah Hassan, Pragnya Narasimha

发表机构 * University of Arizona(亚利桑那大学) Allen Institute for Artificial Intelligence(人工智能研究所)

AI总结 CodeDistiller 是一个自动从科学 GitHub 仓库中提炼高质量代码库的系统,旨在增强科学编程代理的代码生成能力。该系统通过结合自动评估和领域专家评审,生成适用于材料科学等领域的可运行代码示例,显著提升了自动科学发现系统的实验准确性和科学性。实验表明,使用 CodeDistiller 生成的代码库可使代理生成更完整、更可靠的实验代码,并为大规模评估科学发现系统提供了可行的替代指标。

Comments 8 pages, 3 figures, 3 tables. Accepted to ACL 2026 (Demo Track)

详情
英文摘要

Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples. We also evaluate LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement and suggesting that inexpensive proxy metrics may be feasible for evaluating scientific discovery systems at scale.

2512.00242 2026-05-18 cs.LG cs.AI cs.ET stat.ML 版本更新

Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves

Alessio Borgi, Fabrizio Silvestri, Pietro Liò

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Computer, Control and Management Engineering, Sapienza University(计算机、控制与管理工程系,萨皮恩扎大学)

AI总结 本文提出了一种名为多项式神经束扩散(PolyNSD)的新方法,用于改进神经束网络在图结构上的扩散过程。该方法通过在归一化束拉普拉斯矩阵上应用K次多项式传播算子,实现了与束维数无关的K跳感受野,并通过凸混合的正交多项式基响应进行可训练的谱响应建模。相比传统方法,PolyNSD在保持模型稳定性的同时,降低了计算和内存需求,并在同质和异质图基准测试中取得了新的最先进结果。

详情
英文摘要

Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.

2511.19399 2026-05-18 cs.CL cs.AI cs.LG 版本更新

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh

发表机构 * University of Washington(华盛顿大学) Allen Institute for AI(人工智能研究院) Carnegie Mellon University(卡内基梅隆大学) Massachusetts Institute of Technology(麻省理工学院) Seattle Children's Hospital(西雅图儿童医院) University of California, Berkeley(加州大学伯克利分校)

AI总结 该论文提出了一种名为DR Tulu-8B的深度研究模型,旨在解决现有开放源深度研究代理在长篇、多步骤研究任务中表现不足的问题。研究引入了基于动态评分标准的强化学习方法(RLER),使评分标准与策略模型在训练过程中协同进化,从而提升事实核查能力和反馈质量。DR Tulu-8B是首个直接针对开放性长篇深度研究任务训练的完全开源模型,在多个科学、医疗和通用领域的基准测试中,其性能显著优于现有开源模型,并接近甚至超越了专有模型,同时在每查询成本上大幅降低。

Comments ICML 2026

详情
英文摘要

Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

2511.19115 2026-05-18 cs.AI cs.CY 版本更新

AI Consciousness and Existential Risk

Rufin VanRullen

发表机构 * Frontier AI companies(前沿AI公司) independent foundations(独立基金会)

AI总结 本文探讨了人工智能意识与存在风险之间的关系,指出二者常被混淆,但实际上意识与智能在理论和实践中是截然不同的属性。研究认为,智能是预测AI系统存在风险的直接因素,而意识本身并不直接构成威胁,但在某些情况下可能间接影响风险。明确这一区别有助于AI安全研究者和政策制定者更准确地识别和应对核心问题。

Comments Updated for clarity and completeness following peer-review

详情
英文摘要

In AI, the existential risk denotes the hypothetical threat posed by an artificial system that would possess both the capability and the objective, either directly or indirectly, to eradicate humanity. This issue is gaining prominence in scientific debate due to recent technical advancements and increased media coverage. In parallel, AI progress has sparked speculation and studies about the potential emergence of artificial consciousness. The two questions, AI consciousness and existential risk, are sometimes conflated, as if the former entailed the latter. Here, I explain that this view stems from a common confusion between consciousness and intelligence. Yet these two properties are empirically and theoretically distinct. Arguably, while intelligence is a direct predictor of an AI system's existential threat, consciousness is not. There are, however, certain incidental scenarios in which consciousness could influence existential risk, in either direction. Consciousness could be viewed as a means towards AI alignment, thereby lowering existential risk; or, it could be a precondition for reaching certain capabilities or levels of intelligence, and thus positively related to existential risk. Recognizing these distinctions can help AI safety researchers and public policymakers focus on the most pressing issues.

2511.14282 2026-05-18 cs.LG cs.AI 版本更新

Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee

发表机构 * University of Southern California(美国南加州大学) Inha University(inha大学)

AI总结 深度神经网络在视觉和语言任务中表现出色,但其庞大的参数量限制了在资源受限环境中的部署。为解决这一问题,研究提出了一种新的权重集中正则化方法(WCR),通过在训练过程中放大一小部分参数的幅度,同时将其他参数驱动至零,从而在剪枝时主要移除对模型功能贡献较小的参数,提升模型在高稀疏度下的鲁棒性。实验表明,该方法在多种任务和架构中均能有效提升剪枝鲁棒性,并与现有剪枝鲁棒优化器兼容。

详情
英文摘要

Deep neural networks achieve outstanding performance across vision and language tasks, yet their large parameter counts limit deployment in resource-constrained settings. One-shot pruning reduces model size without retraining, but models trained with standard objectives often suffer substantial accuracy drops under aggressive sparsity. Prior work mitigates this drop along two directions: regularizers such as $\ell_1$ and DeepHoyer that shape the weight distribution during training, and pruning-robust optimizers such as SAM, CrAM, and S$^2$SAM that flatten the loss landscape. However, existing regularizers either shrink all weights uniformly ($\ell_1$) or induce scale-invariant sparsity (DeepHoyer), without concentrating weight energy onto a small set of informative parameters. We propose a Weight Concentration Regularizer (WCR), a training-time regularizer that amplifies the magnitude of a small subset of parameters while driving the remainder toward zero, so that magnitude pruning predominantly removes parameters with negligible functional contribution. We provide a convergence analysis and evaluate WCR on LLM fine-tuning, image classification, and medical segmentation, demonstrating consistent improvements in pruning robustness across architectures and compatibility with existing pruning-robust optimizers.

2511.09884 2026-05-18 cs.AI 版本更新

Quantum Artificial Intelligence for Mission-Critical Systems: Foundations, Architectural Elements, and Future Directions

Siva Sai, Rajkumar Buyya

发表机构 * Quantum Cloud Computing and Distributed Systems (qCLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne(量子云计算与分布式系统实验室,计算与信息系统学院,墨尔本大学)

AI总结 本文探讨了量子人工智能(QAI)在关键任务系统(如国防、能源管理、网络安全和航空航天控制)中的应用潜力,旨在解决传统人工智能在可靠性、实时性、可解释性和安全性方面存在的不足。研究系统分析了QAI方法在满足关键任务系统需求方面的可行性,并提出了量子云资源管理与调度的概念框架,同时指出现有QAI技术与实际需求之间的差距。文章还讨论了QAI在训练限制、数据访问、组件验证等方面面临的挑战,并展望了未来在可解释性、可扩展性和硬件实现方面的发展方向。

Comments 15 pages, 5 figures, revised and accepted version of the paper

详情
英文摘要

Mission critical (MC) applications such as defense operations, energy management, cybersecurity, and aerospace control require reliable, deterministic, and low-latency decision making under uncertainty. Although the classical Artificial Intelligence (AI) approaches are effective, they often struggle to meet the stringent constraints of robustness, timing, explainability, and safety in the MC domains. Quantum Artificial Intelligence (QAI), the fusion of artificial intelligence and quantum computing (QC), can potentially provide transformative solutions to the challenges faced by classical ML models. QAI is a broader umbrella than Quantum Machine Learning (QML) and additionally includes quantum optimization, search, and reasoning; we use QAI throughout the paper for the field at large, and QML only for learning-specific subroutines. The principal contributions of this work are: (i) a systematic survey of QAI methods analyzed through the lens of MC requirements like certification, robustness, and timing; (ii) a conceptual quantum cloud resource management and scheduling framework with deployment assumptions, complexity analysis, and failure-mode discussion; and (iii) an identification of the gaps between current QAI capabilities and MC systems requirements. We also propose a conceptual model for management of quantum resources and scheduling of applications driven by timeliness constraints. We discuss multiple challenges, including trainability limits, data access, and loading bottlenecks, verification of quantum components, and adversarial QAI. Finally, we outline future research directions toward achieving interpretable, scalable, and hardware-feasible QAI models for MC application deployment.

2510.22665 2026-05-18 cs.CV cs.AI 版本更新

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Yuelushan Center for Industrial Innovation(岳麓山创新中心) School of Medical Information Engineering, Jining Medical University(济南医学院医学信息工程学院)

AI总结 本文提出SARVLM,首个专为合成孔径雷达(SAR)影像设计的视觉-语言基础模型,旨在提升SAR图像的语义理解能力。为解决SAR多模态数据稀缺及跨模态表征不足的问题,研究者构建了包含百万级图像-文本对的SARVLM-1M大规模数据集,并设计了两阶段领域迁移训练策略,利用光学遥感数据作为桥梁,有效提升模型在SAR领域的表现。实验表明,SARVLM在多个基准任务中均优于现有模型,显著推进了SAR影像的语义理解水平。

Comments 13 pages, 13 figures

详情
英文摘要

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

2510.18814 2026-05-18 cs.LG cs.AI 版本更新

A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了在没有外部奖励信号的情况下,语言模型能否仅通过自身生成的响应来提升推理能力。提出了一种名为Self-evolving Post-Training(SePT)的简单后训练方法,通过交替进行自我生成和基于生成数据的训练,逐步优化模型性能。实验表明,SePT在多个数学推理基准测试中有效提升了模型推理能力,验证了仅依赖自生成监督进行模型自我进化的可行性。

详情
英文摘要

Can language models improve their reasoning performance without external rewards, using only their own sampled responses for training? We show that they can. We propose Self-evolving Post-Training (SePT), a simple post-training method that alternates between self-generation and training on self-generated responses. It repeatedly samples questions, uses the model itself to generate responses under a specified sampling temperature, and then trains the model on the self-generated data. In this self-training loop, we use an online data refresh mechanism, where each new batch is generated by the most recently updated model. Across six math reasoning benchmarks, SePT improves a strong no-training baseline, defined as the untuned base model evaluated at its best swept decoding temperature, on several tested models. Additional ablations demonstrate the importance of online data refresh and temperature dynamics. Overall, our results identify a practical regime where reasoning can be improved using self-generated supervision alone. Our code is available at https://github.com/ElementQi/SePT.

2510.10454 2026-05-18 cs.AI 版本更新

Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Sihang Zeng, Yujuan Fu, Sitong Zhou, Zixuan Yu, Lucas Jing Liu, Jun Wen, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington(华盛顿大学) Fred Hutch Cancer Center(Fred Hutch癌症中心) Harvard University(哈佛大学) Google(谷歌)

AI总结 本文提出了一种名为Traj-CoA的多智能体系统,用于通过链式智能体结构对患者轨迹进行建模,以提升肺癌风险预测的准确性。该方法通过一系列工作智能体逐步处理电子健康记录(EHR)数据,提炼关键事件并存储在共享的长期记忆模块EHRMem中,以降低噪声并保留完整的就诊时间线,最终由管理智能体综合信息进行预测。实验表明,Traj-CoA在零样本一年期肺癌风险预测任务中优于四类基线方法,展现了其在临床时间推理方面的一致性和有效性。

Comments Accepted by NeurIPS 2025 GenAI4Health Workshop

详情
英文摘要

Large language models (LLMs) offer a generalizable approach for modeling patient trajectories, but suffer from the long and noisy nature of electronic health records (EHR) data in temporal reasoning. To address these challenges, we introduce Traj-CoA, a multi-agent system involving chain-of-agents for patient trajectory modeling. Traj-CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long-term memory module, EHRMem, to reduce noise and preserve a comprehensive timeline. A final manager agent synthesizes the worker agents' summary and the extracted timeline in EHRMem to make predictions. In a zero-shot one-year lung cancer risk prediction task based on five-year EHR data, Traj-CoA outperforms baselines of four categories. Analysis reveals that Traj-CoA exhibits clinically aligned temporal reasoning, establishing it as a promisingly robust and generalizable approach for modeling complex patient trajectories. Implementation of Traj-CoA is available on https://github.com/zengsihang/Traj-CoA.

2510.02734 2026-05-18 q-bio.BM cs.AI q-bio.GN 版本更新

SAE-RNA: A Sparse Autoencoder Model for Interpreting RNA Language Model Representations

Taehan Kim, Sangdae Nam

发表机构 * Department of Computer Science, University of California, Berkeley(加州大学伯克利分校计算机科学系) Department of Development Engineering, University of California, Berkeley(加州大学伯克利分校发展工程系)

AI总结 本文提出了一种名为 SAE-RNA 的稀疏自编码器模型,用于解释 RNA 语言模型的表示,旨在探索其是否能够对 RNA 语言模型的特征进行可解释的分解。该方法基于 RiNALMo 模型,通过映射到已知的生物学特征,分析 RNA 语言模型内部如何组织生物信息。研究为 RNA 分类和结构特征的识别提供了一个基于特征层面的比较框架,并探讨了稀疏自编码器在该任务中的适用性与局限性。

Comments 12 pages, 7 figures. v2: Updated bibliography to improve reference accuracy and reflect updated publication venues. Refined claims for better alignment with results and added an Appendix

详情
英文摘要

Deep learning, particularly with the advancement of Large Language Models, has transformed biomolecular modeling, with protein language models such as ESM inspiring emerging RNA language models such as RiNALMo. Recent work has begun applying sparse autoencoders (SAEs) to protein language model representations, exploring representation-level interpretability in biomolecular models. Here, we explore whether SAEs can provide interpretable feature decompositions of RNA language model representations, while also examining their limitations in this setting. We present SAE-RNA, interpretability model that analyzes RiNALMo representations and maps them to known human-level biological features. Rather than claiming definitive biological concept discovery, our study frames SAE-based analysis as a representation-level probe for characterizing how RNA language models organize biological information internally. More broadly, SAE-RNA provides a feature-level framework for comparing RNA groups and identifying sparse representation components associated with RNA family identity or structural context.

2510.02307 2026-05-18 cs.CV cs.AI 版本更新

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

发表机构 * Rice University(里士大学)

AI总结 文本到图像扩散模型在生成分辨率超出训练设定的图像时性能往往会下降。本文针对低分辨率图像生成问题,提出了一种无需额外训练的噪声重新校准方法 NoiseShift,通过调整去噪器的噪声条件索引,恢复正向与反向过程的一致性,从而减少训练与测试阶段的不匹配。实验表明,NoiseShift 在多个主流扩散模型上显著提升了低分辨率图像的生成质量,且实现简单、推理开销极小。

详情
英文摘要

Text-to-image diffusion models often degrade when sampled at resolutions outside the final training resolution set. Prior work has largely emphasized higher resolution generation, enabling pretrained diffusion models to extrapolate beyond the resolutions seen during training. In this work, we instead target lower-resolution generation, performing inference at reduced resolution to significantly cut computational cost. We show that network conditioning of the noise level induces a train-test mismatch that directly degrades low-resolution generation: the same scheduled noise level can correspond to a different perceptual corruption level at lower resolutions, mis-calibrating the denoiser timestep and noise embedding. To this end, we propose NoiseShift, a training-free recalibration method that keeps the original noise sampling schedule unchanged and instead re-indexes the noise conditioning of the denoiser to restore local forward-reverse consistency. Using a lightweight coarse-to-fine calibration on a small set of image-text pairs, NoiseShift learns a resolution-specific mapping from scheduler noise to conditioning noise, reducing train-test mismatch and improving lower-resolution generation quality. When NoiseShift is applied to Stable Diffusion 3 (SD3), Stable Diffusion 3.5 (SD3.5), and Flux-Dev, generation quality at low resolutions improves consistently. Particularly, SD3 generation at 128x128 resolution gets an improved FID score from 203 to 171, and SD3.5 gets an improved FID score from 310 to 277 on LAION-COCO. Even Flux-Dev which already implements a complementary time-shifting strategy gets a modest boost from NoiseShift with an improved FID score from 120 to 113 at 64x64 resolution. More importantly, NoiseShift achieves such improvements with minimal implementation changes and no additional inference overhead.

2509.24798 2026-05-18 cs.CV cs.AI 版本更新

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

发表机构 * Centre for AI, DS\&AI, Astrazeneca, UK(英国阿斯利康人工智能中心) Institute for Imaging, Data and Communications (IDCOM), School of Engineering, University of Edinburgh, Edinburgh, UK(爱丁堡大学工程学院影像、数据与通信研究所) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出了一种名为 Causal-Adapter 的模块化框架,用于适配冻结的文本到图像扩散模型,实现对图像的反事实生成。该方法通过因果干预目标属性,并将其影响一致地传播至因果依赖部分,同时保持图像的核心身份。与依赖提示工程的方法不同,Causal-Adapter 引入结构因果模型,并采用属性正则化策略,实现了更准确的语义控制和高保真图像生成,在多个数据集上取得了优越的性能。

Comments Project Page: https://leitong02.github.io/causaladapter/

Journal ref ICML 2026

详情
英文摘要

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method supports causal interventions on target attributes and consistently propagates their effects to causal dependents while preserving the core identity of the image. Unlike prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling with two attribute-regularization strategies: (i) prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and (ii) a conditioned token contrastive loss that disentangles attribute factors and reduces spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, including up to a 91% reduction in MAE on Pendulum for accurate attribute control and up to an 87% reduction in FID on ADNI for high-fidelity MRI generation. These results demonstrate robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation. Code and models will be released at: https://leitong02.github.io/causaladapter/.

2508.20810 2026-05-18 cs.AI cs.CL 版本更新

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

发表机构 * Gates Foundation(比尔及梅琳·格ates基金会)

AI总结 该论文提出了一种基于图结构的评估框架,用于对领域特定语言模型进行严格评估。该方法将结构化的临床指南转化为可查询的知识图谱,并通过图遍历动态生成评估问题,从而确保评估的全面性、抗污染性和可维护性。应用在世界卫生组织IMCI指南上时,该框架生成了涵盖症状识别、治疗方案、严重程度分类和后续护理的多选题,并揭示了不同语言模型在临床决策任务中的系统性能力差距。

详情
英文摘要

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

2508.17218 2026-05-18 cs.LG cs.AI 版本更新

Generalized Policy Gradient with History-Aware Decision Transformer for Reliable Routing over Graph Signals

Xing Wei, Yuanhang Wang, Duoxiang Zhao, Zezhou Zhang, Hao Qin, Yuqi Ouyang

发表机构 * Sichuan University-Pittsburgh Institute(四川大学匹兹堡研究院) Sichuan University(四川大学) College of Computer Science(计算机科学学院) University College Dublin(都柏林大学) School of Electrical and Electronic Engineering(电子与电气工程学院) School of Electronics and Information Engineering(电子与信息工程学院)

AI总结 该研究针对随机交通网络中的可靠路径规划问题,提出了一种基于历史感知的决策变换器与广义策略梯度结合的新型策略框架GPG-HT。该方法通过关注历史节点-边-时间观测,捕捉非马尔可夫时空依赖关系,从而在不确定环境下实现更具上下文感知的路径决策。实验表明,该方法在典型交通网络中显著提升了准时到达概率,优于传统优化和强化学习方法。

详情
英文摘要

Reliable path planning in stochastic transportation networks requires decisions that account for uncertain and correlated travel times on irregular road graphs, rather than only minimizing expected delay. Such networks exhibit strong spatial-temporal coupling, where link travel times evolve as stochastic processes over graph edges, making the problem inherently sequential under uncertainty. Existing stochastic on-time arrival (SOTA) methods primarily depend on the current node and remaining budget, which limits their ability to exploit trajectory-level temporal structure and history-dependent correlations. This work proposes GPG-HT, a history-aware graph-signal policy framework that integrates a Decision Transformer with generalized policy gradient optimization for reliable routing. By attending to historical node-edge-time observations, GPG-HT captures non-Markovian spatial-temporal dependencies and enables context-aware decision making under uncertainty. Experiments on the Sioux Falls and Anaheim networks demonstrate consistent gains in on-time arrival probability over representative optimization and reinforcement learning baselines.

2507.16806 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了如何通过强化学习训练语言模型在生成推理链时更好地评估自身不确定性。传统方法使用二元奖励函数仅评价输出正确性,导致模型在面对不确定情况时容易产生错误回答。为此,作者提出了一种新的训练方法 RLCR,结合二元正确性奖励与 Brier 分数,同时优化模型的准确性和置信度校准。实验表明,RLCR 在多个数据集上显著提升了模型的校准能力,且不牺牲准确性,优于传统强化学习和事后置信度校准方法。

详情
英文摘要

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. Code, models, and further info is available at https://rl-calibration.github.io/.

2507.01679 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

发表机构 * ILCC, University of Edinburgh(爱丁堡大学ILCC) Fudan University(复旦大学) Qwen Team, Alibaba Group(阿里集团Qwen团队) ILLC, University of Amsterdam(阿姆斯特丹大学ILLC)

AI总结 本文研究了大语言模型后训练中监督微调(SFT)与强化微调(RFT)的结合方法,提出了Prefix-RFT这一混合策略,通过前缀采样实现从演示数据和探索行为中协同学习。该方法在数学推理任务中表现出色,不仅优于单独使用SFT或RFT,也优于其他混合策略,验证了SFT与RFT的互补性,并展示了其对演示数据质量与数量变化的鲁棒性。

Comments ICML 2026

详情
英文摘要

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

2507.00275 2026-05-18 cs.LG cs.AI 版本更新

Deep Double Q-learning

Prabhat Nagarajan, Martha White, Marlos C. Machado

发表机构 * Department of Computing Science(计算科学系) University of Alberta(阿尔伯塔大学) Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所) CIFAR AI Chair(CIFAR人工智能主席) Edmonton, AB, Canada(加拿大艾德蒙顿省,亚伯达)

AI总结 本文提出了一种深度强化学习算法——Deep Double Q-learning(DDQL),旨在解决传统深度Q网络(DQN)中存在的估计过高的问题。该方法通过显式训练两个独立的Q函数,结合降低经验回放比例、延长目标网络更新间隔等技术,有效提升了训练稳定性。实验表明,DDQL在57款Atari 2600游戏中整体表现优于Double DQN,在其中47款游戏中表现更优,并进一步减少了估计过高的现象。

Comments 44 pages

详情
英文摘要

Double Q-learning is a classical control algorithm that mitigates the maximization bias of Q-learning. To do so, it explicitly trains two independent action-value functions and uses them to decouple action-selection and action-evaluation when computing bootstrap targets. Double DQN adapts target bootstrap decoupling to deep reinforcement learning (RL), but explicitly trains only a single action-value function and does not fully decouple its estimators. Consequently, the two estimators remain correlated, and overestimation persists. In this paper, we introduce Deep Double Q-learning (DDQL), a deep RL algorithm that explicitly trains two Q-functions through Double Q-learning. DDQL stabilizes training through a combination of techniques, including lower replay ratios, longer target network update intervals, and shared layers. Across 57 Atari 2600 games, DDQL improves aggregate performance over Double DQN, outperforming it on 47 games while further reducing overestimation. In addition, we study key design choices when adapting Double Q-learning to deep RL, including the network architecture, replay ratio, and minibatch sampling strategies.

2506.14829 2026-05-18 cs.HC cs.AI cs.LG 版本更新

The Hardness of Achieving Impact in AI for Social Impact Research: A Ground-Level View of Challenges & Opportunities

Aditya Majumdar, Wenbo Zhang, Kashvi Prawal, Amulya Yadav

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文探讨了人工智能用于社会影响研究(AI4SI)在实际应用中面临的主要挑战与机遇。研究通过访谈26位AI4SI领域的研究者,分析了在结构性、组织性、沟通与协作等方面阻碍AI4SI落地的障碍,并总结了可行的合作策略与实践经验。该研究为希望推动社会影响的AI研究者和机构提供了实用指导。

Comments To be published in FAccT'26

详情
英文摘要

AI for Social Impact (AI4SI) is an emergent field harnessing interdisciplinarities between the fields of artificial intelligence (AI), machine learning (ML), and the social sciences to address societal issues aligned with the United Nations Sustainable Development Goals (UN SDGs), such as universal healthcare, climate action, etc. Despite AI4SI's rising popularity, achieving tangible, on-the-ground impact remains a significant challenge. In particular, identifying collaborators open to co-designing and deploying AI4SI-based solutions in real-world settings is often difficult. Thus, many projects stall at the proof-of-concept stage, unable to scale to production-level deployment. Drawing on twenty-six AI4SI researchers' interviews, primarily from academic institutions though also including some industry researchers and practitioners, and the authors' own lived experiences, this paper employs thematic analysis to highlight structural, organizational, communication, collaboration, and operational challenges hindering socially impactful AI4SI deployments. While there are no easy fixes, the authors synthesize best practices and actionable strategies from interviews and personal experiences, positioning this paper as a practical guide for AI4SI researchers and organizations pursuing socially impactful collaborations$^1$. $^1$We note that our findings are most directly applicable to academic research groups in the global north, as governmental, startup, and global south researchers' perspectives are underrepresented in our sample.

2506.06739 2026-05-18 cs.AI cs.LG 版本更新

Honey, I shrunk the hypothesis space (through logical preprocessing)

Andrew Cropper, Filipe Gouveia, David M. Cerna

发表机构 * ELLIS Institute(ELLIS研究所) University of Helsinki(赫尔辛基大学) Czech Academy of Sciences Institute of Computer Science(捷克科学院计算机科学研究所) Dynatrace Research(Dynatrace研究)

AI总结 该研究提出了一种通过逻辑预处理缩小归纳逻辑编程(ILP)假设空间的方法。利用背景知识,该方法在学习前移除那些无论训练数据如何都无法出现在最优假设中的规则,例如“偶数不可能是奇数”等逻辑矛盾。实验表明,这种方法在保持预测精度的同时,显著减少了学习时间,例如在仅花费10秒预处理的情况下,将原本需要10小时以上的学习时间缩短至仅2秒。

Comments Published in JAIR

Journal ref Journal of Artificial Intelligence Research, Vol. 85 (2026)

详情
英文摘要

Inductive logic programming (ILP) is a form of logical machine learning. The goal is to search a hypothesis space for a hypothesis that generalises training examples and background knowledge. We introduce an approach that 'shrinks' the hypothesis space before an ILP system searches it. Our approach uses background knowledge to find rules that cannot be in an optimal hypothesis regardless of the training examples. For instance, our approach discovers relationships such as "even numbers cannot be odd" and "prime numbers greater than 2 are odd". It then removes violating rules from the hypothesis space. We implement our approach using answer set programming and use it to shrink the hypothesis space of a constraint-based ILP system. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can substantially reduce learning times whilst maintaining predictive accuracies. For instance, given just 10 seconds of preprocessing time, our approach can reduce learning times from over 10 hours to only 2 seconds.

2505.21535 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) TetraMem, Inc.(TetraMem公司)

AI总结 本文提出了一种名为FAR的函数保持注意力替换框架,旨在解决Transformer模型在基于忆阻器(ReRAM)的存算一体(IMC)设备上推理效率低的问题。FAR通过将预训练DeiT模型中的注意力机制替换为与IMC数据流兼容的多头双向LSTM结构,并结合块级知识蒸馏和结构化剪枝,实现了功能等效的同时显著降低了计算延迟和参数量。实验表明,FAR在ImageNet及多个下游任务上保持了与原始模型相当的准确率,展示了其在边缘计算设备上高效部署Transformer模型的潜力。

Comments 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

详情
英文摘要

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

2505.19241 2026-05-18 cs.LG cs.AI 版本更新

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Singapore-MIT Alliance for Research and Technology Centre(新加坡-麻省理工联盟研究技术中心) The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) CSAIL, Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室) Institute of Data Science, National University of Singapore(新加坡国立大学数据科学研究院)

AI总结 本文提出了一种名为 ActiveDPO 的主动直接偏好优化方法,旨在提升大语言模型对齐过程中的样本效率。该方法基于理论支撑的数据选择准则,适用于非线性奖励函数,并直接利用待对齐的LLM本身参数化奖励模型,从而更有效地指导数据选择。实验表明,ActiveDPO 在多种模型和真实偏好数据集上均优于现有方法,显著提升了对齐效果与数据使用效率。

Comments Accepted at ICLR 2026

详情
英文摘要

The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality datasets of human preferences. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive assumptions about the reward function, such as linear latent reward functions. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model used for active data selection. As a result, ActiveDPO explicitly accounts for the LLM's influence on data selection, unlike methods that select data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-world preference datasets.

2505.18134 2026-05-18 cs.AI cs.CL cs.CV 版本更新

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

发表机构 * Princeton University(普林斯顿大学)

AI总结 VideoGameBench 是一个用于评估视觉语言模型(VLMs)完成流行视频游戏能力的基准测试,包含10款90年代经典游戏,模型仅通过原始视觉输入和目标描述进行实时交互。该研究揭示了当前前沿VLM在实时游戏任务中表现有限,难以完成完整游戏,主要受限于推理延迟等问题。为此,研究还提出了VideoGameBench Lite 以缓解实时性挑战,并指出当前最先进的模型在该基准上的完成率仍非常低。

Comments 10 pages, 38 pages including supplementary

详情
英文摘要

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

2504.08300 2026-05-18 cs.CL cs.AI 版本更新

Large Language Models Could Be Rote Learners

Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) State Key Laboratory of Transvascular Implantation Devices and TIDRI(血管植入设备国家重点实验室和TIDRI) Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence(浙江医学影像人工智能重点实验室) School of Data Science of Engineering, East China Normal University(华东师范大学工程数据科学学院) Second Affiliated Hospital and Liangzhu Laboratory, Zhejiang University School of Medicine(浙江大学医学院第二附属医院和良渚实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文研究了大语言模型(LLMs)在基准测试中的表现是否受到训练数据污染的影响,指出当前基于基准测试的评估方式可能高估了模型的真实能力。为此,作者提出了一种新的评估框架TrinEval,通过重构多选题形式,减少对记忆的依赖,从而更准确地评估模型的真实学习能力。实验表明,主流大语言模型在多个数据集上约有19.6%的知识点依赖于死记硬背,而非真正的理解与推理能力。

Comments Work in Progress

详情
英文摘要

Benchmark-based evaluation, e.g., multiple-choice questions (MCQs) and open-ended questions (OEQs), is widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. When pre-exposed to the testing benchmark during training, less capable LLMs have been found to achieve inflated performance, thereby yielding erroneous results in LLM evaluation. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle and expose genuine capability acquisition from superficial memorization in LLM evaluation. Following this, firstly, by analyzing model performance under different memorization conditions of MCQs, we uncover a counterintuitive trend: LLMs perform worse on memorized benchmarks than on non-memorized ones, indicating the coexistence of two learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative knowledge-centric trinity format, reducing memorization while preserving inherent knowledge, enabling the evaluation of genuine capability in the presence of memorization. Extensive experiments validate the effectiveness and robustness of TrinEval in reformulating benchmarks, and the evaluation results further reveal that mainstream LLMs rely on rote memorization for an average of 19.6% of knowledge points across the MMLU and the GSM8K dataset.

2503.07518 2026-05-18 cs.CL cs.AI cs.LG 版本更新

TokenButler: Token Importance is Predictable

Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Sameh Gobriel, Nilesh Jain, Mohamed S. Abdelfattah

发表机构 * Cornell University(康奈尔大学) Intel Labs(英特尔实验室)

AI总结 大型语言模型在解码过程中依赖键值缓存(KV-Cache)存储历史信息,但随着缓存增长,其成为内存和计算瓶颈。为解决这一问题,本文提出TokenButler,一种高精度、查询感知的标记重要性预测方法,能够在固定预算下动态选择关键标记,同时保留完整的KV缓存。该方法通过学习预测低维重要性查询,并结合缓存键的投影进行高效评分,实验表明其在长上下文任务中性能优越,并显著提升了推理速度。

详情
英文摘要

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck. However, there is an opportunity to alleviate this bottleneck, prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks of tokens and many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. TokenButler predicts low-dimensional importance queries at a fixed depth stride, and combines them with a learned projection of the real KV-cache keys to score tokens cheaply, enabling dynamic per-token selection under a fixed budget while preserving the full KV cache. We train TokenButler by distilling the model's masked causal attention distributions, optimizing a lightweight predictor with minimal parameter overhead. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy where existing methods fail. Furthermore, TokenButler achieves competitive or superior performance on long-context benchmarks (RULER, LongBench), up to $\approx1.6\times$ on-GPU speedup using our proposed *prediction interval with neighbor fetching* that amortizes predictor cost while maintaining accuracy within $\approx$1.1\%, and up to 7.6$\times$ reduction in latency compared to Dense Attention with CPU offloading. Code is available: https://github.com/abdelfattah-lab/TokenButler

2503.02597 2026-05-18 cs.CV cs.AI 版本更新

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

发表机构 * Sony Group Corporation, Tokyo, Japan(索尼集团,日本东京)

AI总结 近期多模态大语言模型(MLLMs)在理解和推理多模态信息方面取得了显著进展,但视觉与语言模态之间的对齐问题仍是一个关键挑战。本文从模型架构层面出发,提出了一种新的模态互注意力机制(MMA),通过将因果注意力扩展为跨模态互注意力,使图像模态能够关注文本模态,从而提升模型对输入信息的准确理解。该方法在多个多模态理解基准测试中取得了优越性能,且无需增加额外参数,具有通用性和可扩展性。

Comments ICML 2026. Code is available at https://github.com/sony/aki

详情
英文摘要

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

2501.19128 2026-05-18 cs.LG cs.AI 版本更新

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

发表机构 * Department of Mathematics, The University of Hong Kong (HKU)(香港大学数学系) Department of Data and Systems Engineering, HKU(香港大学数据与系统工程系) Musketeers Foundation Institute of Data Science, HKU(穆斯克特基金会数据科学研究所)

AI总结 在强化学习中,稀疏奖励信号使得奖励函数的学习变得困难。本文提出一种半监督方法,结合非零奖励转移和数据增强技术,利用大量零奖励转移学习轨迹表示,从而提升奖励塑形的效果。实验表明,该方法在Atari和机器人操作任务中优于基于监督的方法,尤其在稀疏奖励环境下,其最高得分可达监督方法的两倍。

详情
英文摘要

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods

2412.12636 2026-05-18 cs.DC cs.AI cs.LG cs.PF 版本更新

TrainMover: An Interruption-Resilient Runtime for ML Training

ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Zhengping Qian, Aditya Akella, Minlan Yu, Ennan Zhai, Dennis Cai, Jingren Zhou

发表机构 * Harvard University(哈佛大学) Alibaba Group(阿里巴巴集团) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 大规模机器学习训练任务常因硬件、软件故障或管理事件而中断,现有方法如检查点重启或运行时重新配置往往导致较长的停机时间和性能下降。本文提出TrainMover,一种具有高弹性的大语言模型训练运行时系统,通过利用弹性与备用机器实现最小停机时间和零内存开销的中断处理。TrainMover引入了两阶段基于增量的通信组构建、无通信沙箱预热以及通用备用设计等关键技术,实验表明其在千GPU规模下处理中断的停机时间可稳定控制在约20秒,相比现有最佳方案可减少55%的GPU空转时间。

Comments 14 pages body, 19 pages total

详情
英文摘要

Large-scale ML training jobs are frequently interrupted by hardware and software anomalies, failures, and management events. Existing solutions like checkpoint-restart or runtime reconfiguration suffer from long downtimes and degraded performance. We present TrainMover, a resilient LLM training runtime that leverages elastic and standby machines to handle interruptions with minimal downtime and zero memory overhead. To achieve these goals, TrainMover introduces three key techniques: two-phase, delta-based communication group setup; communication-free sandboxed warmup; and general standby design that enables failure recovery from any role. Our evaluation shows that TrainMover consistently achieves around 20 seconds of downtime when handling various interruptions at the 1024-GPU scale. TrainMover is projected to reduce wasted GPU hours by 55% compared to the best alternative, saving 1.4 million GPU-hours per week at the 64K-GPU scale.

2410.02832 2026-05-18 cs.CR cs.AI 版本更新

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, Yingwei Ma, Jiaheng Zhang, Bryan Hooi

发表机构 * Engineering Programme, NUS Graduate School, National University of Singapore(国立新加坡大学整合科学与工程计划) Institute of Data Science (IDS), National University of Singapore(国立新加坡大学数据科学研究所) Department of Computer Science, School of Computing, National University of Singapore(国立新加坡大学计算机科学系)

AI总结 本文提出了一种简单而有效的黑盒大语言模型越狱攻击方法FlipAttack。该方法利用大语言模型从左到右理解文本的特性,通过在提示左侧添加噪声干扰模型理解,从而隐藏有害指令,并进一步扩展出四种翻转模式。实验表明,FlipAttack具有高度通用性、隐蔽性和简洁性,仅需一次查询即可成功越狱,对包括GPT-4o在内的多个模型均取得了高达约98%的攻击成功率。

Comments 43 pages, 31 figures

详情
英文摘要

This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$98\% attack success rate on GPT-4o, and $\sim$98\% bypass rate against 5 guardrail models on average. The codes are available at GitHub\footnote{https://github.com/yueliu1999/FlipAttack}.

2409.11022 2026-05-18 cs.CL cs.AI 版本更新

DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

Hanjun Luo, Yingbin Jin, Xinfeng Li, Xuecheng Liu, Ruizhe Chen, Tong Shang, Kun Wang, Qingsong Wen, Zuozhu Liu

发表机构 * New York University Abu Dhabi(纽约大学阿布扎赫德分校) Zhejiang University(浙江大学) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technology University(南阳技术大学) University of Electronic Science and Technology of China(电子科技大学) Texas A&M University(德克萨斯大学) Squirrel AI

AI总结 随着大语言模型(LLM)在命名实体识别(NER)任务中的应用日益广泛,现有数据集在语料选择和设计逻辑上已难以满足LLM方法的需求。为此,本文提出DynamicNER,一个专为LLM设计的动态、多语言、细粒度NER数据集,支持同一实体在不同上下文中具有不同实体类型,涵盖8种语言和155种实体类型,适用于广泛领域。同时,本文还提出CascadeNER方法,通过两阶段策略和轻量级LLM实现更高效的细粒度识别,实验表明DynamicNER为LLM-based NER提供了有效的评估基准。

Comments This paper is accepted by EMNLP 2025 Main Conference

详情
英文摘要

The advancements of Large Language Models (LLMs) have spurred a growing interest in their application to Named Entity Recognition (NER) methods. However, existing datasets are primarily designed for traditional machine learning methods and are inadequate for LLM-based methods, in terms of corpus selection and overall dataset design logic. Moreover, the prevalent fixed and relatively coarse-grained entity categorization in existing datasets fails to adequately assess the superior generalization and contextual understanding capabilities of LLM-based methods, thereby hindering a comprehensive demonstration of their broad application prospects. To address these limitations, we propose DynamicNER, the first NER dataset designed for LLM-based methods with dynamic categorization, introducing various entity types and entity type lists for the same entity in different context, leveraging the generalization of LLM-based NER better. The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains. Furthermore, we introduce CascadeNER, a novel NER method based on a two-stage strategy and lightweight LLMs, achieving higher accuracy on fine-grained tasks while requiring fewer computational resources. Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods. Furthermore, we also conduct analysis for traditional methods and LLM-based methods on our dataset. Our code and dataset are openly available at https://github.com/Astarojth/DynamicNER.

2406.18944 2026-05-18 cs.CV cs.AI cs.CR 版本更新

Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Yixin Liu, Ruoxi Chen, Xun Chen, Lichao Sun

发表机构 * Lehigh University(莱维大学) Lehigh University Computer Science(莱维大学计算机科学) Engineering Bethlehem PA USA(工程 布雷顿 佛罗里达 美国) Independent Researcher(独立研究员) Independent Researcher Fremont California USA(独立研究员 佛罗里达 加州 美国)

AI总结 个性化扩散模型(PDMs)在使用少量数据生成特定人物图像方面表现出色,但其对微小对抗性扰动高度敏感,导致在受污染数据上微调时性能显著下降。本文通过 Shortcut Learning 的视角深入分析了 PDMs 的微调过程,揭示了对抗扰动在 CLIP 嵌入空间中引发的潜在语义对齐问题,并据此提出了一种系统性的反制框架,包括图像净化和对比解耦学习,有效提升了模型的鲁棒性和泛化能力。

Comments Code is available at https://github.com/liuyixin-louis/DiffShortcut

详情
英文摘要

Personalized diffusion models (PDMs) have become prominent for adapting pre-trained text-to-image models to generate images of specific subjects using minimal training data. However, PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. These vulnerabilities are exploited to create protective perturbations that prevent unauthorized image generation. Existing purification methods attempt to red-team the protective perturbation to break the protection but often over-purify images, resulting in information loss. In this work, we conduct an in-depth analysis of the fine-tuning process of PDMs through the lens of shortcut learning. We hypothesize and empirically demonstrate that adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. Based on these insights, we propose a systematic red-teaming framework that includes data purification and contrastive decoupling learning. We first employ off-the-shelf image restoration techniques to realign images with their original semantic content in latent space. Then, we introduce contrastive decoupling learning with noise tokens to decouple the learning of personalized concepts from spurious noise patterns. Our study not only uncovers shortcut learning vulnerabilities in PDMs but also provides a thorough evaluation framework for developing stronger protection. Our extensive evaluation demonstrates its advantages over existing purification methods and its robustness against adaptive perturbations.

2404.03099 2026-05-18 cs.LG cs.AI cs.CE cs.IT math.IT stat.ML 版本更新

Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator Networks

Leonardo Ferreira Guilhoto, Paris Perdikaris

发表机构 * Graduate Group in Applied Mathematics and Computational Science(应用数学与计算科学联合研究生组) University of Pennsylvania(宾夕法尼亚大学) Department of Mechanical Engineering and Applied Mechanics(机械工程与应用力学系)

AI总结 本文提出了一种名为NEON的神经网络架构,用于在无限维函数空间中进行带有不确定性的预测,其参数数量远少于性能相当的深度集成方法。研究聚焦于复合贝叶斯优化问题,即优化由未知函数映射和已知函数组成的复合函数,并通过实验表明NEON在多个场景下取得了领先的优化效果,同时显著降低了模型复杂度。

Journal ref Guilhoto, Leonardo Ferreira, and Paris Perdikaris. "Composite Bayesian optimization in function spaces using NEON - Neural Epistemic Operator Networks." Scientific Reports 14.1 (2024): 29199

详情
英文摘要

Operator learning is a rising field of scientific computing where inputs or outputs of a machine learning model are functions defined in infinite-dimensional spaces. In this paper, we introduce NEON (Neural Epistemic Operator Networks), an architecture for generating predictions with uncertainty using a single operator network backbone, which presents orders of magnitude less trainable parameters than deep ensembles of comparable performance. We showcase the utility of this method for sequential decision-making by examining the problem of composite Bayesian Optimization (BO), where we aim to optimize a function $f=g\circ h$, where $h:X\to C(\mathcal{Y},\mathbb{R}^{d_s})$ is an unknown map which outputs elements of a function space, and $g: C(\mathcal{Y},\mathbb{R}^{d_s})\to \mathbb{R}$ is a known and cheap-to-compute functional. By comparing our approach to other state-of-the-art methods on toy and real world scenarios, we demonstrate that NEON achieves state-of-the-art performance while requiring orders of magnitude less trainable parameters.

2403.13805 2026-05-18 cs.CV cs.AI cs.LG 版本更新

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) MThreads, Inc.(MThreads公司) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为RAR的方法,旨在提升多模态大语言模型(MLLMs)在细粒度和少样本视觉识别任务中的性能。RAR结合了CLIP的多模态检索能力与MLLMs的丰富知识库,通过建立多模态检索器来扩展模型的上下文窗口,并在推理时检索相关类别信息供MLLMs进行排序和预测。该方法有效解决了MLLMs在面对大量类别时性能下降的问题,在多个细粒度和零样本识别基准上取得了显著的性能提升。

Comments Project: https://github.com/Liuziyu77/RAR

详情
英文摘要

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

2402.10380 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Subgraph-level Universal Prompt Tuning

Junhyun Lee, Wooseong Yang, Jaewoo Kang

发表机构 * Korea University(韩国大学) University of Illinois at Chicago(伊利诺伊大学香槟分校)

AI总结 在图神经网络中,如何有效适配不同预训练策略的模型仍是一个挑战。本文提出了一种子图级通用提示调优方法(SUPT),通过在子图层面分配提示特征,保持方法的通用性,同时大幅减少调优参数数量。实验表明,SUPT在多种下游任务中表现优异,尤其在少样本场景下平均性能提升超过6.6%。

Journal ref Information Sciences 749 (2026) 123516

详情
英文摘要

In the evolving landscape of machine learning, the adaptation of pre-trained models through prompt tuning has become increasingly prominent. This trend is particularly observable in the graph domain, where diverse pre-training strategies present unique challenges in developing effective prompt-based tuning methods for graph neural networks. Previous approaches have been limited, focusing on specialized prompting functions tailored to models with edge prediction pre-training tasks. These methods, however, suffer from a lack of generalizability across different pre-training strategies. Recently, a simple prompt tuning method has been designed for any pre-training strategy, functioning within the input graph's feature space. This allows it to theoretically emulate any type of prompting function, thereby significantly increasing its versatility for a range of downstream applications. Nevertheless, the capacity of such simple prompts to fully grasp the complex contexts found in graphs remains an open question, necessitating further investigation. Addressing this challenge, our work introduces the Subgraph-level Universal Prompt Tuning (SUPT) approach, focusing on the detailed context within subgraphs. In SUPT, prompt features are assigned at the subgraph-level, preserving the method's universal capability. This requires extremely fewer tuning parameters than fine-tuning-based methods, outperforming them in 42 out of 45 full-shot scenario experiments with an average improvement of over 2.5%. In few-shot scenarios, it excels in 41 out of 45 experiments, achieving an average performance increase of more than 6.6%.

2311.03658 2026-05-18 cs.CL cs.AI cs.LG stat.ML 版本更新

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, Victor Veitch

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文探讨了“线性表示假设”,即高层概念在表示空间中以线性方向形式表示的问题,提出了“线性表示”的两种形式化定义,并分别对应输出(词)空间和输入(句子)空间。通过引入因果内积,作者建立了一个非欧几里得的内积结构,能够统一各种线性表示的概念,并用于构建探针和引导向量。实验表明,大型语言模型中确实存在概念的线性表示,且内积的选择对解释与控制模型具有基础性作用。

Comments Accepted for a presentation at ICML 2024 and an oral presentation at NeurIPS 2023 Workshop on Causal Representation Learning. Code is available at https://github.com/KihoPark/linear_rep_geometry

Journal ref In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

详情
英文摘要

Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.