arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2605.22852 2026-05-25 cs.DB cs.AI cs.LG cs.LO

Expressive Power of Deep Homomorphism Networks over Relational Databases

关系数据库上深度同态网络的表达能力

Moritz Schönherr, Balder ten Cate, Maurice Funk, Benny Kimelfeld, Carsten Lutz, Arie Soeteman

发表机构 * University of Amsterdam(阿姆斯特丹大学) Leipzig University(莱比锡大学) Technion(技术学院) RelationalAI(关系AI)

AI总结 本文研究了深度同态网络(DHNs)在关系数据库上的表达能力,探讨其与一阶逻辑及其扩展之间的联系。通过将DHNs与包含否定、计数和比例量化等扩展的逻辑片段进行对比,揭示了其在不同聚合方式下的表达能力边界。研究还表明,DHNs与SQL之间存在经典对应关系,并进一步分析了其在静态分析问题中的可判定性。实验验证了不同表达能力的DHNs在预测任务中的性能差异。

详情
AI中文摘要

消息传递图神经网络(GNN)的表达能力限制促使了更强大的图学习架构的发展。我们主张深度同态网络(DHN)作为一种特别适合在关系数据库上学习的模型,因为它与SQL的重要片段(如合取查询)有密切联系。我们通过将DHN与一阶逻辑(FO)的各种自然片段和扩展相关联,研究了DHN的精确表达能力。对于具有max、sum和mean聚合的DHN,我们建立了与一元否定片段(UNFO)以及带有计数量词和比例量词的UNFO扩展的联系。我们进一步将sum聚合DHN与FO的一元量词交替片段以及带有表达性计数的FO扩展相关联。通过FO与SQL之间的经典对应关系,这些结果也阐明了DHN与SQL之间的关系。它们还使我们能够研究DHN的两个基本静态分析问题——空问题和包含问题——的可判定性。最后,我们通过实验证实,表达能力的差异在合适的预测任务性能上得到了体现。

英文摘要

The expressive limitations of message-passing Graph Neural Networks (GNNs) have motivated a wide range of more powerful graph learning architectures. We advocate Deep Homomorphism Networks (DHNs) as a model particularly well-suited for learning over relational databases, due to their close connection to important fragments of SQL such as conjunctive queries. We study the precise expressive power of DHNs by relating them to various natural fragments and extensions of first-order logic (FO). For DHNs with max, sum, and mean aggregations, we establish connections to the unary negation fragment (UNFO) and to the extensions of UNFO with counting quantifiers and with ratio quantifiers. We further relate sum-aggregation DHNs to the unary quantifier alternation fragment of FO and to an extension of FO with expressive counting. Through the classical correspondence between FO and SQL, these results also illuminate the relation between DHNs and SQL. They also enable us to study the decidability of two fundamental static analysis problems for DHNs, the emptiness problem and the subsumption problem. Finally, we confirm through experiments that the established differences in expressive power are reflected in the performance on suitable prediction tasks.

2605.22851 2026-05-25 eess.SP cs.LG eess.IV

VAMP-Diff: VampPrior Latent Diffusion for Photoplethysmography Modeling

VAMP-Diff: 用于光电容积描记法建模的VampPrior潜扩散模型

Fatemeh Ghasemi Balouei, Nathan Willemsen, Mahesh Banavar, Bahman Moraffah

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Clarkson University(克林顿大学) Department of Computer Science(计算机科学系) Worcester Polytechnic Institute(沃思格理工学院)

AI总结 本文提出了一种名为 VAMP-Diff 的变分潜扩散模型,用于生成和重建光电容积图(PPG)信号。该方法结合了时间编码器、条件扩散解码器和 VampPrior 正则化,能够在潜空间中更准确地保留心率和呼吸率等生理特征,并生成形态更真实的 PPG 波形。实验表明,与基于高斯先验的模型相比,VAMP-Diff 在重建精度和生理信息保持方面表现出更优的性能。

Comments Submitted to the 2026 Asilomar Conference on Signals, Systems, and Computers. 12 pages, 6 figures

详情
AI中文摘要

光电容积描记法(PPG)已成为一种普遍存在的生理信号;然而,当前的生成模型仍然难以保留真实的波形形态并学习捕捉心脏和呼吸生理的潜在结构。使用对抗损失训练的PPG生成器可以产生合理的波形,但无法提供从真实信号到潜在表示的推理路径。另一方面,变分自编码器将PPG数据映射到潜在编码,尽管它们的解码器常常模糊收缩上升波并削弱幅度和频谱细节。扩散模型提高了波形保真度,但通常缺乏用于重建和生理分析的推理路径。我们提出了VampPrior潜扩散(VAMP-Diff),一种联合训练的变分扩散模型,结合了时间PPG编码器、条件一维扩散解码器以及紧凑池化潜在上的VampPrior正则化。该模型在扩散重建期间使用完整的时间潜在,使解码器能够访问心跳时序和形态,同时从学习的VampPrior组件而非固定高斯先验生成样本。我们在CapnoBase数据集上证明,VAMP-Diff生成逼真的PPG信号,重建比高斯先验基线更清晰的生理波形,保留心率信息,维持呼吸率一致性,并通过重建误差对波形损坏敏感。

英文摘要

Photoplethysmography (PPG) has become a ubiquitous physiological signal; however, current generative models still struggle to preserve realistic waveform morphology and learn a latent structure that captures cardiac and respiratory physiology. PPG generators trained with adversarial losses can produce plausible waveforms, but provide no inference path from a real signal to a latent representation. Variational autoencoders, on the other hand, map the PPG data to latent codes, although their decoders often blur systolic upstrokes and dampen amplitude and spectral details. Diffusion models improve waveform fidelity, but typically lack an inference path for reconstruction and physiological analysis. We propose VampPrior Latent Diffusion (VAMP-Diff), a jointly trained variational diffusion model that combines a temporal PPG encoder, a conditional one-dimensional diffusion decoder, and VampPrior regularization on a compact pooled latent. The model uses full temporal latent during diffusion reconstruction, giving the decoder access to beat timing and morphology while generating samples from learned VampPrior components instead of a fixed Gaussian prior. We demonstrate on the CapnoBase dataset that VAMP-Diff produces realistic PPG signals, reconstructs sharper physiological waveforms than Gaussian-prior baselines, preserves heart-rate information, maintains respiratory-rate consistency, and is sensitive to waveform corruptions through reconstruction error.

2605.22850 2026-05-25 cs.DC cs.AI

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

ObjectCache: 用于KV缓存重用的分层对象存储检索

Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso

发表机构 * ETH Zurich(苏黎世联邦理工学院) HPE Labs(惠普实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 在大型语言模型服务中,键值(KV)缓存的重复使用对提升响应速度至关重要,但传统方法受限于GPU内存和本地DRAM容量,需依赖远程存储,增加了系统开销。本文提出ObjectCache,将KV缓存存储于S3兼容的对象存储中,突破容量限制,并通过协同设计存储协议与数据传输调度,实现与GPU计算的重叠,从而最小化对首次生成时间(TTFT)的影响。实验表明,ObjectCache在保持低延迟的同时,有效提升了大规模上下文处理的效率。

详情
AI中文摘要

前缀KV缓存已成为LLM服务中的关键机制:它通过避免共享前缀(即系统提示)的请求之间的冗余计算来减少首令牌时间(TTFT)。然而,累积的KV缓存通常超过GPU内存和本地DRAM的容量。为了保持低延迟,当前系统将KV缓存保存在远程DRAM池中,从而增加了服务集群的规模和成本。在本文中,我们探索了一种不同的方法:将KV缓存存储在S3兼容的对象存储中,使容量不再成为约束,同时最小化对TTFT的影响。我们提出了ObjectCache,它协同设计存储协议和传输调度,使存储服务器按照GPU消费的顺序交付KV缓存数据,并在并发请求之间重叠数据传输与计算。我们在一个100 Gbps的RoCE集群上使用NIXL(一个抽象存储和内存的推理库)、Ceph RGW(一个用于集群的对象网关)和DAOS(一个开源存储系统)对ObjectCache进行了原型实现。对于当今系统中常见的64K上下文,ObjectCache相比本地DRAM仅增加5.6%的延迟;对于4K上下文,由于可用于掩盖传输的计算较少,ObjectCache相比最优的本地逐层基线增加了56-75毫秒。在共享带宽限制下,我们的调度器相比等带宽共享将增加的TTFT减少了1.2-1.8倍。

英文摘要

Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often larger than what GPU memory and local DRAM can hold. To preserve latency, current systems keep the KV cache in remote DRAM pools, increasing serving-cluster size and cost. In this paper, we explore a different approach: storing the KV cache in S3-compatible object storage so that capacity is no longer the constraint, while minimizing the impact on TTFT. We propose ObjectCache, which co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. We prototype ObjectCache on a 100 Gbps RoCE cluster with NIXL (an inference library that abstracts storage and memory), Ceph RGW (an Object Gateway for clusters), and DAOS (an open source storage system). For 64K contexts, common in today's systems, ObjectCache adds only 5.6\% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56--75\,ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2--1.8x compared with equal bandwidth sharing.

2605.22848 2026-05-25 cs.CE cs.LG q-bio.OT

From Simulation to Discovery: AI Enabled Probabilistic Emulation of Mechanistic Crop Systems

从模拟到发现:AI驱动的机理作物系统概率仿真

Mojdeh Saadati, Juan Panelo, Gustavo Visentini, Soumik Sarkar, Carlos Messina, Baskar Ganapathysubramanian

发表机构 * Department of Mathematics and Department of Computer Science, Iowa State University(数学系和计算机科学系,爱荷华州立大学) Department of Horticultural Sciences, University of Florida(园艺科学系,佛罗里达大学) Department of Mechanical Engineering, and Translational AI Center, Iowa State University(机械工程系和转化人工智能中心,爱荷华州立大学)

AI总结 该研究提出了一种基于人工智能的概率神经模拟器,用于高效模拟作物生长过程,解决了传统作物模型计算成本过高的问题。通过训练大量多样化条件下的模拟数据,并结合物理一致的天气生成器,该方法在保持高预测精度的同时大幅提升了模拟效率,能够快速探索不同基因型、环境和管理条件下的作物响应。研究发现了一些在多种条件下保持高产量的玉米性状组合,并揭示了辐射利用效率和温度驱动的根系动态是影响产量韧性的关键因素,展示了该方法在农业适应气候变化研究中的巨大潜力。

详情
AI中文摘要

全球粮食安全依赖于预测作物对气候变异的响应,但基于过程的作物模型对于大规模探索基因型和环境相互作用而言计算成本过高。本文开发了APSIM的概率神经仿真器,该仿真器在13个输出上以高保真度(R²=0.93)再现了关键玉米生长过程,同时将模拟时间降低了数个数量级。该框架在涵盖多样化遗传、土壤和管理条件的200万次模拟上训练,并辅以卷积合成天气生成器以产生物理一致的气候序列,从而能够在现实且多样化的环境输入下进行可扩展的作物响应探索,同时提供校准的预测不确定性,无需昂贵的贝叶斯推断。将该框架应用于10万个性状配置、爱荷华州和伊利诺伊州的六种土壤环境以及两种排放情景下直至2100年的气候预测,我们识别出181种在所有测试条件下均能持续保持高产的玉米性状组合——这一分析仅靠机理模型是无法实现的。我们进一步表明,辐射利用效率和温度驱动的根系动态是产量韧性的主要驱动因素。值得注意的是,预测的产量分布在不同地点间差异显著,一些低生产力地点在未来气候情景下产量增加,表明气候变化可能以非直观的方式重塑区域产量潜力。这些结果证明了不确定性感知仿真如何将机理作物模拟从计算瓶颈转变为按需发现引擎,其能够以任何基于过程的模型无法比拟的规模探索完整的基因型、环境和管理空间。

英文摘要

Global food security depends on predicting crop responses to climate variability, yet process based crop models remain too computationally expensive for large scale exploration of genotype and environment interactions. Here we develop a probabilistic neural emulator of APSIM that reproduces key maize growth processes across 13 outputs with high fidelity (with R^2 of 0.93) while reducing simulation time by several orders of magnitude. Trained on two million simulations spanning diverse genetic, soil, and management conditions, and augmented with a convolutional synthetic weather generator that produces physically consistent climate sequences, the framework enables scalable exploration of crop responses under realistic and diverse environmental inputs while providing calibrated predictive uncertainty without costly Bayesian inference. Applying this framework across 100,000 trait configurations, six soil environments in Iowa and Illinois, and climate projections through the year 2100 under two emissions scenarios, we identify 181 maize trait combinations that consistently maintain high yield across all tested conditionsan analysis infeasible with the mechanistic model alone. We further show that radiation use efficiency and temperature driven root dynamics are dominant drivers of yield resilience. Notably, projected yield distributions vary substantially across locations, with some lower productivity sites exhibiting yield increases under future climate scenarios, indicating that climate change may reshape regional yield potential in nonintuitive ways. These results demonstrate how uncertainty aware emulation transforms mechanistic crop simulation from a computational bottleneck into an on demand discovery engine, one capable of interrogating the full genotype, environment and management space at a scale no process-based model can match.

2605.22842 2026-05-25 cs.CR cs.AI cs.LG

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

归因偏差:当记忆中毒在自主AI系统中看起来像模型失败时

Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam, Sajedul Talukder

发表机构 * Department of Computer Science, University of Texas at El Paso(德克萨斯大学埃尔帕索分校计算机科学系) School of Computing, Southern Illinois University Carbondale(南方伊利诺伊大学卡本代尔分校计算机学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该论文揭示了多智能体AI系统中的一种结构性缺陷——“误归因鸿沟”,即内存层攻击引发的行为与模型失效难以区分,导致防御者误判问题根源。研究提出“语义规范漂移”(SND)作为智能体行为失当的第三种路径,不同于模型对齐偏差和共谋行为,其通过信任清洗链使恶意文档伪装成系统可信内容。论文引入反事实组合测试等新方法,有效识别攻击源,并提出内存持久信息流控制技术,显著提升系统安全性。

Comments This paper is presently under review at a top-tier security venue

详情
AI中文摘要

多智能体AI流水线通常假设智能体不当行为源于模型失配。我们识别了该假设中的一个结构性缺陷,即“归因偏差”,其中记忆层攻击产生与模型失败无法区分的行为,导致防御者应用错误的补救措施。我们将“语义规范漂移”(SND)形式化为智能体不当行为的第三条路径,区别于新兴失配和共谋。在SND中,一份策略格式的文档通过正常上传进入共享向量存储,并在通过信任洗钱链丢失来源后重新作为受信任的系统上下文出现。在64个记录在案的失败中,归因系统一致地指责模型。四个安全分类器,包括一个在记忆中毒上训练的,在510个检查点中产生了零检测。在65个有效案例中的59个中,智能体在服从前明确引用注入的文档作为规范权威。该攻击不需要触发器、模型访问或重复交互,在五个会话内达到完全效果,并无限期持续。我们引入了反事实组合测试,它以87.5%的准确率和零误报识别因果入口,而取证基线在所有25个场景中均失败。我们进一步证明了检索-覆盖困境,表明更强的规避本质上削弱了攻击,限制了自适应绕过策略。最后,我们提出了记忆持久信息流控制,它在跨会话边界阻止了97%的攻击,而先前的防御在此处失败。我们发布了SND语料库,这是第一个具有时间持久性和跨金融与医疗保健领域多智能体组合的对抗性记忆基准。

英文摘要

Multi-agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in this assumption, the \emph{Misattribution Gap}, where memory-layer attacks produce behaviors indistinguishable from model failure, causing defenders to apply the wrong remediation. We formalize \emph{Semantic Norm Drift} (SND) as a third path to agent misconduct, distinct from emergent misalignment and collusion. In SND, a policy-formatted document enters a shared vector store through normal uploads and later reappears as trusted system context after provenance is lost through a Trust Laundering Chain. Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. We introduce Counterfactual Composition Testing, which identifies the causal entry with 87.5% accuracy and zero false positives, while a forensics baseline fails across all 25 scenarios. We further prove the Retrieval-Coverage Dilemma, showing that stronger evasion inherently weakens the attack, limiting adaptive bypass strategies. Finally, we propose Memory-Persistent Information-Flow Control, which blocks 97% of attacks at the cross-session boundary where prior defenses fail. We release the SND Corpus, the first adversarial memory benchmark with temporal persistence and multi-agent composition across financial and Health Care domains.

2605.22841 2026-05-25 physics.soc-ph cs.AI cs.CL cs.GT cs.MA econ.GN q-fin.EC

Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test

联盟内的战略胁迫:格陵兰主权博弈作为人工智能压力测试

Rommin Adl, Peyton Williams

发表机构 * Grinnell College(格里纳尔学院)

AI总结 本文以2019-2026年美国试图从丹麦手中获得格陵兰主权的事件为案例,研究联盟内部强权对弱权的策略性施压问题,构建了多个博弈模型并通过八种前沿大语言模型进行多智能体模拟实验。研究揭示了在战略控制与联盟规范执行等集体行动难题下,不同模型在权力权重、行为策略和冲突升级等方面表现出显著差异,尤其指出中国来源模型在扮演美国角色时具有不同于西方模型的特征,并发现仅有少数模型能够实现和平的美国获取格陵兰的情景。

Comments 78 pages, 17 figures, 18 tables. Multi-agent LLM simulation recovering structural utility parameters across 8 frontier models in the Greenland sovereignty crisis. v3: typo pass, fixes phantom action names (REQUEST_MULTILATERAL, INDEPENDENT) and a Blunden date mismatch. v2 added Section V safety findings (legitimacy-laundered escalation, signal decoupling) and Appendix H

详情
AI中文摘要

当最强大的联盟成员在领土和战略控制问题上向较弱的成员施压时会发生什么?我们将格陵兰主权危机作为大语言模型地缘政治的压力测试,聚焦于2019-2026年美国推动从丹麦王国获取格陵兰的努力。该危机嵌套了两个集体行动问题:北极战略控制以及北约能否对主导成员执行联盟规范。我们开发了三个博弈(非对称胁迫;具有临界点转折的北约保证博弈;具有社会偏好的三元扩展式博弈),并通过多智能体模拟进行测试,其中八个前沿大语言模型扮演六个地缘政治角色(美国、丹麦、格陵兰、北约、俄罗斯、加拿大),共完成3604场博弈和108120个行动观测。利用逆向博弈论,我们恢复了每个模型的结构性效用参数(alpha、beta、gamma、delta、eta),分别对应物质自利、互惠、不平等厌恶、规范尊重和承诺一致性。三个发现突出:第一,所有八个模型在胁迫框架下变得更加升级(四步升级从10.7%上升至28.6%);第二,中国来源模型在扮演美国角色时显示出与西方来源模型系统性不同的权力权重分布;第三,和平的美国获取仅在1.9%的干净博弈中出现,且8个前沿模型中只有3个实现了这一点,最突出的是DeepSeek V3.2,它通过宗主国执行了稳定的五轮策略。强调强制法和自决的提示在仅英语的确认样本中将升级降低回基线附近;多语言对比作为探索性敏感性检验报告。我们将此定位为大语言模型地缘政治行为的结构性基准,补充行动频率基准。

英文摘要

What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.

2605.22840 2026-05-25 physics.soc-ph cs.AI cs.CY

The Cognitive Kardashev Scale: Quantifying the Material Envelope of Civilisational Computation

认知卡尔达肖夫指数:量化文明计算所需的物质外壳

Sachin Sharma

发表机构 * NVIDIA OpenAI Stargate Terafab

AI总结 本文提出了“认知卡尔达肖夫量表”,用于量化文明在计算能力上的潜力。该量表基于总功率、用于认知的功率比例、能量转化为计算效率以及人脑处理速度等四个因素,估算不同文明层级所能支持的持续AI级计算量。研究指出,当前人类文明处于约0.73的量表位置,接近I型文明;若达到I型文明并分配1%的功率用于计算,每位居民可获得相当于一个个人AI的计算能力,而II型文明的计算能力则难以想象。文章还探讨了未来计算能力发展的几种可能路径,并指出能源与效率的限制取决于尚未确定的工程选择。

详情
AI中文摘要

一个文明能进行多少思考?卡尔达肖夫(1964)的分类法根据总功率对文明进行分级:行星级(I型,约10^16瓦)、恒星级(II型,约10^26瓦)、星系级(III型)。本文构建了一个类似的认知卡尔达肖夫指数:每个等级能支持多少持续的AI级计算。计算涉及四个要素:总功率P(瓦特)、其中用于认知的份额f、能量转化为计算的效率η(每焦耳操作次数),以及大脑自身的处理速率$C_{\mathrm{brain}}$作为参考单位。以2024-2026年的硬件(El Capitan、NVIDIA Blackwell、Vera Rubin)为基准,得到$η_{2026} = 10^{12}$ FLOP/J。当代人类位于$K \approx 0.73$,即达到I型的三分之二。在I型且$f = 1\%$时,可用计算量在每个数量级上相当于每位居民拥有一个个人AI的认知能力;在II型时则基本无法理解。本文报告了到2035年前沿计算的三条轨迹,作为条件投影而非预测。长期约束是能源还是效率取决于尚未做出的工程选择;谁有访问权的政治经济可能比两者都更重要。

英文摘要

How much thinking can a civilisation do? Kardashev's (1964) typology ranks civilisations by total power: planetary (Type I, ~10^16 W), stellar (Type II, ~10^26 W), galactic (Type III). This paper builds an analogous Cognitive Kardashev Scale: how much sustained AI-grade computation each tier could support. Four ingredients enter the calculation: total power P (watts), the share f of it devoted to cognition, the efficiency $η$ at which energy becomes compute (operations per joule), and the brain's own processing rate $C_{\mathrm{brain}}$ as a reference unit. Anchoring on 2024-2026 hardware (El Capitan, NVIDIA Blackwell, Vera Rubin) gives $η_{2026} = 10^{12}$ FLOP/J. Contemporary humanity sits at $K \approx 0.73$, three-quarters of the way to Type I. At Type I and $f = 1\%$, available compute is, within an order of magnitude, one personal AI's worth of cognition per human inhabitant; at Type II it is essentially incomprehensible. Three trajectories for frontier compute through 2035 are reported as conditional projections, not predictions. Whether the long-run binding constraint is energy or efficiency depends on engineering choices not yet made; the political economy of who has access may matter more than either.

2605.22837 2026-05-25 physics.geo-ph cs.LG eess.SP

Evaluating PhaseNet on Teleseismic Data with MsPASS

使用 MsPASS 评估 PhaseNet 在远震数据上的表现

Jinxin Ma, Yinzhi Wang, Gary L. Pavlis, Chenbo Yin

发表机构 * Texas Advanced Computing Center, The University of Texas at Austin(德克萨斯高级计算中心,德克萨斯大学奥斯汀分校) Department of Earth and Atmospheric Sciences, Indiana University, Bloomington, IN 47405(地球与大气科学系,印第安纳大学,印第安纳波利斯,IN 47405)

AI总结 本文研究了机器学习拾震器PhaseNet在远震数据上的性能问题,并提出了一种基于MsPASS的可复现工作流,用于大规模地震数据的处理与PhaseNet的训练与推理。通过构建包含160万个远震P波波形的控制数据集,研究发现PhaseNet在区域数据上训练的模型在远震数据上表现较差,而从该数据集重新训练的模型在P波拾取的召回率和精度上均有显著提升。实验还表明,增大模型规模虽能提升性能,但会大幅降低推理效率,尤其在CPU上更为明显。

详情
AI中文摘要

大量研究表明,机器学习拾取器 PhaseNet 在本地地震信号上能产生准确的 P 波和 S 波拾取,但其在远震信号上的性能会急剧下降。为解决这一局限,我们提出了一个可重现的 MsPASS 工作流,该工作流 (i) 支持大规模地震档案的可扩展数据准备和管理,(ii) 支持标准化的 PhaseNet 训练和推理。我们构建了一个包含 160 万条波形的控制数据集,这些波形与 USArray 阵列网络设施 (ANF) 分析人员做出的远震 P 波拾取相关联。控制数据集证实,在区域信号上训练的 PhaseNet 模型在这些数据上表现不佳。然后,我们在 ANF 控制数据集的训练集上从头训练 PhaseNet,并在不重叠的保留测试集上评估,将 P 波拾取召回率提高了 741.5%,并在 0.1 秒残差窗口内产生了 683.9% 更多的拾取。我们还评估了不同模型大小的 PhaseNet 在 CPU 和 GPU 上的表现。将模型大小增加约 120 倍,精度和召回率分别提高了 15.6% 和 23.2%。然而,缩放后的模型在 NVIDIA A100 GPU 上推理吞吐量降低了 87.2%,在 128 核高性能 CPU 节点上降低了 97.3%。这些结果表明,在 GPU 上缩放 PhaseNet 比在 CPU 上更实用,并且简单地扩大模型并不是实现大幅精度提升的有效方法。

英文摘要

Numerous studies have shown that the machine-learning picker PhaseNet produces accurate P and S picks on local earthquake signals, but its performance can degrade sharply on teleseismic signals. To address this limitation, we present a reproducible MsPASS workflow that (i) enables scalable data preparation and management for large seismic archives and (ii) supports standardized PhaseNet training and inference. We assembled a control dataset of 1.6 million waveforms linked to teleseismic P-wave picks made by analysts at the USArray Array Network Facility (ANF). The control dataset confirms that the PhaseNet model trained on regional signals performs poorly on these data. We then trained PhaseNet from scratch on the training split of the ANF control dataset and evaluated it on a non-overlapping held-out test split, increasing P-pick recall by 741.5% and yielding 683.9% more picks within a 0.1s residual window. We also evaluated PhaseNet across different model sizes on both CPUs and GPUs. Increasing the model size by about 120 times improved precision and recall by 15.6% and 23.2%, respectively. However, the scaled model reduced inference throughput by 87.2% on an NVIDIA A100 GPU and by 97.3% on a 128-core high-performance CPU node. These results indicate that scaling PhaseNet is more practical on GPUs than on CPUs, and that simply enlarging the model is not an efficient way to achieve large accuracy gains.

2605.22836 2026-05-25 physics.geo-ph cs.LG

Real-Time Earthquake Magnitude Classification from Initial P-Waves: Models, Dataset, and Comparative Analysis for South Asia

基于初始P波的实时地震震级分类:南亚地区的模型、数据集与比较分析

Md Nasiat Hasan Fahim, Md. Abid Ullah Muhib, Rayhanul Amin Tanvir, Abdullah Al Noman

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Shahjalal University of Science and Technology(沙赫拉尔科学与技术大学)

AI总结 本文研究了如何利用单一地震台站初始7秒P波的垂直分量,实时分类地震震级,以提升地震预警系统的效率。研究比较了六种机器学习方法,包括传统模型和先进深度学习架构,并构建了一个包含7,318个南亚地震事件的新数据集,涵盖五个里氏震级类别。实验表明,基于Transformer的深度学习模型在准确率和推理延迟方面均优于传统方法,尤其在处理震级边界不确定性时表现出色,为实时地震预警提供了可行方案。

Comments Accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT). \c{opyright} 2025 IEEE

Journal ref 2025 28th International Conference on Computer and Information Technology (ICCIT), Cox's Bazar, Bangladesh, 2025

详情
AI中文摘要

快速地震震级估计对于有效的早期预警系统至关重要,可以挽救生命并减少经济损失。在本文中,我们提出了一项全面的震级分类研究,仅使用来自单个台站的初始7秒P波窗口的垂直分量。我们比较了六种机器学习方法,范围从传统模型到最先进的深度学习架构。我们还整理了一个包含南亚7318个地震事件的新数据集。该数据集分为五个里氏震级类别:轻微(3.0-3.9)、轻度(4.0-4.9)、中等(5.0-5.9)、强烈(6.0-6.9)和严重(>=7.0)。我们的实验表明,深度学习模型显著优于传统方法。我们基于Transformer的架构实现了76.23%的标准准确率和81.56%的自适应准确率,推理延迟为4.8毫秒。自适应准确率指标是针对近类别边界震级估计中固有的不确定性而引入的。这些结果表明,Transformer中的注意力机制与自适应分类相结合,有效地捕捉了地震信号的时间动态。这种架构优势有助于对罕见的高震级事件进行有希望的泛化,尽管地震目录具有固有的数据稀缺性。自适应准确率提供了对模型性能更现实的评估,结果表明了实时部署的可行性。

英文摘要

Rapid earthquake magnitude estimation is crucial for effective early warning systems that can save lives and reduce economic damage. In this paper, we present a comprehensive study of magnitude classification using only the vertical component of the initial 7-second P-wave window from a single station. We compare six machine learning approaches that range from traditional models to state-of-the-art deep learning architectures. We also curated a novel dataset of 7,318 earthquake events in South Asia. The dataset was categorized into five Richter-scale classes: slight (3.0-3.9), light (4.0-4.9), moderate (5.0-5.9), strong (6.0-6.9) and severe (>= 7.0). Our experiments show that deep learning models substantially outperform traditional approaches. Our Transformer based architecture achieved 76.23% standard accuracy and 81.56% adaptive accuracy with 4.8 ms inference latency. The adaptive-accuracy metric is introduced for the inherent uncertainty in magnitude estimation of near class boundaries. These results indicate that the attention mechanisms in Transformers combined with adaptive classification effectively capture the temporal dynamics of seismic signals. The architectural advantage facilitates promising generalization to rare high-magnitude events, despite the inherent data scarcity characteristic of seismic catalogs. The adaptive accuracy provides a more realistic assessment of model performance, and the result suggests viability for real-time deployment.

2605.22833 2026-05-25 cs.IR cs.AI cs.LG

RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

RAG4Outcome:用于慢性骨髓炎预后预测的检索增强多模态框架

Daqian Shi, Pei Han, Jishizhan Chen, Yang Wang, Xiaolei Diao, Xianyou Zheng, Pengfei Cheng

发表机构 * Queen Mary University of London(女王玛丽大学) Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine(上海第六人民医院附属复旦大学医学院) University College London(大学学院伦敦)

AI总结 慢性骨髓炎因其高复发风险和复杂的术后恢复过程,给预后预测带来了较大挑战。传统评估方法依赖人工评分系统,存在可扩展性差、效率低和一致性不足的问题。为此,本文提出RAG4Outcome,一种基于检索增强生成(RAG)的多模态框架,整合PET-CT影像报告、结构化手术和诊断记录以及非结构化的随访记录,结合领域特定检索语料和专家引导提示,实现了更可解释、有依据且临床可靠的预后预测,初步实验结果表明其在真实病例中具有良好的效果和临床契合度。

详情
AI中文摘要

慢性骨髓炎因其高复发风险和复杂的术后恢复轨迹而面临巨大的预后挑战。传统评估通常依赖于手动评分系统,这限制了临床实践中的可扩展性、效率和一致性。此外,临床数据的异质性对当前需要对齐输入和大量标注数据集的多模态学习方法构成了挑战。在这项工作中,我们提出了RAG4Outcome,一个用于慢性骨髓炎预后预测的检索增强生成(RAG)框架。我们的方法将多模态临床数据(包括PET-CT影像报告、结构化手术和诊断记录以及非结构化随访笔记)整合到一个统一的预测流程中。通过结合领域特定的检索语料库和专家引导的提示,该框架实现了更可解释、基于证据且临床可靠的预后。在真实世界病例上的初步结果显示了有希望的有效性和临床一致性,突显了RAG4Outcome在AI辅助感染管理和术后决策支持方面的潜力。

英文摘要

Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.

2605.22829 2026-05-25 cs.IR cs.AI

LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

LFRAG:面向布局的多模态文档理解中的细粒度检索增强生成

Yifan Zhu, Yu Mi, Yue Lu, Yanchu Guan, Zhixuan Chu

发表机构 * Zhejiang University(浙江大学) Hangzhou High-Tech Zone (Binjiang) Zhejiang University Institute of Blockchain and Data Security(杭州高新技术区(滨江)浙江大学区块链与数据安全研究院)

AI总结 本文提出了一种面向布局的细粒度检索增强生成框架LFRAG,旨在提升多模态文档理解中的检索与生成效果。传统多模态RAG系统主要依赖页面级检索,难以捕捉视觉丰富文档中的细粒度语义和布局结构,而LFRAG通过块级检索与语义-布局融合编码器,实现了更精确的查询-内容对齐和更高效的生成。研究还构建了块级标注的大规模基准数据集LFDocQA,并在实验中验证了LFRAG在检索和生成任务中的优越性能。

详情
AI中文摘要

多模态检索增强生成(RAG)已成为利用外部知识增强大语言模型(LLMs)的有效范式。然而,现有的多模态RAG系统主要依赖粗粒度的页面级检索,无法捕捉视觉丰富文档中的细粒度语义和布局结构,从而损害检索准确性并导致下游任务中的上下文冗余。为解决这些问题,我们提出了面向布局的细粒度检索增强生成(LFRAG),一种新颖的框架,将多模态RAG从页面级推进到块级检索。我们进行布局分割以构建语义连贯的细粒度检索单元,并设计了一个语义-布局融合编码器,通过交叉注意力将局部语义与全局上下文整合。通过块级后期交互检索,LFRAG实现了精确的查询-内容对齐,并减少了下游生成中的无关内容。为了进行严格评估,我们构建了LFDocQA,一个大规模基准,包含跨多种文档类型的块级注释,旨在以比现有数据集更高的粒度评估多模态文档检索和问答。在LFDocQA上的大量实验表明,LFRAG在检索任务上达到了最先进的性能,在答案准确率上比最佳基线高出7.20%,并在生成任务中减少了73.07%的令牌消耗,确认了LFRAG作为视觉丰富文档上多模态RAG的准确且高效框架。我们的代码和数据集将很快发布。

英文摘要

Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing multimodal RAG systems predominantly rely on coarse-grained page-level retrieval, which fails to capture fine-grained semantic and layout structures in visually rich documents, thereby compromising retrieval accuracy and leading to redundant context in downstream tasks. To address these issues, we propose Layout-oriented Fine-grained Retrieval-Augmented Generation (LFRAG), a novel framework that advances multimodal RAG from page-level to block-level retrieval. We perform layout segmentation to construct semantically coherent fine-grained retrieval units and design a semantic-layout fusion encoder that integrates local semantics with global context via cross-attention. With block-level late interaction retrieval, LFRAG enables precise query-content alignment and reduces irrelevant content for downstream generation. To enable rigorous evaluation, we construct LFDocQA, a large-scale benchmark with block-level annotations spanning diverse document types, designed to assess both multimodal document retrieval and question answering with greater granularity than existing datasets. Extensive experiments on LFDocQA demonstrate that LFRAG achieves state-of-the-art performance on retrieval tasks, outperforms the best baseline by 7.20% in answer accuracy, and reduces token consumption by 73.07% in generation tasks, confirming LFRAG as an accurate and efficient framework for multimodal RAG over visually rich documents. Our code and datasets will be released soon.

2605.22827 2026-05-25 physics.app-ph cs.AI cs.MA cs.PF

Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation

可计算公平性:面向AI资源分配的Boltzmann-Softmax控制

Ji-Won Park, Chae Un Kim

发表机构 * Regional Science, Cornell University(康奈尔大学区域科学系) Department of Economics, University of Ulsan(釜山大学经济系) Department of Physics, UNIST(UNIST物理系)

AI总结 在大规模AI系统中,如何公平地分配有限的计算资源(如GPU时间和带宽)是一个重要问题。本文提出了一种名为Computable Fair Division(CFD)的框架,通过将Boltzmann-Softmax函数重新解释为一种概率资源分配机制,引入可计算的控制变量β来平衡效率与公平性。该方法通过静态分析和动态控制器AHC++实现了对系统稳定性和公平性的有效调控,并在实验中表现出良好的可扩展性和鲁棒性。

Comments 40 pages, 12 figures, 5 tables. Code: https://github.com/entrofy-ai/computable-fairness

详情
AI中文摘要

在大规模AI系统中,在多个智能体之间分配稀缺资源(如GPU计算时间和带宽)是一项关键挑战。传统策略侧重于效率指标,可能导致支配集中,从而破坏系统多样性和稳定性。我们提出可计算公平划分(CFD),该框架将Boltzmann-Softmax函数重新解释为概率资源分配机制,而非选择工具,并将逆温度参数$β$重新定义为控制效率-公平平衡的可计算控制变量。静态分析揭示了一个帕累托前沿,其中存在一个接近最优的稳定走廊,在该走廊内总损失随策略权重变化大致保持不变。在动态设置中,AHC++(自适应硬上限控制器++)利用观测支配度与策略指定目标之间的误差作为反馈,实时更新$β$。仿真表明,AHC++在外生冲击下抑制极端支配集中,同时跟踪公平目标且不显著降低吞吐量。可扩展性分析证实,智能体数量增加100倍仅导致执行时间增加约5.5倍。代码:https://github.com/entrofy-ai/computable-fairness

英文摘要

In large-scale AI systems, allocating scarce resources such as GPU compute time and bandwidth among multiple agents is a critical challenge. Conventional policies focus on efficiency metrics, potentially leading to dominance concentration that undermines system diversity and stability. We propose Computable Fair Division (CFD), a framework that reinterprets the Boltzmann-Softmax function not as a selection tool but as a probabilistic resource allocation mechanism, redefining the inverse temperature parameter $β$ as a computable control variable governing the efficiency-fairness balance. Static analysis reveals a Pareto frontier with a near-optimal Stability Corridor where total loss remains approximately constant across policy weights. In the dynamic setting, AHC++ (Adaptive Hard-Cap Controller++) updates $β$ in real time using the error between observed dominance and a policy-specified target as feedback. Simulations show that AHC++ suppresses extreme dominance concentration under exogenous shocks while tracking fairness targets without substantial throughput degradation. Scalability analysis confirms that a 100x increase in agents yields only approximately 5.5x increase in execution time. Code: https://github.com/entrofy-ai/computable-fairness

2605.22825 2026-05-25 cs.DC cs.AI cs.ET cs.PF

KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

KPI2KVI:一种从服务描述计算关键价值指标的多智能体工作流

Masoud Shokrnezhad, Tarik Taleb, Yan Chen, Qize Guo

发表机构 * ICTFICIAL OY(ICTFICIAL公司) Ruhr-Universitaet Bochum(波恩鲁尔大学)

AI总结 本文提出了一种名为 KPI2KVI 的多智能体工作流工具,用于从服务描述中自动计算关键价值指标(KVIs)。该方法基于大语言模型,通过协调多个智能体完成从服务描述中提取上下文、确定 KVI 类别、生成 KPI、收集 KPI 值并计算区间化 KVI 输出等任务,实现了从自然语言描述到结构化 KVI 估计的端到端映射。该工具有效解决了 KVIs 计算过程中手动操作繁琐、结果不一致的问题,并提供了可追溯的计算过程,支持后续审计与交互式咨询。

详情
AI中文摘要

关键价值指标(KVI)通过总结运营绩效如何转化为利益相关者价值、风险和结果,提供服务的决策导向视图。然而,在许多领域,KVI在实践中难以计算,因为它们需要选择相关的KVI类别、定义可测量的关键绩效指标(KPI)、收集KPI值并应用一致的计算逻辑,而这些通常是从非结构化服务文档中手动且不一致地执行的。本文提出KPI2KVI,一种通过编排由大语言模型(LLM)驱动的确定性多智能体工作流,将自然语言服务描述转化为计算出的KVI估计值的工具,该工作流(i)引出缺失的服务上下文,(ii)从分类中提取并最终确定相关的KVI类别,(iii)生成带有单位和描述的服务特定KPI,(iv)通过交互式对话收集KPI值,并支持对不可用KPI值的智能估计,以及(v)计算区间值的KVI输出(最小值、精确值、最大值),并为每个KVI代码提供可追溯的解释。使用代表性服务描述的模拟表明,KPI2KVI一致地产生从描述到KVI区间的完整端到端映射,并提供透明的计算叙述,支持事后审计和交互式咨询查询。

英文摘要

Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeholder value, risk, and outcomes. However, in many domains KVIs are difficult to compute in practice because they require selecting relevant KVI categories, defining measurable Key Performance Indicators (KPIs), collecting KPI values, and applying consistent calculation logic, all of which is typically performed manually and inconsistently from unstructured service documentation. This paper presents KPI2KVI, a tool that transforms a natural language service description into computed KVI estimates by orchestrating a deterministic multi agent workflow powered by Large Language Models (LLMs) that (i) elicits missing service context, (ii) extracts and finalizes relevant KVI categories from a taxonomy, (iii) generates service specific KPIs with units and descriptions, (iv) collects KPI values through an interactive dialogue and also supports intelligent estimation for KPI values that are unavailable, and (v) computes interval valued KVI outputs (minimum, exact, maximum) with traceable explanations for each KVI code. Simulations with representative service descriptions demonstrate that KPI2KVI consistently produces a complete end to end mapping from description to KVI intervals and provides transparent calculation narratives that support post hoc auditing and interactive advisory queries.

2605.22824 2026-05-25 cs.DC cs.AI

An AI-Driven Framework for Energy-Efficient Environmental Monitoring in Smart Cities Using Edge Intelligence

基于边缘智能的智慧城市节能环境监测AI驱动框架

Yichen Liu, Imam Akintomiwa Akinlade, Xiaochong Jiang, Wenting Yang, Shiqi Yang

发表机构 * Independent Researcher(独立研究者) Harvard Business School(哈佛商学院)

AI总结 本文提出了一种基于边缘智能的AI驱动框架,旨在提升智慧城市中环境监测的能源效率。该框架利用TinyML技术与上下文感知的自适应决策机制,根据时空条件、环境统计和能量约束动态激活传感器,从而减少冗余数据采集和能耗。实验表明,与传统静态或基于UCB的传感策略相比,该方法显著降低了能量消耗并延长了传感器寿命,展示了边缘智能在构建可持续智慧城市监测系统中的潜力。

Comments 6 pages, 2 figures, 3 tables

详情
AI中文摘要

环境监测是智慧城市基础设施的关键组成部分,它能够支持明智决策,从而增强可持续性、公共卫生和城市规划。然而,智能传感器的大规模部署引发了关于过度能耗、冗余数据收集以及传感器寿命有限的问题。为解决这些问题,我们提出了一种基于边缘智能的智慧城市节能环境监测AI驱动框架。我们的框架利用支持TinyML的边缘设备和上下文感知自适应决策,根据时空条件、环境统计数据和能量约束动态激活传感器。传感器将基于一个效用函数动态激活,该函数考虑实时环境条件、传感器位置和剩余电池寿命等因素。我们的框架将减少不必要的感知和通信,同时保持高监测覆盖率。我们引入了一种分层边缘智能架构,以支持城市规模的部署。我们使用真实多传感器环境迹线驱动的城市规模模拟进行了评估,结果表明,与静态、周期和基于UCB的自适应感知策略相比,所提出的机制显著降低了能耗并延长了传感器寿命。结果突出了边缘智能和自适应AI技术在构建可持续高效的智慧城市监测系统方面的潜力。

英文摘要

Environmental monitoring is a crucial component of the smart city infrastructure. It enables informed decision making which enhances sustainability, public health and urban planning. However, the large-scale deployments of the smart sensors have raised concerns on excessive energy consumption and redundant data collection as well as limited sensor lifespan. To resolve these issues, we present an AI-driven framework for energy-efficient environmental monitoring in smart cities utilizing edge intelligence. Our proposed framework leverages TinyML-enabled edge devices and context-aware adaptive decision-making in order to dynamically activate the sensors based on the spatiotemporal conditions, environmental statistics and energy constraints. The sensors will be dynamically activated based on a utility function that takes in factors such as real-time environmental conditions, sensor location, and remaining battery lifespan. Our framework will reduce unnecessary sensing and communication while maintaining high coverage for monitoring. We introduce a hierarchical Edge Intelligence architecture to support deployments in city-wide scales. We conducted evaluation using a city-scale simulation driven by real multi-sensor environmental traces, which demonstrates that the proposed mechanism significantly reduces energy consumption and extends sensor lifespan when compared to static, periodic, and UCB-based adaptive sensing strategies. The results highlight the potential of edge intelligence and adaptive AI techniques for building sustainable and efficient smart city monitoring systems.

2605.23891 2026-05-25 cs.CV

Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework

Smart-Insertion-V: 通过闭环反馈双流框架实现逼真的视频插入

Xiao Cao, Yansong Qu, Xiangzhen, Chang, Wen Xiao, Jiakui Hu, Heyuan Li, Jialun Liu, Zhiyong Huang, Xuelong Li

AI总结 本文提出了一种名为 Smart-Insertion-V 的端到端双流框架,用于实现无需掩码的高质量视频物体插入。该方法通过图像流同步引导视频生成,并引入闭环反馈机制以增强插入鲁棒性,同时设计了 Dual-World-View RoPE 和解耦引导模块,以解决特征纠缠和风格泄露问题,并提升语义对齐与风格适应能力。实验表明,该方法在物体插入位置合理性与画面和谐性方面均达到当前最优水平。

详情
AI中文摘要

无掩码视频对象插入已成为一项具有挑战性的任务,需要将参考对象和谐地融入源视频中。然而,当参考对象与源场景存在严重的风格域差异时,现有方法难以应对。为了克服这一问题,我们提出了 extit{ extbf{Smart-Insertion-V}},一种端到端的 extbf{双流}框架,同时进行视频插入和图像风格迁移。在该框架内,图像流同步引导视频生成过程,同时进一步引入 extbf{闭环反馈}机制以确保鲁棒插入。不可避免地,整合这些多样化的条件信号会导致特征纠缠和风格泄露。为解决此问题,我们设计了 extbf{双世界视角旋转位置编码},通过时空偏移区分不同信号,且不增加大量训练开销。此外,为了促进空间定位和风格适应,我们引入了 extbf{解耦引导模块},该模块利用视觉语言模型进行语义推理,同时通过原生文本编码器保留原始时间引导。为了弥合和谐参考插入任务的数据差距,我们提出了一种数据整理流程,并将发布一个 extbf{开源数据集}。实验表明,我们的方法可以将对象插入到合理的位置,同时实现最和谐的结果。

英文摘要

Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing methods struggle when references exhibit severe stylistic domain gaps with the source scene. To overcome this, we propose \textit{\textbf{Smart-Insertion-V}}, an end-to-end \textbf{Dual-Stream} framework that concurrently conducts video insertion and image style transfer. Within this framework, the image stream synchronously guides the video generation process, while a \textbf{Closed-loop Feedback} mechanism is further incorporated to ensure robust insertion. Inevitably, integrating these diverse conditioning signals results in feature entanglement and style leakage. To tackle this issue, we design \textbf{Dual-World-View RoPE} to distinguish different signals via spatial-temporal offsets without incurring heavy training overhead. Furthermore, to facilitate spatial grounding and stylistic adaptation, we introduce a \textbf{Decoupled Guidance Module} that leverages a Vision-Language Model for semantic reasoning while preserving original temporal guidance with native text encoder. To bridge data gap for harmonious reference insertion task, we propose a data curation pipeline and will release an \textbf{open-source dataset}. Experiments demonstrate that our method can insert objects into plausible positions while achieving the most harmonious results.

2604.24021 2026-05-25 cs.AI math.AP

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

QED:一个用于生成开放问题数学证明的开源多智能体系统

Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

AI总结 本文介绍了一个名为 QED 的开源多智能体系统,旨在无需人工干预即可将人类提出的研究问题转化为完整的数学证明。该系统通过分离规划、证明和验证三个阶段,有效克服了单一查询证明生成的常见缺陷,其中分解代理负责结构规划,证明代理生成候选论证,验证代理检查正确性。在与领域专家合作的评估中,QED 在 18 个不同难度的研究项目上表现出色,成功生成了五项原创性研究成果,其中三项被认为具有与主流数学期刊相当的深度和广度。

详情
AI中文摘要

我们提出 extbf{QED},一个开源的多智能体系统,它能够将人类提供的研究问题转化为完整的数学证明,无需进一步的人类指导。其流水线旨在通过分离规划、证明和验证来克服单次查询证明生成的常见失败:分解智能体结构化证明搜索,证明智能体生成候选论证,验证智能体检查正确性。与领域专家合作,我们在18个不同难度的研究级项目上评估了QED。QED在代数几何、流体偏微分方程、概率和反问题领域产生了五篇原创工作。专家评估认为这些工作是扎实的专业研究贡献,其中三篇在难度和范围上与常见于成熟专业数学场所发表的工作相当。QED发布于https://github.com/proofQED/QED。

英文摘要

We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single-query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research-level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

2605.23350 2026-05-25 cs.RO

Multi-Floor Exploration for Ground Robots via an Incremental Reachable Graph and Structural Priors

基于增量可达图与结构先验的地面机器人多层探索

Zhiwen Zhu, Jiaqi Chen, Xiangyi Huang, Meiqi Hu, Boyu Zhou

AI总结 本文研究了地面机器人在多层建筑中的自主探索问题,针对传统二维和2.5维地图无法有效表示楼梯、坡道等可通行的重叠表面的问题,提出了一种基于增量可达图的多层探索框架。该方法通过构建稀疏的可达图并结合结构先验信息,实现了稳定且物理可行的前沿检测与跨楼层探索引导,实验表明该方法在仿真和实际环境中均表现出更高的探索效率和地图完整性。

详情
AI中文摘要

对于地面机器人而言,多层建筑的自主探索仍然具有挑战性,因为传统的2D和2.5D地图无法表示重叠的可通行表面,例如楼梯、坡道和多个可达高度。本文提出了一种基于增量可达图的多层探索框架。该图构建于可达支撑面之上的稀疏图,通过稀疏观测下的试探性图元素保留潜在的有效连接,并实现稳定的、物理可达的前沿检测。为了引导探索超越当前已建图楼层,我们将已探索楼层的任务区域先验投影到目标楼层,以初始化一个假设图,并随着新观测的到来逐步调整该图。然后,一个分层规划器共同推理确认和假设的结构以提供全局引导。在仿真中,与评估的基线相比,所提出的方法展示了改进的探索效率和地图完整性。此外,机载真实世界实验验证了其实用可行性和实时性能。

英文摘要

Autonomous exploration of multi-floor buildings remains challenging for ground robots because conventional 2D and 2.5D maps cannot represent overlapping traversable surfaces such as stairs, ramps, and multiple reachable elevations. This letter presents a multi-floor exploration framework based on an incremental reachable graph. Built as a sparse graph over reachable support surfaces, the graph preserves potentially valid connectivity through tentative graph elements under sparse observations and enables stable, physically reachable frontier detection. To guide exploration beyond the currently mapped floor, we project task-zone priors from an explored floor to initialize a hypothetical graph on the target floor and reconcile it incrementally with incoming observations. A hierarchical planner then jointly reasons over confirmed and hypothetical structures for global guidance. In simulation, the proposed method demonstrates improved exploration efficiency and mapping completeness compared to evaluated baselines. Furthermore, onboard real-world experiments validate its practical feasibility and real-time performance.

2605.23315 2026-05-25 cs.CL cs.AI

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

没有理解的趋同:当语言模型在表示上一致但在推理上分歧时

Muhammad Usama, Dong Eui Chang

AI总结 本研究探讨了大型语言模型在不同目标和架构下训练后,其内部表征是否趋于一致,并进一步验证了这种表征一致性是否也体现在推理过程上。通过对8个模型家族共16个语言模型在800个推理问题上的分析,研究发现模型在表征层面趋于一致,但在推理策略上却存在显著分歧,表明表征收敛更多源于输入处理的共性,而非推理方法的统一。这一发现对模型集成、可解释性迁移及模型相似性评估具有重要意义。

详情
AI中文摘要

在不同目标和架构下训练的大型语言模型已被证明会发展出越来越相似的内部表示,这一观察被形式化为柏拉图式表示假说。这种表示趋同是否延伸到对共享表示进行操作的推理过程仍未得到检验。我们在800个涵盖数学、科学、常识和真实性的推理问题上,评估了来自8个家族(1.5B到72B参数)的16个语言模型的表示相似性,并按问题难度、计算阶段和因果相关性进行分层。我们的分析揭示了三种分离:难度反转,模型在它们共同失败的问题上趋同更多(中心核对齐[CKA] = 0.897),而在它们解决的问题上趋同较少(CKA = 0.830);生成差距,决策前表示对齐(CKA = 0.875),而决策后表示分歧(CKA = 0.274);以及附带正确性,共享信息可在模型间解码(66%的迁移准确率),但对预测的因果影响极小(在不同消融协议下翻转率为1.5%到5.5%)。这些结果表明,语言模型中的表示趋同反映了共享的输入处理约束而非共享的推理策略,对集成设计、可解释性迁移和模型相似性评估有直接影响。代码可在 https://github.com/Usama1002/convergence-without-understanding 获取。

英文摘要

Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.

2605.23297 2026-05-25 cs.AI cs.DC

Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

本体知识块:可信AI系统的可执行合规与基于配置文件的验证

Aasish Kumar Sharma, Julian M. Kunkel

AI总结 本文提出了一种名为本体知识块(Ontological Knowledge Blocks, OKBs)的可执行治理框架,用于实现可信AI系统的合规性与基于配置文件的验证。OKBs将法规义务编译为可由机器验证的约束条件,结合RDF/OWL本体、SHACL验证规则、证据要求和溯源链接,实现了自动化合规检查。研究通过两个原型系统在高性能计算资源分配场景中进行了评估,验证了其在不同治理配置文件下的有效性与性能表现。

Comments 6 pages, 3 figures. Accepted at the Security, Trust and Privacy for Software and Applications (STPSA) Workshop, IEEE COMPSAC 2026, Madrid, Spain, July 7-10, 2026

详情
AI中文摘要

部署在关键数字基础设施中的AI服务需遵守透明度、问责制、公平性和可追溯性等治理义务。目前的合规仍以文档为中心:义务用散文描述,审计依赖静态检查表,验证依赖人工审查。此类方法无法扩展到自动化AI系统。本文引入本体知识块(OKBs),一种可编程治理基础设施,将监管义务编译为结构化证据图上的机器可检查约束。我们将OKB形式化为一个五元组,将规范性义务绑定到RDF/OWL概念模式、可执行的SHACL验证规则、明确的证据要求和PROV-O溯源链接。一个确定性监管编译器将结构化中间表示(IR)记录转换为可组合的KB模块,实现基于配置文件的治理重配置而无需修改服务代码。我们实现了两个原型,并在AI辅助HPC资源分配场景中进行了24次验证运行和四个治理配置文件的评估。结果表明配置文件敏感的验证、严格累加的违规累积、SHACL验证延迟在12.6毫秒至100.3毫秒之间,以及配置文件等价性测试确认Combined是最严格全面的配置文件。所有工件均以开源形式发布。

英文摘要

AI-enabled services deployed in critical digital infrastructure are subject to governance obligations spanning transparency, accountability, fairness, and traceability. Compliance today remains documentation-centric: obligations are described in prose, audits rely on static checklists, and verification depends on manual review. Such approaches do not scale to automated AI systems. This paper introduces Ontological Knowledge Blocks (OKBs), a programmable governance infrastructure that compiles regulatory obligations into machine-checkable constraints over structured evidence graphs. We formalize an OKB as a 5-tuple that binds normative obligations to an RDF/OWL concept schema, executable SHACL validation rules, explicit evidence requirements, and PROV-O provenance links. A deterministic regulatory compiler translates structured Intermediate Representation (IR) records into composable KB modules, enabling profile-based governance reconfiguration without modifying service code. We implement two prototypes and evaluate them in an AI-assisted HPC resource allocation scenario across 24 validation runs and four governance profiles. Results demonstrate profile-sensitive validation, strictly additive violation accumulation, SHACL validation latency between 12.6 ms and 100.3 ms, and profile equivalence testing confirming Combined as the strictly most comprehensive profile. All artefacts are released as open source.

2605.23100 2026-05-25 cs.RO

Four Simple Proprioceptive Estimators for Legged Robots

腿式机器人的四种简单本体感受估计器

Frank Dellaert, Chiyun Noh, Varun Agrawal, Ayoung Kim

AI总结 本文研究了如何利用足端间歇接触信息来改进腿式机器人在惯性测量单元(IMU)噪声影响下的姿态估计。作者提出了一系列逐步增强的估计方法,从基于接触辅助不变扩展卡尔曼滤波(EKF)的方法出发,逐步引入因子图和固定滞后平滑技术,以提升估计精度和鲁棒性。所有四种方法均在GTSAM中实现,并提供了ROS2兼容的代码,便于复现和进一步研究。

详情
AI中文摘要

腿式机器人携带IMU,但由于消费级IMU噪声大,惯性解会漂移。然而,足部与环境产生间歇性接触,可用于减轻这种漂移。本报告开发了一系列表达力逐渐增强的腿式机器人状态估计器,利用了这一特性。在所有情况下,浮动基座状态包括姿态、位置、速度和IMU偏置。为了建模足部接触,我们从Hartley等人的接触辅助不变EKF开始,但降低了接触更新率。然后通过用小因子图替换测量更新来增强。最后,我们将相同的因子转化为带有接触时段足端接触点的固定滞后平滑器,包括和不包括变化的IMU偏置。为了促进可重复性和本体感受腿式里程计的进一步研究,所有四种变体都在GTSAM(Dellaert等人)中可用,并且我们还提供了一个与ROS2兼容的实现。

英文摘要

Legged robots carry an IMU, but the inertial solution drifts because consumer-grade IMUs are noisy. However, the feet create intermittent contacts with the environment that can be used to mitigate that drift. This report develops a sequence of increasingly expressive legged robot state estimators that leverage this. In all cases, the floating-base state comprises attitude, position, velocity, and IMU biases. To model foot contacts, we start from the contact-aided invariant EKF of Hartley et al., albeit at a reduced contact update rate. This is then augmented by replacing the measurement update by a small factor graph. Finally, we turn the same factors into a fixed-lag smoother with contact-episode footholds, with and without an evolving IMU bias. To facilitate reproducibility and further research in proprioceptive legged odometry, all four variants are available in GTSAM (Dellaert et. al), and we additionally provide a ROS2-compatible implementation.

2605.23024 2026-05-25 cs.AI cs.CC cs.CL cs.LG

The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

确定性视界:作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结 本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题,提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”,该上限不受训练数据量、适配器秩或损失函数的影响,并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用,形成一套包含十六项设计规范的目录,为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情
AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记,但从图灵、阿罗到没有免费午餐定理的基本极限,塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限:超过关键推理深度后,无论适配器秩、样本大小或损失函数如何,训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算,在十二种Transformer架构中测量值介于19到31之间,而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性,信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界(对抗常数深度素数模电路)补充了这一结果。同样的论证重新应用于多个子领域:任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃;多阶段检索流水线至少需要与阶段数一样多的独立指标;标准诚实拍卖对于具有提示相关估值的智能体失效;神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录,每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则:两个组合已被证明,一个配对是诚实障碍,四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

2605.22986 2026-05-25 cs.RO cs.AI cs.HC cs.LG

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

知道该问什么的机器人:通过有针对性的解释恢复未对齐的奖励

Helena Merker, Nick Walker, Andreea Bobu

AI总结 该研究针对从人类示范中学习奖励函数时存在的特征不充分问题,提出了一种通过有针对性的解释来识别并修正奖励函数偏差的框架。核心方法基于分析示范数据中各特征的一致性,识别出未充分说明的特征,并通过自然语言解释这些不确定性,主动请求针对性的补充示范。实验表明,该方法在模拟和真实机器人任务中显著提升了奖励函数的学习效果,优于随机查询和被动数据收集的方式。

详情
AI中文摘要

从演示中学习奖励函数假设演示对所有特征(或行为中与任务相关的方面)提供了充分的监督。实际上,演示往往不完美:由于认知负荷或物理难度,人类可能低估某些特征,或者训练机制可能未能充分覆盖所有相关情况。无论哪种情况,重要特征可能未被充分指定,导致学习到的奖励函数存在歧义,并在部署时出现未对齐的行为。我们提出一个框架,检测此类未充分指定的特征,并主动请求有针对性的纠正演示。我们的关键洞察是,演示隐含地揭示了哪些特征被良好指定:一致优化的特征在演示之间变化很小,而未充分指定的特征则变化很大。我们利用这一统计信号推断哪些特征可能未被充分演示。然后,机器人用自然语言解释它不确定哪些特征,并请求明确解决已识别差距的演示。我们在模拟桌面操作领域和真实Franka机器人的用户研究中评估了我们的方法。与随机查询和被动数据收集相比,有针对性的、解释引导的查询显著改善了奖励恢复,减少了否则会从有缺陷的演示中持续存在的歧义。

英文摘要

Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

2605.20558 2026-05-25 cs.CL

When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

当不规则性有帮助:神经形态学中归纳偏置的子类分析

Wen Zhang

AI总结 该研究分析了神经形态生成系统在处理日语过去时动词变位时的表现,发现模型错误主要集中在一个结构特殊且数据极少的不规则子类上。通过对照实验表明,移除这一小类比移除所有不规则动词更能提升模型泛化能力,说明不同不规则模式对模型稳定性的影响不同。研究指出,错误集中源于极低频形态模式与特定音系过程(如重音)的相互作用,强调形态评估应引入更细致的子类分析以揭示模型缺陷。

详情
AI中文摘要

神经形态生成系统通常在基准数据集上达到高总体准确率,但这种性能可能掩盖集中在罕见形态子类中的系统性错误。我们考察日语过去式动词屈折,发现一个非常小、结构特定的不规则子类(<1%的数据)占据了模型错误的不成比例份额。受控消融实验表明,移除该子类比移除所有不规则动词带来更大的泛化提升,表明并非所有不规则性对模型不稳定性贡献相同。这些发现表明,错误集中是由极端低频形态模式与特定形态音韵过程(特别是促音化)之间的交互驱动的。我们认为形态评估应纳入比标准变位类别更细粒度的子类分析。

英文摘要

Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (<1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.

2605.20043 2026-05-25 cs.CL

Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

注意你的莫拉:神经日语形态生成的拼写感知错误分析

Wen Zhang

AI总结 本文针对日语过去时态形态生成任务,提出了一种关注正字法的错误分析方法,将平假名视为编码形态音系差异的表征系统,而不仅仅是转录媒介。研究评估了两种字符级序列到序列模型,在高整体准确率下仍存在与平假名正字法特性相关的系统性错误,尤其在涉及辅音连写(Gemination)的动词中表现明显。研究提出了包含七种主要错误模式的分类体系,并揭示了正字法表征、形态结构和数据频率在模型泛化中的紧密关联,强调了在形态复杂的语言中进行正字法感知评估的重要性。

详情
AI中文摘要

我们提出了一种拼写感知的日语过去时形态屈折错误分析,将平假名不仅视为转录媒介,而且视为编码形态音位区别的表征系统,这些区别可能影响模型泛化。我们使用根据SIGMORPHON 2020和2023共享任务约定格式化的数据集,评估了两种字符级序列到序列架构在过去时形成上的表现。尽管总体准确率较高,但模型表现出系统的、语言上可解释的错误,这些错误集中在平假名的特定拼写属性上。我们引入了一个简洁的错误分类法,捕获了七种主要失败模式,并提供了定量和定性分析。促音化相关错误主导了剩余失败,占错误的75-80%,特别是在词干以元音e结尾且需要在过去时后缀前促音化的动词中。错误模式在架构和随机种子之间高度一致,表明拼写表示、形态结构和数据频率效应在塑造模型泛化中存在稳健的交互。这些结果强调了在理解形态复杂语言的神经泛化时,拼写感知评估的必要性。

英文摘要

We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

2602.18788 2026-05-25 cs.CL

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

BURMESE-SAN: 评估大语言模型的缅甸语NLP基准

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

AI总结 本文介绍了BURMESE-SAN,这是首个系统评估大型语言模型在缅甸语自然语言理解、推理和生成能力的综合性基准。该基准包含七项子任务,涵盖问答、情感分析、因果推理等多个领域,并通过严格的母语者参与流程构建,确保语言自然性和文化真实性。研究发现,缅甸语模型性能更依赖于架构设计、语言表示和指令微调,而非模型规模,并指出区域微调和新一代模型能显著提升效果。

详情
AI中文摘要

我们引入了BURMESE-SAN,这是第一个系统性评估大语言模型(LLM)在缅甸语上三种核心NLP能力:理解(NLU)、推理(NLR)和生成(NLG)的全面基准。BURMESE-SAN整合了涵盖这些能力的七个子任务,包括问答、情感分析、毒性检测、因果推理、自然语言推理、抽象摘要和机器翻译,其中多个任务此前在缅甸语中不可用。该基准通过严格的母语者驱动流程构建,以确保语言自然性、流畅性和文化真实性,同时最小化翻译引起的伪影。我们对开源和商业LLM进行了大规模评估,以考察缅甸语建模中因预训练覆盖有限、丰富形态和句法变异带来的挑战。我们的结果表明,缅甸语性能更多地依赖于架构设计、语言表示和指令微调,而非仅模型规模。特别是,东南亚区域微调和更新的模型世代带来了显著提升。最后,我们发布BURMESE-SAN作为公共排行榜,以支持缅甸语及其他低资源语言的系统评估和持续进步。https://leaderboard.sea-lion.ai/detailed/MY

英文摘要

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

2602.13985 2026-05-25 cs.AI

Bridging AI and Clinical Reasoning: Abductive Explanations for Alignment on Critical Symptoms

弥合AI与临床推理:针对关键症状对齐的溯因解释

Belona Sonna, Alban Grastien

AI总结 该研究旨在解决人工智能在临床诊断中与结构化临床推理不一致的问题,提出利用形式化归因解释方法,以确保AI决策基于关键症状进行合理推理。通过识别最小充分特征集,该方法不仅提升了AI解释的透明度和可信度,还实现了与临床思维的对齐,为构建可信赖的医疗诊断AI系统提供了有效框架。

Comments The Algorithm 1 is not entirely correct and they may affect the results as well. We are restarting the experimentations and will upload the new version as soon as possible

详情
AI中文摘要

人工智能在临床诊断中展现出强大潜力,其准确性常达到或超过人类专家水平。然而,一个关键挑战是AI推理常偏离结构化临床框架,限制了信任、可解释性和应用。即使预测正确,AI模型也可能忽略对快速准确决策至关重要的关键症状。现有的事后解释方法透明度有限且缺乏形式保证。为解决此问题,我们利用形式溯因解释,它在最小充分特征集上提供一致且保证的推理。这使我们能够清晰理解AI决策,并使其与临床推理对齐。我们的方法在保持预测准确性的同时提供临床可操作的见解,为医疗诊断中可信AI建立了稳健框架。

英文摘要

Artificial intelligence (AI) has demonstrated strong potential in clinical diagnostics, often achieving accuracy comparable to or exceeding that of human experts. A key challenge, however, is that AI reasoning frequently diverges from structured clinical frameworks, limiting trust, interpretability, and adoption. Critical symptoms, pivotal for rapid and accurate decision-making, may be overlooked by AI models even when predictions are correct. Existing post hoc explanation methods provide limited transparency and lack formal guarantees. To address this, we leverage formal abductive explanations, which offer consistent, guaranteed reasoning over minimal sufficient feature sets. This enables a clear understanding of AI decision-making and allows alignment with clinical reasoning. Our approach preserves predictive accuracy while providing clinically actionable insights, establishing a robust framework for trustworthy AI in medical diagnosis.

2511.00266 2026-05-25 cs.LG cs.RO

X-TRACK: Physics-Aware xLSTM for Realistic Vehicle Trajectory Prediction

X-TRACK: 物理感知的xLSTM用于真实车辆轨迹预测

Aanchal Rajesh Chugh, Marion Neumeier, Sebastian Dorn

AI总结 准确的轨迹预测对自动驾驶系统的安全性和可靠性至关重要,尤其需要在高速公路场景中建模长期时间依赖关系并考虑车辆之间的社会交互。本文提出了一种基于xLSTM的新型高速公路轨迹预测框架X-TRAJ,并进一步引入其物理感知变体X-TRACK,通过显式整合车辆运动学约束,生成更真实可行的轨迹。实验表明,X-TRACK在公开数据集highD和NGSIM上均优于现有先进方法,尤其在highD上表现突出。

详情
AI中文摘要

准确的轨迹预测对于安全可靠的自动驾驶系统至关重要,需要模型能够捕捉长期时间依赖性,同时考虑高速公路驾驶场景中相邻车辆之间的社交互动。虽然长短期记忆(LSTM)网络在轨迹预测领域得到了广泛应用,但它们存在记忆容量有限和标量细胞状态等局限性。最近引入的扩展长短期记忆(xLSTM)通过引入指数门控和增强的记忆结构解决了传统LSTM的这些局限性,使其更适合建模长期时间依赖性。尽管具有潜力,基于xLSTM的模型在车辆轨迹预测方面仍未得到充分探索。本文首次将xLSTM应用于高速公路轨迹预测,提出了新颖的基于xLSTM的高速公路轨迹预测框架X-TRAJ,以及其物理感知变体X-TRACK(受运动学约束的扩展LSTM轨迹预测),该变体将车辆运动学显式集成到模型学习过程中。通过引入物理约束,所提出的模型生成真实可行的高速公路轨迹。在公开的高速公路数据集highD和NGSIM上的全面评估表明,X-TRACK在highD上优于最先进的基线,并在NGSIM数据集上达到最先进模型水平。

英文摘要

Accurate trajectory prediction is crucial for safe and reliable autonomous driving systems, requiring models that capture long-term temporal dependencies while accounting for social interactions among neighboring vehicles in highway driving scenarios. While Long Short Term Memory (LSTM) networks have been widely used in the domain of trajectory prediction, they have limitations such as limited memory capacity and scalar cell state. The recently introduced Extended Long Short Term Memory (xLSTM) addresses these limitations of traditional LSTMs by introducing exponential gating and enhanced memory structures, making them better suited for modeling long-term temporal dependencies. Despite their potential, xLSTM-based models remain underexplored in the context of vehicle trajectory prediction. This paper introduces a novel xLSTM-based highway trajectory prediction framework, X-TRAJ, as the first application of xLSTM, and its physics-aware variant, X-TRACK (eXtended LSTM for TRAjectory prediction Constraint by Kinematics), which explicitly integrates vehicle motion kinematics into the model learning process. By introducing physical constraints, the proposed model generates realistic and feasible highway trajectories. A comprehensive evaluation on the publicly available highway datasets, highD and NGSIM, demonstrates that X-TRACK outperforms state-of-the-art baselines on highD and is among the state-of-the-art models on the NGSIM dataset.

2605.23871 2026-05-25 stat.ML cs.LG math.ST stat.TH

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

Muon上的移动:Muon优化器的哈密顿概率梯度流视角

Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur

AI总结 本文从哈密顿概率梯度流的视角,研究了Muon优化器的连续时间动力学行为,提出了正则化Muon优化的梯度流形式,并揭示了其与核范数的Fenchel对偶平滑之间的联系。通过将Muon优化推广到有限粒子概率目标函数,作者推导了其惯性连续时间极限,并建立了参数-动量对的概率相空间平均场方程,证明了该动力学为阻尼哈密顿概率动力系统,具有单调递减的哈密顿能量。此外,文章还分析了目标函数的收敛性,并将该方法扩展到适用于变换器混合专家模型的块状Muon概率流。

详情
AI中文摘要

我们开发了一种在矩阵值参数概率测度空间上的梯度流,该梯度流由正则化Muon(理想化Muon优化器的解析平滑版本)诱导。关键观察是正则化正交化映射是核范数的光滑Fenchel对偶平滑的梯度。这确定了(正则化)Muon更新为更新变量中的镜像/近端步骤,其中动量充当对偶坐标。我们利用这一结构将Muon从单个矩阵参数提升到形如$J(ρ)=R\left(\int F d ρ ight)$的有限粒子概率目标,这一设置由神经网络训练的均场描述所激发,并推导出惯性连续时间极限。利用这一结构,我们在步长和动量的惯性缩放下推导出有限粒子连续时间极限,然后过渡到参数-动量对概率律上的相空间均场方程。所得流可被证明是阻尼哈密顿概率动力学,其动能由正则化Muon镜像势诱导。我们证明了一个精确的哈密顿耗散恒等式,显示哈密顿能量单调递减。虽然目标目标本身在惯性Muon动力学下不一定单调,但在额外的梯度优势、有界动量和曲率/对齐假设下,我们获得了目标间隙的连续和离散时间指数收敛率。我们还研究了均场极限方程的适定性,并建立了相互作用粒子系统的混沌传播保证。最后,我们将公式扩展到乘积矩阵空间上的Hilbert值特征映射,得到适用于平滑变压器混合专家模型的块状Muon概率流。

英文摘要

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

2605.23635 2026-05-25 stat.ML cs.LG

Dirichlet-Based Monte Carlo Dropout for Uncertainty Estimation in Neural Networks

基于狄利克雷的蒙特卡洛丢弃法用于神经网络不确定性估计

Rouaa Hoblos, Noura Dridi, Noureddine Zerhouni, Zeina Al Masry

AI总结 传统神经网络无法提供预测的不确定性估计,而贝叶斯神经网络虽能进行不确定性量化,但计算复杂度较高。本文提出了一种基于狄利克雷分布的蒙特卡洛Dropout方法,在保持计算效率的同时提升了不确定性估计的质量。该方法通过将类别概率建模为狄利克雷分布,实现了更具信息量的不确定性表示,并在实验中验证了其在不确定性校准方面的有效性。

Journal ref 56es Journ{é}es de Statistique de la SFdS, Jun 2025, Marseille, France

详情
AI中文摘要

传统神经网络提供确定性预测,缺乏固有的不确定性估计。虽然贝叶斯神经网络(BNN)为不确定性量化提供了原则性方法,但其计算复杂度限制了可扩展性。蒙特卡洛(MC)Dropout最初作为正则化技术引入,已被证明通过多次随机前向传播实现概率建模,从而近似贝叶斯推断。在这项工作中,我们通过在MC Dropout中集成基于狄利克雷的框架来增强深度学习中的不确定性估计。具体来说,我们利用Sensoy等人(2018)提出的公式,其中使用狄利克雷分布对类概率进行建模,从而允许更信息化的不确定性表示。所提出的方法保持了MC Dropout的计算效率,同时提高了不确定性估计的质量。我们讨论了所提出方法的理论基础,并将其与现有的不确定性量化技术进行了比较。结果突显了所提出方法在产生良好校准的不确定性估计方面的有效性,为不确定性感知的深度学习模型提供了实用解决方案。

英文摘要

Traditional neural networks provide deterministic predictions without inherent uncertainty estimates. While Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty quantification, their computational complexity limits scalability. Monte Carlo (MC) Dropout, initially introduced as a regularization technique, has been shown to approximate Bayesian inference by enabling probabilistic modeling through multiple stochastic forward passes. In this work, we enhance uncertainty estimation in deep learning by integrating a Dirichlet-based framework within MC Dropout. Specifically, we leverage the formulation proposed by Sensoy et al. (2018), where class probabilities are modeled using a Dirichlet distribution, allowing for a more informative uncertainty representation. The proposed approach maintains the computational efficiency of MC Dropout while improving the quality of uncertainty estimates. We discuss the theoretical foundations of our method and compare it with existing uncertainty quantification techniques. The results highlight the effectiveness of the proposed method in producing well-calibrated uncertainty estimates, offering a practical solution for uncertainty-aware deep learning models.

2605.17468 2026-05-25 cs.HC cs.AI

An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training

一种可解释的闭环智能辅导系统,用于异步演讲训练中的多模态情感反馈

Hung-Yue Suen, Kuo-En Hung

AI总结 本文提出了一种可解释的闭环智能辅导系统(ITS),用于支持异步演讲训练中的多模态情感反馈,帮助大规模提升学员的镜头前口头表达能力。该系统基于七维行为锚定评分量表(BARS),结合多模态评分、观众感知表达诊断和增强检索的对话辅导,构建了三层可解释反馈架构,能够将面部、语音、文本和眼动等多模态输入转化为可追溯的、基于证据的反馈。实验表明,该系统在MOOC视频数据上的评分表现接近专家水平,并在30天的实践过程中显著提升了学员的多项表现维度。

Comments 12 pages, 8 figures, IEEE Transactions on Learning Technologies, 2026

详情
AI中文摘要

本文提出了一种可解释的闭环智能辅导系统(ITS),支持大规模开发摄像机前口头演讲技能的反馈引导练习。该系统操作化了一个七维行为锚定评级量表(BARS),并实现了一个三层可解释反馈架构,该架构连接了与评分标准一致的多模态评分、观众感知的表达诊断以及检索增强的对话式辅导,以支持刻意练习。基于XGBoost骨干,该ITS将多模态输入(面部、声音、文本和眼动特征)映射为基于证据的反馈,这些反馈可以追溯到可观察的表现线索。在10,360个大规模开放在线课程(MOOC)视频片段上训练后,该系统实现了与专家评分相当的表现水平的评分标准一致评分(R2 = 0.48-0.61,Spearman's rho = 0.69-0.78,MAE = 0.43-0.57)。在204名成年学习者为期30天的练习窗口的前后验证研究中,参与者在所有七个BARS维度上表现出显著改善(Cohen's d = 0.39-0.90),在控制基线分数和人口统计学因素后,练习频率与后测成绩呈强正相关。结果展示了如何通过集成的反馈架构将多模态分析输出系统地转化为可观察的行为变化,推动了基于表现的能力的可解释和教学导向的ITS设计。

英文摘要

This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman's rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.