arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
2605.25388 2026-05-26 cs.LG q-bio.QM

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

ViroBench:病毒基因组学任务中的核苷酸基础模型基准测试

Dongxin Ye, Fang Hu, Han Hu, Shu Hu, Yang Tan, Wanli Ouyang, Stan Z. Li, Jie Cui, Nanqing Dong

发表机构 * Shanghai Innovation Institute Shanghai China University of Electronic Science Fudan University Shanghai China Shanghai Artificial Intelligence Laboratory Shanghai China Institute of Infection Health Fudan University Shanghai China Shanghai Sci-Tech Inno Center for Infection \& Immunity Shanghai China Shanghai Jiao Tong University Shanghai China Shenzhen Loop Area Institute Shenzhen China Chinese University of Hong Kong Hong Kong China Westlake University Hangzhou China Shanghai Innovation Institute Fudan University Shanghai Artificial Intelligence Laboratory Shanghai Sci-Tech Inno Center for Infection \& Immunity Shanghai Jiao Tong University Shenzhen Loop Area Institute Chinese University of Hong Kong Westlake University

AI总结 提出首个针对病毒基因组学的综合基准ViroBench,评估66个核苷酸基础模型在生物学理解和潜在生物安全风险上的表现,发现模型在系统发育和时间偏移下性能下降,生成任务中统计似然与生物功能有效性脱钩,且预训练数据的分类多样性比参数规模更重要。

Comments 42 pages,15 figures

详情
AI中文摘要

核苷酸序列构成了生物系统的基本遗传基础,使得病毒基因组分析对生物医学进步至关重要。尽管生物基础模型,特别是核苷酸基础模型(NFMs)取得了进展,但该领域缺乏一个统一的病毒基因组学标准来促进社区发展并实施生物安全约束。为了解决这个问题,我们引入了ViroBench,这是第一个专门为病毒场景中的NFMs设计的全面且大规模的基准测试。ViroBench在两个关键维度上评估模型:生物学理解和潜在生物安全风险,覆盖4种任务类型中的18个不同场景。对66个不同架构的NFMs的广泛评估得出了三个关键结论。首先,NFMs在系统发育和时间偏移下表现出生物学理解的性能下降,表明外推能力较弱。其次,生成任务揭示了统计似然与生物功能有效性之间的脱钩,构成了潜在的生物安全风险。第三,受控消融研究表明,预训练数据中的分类多样性比参数规模更重要。具体来说,一个在多样化数据上训练的轻量级基线相比其原始模型实现了67.5%的性能提升。总体而言,ViroBench为未来病毒核苷酸基础模型的研究提供了可解释的诊断评估和可重复的测量框架。数据集和代码公开于https://github.com/QIANJINYDX/ViroBench。

英文摘要

Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.

2605.25385 2026-05-26 cs.CV cs.AI

Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

发表机构 * School of Computer Science(计算机科学学院) Technology, Ocean University of China, Qingdao 266100, China(技术,中国海洋大学,青岛266100,中国)

AI总结 提出MGNet网络,利用SAM模型生成伪标签,通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块,实现弱监督伪装目标检测,性能与全监督方法相当。

Comments 18 pages

详情
AI中文摘要

伪装目标检测(COD)由于目标与背景高度相似,是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注,因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而,由于使用粗标注,弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地,我们设计了一个新颖的网络MGNet,通过利用自定义级联掩码解码器(CMD)生成的初始掩码来引导分割过程并增强边缘预测,从而解决边缘模糊和漏检问题。我们引入上下文增强模块(CEM)以减少漏检,以及掩码引导特征聚合模块(MFAM)进行有效的特征聚合。针对弱监督挑战,我们提出BoxSAM,利用带有边界框提示的Segment Anything Model(SAM)生成伪标签。通过采用冗余处理策略,为训练MGNet提供高质量的像素级伪标签。大量实验表明,我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

2605.25384 2026-05-26 cs.CL

GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving

GeoMathCode: 理解几何问题求解中交织的数学-代码推理

Yingji Zhang, Yong Dai, André Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) X-Humanoid Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Cancer Biomarker Centre, CRUK Manchester Institute(癌症生物标志物中心,CRUK曼彻斯特研究所)

AI总结 本文提出GeoMathCode,通过程序化表示作为中间视觉输出,分析多模态大模型在几何问题中的推理与代码生成,发现推理与代码步骤在潜在空间可解耦,监督微调使推理流形更结构化,且层次化代码结构包含更多数学符号信息。

详情
AI中文摘要

数学推理是人类智能的标志,需要逻辑演绎、符号操作和抽象思维。最近的多模态大语言模型通过多步推理在几何问题上表现出强大性能。为了更好地模拟人类问题求解,中间步骤可以融入辅助视觉构造,例如额外的线条或点,这改善了几何解释和教育清晰度。在这项工作中,我们引入了GeoMathCode,其中程序化表示作为中间视觉输出。我们进一步对底层推理几何进行了深入分析。实验结果表明,推理和代码生成步骤可以在潜在空间中解耦,而监督微调使推理流形更加结构化和信息丰富。此外,层次化的句法代码结构作为解耦的潜在子空间出现,并且比视觉表示包含更多的数学符号信息。

英文摘要

Mathematical reasoning is a hallmark of human intelligence, requiring logical deduction, symbolic manipulation, and abstract thinking. Recent multimodal large language models (MLLMs) have demonstrated strong performance on geometry problems through multi-step reasoning. To better emulate human problem-solving, intermediate steps can incorporate auxiliary visual constructions, such as additional lines or points, which improve geometric interpretation and educational clarity. In this work, we introduce the GeoMathCode, where programmatic representations serve as intermediate visual outputs. We further conduct an in-depth analysis of the underlying reasoning geometry. Experimental results show that reasoning and code generation steps can be disentangled in the latent space, while supervised fine-tuning (SFT) makes the reasoning manifold more structured and informative. Moreover, hierarchical syntactic code structures emerge as disentangled latent subspaces, and contain more mathematical symbolic information than visual representations.

2605.25381 2026-05-26 cs.LG

Not only where, But when: Temporal Scheduling for RLVR

不仅在哪里,而且何时:RLVR 的时间调度

Jinghao Zhang, Ruilin Li, Feng Zhao, Jiaqi Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Wuhan University(武汉大学)

AI总结 针对强化学习可验证奖励(RLVR)中忽略策略行为异质性的问题,提出时间调度方法,通过动态调整信用分配标准来优化学习动态,实验表明该方法能提升训练稳定性和效率。

Comments Github: https://github.com/Jinghaoleven/RLVR-Schedule

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为大型语言模型(LLMs)后训练的核心技术。虽然策略优化由所有采样token在全局广播标量奖励下驱动,但轨迹中表现出的异质性策略行为在很大程度上被忽视而未加以区分。现有工作通过信用分配来解决这一问题,包括token级优势重加权和选择性token优化,然而分配标准在整个训练过程中基本保持不变,限制了策略的弹性演化。在这项工作中,我们认为学习信号的调度时机与它们在token间的分配位置同样重要,并引入了时间维度,即在RLVR优化过程中调度信用分配标准。我们发现,优先关注具有特定策略行为的目标token,并逐渐向通用优化衰减,可以带来更稳定和高效的学习动态。此外,我们表明简单的轨迹百分位数为区分策略行为提供了自然视角,并与时间调度有效配合。我们的分析揭示,标准优化在同时适应异质性行为时显著牺牲了策略熵,而时间调度产生了更健康的策略演化动态。在数学和通用推理基准上的实验表明了一致的改进,表明时间调度构成了一个有前景的优化维度。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that \textit{when} learning signals are scheduled can be as important as \textit{where} they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.

2605.25379 2026-05-26 cs.CL

EfficientGraph-RAG: Structured Retrieval-State Management for Cross-Task Retrieval-Augmented Generation

EfficientGraph-RAG:面向跨任务检索增强生成的结构化检索状态管理

Miaohe Niu, Lianlei Shan, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Tsinghua University, China(清华大学) Kunming University of Science and Technology, China(昆明理工大学) NiuTrans Research, Shenyang, China(沈阳NiuTrans研究院)

AI总结 提出EfficientGraph-RAG框架,通过显式定义检索状态(TAM、MARS、SMP三个机制)实现结构化状态管理,在多个基准上提升答案质量并降低大模型token消耗。

Comments 19 pages, 5 figures, 14 tables

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的标准方式,但许多系统仍将证据组织为扁平块并通过基本无结构的搜索进行检索。这种弱结构成为复杂检索的瓶颈:系统必须决定搜索位置、如何从粗粒度主题过渡到实体关系证据、哪些证据已被验证以及哪些中间产物可复用。我们将这些中间变量定义为检索状态,并将RAG研究视为结构化状态管理。EfficientGraph-RAG通过三种耦合机制使该状态显式化:TAM定义了证据上的类型化层次状态空间,MARS通过角色专业化代理更新和验证状态,SMP在层次感知访问控制下存储可复用状态。使用一个共享框架配置,EfficientGraph-RAG在三个评估的LongBench检索风格子集上平均报告答案质量指标排名第一,在HotpotQA EM上与最强智能体基线持平,同时将大模型token使用量减少3.51倍,并在检索组织跨模态方法中提供了低token的DocVQA结果。组件分析显示了角色特定机制:MARS是主要答案质量驱动因素,TAM提供类型化遍历状态和自适应路由信号,SMP支持语料库依赖的复用,跨查询缓存命中率范围为3.77%至23.18%。

英文摘要

Retrieval-augmented generation (RAG) has become the standard way to ground large language models in external knowledge, but many systems still organize evidence as flat chunks and retrieve it through largely unstructured search. This weak structure becomes a bottleneck for complex retrieval: the system must decide where to search, how to move from coarse topics to entity-relation evidence, which evidence has been verified, and which intermediate artifacts can be reused. We define these intermediate variables as a retrieval state and study RAG as structured state management. EfficientGraph-RAG makes this state explicit through three coupled mechanisms: TAM defines a typed hierarchical state space over evidence, MARS updates and verifies the state through role-specialized agents, and SMP stores reusable state under hierarchy-aware access control. Using one shared framework configuration, EfficientGraph-RAG ranks first on the reported answer-quality metrics averaged over the three evaluated LongBench retrieval-style subsets, matches the strongest agentic baseline on HotpotQA EM while reducing large-model token usage by $3.51\times$, and provides a low-token DocVQA result among retrieval-organizing cross-modal methods. Component analysis shows role-specific mechanisms: MARS is the main answer-quality driver, TAM supplies the typed traversal state and Adaptive Routing signal, and SMP enables corpus-dependent reuse, with cross-query cache hit rates ranging from 3.77% to 23.18%.

2605.25377 2026-05-26 cs.CV cs.AI

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

发表机构 * Fudan University(复旦大学) Tencent(腾讯) Nanjing University(南京大学) Southeast University(东南大学) Great Bay University(大坝大学) TeleAI, China Telecom(TeleAI,中国电信)

AI总结 提出对抗正交解缠(AOD)框架,通过最小最大目标学习幻觉相关方向,并利用双前向对比解码策略,在不需额外训练的情况下缓解大型视觉语言模型(LVLM)的幻觉问题。

详情
AI中文摘要

大型视觉语言模型(LVLM)推进了多模态理解,但其可靠性受到幻觉的限制,即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预(如指令调优和检索),要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠(AOD),一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向:分类器将幻觉信号集中到投影分量中,而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明,AOD一致优于强基线。它在POPE上平均提高超过6%的准确率,将AMBER提升6%,并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移,表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

2605.25373 2026-05-26 cs.CV

Physics-Aware 3D Gaussian Editing for Driving Scene Generation

物理感知的三维高斯编辑用于驾驶场景生成

Feng Zhou, Jian Zhang, Yuhang Sun, He Wang, Qiong Wen, Debao Kong, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物力学国家重点实验室) China FAW Group Co., Ltd.(中国第一汽车集团有限公司)

AI总结 提出RoVES系统,通过单图像驱动的道路几何插入和4-DOF半车动力学模型,实现物理感知的驾驶场景编辑与车辆姿态校正。

详情
AI中文摘要

三维高斯泼溅(3DGS)在自动驾驶仿真和数据生成中展现出巨大潜力,能够实现逼真的重建和灵活的场景操作。然而,现有的3DGS场景编辑方法对道路几何编辑(例如插入减速带或凹陷路面)支持有限,并且通常不将此类编辑与合理的车辆-道路交互动力学耦合。这种编辑对于在极端驾驶场景下生成训练数据或评估系统在这些道路不规则情况下的可靠性至关重要。此外,许多基于优化的方法需要每次编辑进行数分钟的细化,而现有的高效替代方案主要关注外观级别或对象级别的操作,而非物理感知的道路不规则编辑。为了解决这些限制,我们提出了RoVES,一个用于驾驶场景中物理感知三维高斯编辑的道路和车辆编辑系统。RoVES实现了单图像驱动的道路几何插入,并将编辑后的道路轮廓与4-DOF半车动力学模型耦合,以实现垂直位移和俯仰方向上的物理感知车辆姿态校正。RoVES以一次性、无优化的流水线(1.84秒)插入道路元素,完整流水线(包括颜色转移和基于车辆动力学的姿态校正)在6.24秒内完成;它通过姿态编辑编辑动态车辆,并逐帧校正姿态以近似动力学一致的垂直位移和俯仰响应。在Waymo数据集上的实验表明,RoVES为物理感知的驾驶场景生成提供了实用的效率和具有竞争力的视觉一致性。

英文摘要

3D Gaussian Splatting (3DGS) has shown great potential in autonomous driving simulation and data generation, enabling photorealistic reconstruction and flexible scene manipulation. However, existing 3DGS scene editing methods have limited support for road geometry editing (e.g., inserting speed humps or sunken roads), and generally do not couple such edits with plausible vehicle-road interaction dynamics. Such editing is essential for generating training data under extreme driving scenarios or evaluating system reliability under these road irregularities. Moreover, many optimization-based methods require minutes of per-edit refinement, while existing efficient alternatives mainly focus on appearance-level or object-level manipulation rather than physics-aware road irregularity editing. To address these limitations, we propose RoVES, a Road-and-Vehicle Editing System for physics-aware 3D Gaussian editing in driving scenes. RoVES enables single-image-driven road geometry insertion and couples the edited road profile with a 4-DOF half-car vehicle dynamics model to achieve physics-aware vehicle pose correction in vertical displacement and pitch. RoVES inserts road elements in a one-shot, optimization-free pipeline (1.84s), and the full pipeline (including color transfer and vehicle-dynamics-based pose correction) completes in 6.24s; it edits dynamic vehicles via pose editing and corrects poses frame-by-frame to approximate dynamics-consistent vertical displacement and pitch responses. Experiments on the Waymo dataset show that RoVES provides practical efficiency and competitive visual consistency for physics-aware driving scene generation.

2605.25364 2026-05-26 cs.CV

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

MLLMs 能否超越语言进行推理?VisReason:一个面向视觉中心推理的综合基准

Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu, Jing Liu, Xinxin Zhu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出 VisReason 基准,包含 1505 个日常场景问题,评估多模态大模型在视觉中心推理上的表现,揭示人类与模型间的显著差距。

Comments Accepted by ACL 2026 Findings, resources released at https://github.com/CASIA-IVA-Lab/VisReason

详情
AI中文摘要

近期多模态大语言模型(MLLMs)在视觉推理基准上取得了强劲性能,但尚不清楚这种性能在多大程度上反映了直接基于视觉证据的推理。我们引入了 VisReason,一个面向日常场景中视觉中心推理的基准,其中感知与推理紧密耦合。VisReason 包含 1505 个问题,涵盖感知、结构和概念推理等 10 个类别。我们的评估表明,VisReason 对现有基准提出了性质不同的挑战,暴露了人类与当前 MLLMs 之间的巨大差距,并揭示了测试时推理策略带来的有限收益。VisReason 为评估超越语言的视觉中心推理提供了一个聚焦的诊断工具。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.

2605.25363 2026-05-26 cs.CV

MARVEL: Universal Murray's Law-informed Vessel Tree Segmentation and Topology Estimation

MARVEL:基于Murray定律的通用血管树分割与拓扑估计

Yi Zhou, Thiara Sana Ahmed, Jacqueline Chua, Meng Wang, Qinrong Zhang, Alejandro F. Frangi, Huazhu Fu, Jun Cheng, Leopold Schmetterer, Bingyao Tan

发表机构 * Singapore Eye Research Institute(新加坡眼科学研究院) Singapore National Eye Centre(新加坡国家眼科中心) Ophthalmology & Visual Sciences Academic(眼科与视觉科学学术)

AI总结 提出一种与骨干网络无关的框架MARVEL,通过可微分的Murray定律约束正则化训练,提升血管分割的生理合理性、拓扑一致性,并在高血压分类任务中显著优于基线模型。

Comments 10 pages, 18 figures

详情
AI中文摘要

血管循环遵循优化质量传输和代谢能量消耗的基本生物物理原理,这些原理可以通过Murray定律有效建模。然而,当代深度学习方法用于血管分割时往往忽略这些生物物理约束,导致生理上不合理的分支和血管树误分类,使得这些自动分割结果对于下游临床任务(如血流模拟或疾病量化)不可靠。在本文中,我们引入MARVEL(基于Murray定律的通用血管分割与拓扑估计),一个与骨干网络无关的框架,将生物物理先验整合到血管树提取中。MARVEL结合逐像素监督与显式半径预测,以强制执行从经验宽度-指数映射导出的局部分叉约束。我们在训练期间将这些约束实现为可微正则化器,以引导模型朝向生理一致的重建。我们在八个公开数据集上评估MARVEL,涵盖多种血管模态和分割骨干网络。结果表明MARVEL在分割准确性、拓扑一致性和生理合理性方面具有优越性能。通过将分割掩膜转换为基于图的血流动力学模拟,我们证明MARVEL保留了区分高血压眼和正常眼所需的细微病理狭窄和拓扑连接。结果显示,MARVEL通过眼内动静脉压力差显著改善了高血压的分类(p < 0.001),在拓扑一致性和临床预测价值方面均优于基线模型。

英文摘要

Vascular circulation follows fundamental biophysical principles that optimize mass transport and metabolic energy expenditure, which can be effectively modeled by Murray's law. However, contemporary deep learning methods for vascular segmentation often neglect these biophysical constraints. This leads to physiologically implausible branching and misclassification vascular trees, rendering. These automated segmentation results are unreliable unreliable for downstream clinical tasks such as blood flow simulation or disease quantification. In this paper, we introduce MARVEL (Universal MurrAy's law-infoRmed Vessel sEgmentation and topoLogy estimation), a backbone-agnostic framework that integrates biophysical priors into vascular tree extraction. MARVEL combines per-pixel supervision with explicit radius predictions to enforce local bifurcation constraints derived from an empirical width-exponent mapping. We implement these constraints as differentiable regularizers during training to guide models toward physiologically consistent reconstructions. We evaluate MARVEL on eight public datasets across multiple vascular modalities and segmentation backbones. Results demonstrate MARVEL's superior performance in segmentation accuracy, topological consistency, and physiological plausibility. By converting segmented masks into graph-based hemodynamic simulations, we demonstrate that MARVEL preserves the subtle pathological narrowing and topological connectivity required to distinguish hypertensive from normotensive eyes. Results show that MARVEL significantly improves the classification of hypertension via arteriovenous pressure differences in the eye (p < 0.001), outperforming baseline models in both topological consistency and clinical predictive value.

2605.25362 2026-05-26 cs.RO

Prior Policy Guided Dual-Agent Coordinated Manipulation Planning of Spacecraft-Manipulator System

先验策略引导的航天器-机械臂系统双智能体协同操控规划

Yuhui Hu, Dong Zhou, Kaihong Ouyang, Zhongliang Yu, Jianfeng Lv, Xiangyu Shao

发表机构 * School of Astronautics(航天学院) School of Automation(自动化学院) School of Information Science and Engineering(信息科学与工程学院)

AI总结 针对空间机械臂与基座强耦合导致的姿态稳定问题,提出先验策略引导的双智能体协同操控规划框架,通过时间步级专家切换机制提升深度强化学习效率,实现末端执行器高精度到达与基座姿态稳定。

Comments 36 pages, 13 figures, 6 tables. Under review

详情
AI中文摘要

机械臂与基座之间的强动态耦合对维持航天器姿态稳定性构成了重大挑战,可能危及任务安全。本文提出了一种双智能体协同操控规划(DACMP)框架,该框架同时实现了六自由度空间机械臂末端执行器的高精度位姿到达和基座航天器的姿态稳定。为了提高学习效率,我们提出了一种结合时间步级专家切换引导(TESG)机制的先验策略引导深度强化学习算法,从而促进全局收敛并提高任务成功率。大量实验表明,DACMP在任务成功率和控制精度方面显著优于基线深度强化学习算法。此外,在包括系统约束、环境干扰和感知不确定性在内的各种挑战性场景下,验证了DACMP的鲁棒性。代码和仿真配置可在GitHub上获取:https://github.com/HIT-YuhuiHu/DACMP。

英文摘要

The strong dynamic coupling between the manipulator and the base poses a significant challenge to maintaining spacecraft attitude stability, potentially compromising mission safety. In this paper, we propose a Dual-Agent Coordinated Manipulation Planning (DACMP) framework that simultaneously achieves high-precision end-effector pose reaching for a 6-DoF space manipulator and attitude stabilization of the base spacecraft. To enhance learning efficiency, we present a prior policy-guided Deep Reinforcement Learning algorithm incorporating the Timestep-level Expert Switching Guidance (TESG) mechanism, thereby promoting global convergence and improving task success rates. Extensive experiments demonstrate that DACMP significantly outperforms baseline DRL algorithms in terms of task success rate and control precision. Furthermore, the robustness of DACMP is validated under various challenging scenarios, including system constraints, environmental disturbances, and perception uncertainties. The code and simulation configurations are available on GitHub: https://github.com/HIT-YuhuiHu/DACMP.

2605.25360 2026-05-26 cs.CL

Learning to Route Languages for Multilingual Policy Optimization

学习路由语言以实现多语言策略优化

Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sony Group Corporation(索尼集团)

AI总结 提出语言路由策略优化(LRPO)框架,将语言作为可选变量,通过在线策略优化和可训练的语言路由器(多臂老虎机)自适应地选择语言,在固定预算下提升多语言训练信号的多样性和信息量,从而显著提高多语言性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在异构多语言语料库上进行训练,然而现有的策略优化方法通常隐式地将每个训练问题限制为单一响应语言,或依赖固定的主导语言进行监督。我们提出了语言路由策略优化(LRPO),这是一种在线策略优化框架,将语言视为可选变量。LRPO为每个训练问题生成多语言展开,并将其相对质量整合到基于偏好的策略更新中,从而在固定展开预算下增加训练信号的多样性和信息量。为了在强化学习过程中自适应地决定探索哪些语言,我们引入了一个可训练的语言路由器,其形式为多臂老虎机,平衡对未充分利用语言的探索与对信息量更大语言的利用。大量实验表明,LRPO持续提升多语言性能,证明自适应语言路由能够有效利用跨语言知识进行训练。我们在https://github.com/Guochry/LRPO 发布所有资源。

英文摘要

Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.

2605.25358 2026-05-26 cs.CL cs.AI cs.CY

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

AI相关的词汇转变跨越34种语言:新闻写作中的跨语言趋同与历时采纳

Thomas Stephan Juzek

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 通过分析34种语言的新闻语料,使用GPT-4.1续写诊断方法,发现AI过度使用的词汇在跨语言中呈现语义趋同,且ChatGPT发布后这些词汇的使用频率显著增加。

Comments 19 pages (9-page main body, plus references and appendices), 3 figures; ACL ARR reviewed, committed to EMNLP 2026

详情
AI中文摘要

AI相关的词汇转变主要被记录在科学英语中。我们将这项工作扩展到WMT新闻抓取语料库中的34种语言,改进了一种分割-后半部分续写诊断方法,比较GPT-4.1续写与匹配的人类黄金标准文本。对于每种语言,我们使用对数流行率比率推导出排名靠前的AI过度使用词元。我们发现显著的跨语言语义趋同:语义相关的概念在类型多样的语言中反复出现,其中'强调'类动词出现在34种语言中的24种。基于嵌入和人工分析支持这一模式。我们还考察了ChatGPT发布前后新闻写作中的历时采纳情况。追踪每种语言前20个AI过度使用项目,我们发现从2020-2021年到2023-2024年,34种语言中有26种语言的流行率增加,平均变化为+15.1%,而匹配的基线词汇没有显示出可比的增加(-4.5%)。在具有较长历史覆盖的10种语言中,纵向分析显示2022年后的增加超过了早期观察到的适度变化,尽管效应大小小于科学英语。我们广泛验证了我们的方法,包括跨种子、模型变体、数据大小、模型系列等。我们的发现与以下观点一致:AI相关的词汇偏好超越了英语,并可能对全球语言使用施加跨语言同质化压力。

英文摘要

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

2605.25357 2026-05-26 cs.CV cs.MA

Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

面向可靠胎儿超声解读的多智能体协作

Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen, Yiming Huang, Xuguang Bai, Zihan Li, Hongjia Yang, Yingqi Hao, Hong Xu, Yu Jiang, Tian Tian, Yi Liao, Haibo Qu, Qiyuan Tian

发表机构 * Tsinghua University(清华大学) University of California San Diego(加州大学圣地亚哥分校) West China Second University Hospital, Sichuan University(四川大学西昌医学院)

AI总结 提出FetUSAgents多智能体系统,通过协作LLM代理和双路径证据仲裁(DPEA)整合视觉工具与临床推理,在胎儿超声VQA、报告生成等任务上超越最强基线25%以上。

详情
AI中文摘要

自动化胎儿超声解读需要从视觉感知(包括平面识别和解剖分割)到临床理解(包括生物测量和诊断报告)的工作流程。然而,当前“一任务一模型”的范式限制了跨多步骤过程的系统性证据整合。尽管多模态大语言模型(MLLM)展现出有前景的视觉理解能力,但其有限的领域特定基础和幻觉风险限制了在胎儿超声分析中的可靠性。为解决这些限制,我们提出了FetUSAgents,一个工具增强的多智能体系统,用于全面的胎儿超声解读,支持视觉问答(VQA)、报告生成、图像描述和视频总结。FetUSAgents通过协作的LLM代理协调任务特定的视觉工具,并将临床查询分解为从解剖识别到定量测量的子任务。我们进一步引入了双路径证据仲裁(DPEA),它将基于LLM的审慎推理与来自专业视觉工具的结构化计算证据相结合。一个检索增强的证据库整合中间发现,以支持可追溯且临床可靠的结论。此外,我们构建了FetUS-VQA,一个专门用于胎儿超声的VQA基准,包含1,892张图像和3,205个问答对,涵盖10个临床任务。广泛的分布外实验表明,FetUSAgents优于通用和医学MLLM,在VQA准确率上超过最强基线25%以上。这些结果表明了一条通往产前成像的基于证据的临床助手的可扩展路径。代码已公开。

英文摘要

Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

2605.25354 2026-05-26 cs.AI

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

Context-CoT:通过高质量推理合成增强上下文学习

Hongbo Jin, Mingnan Zhu, Jingqi Tian, Xu Jiang, Zhongjing Du, Haoran Tang, Siyi Xie, Qiaoman Zhang, Jiayu Ding

发表机构 * Peking University(北京大学) Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 针对大语言模型在动态提取和应用新知识方面的上下文学习能力不足,提出Context-CoT方法,通过合成高质量推理链来增强上下文学习,在CL-Bench上显著提升性能。

详情
AI中文摘要

虽然大语言模型在使用静态预训练知识进行推理方面表现出色,但在上下文学习——即从复杂、任务特定的上下文中动态提取、内化和应用新知识的能力——方面存在显著困难。最近在CL-Bench上的评估揭示了一个关键能力差距:前沿模型平均仅能解决17.2%的上下文相关任务。

英文摘要

While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts. Recent evaluations on the CL-Bench reveal a critical capability gap: frontier models solve only 17.2% of context-dependent tasks on average.

2605.25352 2026-05-26 cs.LG cs.AI

Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces

基于预训练潜在空间中近似高斯混合结构的认证鲁棒性

Konstantinos Emmanouilidis, Tianjiao Ding, Nghia Nguyen, Nicolas Loizou, René Vidal

发表机构 * CS & MINDS Johns Hopkins University(计算机科学与MINDS约翰霍普金斯大学) CIS University of Pennsylvania(计算机与信息科学宾夕法尼亚大学) AMS & MINDS Johns Hopkins University(人工智能与机器学习系约翰霍普金斯大学) ESE, Radiology & IDEAS University of Pennsylvania(工程科学与放射学系及IDEAS宾夕法尼亚大学)

AI总结 本文提出一个框架,利用预训练编码器将输入映射到近似高斯混合的潜在分布,通过理论分析证明鲁棒性退化有界,从而实现可认证鲁棒分类器,在CIFAR-10和ImageNet上达到最优或竞争性的认证准确率。

详情
AI中文摘要

深度学习模型易受对抗扰动影响,这对安全关键部署提出了重要关切。经验性防御在实践中可以实现强鲁棒性,但缺乏形式化保证,这推动了可认证鲁棒分类器的需求。虽然认证方法提供了形式化保证,但由于无法利用复杂数据分布中的结构,它们通常产生过于保守的边界。在这项工作中,我们提出了一个设计可认证鲁棒分类器的框架,该框架利用数据表示中的潜在结构。我们首先分析高斯混合设置,推导出鲁棒分类器存在的必要和充分条件,并构建了一个具有闭式鲁棒性证书和泛化保证的分类器。我们的主要贡献是证明精确结构并非必需:我们证明,如果预训练编码器将输入映射到一个与高斯混合分布$\varepsilon$-接近(在KL散度下)的潜在分布,那么认证准确率会优雅地退化,并给出了一个显式边界,关联真实分布和近似分布下的鲁棒性。这一结果使得直接使用预训练模型成为可能,而无需精确的分布假设。实验上,我们的方法在CIFAR-10和ImageNet上实现了最先进或具有竞争力的认证准确率,同时保持了强大的干净性能和低计算开销。总体而言,我们的工作将近似潜在结构确立为通往可认证鲁棒性的一条实用且有原则的路径。

英文摘要

Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defenses can achieve strong robustness in practice, but lack formal guarantees, motivating the need for certifiably robust classifiers. While certified methods provide formal guarantees, they often yield overly conservative bounds due to their inability to exploit structure in complex data distributions. In this work, we propose a framework for designing certifiably robust classifiers that leverages latent structure in data representations. We first analyze the Gaussian mixture setting, deriving necessary and sufficient conditions for the existence of robust classifiers and constructing a classifier with a closed-form robustness certificate and generalization guarantees. Our main contribution is to show that exact structure is not required: we prove that if a pretrained encoder maps inputs to a latent distribution that is $\varepsilon$-close (in KL divergence) to a Gaussian mixture, then certified accuracy degrades gracefully, with an explicit bound relating robustness under the true and approximate distributions. This result enables the direct use of pretrained models without requiring exact distributional assumptions. Empirically, our method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet, while maintaining strong clean performance and low computational overhead. Overall, our work establishes approximate latent structure as a practical and principled route to certifiable robustness.

2605.25347 2026-05-26 cs.CV cs.LG

ERNIE-Image Technical Report

ERNIE-Image 技术报告

Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu, Jun Xia, Yuehu Dong, Yanzheng Lin, Honglin Xiong, Anqi Chen, Yunpeng Ding, Jinghui Duan, Lin Gao, Chao Han, Tiechao He, Jiakang Hu, Ranjun Hua, Xueming Jiang, Qingli Kong, Yuting Lei, Tianyu Li, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Liu, Xiaolong Ma, Yan Pan, Yiran Ren, Nan Sheng, Yu Sun, Siyang Sun, Yixiang Tu, Yang Wan, Huanai Wang, Siqi Wang, Yang Wu, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Yang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhang, Qiao Zhao, Qi Zhou

发表机构 * ERNIE Team, Baidu(百度ERNIE团队)

AI总结 提出基于8B单流DiT架构的开源文本到图像生成模型ERNIE-Image,通过自底向上的预训练数据构建和自顶向下的后训练数据构建,结合稳定DPO策略和MT-DMD蒸馏方法,在指令遵循、文本渲染和美学质量上接近顶级商业模型。

详情
AI中文摘要

我们介绍了ERNIE-Image,一个基于8B单流DiT架构构建的开源文本到图像生成模型。ERNIE-Image旨在通过更有效地挖掘大规模预训练数据并在整个训练过程中提高监督质量,来弥合当前开源模型与领先闭源系统之间的差距。在预训练阶段,我们采用自底向上的数据构建流程,结合细粒度图像分类、丰富的标题注释、美学评估和分层采样。该策略在保留长尾概念和详细真实世界知识的同时减少数据噪声,为复杂生成任务提供了更坚实的基础。在后训练阶段,我们针对高需求场景使用自顶向下的数据构建流程,多样化提示注释以更好地匹配真实用户输入,并应用稳定的DPO策略使模型与人类美学偏好对齐。我们进一步训练ERNIE-Image-Turbo以实现高效的8-NFE生成,并提出MT-DMD以减轻蒸馏过程中的能力漂移。为了使模型在实际场景中更易于使用,我们为其配备了一个轻量级的提示增强器,将简洁的用户意图扩展为结构化的视觉描述。此外,我们开发了工业级美学模型ERNIE-Image-Aes,以及用于真实美学评估的人工标注基准ERNIE-Image-Aes-1K。大量的定性和定量实验表明,ERNIE-Image在开源模型中实现了领先性能,并在指令遵循、文本渲染和美学质量方面接近顶级商业模型。我们发布训练好的模型和美学资源,以促进AIGC社区的进一步学术研究和技术进步。

英文摘要

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

2605.25346 2026-05-26 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC

Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers

用于学习和规划的并行可微可达性:带认证的神经动力学与控制器

Keyi Shen, Glen Chou

发表机构 * MIT(麻省理工学院)

AI总结 提出一种基于JAX的并行可微可达性框架,结合泰勒模型流形构建与CROWN线性界传播,支持GPU批处理和自动微分,并用于认证训练和可达性感知的MPC,在非抓取操作和四旋翼任务中实现在线规划与有界不确定性下的认证可达集过近似。

Comments Robotics: Science and Systems XXII (RSS 2026)

详情
AI中文摘要

神经网络动力学模型和控制策略在机器人领域取得了强大性能,但在不确定性下提供可靠保证仍然困难,尤其是对于闭环神经网络系统。现有的可达性工具提供了形式化的过近似,但通常不可微、过于保守或对于现代学习和在线规划流程来说太慢。为了解决这个问题,我们提出了一个在JAX中可并行化、可微的可达性框架,适用于连续和离散时间系统,具有解析和基于神经网络的动力学和控制器。我们的框架通过统一表示结合了泰勒模型流形构建和CROWN风格的线性界传播,该表示在支持GPU批处理计算和自动微分的同时保留了仿射依赖。基于这个可达性基元,我们开发了(i)一种认证训练方法,鼓励生成对可达性友好的动力学模型和控制器,以及(ii)一种具有基于梯度细化的可达性感知采样MPC方案。在非抓取操作和四旋翼任务上的实验,包括硬件和更高维度的评估(高达72维),展示了在实际在线规划中保持有界不确定性下认证可达集过近似的可行性。

英文摘要

Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncertainty remains difficult, especially for closed-loop NN systems. Existing reachability tools provide formal over-approximations, yet are often non-differentiable, overly conservative, or too slow for modern learning and online planning pipelines. To address this, we present a parallelizable, differentiable reachability framework in JAX for continuous- and discrete-time systems with analytical and NN-based dynamics and controllers. Our framework combines Taylor-model flowpipe construction with CROWN-style linear bound propagation through a unified representation that preserves affine dependencies while supporting GPU-batched computation and automatic differentiation. Building on this reachability primitive, we develop (i) a certified training method that encourages reachability-friendly dynamics models and controllers, and (ii) a reachability-aware sampling-based MPC scheme with gradient-based refinement. Experiments on non-prehensile manipulation and quadrotor tasks, including hardware and higher-dimensional evaluations (up to 72D), demonstrate practical online planning while maintaining certified reachable-set over-approximations under bounded uncertainty.

2605.25344 2026-05-26 cs.CL cs.AI cs.LG quant-ph

A general tensor-structured compression scheme for efficient large language models

一种用于高效大语言模型的通用张量结构压缩方案

Ying Lu, Peng-Fei Zhou, Qi-Xuan Fang, Pan Zhang, Shi-Ju Ran, Gang Su

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences(中国科学院大学物理科学学院) Kavli Institute for Theoretical Sciences, University of Chinese Academy of Sciences(中国科学院大学理论科学研究院) Center for Quantum Physics and Intelligent Sciences, Department of Physics, Capital Normal University(首都师范大学量子物理与智能科学中心) Institute of Theoretical Physics, Chinese Academy of Sciences(中国科学院理论物理研究所)

AI总结 提出张量混合(MixT)方案,通过将密集线性层替换为张量算子混合体,在保持MMLU准确率的同时大幅减少参数、FLOPs和内存。

Comments 12 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)主要由密集线性变换主导,其存储、内存和计算开销阻碍了高效的适配和部署,同时掩盖了结构简化对功能的影响。本文提出张量混合(MixT),一种通用的张量结构压缩方案,将目标密集线性层替换为可原生执行的张量算子混合体。MixT直接作用于通用线性投影而非模型特定组件,因此可能适用于基于Transformer的LLMs及其他密集神经映射。我们在统一的恢复协议下对Qwen3-8B和LLaMA2-7B评估MixT,识别出一个广泛的压缩区域,在该区域内MMLU准确率基本保持不变,直到模型特定边界处出现突变。该突变与输出熵、预测熵和层间几何的协同变化同时发生。在LLaMA2-7B的突变边界处,MixT将全模型参数减少47.5%,推理FLOPs减少37.1%,训练FLOPs减少52.1%,峰值推理内存减少60.4%,展示了其在低成本LLM压缩中的实际潜力。

英文摘要

Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5\%, inference FLOPs by 37.1\%, training FLOPs by 52.1\% and peak inference memory by 60.4\%, demonstrating its practical potential for lower-cost LLM compression.

2605.25343 2026-05-26 cs.CV

Toward Native Multimodal Modeling: A Roadmap

迈向原生多模态建模:路线图

Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li, Weizhi Fei, Zichao Yu, Zheng Yuan, Biao Liu, Haopeng Wang, Renzhao Liang, Yixuan Yang, Yunhang Shen, Bo Ke, Keyu Chen, Linhao Luo, Difan Zou, Xiao Huang, Di Yin, Ruizhi Qiao, Xing Sun

发表机构 * Tencent Youtu Lab(腾讯优图实验室) Tsinghua University(清华大学) The University of Hong Kong(香港大学) University of Warwick(沃林汉大学) Monash University(墨尔本大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出从非原生多模态范式向原生多模态建模(NMM)过渡的正式路线图,通过输入-输出二元性分类现有模型,并系统探讨架构协调、数据整理、训练推理及评估的全栈工业级方案。

Comments 52 pages, 5 figures, 3 tables, ~300 references

详情
AI中文摘要

多模态建模是从模态无关推理迈向世界建模的关键一步。早期方法主要依赖后期融合,即组装编码器、冻结语言骨干网络和输出头;而近期研究已将范式转向原生多模态建模(NMM),通过模态的内在集成实现卓越的多模态性能。尽管潜力巨大,原生架构的设计空间仍缺乏明确定义。本文向社区呈现了这一过渡的正式路线图。具体而言,我们正式定义了架构原生性,将中期融合和早期融合与非原生范式区分开来。我们进一步通过输入-输出二元性的视角将现有原生模型组织为三类:(i) 多到文本,用于仅输出文本的跨模态理解;(ii) 多到目标,用于面向场景的生成,例如图像、音频和视频生成;(iii) 多到多,用于对称输入-输出的统一建模。我们对迈向最终NMM框架的过渡进行了全面且工业级的调查,在该框架中,理解和生成在统一的Transformer范式中无缝共存。我们从工业视角系统地拆解了端到端流水线,包括架构协调、大规模数据整理、全栈训练配方、推理与部署,以及真正原生建模的综合评估。

英文摘要

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.

2605.25342 2026-05-26 cs.CL

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

MATO: 面向大语言模型的多目标个性化对齐与测试时优化

Linhao Luo, Thuy-Trang Vu, Van-Anh Nguyen, Junae Kim, Gholamreza Haffari, Dinh Phung

发表机构 * Monash University(墨尔本大学) Defence Science and Technology Group, Australia(澳大利亚国防科学与技术集团)

AI总结 提出MATO框架,通过测试时优化在解码过程中动态调整多目标权重,无需训练或外部奖励模型,实现大语言模型与用户多样化偏好的对齐。

Comments Preprint

详情
AI中文摘要

将大语言模型与多样且多方面的用户偏好对齐是个性化AI系统的基本挑战。现有的多目标对齐方法要么依赖昂贵的训练,要么需要为每个偏好预训练奖励模型,这使得它们难以适应不断变化的偏好。基于提示的个性化提供了一种无需训练的替代方案,但仅靠提示通常提供有限的可操控性,因为大语言模型可能过度强调或忽略某些偏好,并且在冲突出现时无法让用户可靠地控制不同目标的相对重要性,导致对齐效果欠佳。在本文中,我们介绍了MATO,一种无需训练的多目标个性化对齐与测试时优化框架。MATO将个性化表述为一个测试时优化问题,在解码过程中通过可控权重引导多个目标的相对重要性,无需修改模型参数或需要外部奖励模型。具体来说,奖励发现模块直接从骨干大语言模型中恢复针对自然语言指定的多种目标的偏好奖励,而权重优化模块根据用户的初始偏好和部分生成的响应动态调整目标权重,以在生成过程中平衡相互竞争的目标。得到的奖励和权重共同指导对令牌分布的在线优化过程,从而更好地与目标对齐。在多个数据集和骨干大语言模型上的大量实验表明,MATO始终优于强基线,实现了帕累托改进的多目标对齐和更强的可操控性。这些结果凸显了测试时优化作为可扩展、可控且模型无关的个性化对齐的一个有前景的方向。

英文摘要

Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require pre-trained reward models for each preference, making it difficult for them to adapt to evolving preferences. Prompt-based personalization offers a training-free alternative, but prompting alone often provides limited steerability, as LLMs may overemphasize or overlook certain preferences and fail to give users reliable control over the relative importance of different objectives when conflicts arise, leading to suboptimal alignment. In this paper, we introduce MATO, a training-free framework for Multi-objective personalized Alignment with Test-time Optimization. MATO formulates personalization as a test-time optimization problem that steers the relative importance of multiple objectives through controllable weights during decoding, without modifying model parameters or requiring external reward models. Specifically, a reward discovery module recovers preference rewards directly from the backbone LLM for diverse objectives specified in natural language, while a weight optimization module dynamically adjusts objective weights based on the user's initial preferences and the partially generated response to balance competing objectives during generation. The resulting rewards and weights jointly guide an online optimization procedure over the token distribution, enabling better alignment with the target objectives. Extensive experiments across multiple datasets and backbone LLMs show that MATO consistently outperforms strong baselines, achieving Pareto-improving multi-objective alignment and stronger steerability. These results highlight test-time optimization as a promising direction for scalable, controllable, and model-agnostic personalized alignment.

2605.25338 2026-05-26 cs.LG cs.AI

CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

CausalFlow: LLM Agent 失败的因果归因与反事实修复

Akash Bonagiri, Devang Borkar, Gerard Janno Anderias, Setareh Rafatirad, Houman Homayoun

发表机构 * Department of Computer Science University of California, Davis(计算机科学系加州大学戴维斯分校)

AI总结 提出CausalFlow框架,通过反事实干预计算步骤级因果责任分数,识别失败步骤并生成最小编辑修复,用于测试时修复和训练时监督,在多个基准上优于启发式方法。

详情
AI中文摘要

大型语言模型(LLM)代理在涉及推理、工具使用和环境交互的多步任务中经常失败。虽然此类失败通常被记录或通过启发式重试处理,但它们包含了关于执行中断位置的结构化信号。我们提出了CausalFlow,一个干预框架,将失败的代理轨迹转换为最小的反事实修复和可重用的监督。CausalFlow将执行轨迹建模为依赖步骤的顺序链,并通过步骤级反事实干预计算因果责任分数(CRS)来识别导致失败的步骤。对于这些步骤,我们生成最小编辑修复,将最终结果翻转为成功,产生形式为(错误步骤,修正步骤)的验证对比对。CausalFlow支持两种互补用途:具有最小行为漂移的针对性测试时修复,以及适用于离线偏好优化或奖励建模的训练时监督。在涵盖数学推理、代码生成、问答和医学浏览的四个基准测试中,CausalFlow将失败执行转换为具有高最小性和因果一致性分数的验证最小修复,并证明因果归因对于跨不同代理任务的可靠改进是必要的,在复杂检索设置中优于启发式细化,同时产生更局部的修复。这些结果表明,对结构化执行轨迹的干预分析提供了一种原则性和可扩展的机制,将代理失败转化为可靠性提升和可学习的监督。

英文摘要

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While such failures are typically logged or retried heuristically, they contain structured signals about where execution broke down. We introduce CausalFlow, an interventional framework that converts failed agent traces into minimal counterfactual repairs and reusable supervision. CausalFlow models execution traces as sequential chains of dependent steps and computes Causal Responsibility Scores(CRS) via step-level counterfactual intervention to identify failure-inducing steps. For these steps, we generate minimally edited repairs that flip the final outcome to success, producing validated contrastive pairs of the form (wrong step, corrected step). CausalFlow supports two complementary uses: targeted test-time repair that recovers from failures with minimal behavioral drift, and training-time supervision suitable for offline preference optimization or reward modeling. Across four benchmarks spanning mathematical reasoning, code generation, question answering, and medical browsing, CausalFlow converts failed executions into validated minimal repairs with high minimality and causal-consensus scores, and demonstrates that causal attribution is necessary for reliable improvement across diverse agent tasks, outperforming heuristic refinement in complex retrieval settings while producing more localized repairs throughout. These results demonstrate that interventional analysis over structured execution traces provides a principled and scalable mechanism for transforming agent failures into reliability gains and learning-ready supervision.

2605.25334 2026-05-26 cs.CV

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

双路径几何感知多模态大语言模型用于空间智能

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Inc.(利汽车公司) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出GAMSI,一种仅以RGB图像为输入、通过双路径查询和专家引导视觉对齐实现3D结构与度量尺度联合感知的多模态大语言模型,在七个空间智能基准上达到最优性能。

详情
AI中文摘要

从2D视觉输入理解物理世界的空间能力依赖于两种互补的几何知识:整体3D结构感知和细粒度度量尺度估计。现有的多模态大语言模型通常只处理其中一个方面,将深度图或点云作为额外模型输入,这带来了大量计算开销并继承了上游预测模型的泛化局限性。我们提出GAMSI,一种双路径几何感知多模态大语言模型用于空间智能,仅以RGB图像为输入,同时在统一的自回归骨干网络内内化两种几何先验。具体地,我们引入度量-结构解耦查询,使用两组可学习查询分别从共享视觉上下文中提取密集度量信号和稀疏结构线索,并通过任务解耦注意力掩码防止两条路径相互污染。在此基础上,专家引导视觉定位模块将聚合的线索投影回帧级视觉特征,并与视觉基础模型对齐,这些模型仅作为训练时的监督,而非模型输入。我们进一步构建了一个多任务空间指令微调数据集,包含152,776个样本,涵盖13种任务类型和三种视觉模态,整合自六个公共数据集。通过两阶段课程训练,GAMSI在七个空间智能基准上达到了最先进的性能。

英文摘要

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

2605.25333 2026-05-26 cs.CV

Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

教会视频生成器记忆:为不可见状态演化引出动态记忆

Tianshuo Xu, Yichen Xie, Depu Meng, Chensheng Peng, Quentin Herau, Bo Jiang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition(应用直觉) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对视频生成模型在观测中断时状态冻结的问题,提出ReMind框架,通过面向记忆的数据构建、事件感知训练和缓存适配,利用KV缓存机制实现动态记忆,在STEVO-Bench和恢复任务上取得最佳成绩。

详情
AI中文摘要

视频世界模型应在证据未被观测时维持演化状态,但当前生成器在中断时往往冻结隐藏状态。这不仅仅是容量问题:预训练的视频扩散Transformer已经具备能够进行非局部检索的KV缓存机制,但很少被训练用作动态记忆。我们引入ReMind,一个通过面向记忆的数据、事件感知训练和缓存适配来引出动态记忆行为的框架。围绕100多种动态事件的分类,我们构建了一个带相机标注的训练混合集,结合了VLM过滤的真实视频、生成的硬动态、合成相机循环和记忆中断增强。每个片段被转换为带有保护锚点、退化区间和显式时间间隙的帧图。节点结构化的课程,包括节点丢弃、噪声记忆、前沿延续和参考缓存训练,迫使模型在中断时检索相关的过去状态,而不是仅依赖局部连续性。PM-RoPE,一种优雅的相机相位RoPE扩展,以单注意力成本解锁了时空检索,同时保留了预训练路径。ReMind在STEVO-Bench和恢复任务上取得了最佳总体分数。此外,通用图像到视频评估证实该课程避免了灾难性遗忘。我们将开源代码、数据和模型。

英文摘要

Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

2605.25328 2026-05-26 cs.CV cs.MM

DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement

DIVA: 利用统一多模态模型中的表示差异实现相互增强

Renjie Lu, Xulong Zhang, Xiaoyang Qu, Shangfei Wang, Jianzong Wang

发表机构 * Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China(平安科技(深圳)有限公司,深圳,中国) University of Science and Technology of China(中国科学技术大学)

AI总结 针对统一多模态模型中理解与生成任务因监督信号差异导致相互干扰的问题,提出DIVA框架,通过分解视觉表示为共享和独有成分并利用互信息估计实现内部协同,在理解与生成任务上分别提升7.82%和8.46%。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于单一架构构建的统一多模态模型(UMMs)在理解和生成任务中均展现出令人印象深刻的表现。我们识别出一个基本挑战,即由不同监督信号引起的归纳偏差:生成分支偏好能够重建的高保真、细粒度表示,而理解分支则偏好对任务无关因素保持不变的语义判别性嵌入。因此,在单一骨干网络中优化这些互补但不等价的目标会导致相互损害而非增强。在本文中,我们首先分析了统一骨干网络中这种干扰的根本原因,并揭示了其内部表示中的互补结构。受此观察启发,我们提出了DIVA,一个自我改进的训练后框架,将表示差异转化为内部协同。通过基于两条互补信息流将视觉表示显式分解为共享和独有成分,DIVA使得理解和生成分支都能实现有益的迁移,同时通过互信息估计保护独有信息免受跨流干扰的完整性。尽管具有通用性,我们的方法在视觉理解(+7.82%)和生成(+8.46%)任务上均取得了一致的改进。官方代码见:https://github.com/Jayyy-H/DIVA。

英文摘要

Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.

2605.25326 2026-05-26 cs.CV

Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

感知-然后-规划:以布局为策略的单目3D场景布局估计

Junwei Zhou, Yu-Wing Tai

发表机构 * Department of Computer Science(计算机科学系) Dartmouth College(达特茅斯学院)

AI总结 提出Perceive-then-Plan框架,通过视觉语言模型将单目3D布局估计转化为感知与迭代规划问题,以布局为策略(LaP)学习动作序列逐步优化场景假设,生成更物理一致且与观测对齐的3D布局。

Comments 21 pages

详情
AI中文摘要

从单张图像构建结构化的3D场景布局需要协调视觉观察与物理和空间约束,这一挑战难以仅通过直接预测来解决。在这项工作中,我们将单目3D布局估计形式化为一个带有视觉语言模型的感知-然后-规划问题,其中感知器首先定位3D对象,然后规划器通过动作迭代优化场景假设,这些动作在保持与输入图像一致性的同时提高物理合理性。我们提出布局为策略(LaP),将规划阶段视为策略学习问题:3D布局表示为结构化状态,并通过离散动作(如平移、旋转和缩放)进行优化。从几何增强感知器的观测对齐初始化开始,LaP规划器被训练生成逐步解决几何不一致性并强制实现现实空间关系的动作序列。为了实现有效学习,我们将监督轨迹初始化与基于偏好的优化相结合,使模型能够在无需显式奖励工程的情况下学习纠正行为。这种公式将布局估计从一次性预测任务转变为迭代优化过程,从而更好地处理全局约束和复杂的对象交互。实验表明,我们的方法生成的布局在物理上更连贯,与视觉观察更一致,同时自然支持场景编辑和操作等下游任务。

英文摘要

Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.

2605.25313 2026-05-26 cs.LG cs.AI cs.RO stat.ML

UWM-JEPA: Predictive World Models That Imagine in Belief Space

UWM-JEPA:在信念空间中进行想象的世界预测模型

Santosh Kumar Radha, Oktay Goktas

发表机构 * AgentField AI

AI总结 针对部分可观测环境,提出UWM-JEPA模型,通过密度矩阵潜变量和酉预测器在信念空间中保持联合状态谱,实现长时域盲推演下的不确定性保持,显著优于向量潜变量基线。

Comments 14 pages, 6 figures, 7 tables. Code and data: https://github.com/santoshkumarradha/uwm-jepa

详情
AI中文摘要

部分可观测环境下的世界模型必须想象多个兼容的隐藏未来,并在反事实动作下引导它们。联合嵌入预测架构(JEPAs)在潜在空间中实现这一点,但向量值潜变量没有内部结构来承载盲推演过程中隐藏连续性的信念。我们引入了酉世界模型JEPA(UWM-JEPA),这是一种JEPA世界模型,具有在联合系统-环境空间上的密度矩阵潜变量和学习的酉预测器。该结构在推演过程中精确保持联合状态谱,因此预测器本身不会耗散表示的不确定性。在一个需要根据给定动作序列进行五步前向模拟且目标观测被掩蔽的隐藏速度指示任务中,UWM-JEPA达到0.77的准确率,并且随着动作被扰动而单调下降;而参数匹配的LSTM-JEPA在相同的反事实目标目标和动作头训练下,在所有动作条件下都崩溃为多数类准确率(0.53)。在盲推演下,UWM-JEPA在短时域上损失不到十个点的探针R^2,而向量潜变量基线损失四十一个和六十八个点;两者在保留的上下文探针上表现相当,表明差异在于预测器而非编码器。动作敏感性本身需要针对反事实而非教师强制目标进行训练,这一发现适用于酉参数化之外。对于JEPA世界模型在部分可观测性下进行想象,潜变量几何和预测器动力学至关重要,而不仅仅是冻结的上下文编码能力。

英文摘要

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone.

2605.25310 2026-05-26 cs.CL

Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams

工具调用依赖结构在LLM智能体残差流中是线性可解码的

Tianda Sun, Dimitar Kazakov

发表机构 * University of York(约克大学) Department of Computer Science(计算机科学系) Heslington, York(约克大学赫斯林顿校区)

AI总结 本研究通过低容量边探针在Qwen3-32B残差流中解码工具调用依赖图,发现该表示追踪抽象拓扑而非标识符值,且在不同模型和任务中可复制。

Comments 16 pages, 7 figures

详情
AI中文摘要

使用工具的LLM智能体产生的轨迹中,调用形成有向依赖图:早期工具输出为后续调用提供参数。这种执行结构是否在模型内部表示尚不清楚;先前的结构探针针对静态代码或思维链文本,而非智能体的运行时调用图。在Qwen3-32B残差流上的低容量边探针解码工具调用依赖图,显著高于Hewitt-Liang随机标签控制和位置基线。反事实对比(值破坏与结构扰动)表明信号追踪抽象拓扑而非标识符值,并在独立的非子串预言机下可复制。非位置成分在另外三个交互式多跳基准上可复制,并在调用顺序本身成为依赖的充分代理时衰减,在单次规划中消失。逐层激活修补在后续非修补边界移动探针,表明表示传播而非被动读出,尽管实际工具调用未移动。据我们所知,这是首个对LLM智能体运行时工具调用依赖图的结构探针。我们的主张涉及表示而非行为控制,涵盖两个模型系列和一个主要领域。

英文摘要

Tool-using LLM agents produce trajectories whose calls form a directed dependency graph: earlier tool outputs supply arguments to later calls. Whether this execution structure is represented inside the model is unknown; prior structural probes have targeted static code or chain-of-thought text, not an agent's run-time call graph. A low-capacity edge probe on the residual stream of Qwen3-32B decodes the tool-call dependency graph well above both a Hewitt--Liang random-label control and a positional baseline. A counterfactual contrast between value corruption and structural perturbation indicates the signal tracks abstract topology rather than identifier values, and replicates under an independent, non-substring oracle. The non-positional component replicates on three further interactive multi-hop benchmarks and attenuates as call order alone becomes a sufficient proxy for dependency, vanishing in single-shot planning. Per-layer activation patching shifts the probe at a later, non-patched boundary, evidence that the representation propagates rather than passively reads out, though the realised tool call does not move. To our knowledge this is the first structural probe of an LLM agent's runtime tool-call dependency graph. Our claims concern representation, not behavioural control, and span two model families and one primary domain.

2605.25308 2026-05-26 cs.CV

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

通过动态特征归一化稳定流视频几何

Xiaoyang Lyu, Muxin Liu, Xiaoshan Wu, Ruicheng Wang, Yi-Hua Huang, Yang-Tian Sun, Shaoshuai Shi, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) USTC(中国科学技术大学) Voyager Research, Didi Chuxing(滴滴出行 Voyager 研究)

AI总结 针对流式RGB输入中单目几何模型的时间不一致问题(主要表现为尺度-偏移漂移),提出轻量级因果循环模块DyFN,通过动态调制特征统计量实现稳定几何估计,仅微调2%参数即可达到SOTA时间稳定性。

Comments 16 pages, 9 Figures, page: https://shawlyu.github.io/DyFN

详情
AI中文摘要

从流式RGB输入中一致地估计3D几何对于自动驾驶、具身AI和大规模重建等实际应用至关重要。虽然现代单目几何基础模型在单张图像上取得了很高的精度,但在连续输入上表现出严重的时间不一致性,主要表现为尺度-偏移漂移。通过有针对性的实证分析,我们将这种不稳定性追溯到其根本原因:潜在特征统计量的波动,其均值和方差直接决定了预测深度的尺度和偏移。基于这一洞察,我们引入了动态特征归一化(DyFN),这是一种轻量级的因果循环模块,能够动态且鲁棒地调制特征统计量,以随时间保持稳定的几何。我们通过仅微调DyFN(仅占2%的额外参数)来适配强大的预训练单目几何模型用于流式处理,同时保持骨干网络冻结,从而在保持单张图像精度的同时实现时间一致性。在四个基准上的大量实验表明,DyFN有效消除了时间伪影,如不连续的分层和位置抖动,并实现了最先进的时间稳定性,相比先前的流式方法提升了高达14%,甚至优于更重的非因果视频基线。项目页面:https://shawlyu.github.io/DyFN

英文摘要

Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale--shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth's scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2\% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14\% and even outperforming heavier non-causal video baselines. Project Page: https://shawlyu.github.io/DyFN

2605.25307 2026-05-26 cs.CV

Recursive Class Connectivity Classification (R3C) Applied to Binary Image Segmentation for Improved Infant Fingerprint Enhancement

递归类连接分类(R3C)应用于二值图像分割以改进婴儿指纹增强

Joao Leonardo Harres Dall Agnol, Luiz Fernando Puttow Southier, Jefferson Tales 0liva, Marcelo Teixeira, Rodrigo Mineto, Marcelo Filipa, Dalcimar Casanova, Erick Oliveira Rodrigues

发表机构 * Infant.ID Ltda(Infant.ID公司) Graduate Program in Production and Systems Engineering (PPGEPS), Federal University of Technology-Paran (UTFPR)(生产与系统工程硕士项目,联邦技术大学-巴拉那(UTFPR))

AI总结 提出递归类连接分类(R3C)框架,通过迭代扩展脊线结构改进现有增强方法的二值分割输出,无需训练数据即可提升婴儿指纹识别率。

Journal ref IEEE Access 2025

详情
AI中文摘要

图像增强在婴儿指纹匹配中至关重要,因为儿童特有的特征(如较小的手指尺寸和较薄的脊线结构)通常会在采集过程中降低图像质量。为解决这些限制,注册通常依赖于专门的高分辨率扫描仪,而大多数现有增强方法并非为此设计。因此,儿童的识别率仍显著低于成人指纹。本研究引入递归类连接分类(R3C),一种通过扩展脊线结构迭代细化现有增强方法二值分割输出的新颖框架。R3C不需要修改底层分类器,且无需训练数据(目前婴儿指纹尚无此类数据)。相反,该方法通过将分类后的图像反复反馈到分类过程中,同时将每个中间分割与原始输入图像结合,从而改进分割。在三个指纹数据集上使用四种不同增强分类器进行的实验表明,与单独使用增强方法相比,R3C可将儿童的真接受率(TAR)提高最多4%,新生儿提高超过40%。定性分析进一步表明,R3C重新连接了断裂的脊线模式,改善了分割的视觉质量。由于独立于所使用的增强方法,R3C为改进二值分割提供了灵活且广泛适用的解决方案。

英文摘要

Image enhancement plays a crucial role in infant fingerprint matching, as child-specific characteristics such as smaller finger dimensions and thinner ridge structures often degrade image quality during acquisition. To address these limitations, enrollment typically depends on specialized highresolution scanners, which most existing enhancement methods are not designed to support. Consequently, identification rates for children remain significantly lower than those achieved with adult fingerprints. This study introduces Recursive Class Connectivity Classification (R3C), a novel framework that iteratively refines binary segmentation outputs from existing enhancement methods by extending ridge structures. R3C does not require modifications to the underlying classifier and operates without training data, which is not currently available for infant fingerprints. Instead, the method improves segmentation by repeatedly feeding the classified image back into the classification process, while combining each intermediate segmentation with the original input image. Experiments conducted on three fingerprint datasets using four different enhancement classifiers show that R3C can increase the True Acceptance Rate (TAR) by up to 4% for children and over 40% for newborns, compared to using the enhancement methods alone. A qualitative analysis further demonstrates that R3C reconnects fragmented ridge patterns, improving the visual quality of segmentation. Because it functions independently of the enhancement method used, R3C provides a flexible and broadly applicable solution for improving binary segmentation.

2605.25305 2026-05-26 cs.LG

Electricity Consumption Forecasting: An Approach Using Cooperative Ensemble Learning with SHapley Additive exPlanations

电力消耗预测:一种使用SHapley加法解释的协作集成学习方法

Eduardo Luiz Alba, Gilson Adamczuk Oliveira, Matheus Henrique Dal Molin Ribeiro, Érick Oliveira Rodrigues

发表机构 * Industrial & Systems Engineering Graduate Program (PPGEPS), Federal University of Technology-Parana (UTFPR)(工业与系统工程研究生项目(PPGEPS),联邦技术大学-巴兰(UTFPR))

AI总结 提出一种名为弱分离器增强器(WSB)的协作集成学习方法,结合LSTM、RF、SVR和XGBoost模型,利用SHAP进行特征选择,遗传算法和粒子群优化超参数,对巴西联邦学院两个校区未来12个月的电力消耗进行预测,取得较低误差。

Journal ref Forecasting 2024

详情
AI中文摘要

电力费用管理面临重大挑战,因为该资源易受多种影响因素影响。在大学中,随着机构扩张,对该资源的需求迅速增长,并对环境产生显著影响。本研究使用长短期记忆(LSTM)、随机森林(RF)、支持向量回归(SVR)和极端梯度提升(XGBoost)机器学习模型,基于巴拉那联邦学院(IFPR)过去七年的历史消费数据和气候变量,训练模型以预测未来12个月的电力消耗。采用了两个校区的数据集。为了提高模型性能,使用Shapley加法解释(SHAP)进行特征选择,并使用遗传算法(GA)和粒子群优化(PSO)进行超参数优化。结果表明,所提出的名为弱分离器增强器(WSB)的协作集成学习方法在数据集上表现最佳。具体而言,对于IFPR-Palmas校区,其sMAPE为13.90%,MAE为1990.87 kWh;对于Coronel Vivida校区,sMAPE为18.72%,MAE为465.02 kWh。SHAP分析揭示了两个IFPR校区不同的特征重要性模式。一个共同点是滞后时间序列值的强烈影响和气候变量的最小影响。

英文摘要

Electricity expense management presents significant challenges, as this resource is susceptible to various influencing factors. In universities, the demand for this resource is rapidly growing with institutional expansion and has a significant environmental impact. In this study, the machine learning models long short-term memory (LSTM), random forest (RF), support vector regression (SVR), and extreme gradient boosting (XGBoost) were trained with historical consumption data from the Federal Institute of Paraná (IFPR) over the last seven years and climatic variables to forecast electricity consumption 12 months ahead. Datasets from two campuses were adopted. To improve model performance, feature selection was performed using Shapley additive explanations (SHAP), and hyperparameter optimization was carried out using genetic algorithm (GA) and particle swarm optimization (PSO). The results indicate that the proposed cooperative ensemble learning approach named Weaker Separator Booster (WSB) exhibited the best performance for datasets. Specifically, it achieved an sMAPE of 13.90% and MAE of 1990.87 kWh for the IFPR-Palmas Campus and an sMAPE of 18.72% and MAE of 465.02 kWh for the Coronel Vivida Campus. The SHAP analysis revealed distinct feature importance patterns across the two IFPR campuses. A commonality that emerged was the strong influence of lagged time-series values and a minimal influence of climatic variables.