arXivDaily arXiv每日学术速递 周一至周五更新
重置
2601.06997 2026-06-10 cs.RO cs.CV 版本更新

ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

ObjSplat: 几何感知的高斯面元用于主动物体重建

Yuetao Li, Zhizhou Jia, Yu Zhang, Qun Hao, Shaohui Zhang

发表机构 * School of Optics and Photonics, Beijing Institute of Technology(光学与光子学学院,北京理工大学) School of Optoelectronic Engineering, Changchun University of Science and Technology(光电工程学院,长春理工大学)

AI总结 提出ObjSplat框架,利用高斯面元统一表示,通过几何感知视点评估和下一最佳路径规划器,实现高效高保真的主动物体重建。

详情
Comments
Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/
AI中文摘要

自主高保真物体重建是创建数字资产和弥合机器人模拟与现实差距的基础。我们提出ObjSplat,一个主动重建框架,利用高斯面元作为统一表示,逐步重建未知物体,同时具有逼真的外观和准确的几何。针对传统基于不透明度或深度线索的局限性,我们引入了几何感知视点评估管线,明确建模背面可见性和遮挡感知的多视图共视性,即使在几何复杂的物体上也能可靠地识别未重建区域。此外,为了克服贪婪规划策略的局限性,ObjSplat采用下一最佳路径(NBP)规划器,在动态构建的空间图上执行多步前瞻。通过联合优化信息增益和移动成本,该规划器生成全局高效的轨迹。在仿真和真实世界文化遗物上的大量实验表明,ObjSplat在几分钟内生成物理一致的模型,与最先进方法相比,实现了卓越的重建保真度和表面完整性,同时显著减少了扫描时间和路径长度。项目页面:此https URL。

英文摘要

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

2512.17629 2026-06-10 cs.LG cs.AI 版本更新

SCOPE: Sequential Causal Optimization of Process Interventions

SCOPE: 过程干预的顺序因果优化

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

发表机构 * Research Centre for Information Systems Engineering (LIRIS), KU Leuven, Leuven, Belgium(信息系统工程研究中心(LIRIS),鲁汶大学,比利时列文) School of Computation, Information and Technology, Technical University of Munich (TUM), Munich, Germany(计算、信息与技术学院,慕尼黑技术大学(TUM),德国慕尼黑)

AI总结 提出SCOPE方法,通过反向归纳和因果学习直接利用观测数据,优化业务流程中顺序干预的KPI,优于现有方法。

详情
AI中文摘要

规范性过程监控(PresPM)在运行业务流程期间推荐干预措施以优化关键绩效指标(KPI)。在现实环境中,干预很少是孤立的:组织需要对齐干预序列以共同引导案例的结果。现有的PresPM方法仅部分解决了这一挑战。许多方法专注于单个干预决策,而其他方法将多个干预视为独立,忽略了它们随时间如何相互作用。确实处理这些依赖关系的方法依赖于模拟或数据增强来近似过程以训练强化学习(RL)代理,这可能会造成现实差距并引入偏差。我们提出了SCOPE(过程干预的顺序因果优化),一种学习对齐的顺序干预推荐的PresPM方法。SCOPE采用反向归纳来估计每个候选干预动作的效果,将其影响从最终决策点传播回第一个决策点。通过利用因果学习器,我们的方法可以直接使用观测数据,不同于需要构建过程近似用于RL的方法。在现有合成数据集和新的半合成数据集上的实验表明,SCOPE在优化KPI方面始终优于最先进的PresPM技术。基于真实事件日志的新型半合成设置作为可重复使用的基准,用于未来关于顺序PresPM的工作。

英文摘要

Prescriptive Process Monitoring (PresPM) recommends interventions during running business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches only partially address this challenge. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which may create a reality gap and introduce bias. We introduce SCOPE (Sequential Causal Optimization of Process Interventions), a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for RL. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

2601.04776 2026-06-10 cs.CV 版本更新

Segmentation-Driven Monocular Shape from Polarization based on Physical Model

基于物理模型的分割驱动单目光学偏振形状恢复

Jinyu Zhang, Xu Ma, Weili Chen

发表机构 * Key Laboratory of Photoelectronic Imaging Technology and System of Ministry of Education of China, School of Optics and Photonics, Beijing Institute of Technology(中国教育部光电成像技术与系统重点实验室,光学与 photonics 学院,北京理工大学) National Key Laboratory of Scattering and Radiation, Beijing Institute of Environmental Features(散射与辐射国家重点实验室,北京环境特征研究院)

AI总结 提出分割驱动单目光学偏振形状恢复框架,通过偏振辅助自适应区域生长分割凸子区域并引入多尺度融合凸性先验约束,有效解决方位角歧义,提升重建精度与几何保真度。

详情
Comments
23 pages, 10 figures, submittd to Elsevier Pattern Recognition
AI中文摘要

单目光学偏振形状恢复(SfP)利用光偏振特性与表面几何之间的内在关系,从单视角偏振图像中恢复表面法线,为三维(3D)重建提供了一种紧凑且稳健的方法。尽管具有潜力,现有的单目SfP方法受到方位角歧义(偏振分析的固有限制)的影响,严重损害了重建的准确性和稳定性。本文提出了一种新颖的分割驱动单目SfP(SMSfP)框架,将全局形状恢复重新表述为在自适应分割的凸子区域上的一组局部重建。具体而言,提出了一种偏振辅助自适应区域生长(PARG)分割策略,将全局凸性假设分解为局部凸区域,有效抑制方位角歧义并保持表面连续性。此外,开发了一种多尺度融合凸性先验(MFCP)约束,以确保局部表面一致性并增强精细纹理和结构细节的恢复。在合成和真实世界数据集上的大量实验验证了所提出的方法,与现有的基于物理的单目SfP技术相比,在消歧准确性和几何保真度方面显示出显著改进。

英文摘要

Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.

2510.04514 2026-06-10 cs.AI cs.CE cs.CL cs.CV stat.ME 版本更新

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通人工智能研究)

AI总结 提出ChartAgent框架,通过迭代分解查询为视觉子任务并利用图表专用视觉工具(如绘制注释、裁剪区域)进行空间域推理,在ChartBench和ChartX上取得最先进性能,尤其对无标注图表提升显著。

详情
Comments
Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)
AI中文摘要

最近的多模态大语言模型在基于图表的视觉问答中显示出潜力,但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题,我们引入了ChartAgent,一种新颖的智能体框架,它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同,ChartAgent通过专门的行动(如绘制注释、裁剪区域(例如分割饼图切片、隔离条形图)和定位坐标轴)迭代地将查询分解为视觉子任务,并主动操作和交互图表图像,使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率,整体上比先前方法绝对提升高达16.07%,在无标注、数值密集的查询上提升17.31%。此外,我们的分析表明,ChartAgent (a) 在多种图表类型上有效,(b) 在不同视觉和推理复杂度水平上均取得最高分数,(c) 作为一个即插即用的框架,提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。

英文摘要

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

2601.03093 2026-06-10 cs.LG cs.CL 版本更新

ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning

ATLAS:验证器引导的自适应潜在激活引导用于高效LLM推理

Tuc Nguyen, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出ATLAS框架,通过轻量级验证器动态调整推理时潜在状态引导策略,实现每步自适应控制,在数学和编码推理任务上提升准确率并减少测试时token使用。

详情
Comments
21 pages, 6 figures
AI中文摘要

最近关于激活和潜在引导的研究表明,修改内部表示可以有效引导大型语言模型(LLMs)在不更新模型参数的情况下提高推理和效率。然而,大多数现有方法依赖固定引导策略和静态干预强度,这限制了它们在问题实例上的鲁棒性,并常常导致过度或不足引导。我们提出自适应测试时潜在引导(ATLAS),这是一个轻量级框架,通过训练好的、轻量级验证器在推理时动态控制引导决策。给定中间隐藏状态,验证器预测当前推理的质量,并自适应选择要应用的引导动作,实现每个示例和每个步骤的调整,且开销最小。ATLAS提供了一个统一框架,将学习到的潜在验证与测试时激活引导相结合,无需额外的LLM解码或推理时过程奖励模型调用即可实现自适应推理控制。在多个数学和编码推理基准上的实验表明,ATLAS始终优于普通解码和固定引导基线,在实现更高准确率的同时大幅减少测试时token使用。这些结果表明,验证器引导的潜在适应提供了一种有效且可扩展的机制,可以在不牺牲解决方案质量的情况下控制推理效率。所有源代码将公开提供。

英文摘要

Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without updating model parameters. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose Adaptive Test-time Latent Steering (ATLAS), a lightweight framework that dynamically controls steering decisions at inference time using a trained, lightweight verifier over the latent states. Given intermediate hidden states, the verifier predicts the quality of ongoing reasoning and adaptively selects which steering action to apply, enabling per-example and per-step adjustment with minimal overhead. ATLAS provides a unified framework for combining learned latent verification with test-time activation steering, enabling adaptive reasoning control without additional LLM decoding or inference-time process reward model calls. Experiments on multiple mathematical and coding reasoning benchmarks show that ATLAS consistently outperforms both vanilla decoding and fixed steering baselines, achieving higher accuracy while substantially reducing test-time token usage. These results demonstrate that verifier-guided latent adaptation provides an effective and scalable mechanism for controlling reasoning efficiency without sacrificing solution quality. All source code will be publicly available.

2510.14836 2026-06-10 cs.CV cs.RO 版本更新

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

QDepth-VLA:量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Zhongke Huiling Robot Technology Co.(北京中科创联机器人科技有限公司)

AI总结 提出QDepth-VLA框架,通过辅助深度预测任务增强VLA模型的空间感知与推理能力,在仿真和真实任务中提升操作性能。

详情
AI中文摘要

空间感知和推理对于视觉-语言-动作(VLA)模型完成精细操作任务至关重要。然而,现有方法往往缺乏理解和推理精确控制所需的基本3D结构的能力。为解决这一局限,我们提出QDepth-VLA,一种通过辅助深度预测任务增强VLA模型的通用框架。设计了一个专门的深度专家,用于预测从VQ-VAE编码器获得的深度图的量化潜在令牌,使模型能够学习捕捉关键几何线索的深度感知表示。在仿真基准和真实世界任务上的实验结果表明,QDepth-VLA在操作任务上展现出强大的空间推理能力和竞争性能。

英文摘要

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

2512.18531 2026-06-10 physics.chem-ph cs.LG 版本更新

Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence

利用人工智能推动一维核磁共振波谱在自动结构解析中的极限

Frank Hu, Jonathan M. Tubb, Dimitris Argyropoulos, Sergey Golotvin, Mikhail Elyashberg, Grant M. Rotskoff, Matthew W. Kanan, Thomas E. Markland

发表机构 * Department of Chemistry, Stanford University(化学系,斯坦福大学) ACD/Labs(ACD实验室)

AI总结 提出基于Transformer的深度学习框架,仅利用一维1H和13C NMR谱,对含多达40个非氢原子的有机分子实现60.4%的首次15次预测准确率,克服化学空间组合爆炸。

详情
AI中文摘要

一维核磁共振波谱是有机化合物和天然产物表征中最广泛使用的技术之一。对于含有最多36个非氢原子的分子,可能的结构数量估计在$10^{20} - 10^{60}$范围内。因此,仅使用其一维$^1$H和/或$^{13}$C NMR谱来确定该大小分子的结构(分子式和连接性),即从头结构生成,似乎完全不可行。在这里,我们展示了如何通过深度学习框架,对含有最多40个非氢原子且涵盖有机化学中常见元素(C、N、O、H、P、S、Si、B和卤素)的系统实现这一任务,从而覆盖了类药化学空间的绝大部分。利用自然语言处理的见解,我们展示了基于Transformer的架构仅使用$^1$H和$^{13}$C NMR谱,在前15次预测中正确预测分子的准确率达到60.4%,从而克服了化学空间的组合增长,同时通过微调也可扩展到实验数据。

英文摘要

One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from $10^{20} - 10^{60}$. The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional $^1$H and/or $^{13}$C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 60.4% accuracy within the first 15 predictions using only the $^1$H and $^{13}$C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.

2512.14617 2026-06-10 cs.LG cs.AI 版本更新

Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

离散动作非马尔可夫奖励决策过程中基于模型的强化学习

Alessandro Trapasso, Luca Iocchi, Fabio Patrizi

发表机构 * Fondazione Bruno Kessler(布雷诺·科塞拉基金会) Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 提出QR-MAX算法,通过奖励机分解马尔可夫转移学习与非马尔可夫奖励处理,首次在离散NMRDP中获得PAC收敛到ε-最优策略的多项式样本复杂度,并扩展至连续状态空间。

详情
Comments
Accepted at IJCAI-ECAI 2026. 19 pages, 32 figures, includes appendix
AI中文摘要

许多实际决策问题涉及的任务成功取决于整个系统历史,而非达到具有期望属性的状态。马尔可夫强化学习(RL)方法不适用于此类任务,而基于非马尔可夫奖励决策过程(NMRDP)的RL使智能体能够处理时间依赖任务。长期以来,这种方法缺乏关于(近)最优性和样本效率的形式保证。我们通过QR-MAX解决了这两个问题,这是一种新颖的基于模型的算法,用于离散NMRDP,通过奖励机将马尔可夫转移学习与非马尔可夫奖励处理分解。据我们所知,这是第一个利用这种分解获得PAC收敛到ε-最优策略且具有多项式样本复杂度的离散动作NMRDP的基于模型的RL算法。然后,我们将QR-MAX扩展到连续状态空间,提出Bucket-QR-MAX,一种基于SimHash的离散化器,它保留了相同的分解结构,无需手动网格划分或函数逼近即可实现快速稳定的学习。我们在复杂度递增的环境中将我们的方法与现代最先进的基于模型的RL方法进行了实验比较,显示出样本效率的显著提高和寻找最优策略的鲁棒性增强。

英文摘要

Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.

2512.14614 2026-06-10 cs.CV cs.GR 版本更新

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

WorldPlay:面向实时交互式世界建模的长期几何一致性

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出WorldPlay流式视频扩散模型,通过双重动作表示、重构上下文记忆和上下文强制蒸馏方法,实现实时交互式世界建模并保持长期几何一致性,生成24 FPS的720p长视频。

详情
Comments
project page: https://3d-models.hunyuan.tencent.com/world/, demo: https://3d.hunyuan.tencent.com/sceneTo3D, code: https://github.com/Tencent-Hunyuan/HY-WorldPlay
AI中文摘要

本文提出WorldPlay,一种流式视频扩散模型,能够实现实时、交互式的世界建模,并保持长期几何一致性,解决了当前方法在速度与内存之间的权衡。WorldPlay的威力来自三个关键要素。1)我们使用双重动作表示(Dual Action Representation),以响应用户的键盘和鼠标输入实现鲁棒的动作控制。2)为了强制长期一致性,我们的重构上下文记忆(Reconstituted Context Memory)从过去帧动态重建上下文,并使用时间重构使几何上重要但久远的帧保持可访问,有效缓解记忆衰减。3)我们还提出上下文强制(Context Forcing),一种针对记忆感知模型的新型蒸馏方法。对齐教师和学生之间的记忆上下文,保留了学生使用长程信息的能力,在实现实时速度的同时防止误差漂移。综合来看,WorldPlay以24 FPS生成具有优越一致性的长时域流式720p视频,与现有技术相比表现更优,并在多种场景中展现出强大的泛化能力。项目页面和在线演示可访问:this https URL 和 this https URL。

英文摘要

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key ingredients. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

2512.11995 2026-06-10 cs.CV cs.AI cs.LG 版本更新

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

V-REX: 通过问题链进行探索性视觉推理的基准测试

Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学学院市分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出V-REX基准,通过问题链将多步探索推理分解为规划和遵循能力,评估视觉语言模型在复杂开放任务中的表现。

详情
Comments
28 pages
AI中文摘要

尽管许多视觉语言模型(VLM)被开发用于回答定义明确、目标高度具体的简单问题(如大多数基准测试所示),但在实践中,它们通常难以处理复杂的开放式任务,这些任务通常需要在视觉空间中进行多轮探索和推理。这种视觉思维路径不仅像AI侦探一样提供逐步探索和验证,还能对最终答案产生更好的解释。然而,由于中间步骤的探索空间巨大,这些路径难以评估。为弥补这一差距,我们开发了一个评估套件“多步探索视觉推理(V-REX)”,它由一个具有挑战性的视觉推理任务基准和一个评估协议组成。V-REX涵盖了跨不同领域的丰富应用场景。V-REX将多步探索推理转化为问题链(CoQ),并解耦了VLM的能力:(1)规划:通过选择一系列探索性问题来分解开放式任务;(2)遵循:顺序回答精心策划的CoQ以收集信息,从而推导出最终答案。通过每步策划有限的问题和答案选项,V-REX实现了对中间步骤的可靠定量和细粒度分析。通过评估最先进的专有和开源VLM,我们揭示了持续的扩展趋势、规划与遵循能力之间的显著差异,以及多步探索推理中巨大的改进空间。

英文摘要

While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

2512.08180 2026-06-10 cs.CV 版本更新

GeoLoom: High-quality Geometric Diagram Generation from Textual Input

GeoLoom:从文本输入生成高质量几何图形

Xiaojing Wei, Ting Zhang, Wei He, Jingdong Wang, Hua Huang

发表机构 * arXiv.org GitHub

AI总结 提出GeoLoom框架,通过自动形式化模块和坐标求解器,将自然语言几何描述转化为高质量图形,并引入约束评估指标,显著优于现有方法。

详情
AI中文摘要

高质量几何图形生成既带来挑战也带来机遇:它要求严格的空间准确性,同时提供明确的约束来指导生成。受近期在几何问题求解中使用形式语言和符号求解器以增强正确性和可解释性的进展启发,我们提出了GeoLoom,一个用于几何领域文本到图形生成的新颖框架。GeoLoom包含两个核心组件:一个自动形式化模块,将自然语言翻译成专门设计的面向生成的形式语言GeoLingua;以及一个坐标求解器,利用高效的蒙特卡洛优化将形式约束映射到精确坐标。为支持该框架,我们引入了GeoNF,一个将自然语言几何描述与形式化GeoLingua描述对齐的数据集。我们进一步提出了一种基于约束的评估指标,量化结构偏差,为迭代细化提供数学上有依据的监督。实验结果表明,GeoLoom在结构保真度上显著优于最先进的基线,为可解释和可扩展的图形生成提供了原则性基础。

英文摘要

High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.

2512.02240 2026-06-10 cs.CL 版本更新

Lightweight Latent Reasoning for Narrative Tasks

面向叙事任务的轻量级潜在推理

Alexander Gurung, Esmeralda S. Whitammer, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院) CIFAR Fellow

AI总结 提出LiteReason方法,通过轻量级推理投影器生成连续潜在令牌,在强化学习中动态切换潜在与离散推理,将推理长度减少77-92%,同时保持接近非潜在RL的性能。

详情
AI中文摘要

大型语言模型通过生成长思维链或“推理轨迹”来处理复杂任务,这些轨迹在给定查询时作为输出生成的潜在变量。模型生成此类轨迹的能力可以通过强化学习进行优化,以提高其在预测答案中的效用。这种优化带来了高昂的计算成本,尤其是对于涉及检索和处理大量令牌的叙事相关任务。为此,我们提出了LiteReason,一种潜在推理方法,可以与标准令牌采样交错进行,并易于与RL技术结合。LiteReason采用轻量级推理投影器模块,训练生成连续的潜在令牌,帮助模型“跳过”推理步骤。在RL过程中,策略模型决定何时激活投影器,根据需要切换潜在和离散推理。在情节漏洞检测和书籍章节生成上的实验结果表明,我们的方法优于潜在推理基线,并接近匹配非潜在RL训练,同时将最终推理长度减少77-92%。总体而言,LiteReason引导RL训练到性能-计算权衡曲线中更高效的部分。

英文摘要

Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

2404.09101 2026-06-10 cs.LG cs.AI cs.NA math.NA stat.ML 版本更新

Mixtures of Neural Operators Reduce Active Complexity in Operator Learning

神经算子混合体降低算子学习中的主动复杂度

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

发表机构 * McMaster University and Vector Institute(麦斯特大学和向量研究所) Shimane University(岛根大学) Rice University(里士满大学) University of Helsinki(赫尔辛基大学)

AI总结 通过路由混合神经算子(MoNO)与固定单神经算子构造的比较,证明MoNO在主动专家规模上具有更优的深度、宽度和秩缩放,且对Lipschitz目标这些量以O(ε^{-1})为界。

详情
AI中文摘要

算子学习系统并非仅由总参数数量决定;对于一次查询,相关瓶颈可能是必须加载和评估的模型。我们通过路由混合神经算子(MoNO)与固定单神经算子构造之间的建设性比较,在紧致Sobolev子集上研究了经典神经算子的这一区别。该比较涉及相对于基线的专家主动复杂度,其中总存储大小和路由搜索分别考虑。MoNO将每个输入函数通过树路由到一个专家。我们的主要定理表明,在近似集上,每个具有有界输出Sobolev半径的标量一致连续非线性算子都存在一个MoNO近似,其主动专家具有比所分析的单神经算子构造更小的深度、宽度和秩缩放;对于Lipschitz目标,这些专家量以$\mathcal{O}(\varepsilon^{-1})$为界。该定理将局部化转化为主动专家大小、路由深度和专家数量的算子级核算。我们还证明了底层神经算子架构的定量通用近似定理,明确依赖于紧集直径和连续模。

英文摘要

Operator-learning systems are not governed solely by total parameter count; for one query, the relevant bottleneck can be the model that must be loaded and evaluated. We study this distinction for classical neural operators on compact Sobolev subsets through a constructive comparison between routed mixtures of neural operators (MoNOs) and a fixed single-neural-operator construction. The comparison concerns expert-active complexity relative to that baseline, with total stored size and routing search accounted separately. A MoNO routes each input function through a tree to one expert. Our main theorem shows that every scalar uniformly continuous nonlinear operator with bounded output Sobolev radius on the approximation set admits a MoNO approximation whose active expert has smaller depth, width, and rank scaling than the analyzed single-neural-operator construction; for Lipschitz targets these expert quantities are bounded by $\mathcal{O}(\varepsilon^{-1})$. The theorem turns localization into an operator-level accounting of active expert size, routing depth, and number of experts. We also prove a quantitative universal approximation theorem for the underlying neural-operator architecture, with explicit dependence on compact-set diameter and modulus of continuity.

2511.22331 2026-06-10 math.OC cs.AI cs.LG 版本更新

On the Condition Number Dependency in Bilevel Optimization

关于双层优化中条件数依赖性的研究

Lesi Chen, Jingzhao Zhang

发表机构 * IIIS, Tsinghua University(清华大学信息学院)

AI总结 本文针对非凸上层、强凸下层的双层优化问题,建立了条件数依赖性的下界,揭示了双层与极小极大优化在条件数依赖上的首次可证明差距。

详情
Comments
This new version improves deterministic lower bounds in v1
AI中文摘要

双层优化最小化一个由上层问题定义的目标函数,其可行域是下层问题的解集。我们研究当上层问题非凸、下层问题强凸时,使用一阶方法寻找 $\epsilon$-稳定点的 oracle 复杂度。近期工作 (Ji et al., ICML 2021; Arbel and Mairal, ICLR 2022; Chen et al., JMLR 2025) 达到了 $\tilde{\mathcal{O}}(\bar \kappa_y^4 \epsilon^{-2})$ 的上界,在 $\epsilon$ 上接近最优,通过在内循环中朴素应用 Nesterov 加速可降至 $\tilde{\mathcal{O}}(\bar \kappa_y^{7/2} \epsilon^{-2})$,其中 $\bar \kappa_y$ 是全局条件数。然而,条件数的最优依赖性未知。本文建立了新的 $\Omega(\kappa_y^{5/2} \epsilon^{-2})$ 下界,其中 $\kappa_y < \bar \kappa_y$ 是下层条件数,当光滑常数为 $\mathcal{O}(1)$ 时与 $\bar \kappa_y$ 同阶。我们的下界首次证明了在此设定下双层问题与极小极大优化在条件数依赖性上的可证明差距。下界可推广到多种设置,包括高阶光滑函数、随机 oracle 和凸超目标:(1) 对于二阶和任意光滑问题,我们分别给出 $\Omega({\kappa_y^{31/14}} \epsilon^{-12/7})$ 和 $\Omega(\kappa_y^{21/10} \epsilon^{-8/5})$ 的下界。(2) 对于凸-强凸问题,我们将先前最佳下界 (Ji and Liang, JMLR 2022) 从 $\Omega(\kappa_y /\sqrt{\epsilon})$ 改进为 $\Omega(\kappa_y^{3/2} / \sqrt{\epsilon})$。(3) 对于光滑随机问题,我们也给出 $\Omega(\kappa_y^4 \epsilon^{-4})$ 的下界。

英文摘要

Bilevel optimization minimizes an objective function, defined by an upper-level problem whose feasible region is the solution of a lower-level problem. We study the oracle complexity of finding an $ε$-stationary point with first-order methods when the upper-level problem is nonconvex, and the lower-level problem is strongly convex. Recent works (Ji et al., ICML 2021; Arbel and Mairal, ICLR 2022; Chen et al., JMLR 2025) achieve a $\tilde{\mathcal{O}}(\bar κ_y^4 ε^{-2})$ upper bound that is near-optimal in $ε$, which can be reduced to $\tilde{\mathcal{O}}(\bar κ_y^{7/2} ε^{-2})$ by a naive application of Nesterov acceleration in the inner loop, where $\bar κ_y$ is the global condition number. However, the optimal dependency on the condition number is unknown. In this work, we establish a new $Ω(κ_y^{5/2} ε^{-2})$ lower bound, where $κ_y < \bar κ_y$ is the lower-level condition number that is of the same order as $\bar κ_y$ when the smoothness constants are $\mathcal{O}(1)$. Our lower bound establishes the first provable gap in terms of condition number dependency between bilevel problems and minimax problems in this setup. Our lower bounds can be extended to various settings, including high-order smooth functions, stochastic oracles, and convex hyper-objectives: (1) For second-order and arbitrarily smooth problems, we show lower bounds of $Ω({κ_y^{31/14}} ε^{-12/7})$ and $Ω(κ_y^{21/10} ε^{-8/5})$, respectively. (2) For convex-strongly-convex problems, we improve the previously best lower bound (Ji and Liang, JMLR 2022) from $Ω(κ_y /\sqrtε)$ to $Ω(κ_y^{3/2} / \sqrtε)$. (3) For smooth stochastic problems, we also show a lower bound of $Ω(κ_y^4 ε^{-4})$.

2511.10234 2026-06-10 cs.LG cs.AI 版本更新

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

迷失在序列化中:LLM图推理器的不变性与泛化能力

Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

发表机构 * arXiv.org University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM图推理器对图表示对称性的缺乏不变性,通过分解序列化因素并评估微调影响,发现大模型更鲁棒,微调降低节点重标敏感但增加结构和格式敏感,且不保证泛化。

详情
Comments
ICML 2026 Workshop on Graph Foundation Models
AI中文摘要

尽管前景广阔,基于大型语言模型(LLM)的图推理器缺乏对图表示中对称性的内置不变性。在顺序图序列化上操作时,LLM在节点重索引、边重排序或格式变化下可能产生不同输出,引发鲁棒性问题。我们系统分析了这些影响,研究了微调如何影响编码敏感性以及在未见任务上的泛化能力。我们提出了一种将图序列化分解为节点标记、边编码和语法的原则性方法,并在一个全面的基准测试套件上评估了LLM对每个因素变化的鲁棒性。我们还贡献了一组新的谱任务,以进一步评估微调推理器的泛化能力。结果表明,较大的(未微调)模型更鲁棒。微调降低了对节点重标的敏感性,但可能增加对结构和格式变化的敏感性,同时并未一致地提高在未见任务上的性能。

英文摘要

While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

2511.05349 2026-06-10 cs.SD 版本更新

Passive Acoustic-based Composite Indices for Reef Health Monitoring in Noisy Tropical waters

基于被动声学的复合指数用于嘈杂热带水域的珊瑚礁健康监测

Hari Vishnu, Yuen Min Too, Mandar Chitre, Danwei Huang, Teong Beng Koay, Sudhanshi S. Jain

发表机构 * University of Technology, Sydney(悉尼科技大学) Nanyang Technological University(南洋理工大学) National Institute of Oceanography and Environmental Physics(国家海洋与环境物理研究所) Institute of Marine and Coastal Sciences, University of Connecticut(康乃狄克大学海洋与海岸科学研究所) Indian Institute of Technology, Bombay(印度班加罗尔理工学院)

AI总结 提出使用卷积神经网络去噪器处理低频噪声,结合声压级、声学复杂度指数和虾鸣率等声学指标,实现与潜水评估一致的珊瑚礁健康监测。

详情
AI中文摘要

被动声学监测为珊瑚礁的长期、空间广泛评估提供了潜力。为探索这种方法,我们在新加坡水域的十个珊瑚礁站点部署了水下声学记录仪,持续两年。为减轻持续的人为和流致噪声对低频礁声景的掩蔽,我们训练了一个卷积神经网络去噪器。声学数据分析揭示了明显的晨昏合唱。尽管在噪声记录的低频部分,与环境变量的相关性被掩盖,但去噪后的数据显示声学活动指数(如声压级和声学复杂度指数)与基于潜水员的珊瑚礁健康评估(如活珊瑚丰富度和覆盖率、藻类覆盖率)之间存在相关性。此外,从高频声带计算的虾鸣率在时间和空间上与珊瑚礁参数稳健相关。本研究证明,只要有效去噪和解释数据,被动声学包含有助于珊瑚礁监测的有价值信息。该方法可推广到其他因持续噪声而阻碍声学监测的海洋环境。

英文摘要

Passive acoustic monitoring offers the potential to enable long-term, spatially extensive assessments of coral reefs. To explore this approach, we deployed underwater acoustic recorders at ten coral reef sites around Singapore waters over two years. To mitigate the persistent anthropogenic and current-induced noise masking the low-frequency reef soundscape, we trained a convolutional neural network denoiser. Analysis of the acoustic data reveals distinct morning and evening choruses. Though the correlation with environmental variates was obscured in the low-frequency part of the noisy recordings, the denoised data showed correlations of acoustic activity indices such as sound pressure level and acoustic complexity index with diver-based assessments of reef health such as live coral richness and cover, and algal cover. Furthermore, the shrimp snap rate, computed from the high-frequency acoustic band, is robustly correlated with the reef parameters, both temporally and spatially. This study demonstrates that passive acoustics holds valuable information that can help with reef monitoring, provided the data is effectively denoised and interpreted. This methodology can be extended to other marine environments where acoustic monitoring is hindered by persistent noise.

2507.22017 2026-06-10 eess.IV cs.CV 版本更新

Cyst-X: A Multi-Center MRI Benchmark and Federated Learning Framework for Malignancy-Risk Stratification of Pancreatic Cystic Neoplasm

Cyst-X:用于胰腺囊性肿瘤恶性风险分层的多中心MRI基准与联邦学习框架

Hongyi Pan, Gorkem Durak, Elif Keles, Ziliang Hong, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchan Huang, Candice W. Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci

发表机构 * Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University(机器与混合智能实验室,放射科,西北大学) Istanbul Faculty of Medicine, Istanbul University(伊斯坦布尔大学医学学院) Department of Biomedical Engineering and Radiology, University of Wisconsin-Madison(生物医学工程与放射科,威斯康星大学麦迪逊分校) Department of Preventive Medicine, Northwestern University(预防医学系,西北大学) Division of Gastroenterology and Hepatology, New York University(消化内科与肝病科,纽约大学) Department of Electrical, Electronic and Computer Engineering, University of Catania(电气、电子和计算机工程系,卡塔尼亚大学) NVIDIA Department of Radiology, Columbia University(放射科,哥伦比亚大学) Department of Radiology and Nuclear Medicine, Erasmus Medical Center(放射科与核医学科,埃因霍温医学院) Department of Gastroenterology and Hepatology, Erasmus Medical Center(消化内科与肝病科,埃因霍温医学院) Department of Radiology, New York University(放射科,纽约大学) Division of Gastroenterology and Hepatology, Mayo Clinic Florida(消化内科与肝病科,迈阿密诊所佛罗里达分部) Department of Gastroenterology and Hepatology, Northwestern University(消化内科与肝病科,西北大学)

AI总结 提出Cyst-X,一个多中心MRI基准和联邦学习框架,用于IPMN恶性风险分层,结合PanSegNet分割器和3D DenseNet-121分类器,在内部交叉验证中达到0.85的AUC,性能与放射科医生相当。

详情
AI中文摘要

预计到2030年,胰腺癌将成为第二大致命癌症,因此早期检测至关重要。导管内乳头状黏液性肿瘤(IPMN)是关键的癌前病变,目前指南在恶性风险分层方面存在困难,导致不必要的手术或漏诊。在此,我们介绍Cyst-X,一个用于IPMN恶性风险分层的多中心MRI基准和联邦学习框架。该数据集包含来自七个国际中心764名患者的1,461次腹部MRI扫描,具有基于组织病理学或三年影像随访的三级恶性标签和专家胰腺分割。该流程将PanSegNet胰腺分割器与3D DenseNet-121分类器以及并行放射组学预测器相结合。在内部交叉验证中,深度学习分类器在T2加权MRI上对高风险与低风险或无风险鉴别达到了平均受试者工作特征曲线下面积(AUC)0.85(95%置信区间0.84-0.86),平均精确度从患病率基线0.23提高到0.64。当训练分布在多个机构之间且不交换原始患者图像时,该性能得以保持(AUC 0.85,FedProx)。在仅基于影像条件下评估的629例读者子集上,与三位盲法放射科医生相比,该分类器在特异性相当的情况下达到或超过了敏感性。为了加速早期胰腺癌检测研究,我们公开发布Cyst-X数据集、分割掩膜和训练模型,作为首个用于胰腺囊性肿瘤分析的大规模多中心MRI资源。

英文摘要

Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we introduce Cyst-X, a multi-center MRI benchmark and a federated learning framework for IPMN malignancy-risk stratification. The dataset comprises 1,461 abdominal MRI scans from 764 patients at seven international centers, with three-tier malignancy labels anchored in histopathology or three-year imaging follow-up and expert pancreas segmentations. The pipeline couples the PanSegNet pancreas segmenter with a 3D DenseNet-121 classifier and a parallel radiomics predictor. On internal cross-validation, the deep learning classifier reached a mean area under the receiver operating characteristic curve (AUC) of 0.85 (95% confidence interval 0.84-0.86) on T2-weighted MRI for high-risk versus low- or no-risk discrimination, with the average precision rising from a prevalence baseline of 0.23 to 0.64. This performance was preserved (AUC 0.85, FedProx) when training was distributed across institutions without exchange of raw patient images. Benchmarked against three blinded radiologists on a 629-case reader subset evaluated under imaging-only conditions, the classifier matched or exceeded sensitivity at comparable specificity. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset, segmentation masks, and trained models as the first large-scale, multi-centre MRI resource for pancreatic cystic neoplasm analysis.

2510.09801 2026-06-10 cs.AI 版本更新

How can we assess human-agent interactions? Case studies in software agent design

如何评估人机交互?软件代理设计案例研究

Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PULSE框架,通过用户反馈和模型预测结合评估人机交互,在15k用户实验中验证其能减少40%置信区间,并揭示基准测试与真实结果的差异。

详情
Comments
ICML 2026
AI中文摘要

虽然基准测试衡量了基于LLM的代理的准确性,但它们大多假设完全自动化,未能代表现实用例的协作性质。在本文中,我们朝着严格评估人机交互迈出了两大步。首先,我们提出了PULSE,一个用于更高效地以人为中心评估代理设计的框架,包括收集用户反馈、训练ML模型预测用户满意度,以及通过结合人类满意度评分与模型生成的伪标签来计算结果。其次,我们在软件工程——LLM代理最高影响、最真实的领域之一——中部署了PULSE,通过一个围绕开源代理OpenHands构建的大规模网络平台。在15k用户中,我们评估了三个代理设计决策如何影响开发者满意度率。我们还展示了PULSE如何能导致关于代理设计的更稳健结论,与标准A/B测试相比,将置信区间减少了40%。最后,我们发现了野外结果与基准性能之间的显著差异(例如,claude-sonnet-4和gpt-5之间的反相关性),强调了基准驱动评估的局限性。我们的框架PULSE为未来评估提供了指导,我们的发现识别了改进软件代理设计的机会。

英文摘要

While benchmarks measure the accuracy of LLM-powered agents, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy PULSE in software engineering -- one of the highest-impact, real-world domains for LLM agents -- via a large-scale web platform built around the open-source agent OpenHands. Across 15k users, we evaluate how three agent design decisions impact developer satisfaction rates. We also show how PULSE can lead to more robust conclusions about agent design, reducing confidence intervals by 40\% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results with benchmark performance (e.g., the anti-correlation between claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our framework PULSE provides guidance for future evaluations, and our findings identify opportunities for better software agent designs.

2511.01927 2026-06-10 cs.LG cs.AI cs.NA math.NA 版本更新

Learning-Guided Integration Contours Construction for Fast Large-Scale Generalized Eigensolvers

学习引导的积分轮廓构建用于快速大规模广义特征值求解器

Yeqiu Chen, Ziyan Liu, Hong Wang, Lei Liu

发表机构 * arXiv.org

AI总结 提出Deepcontour混合框架,结合深度学习谱预测器与核密度估计自动构建优化积分轮廓,加速大规模广义特征值求解,实现最高5.63倍加速并保持数值精度。

详情
AI中文摘要

解决大规模广义特征值问题(GEPs)是科学与工程中一项基本但计算上极为困难的任务。作为一种有前景的方向,轮廓积分(CI)方法提供了高效且可并行化的框架。然而,其性能关键依赖于积分轮廓的选择——在没有可靠先验知识的情况下,不当选择可能导致显著的计算开销并损害数值精度。为应对这一挑战,我们提出Deepcontour,一种新颖的混合框架,它将基于深度学习的谱预测器与核密度估计(KDE)相结合,用于原则性的轮廓设计。具体而言,Deepcontour利用其专用的特征神经算子(ENO)提供快速的谱分布先验,驱动KDE模块自动构建优化的积分轮廓,从而引导CI求解器高效地找到所需特征值。Deepcontour在多种科学数据集上实现了高达5.63倍的加速,同时保持严格的数值精度。通过融合深度学习的预测能力与经典求解器的数值严谨性,这项工作为解决大规模GEPs建立了一种高效且稳健的范式。

英文摘要

Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours -- improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we propose Deepcontour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation (KDE) for principled contour design. Specifically, Deepcontour utilizes its specialized Eigen-Neural-Operator (ENO) to provide rapid spectral distribution priors, driving a KDE module to automatically construct the optimized integration contours, which guide the CI solver to efficiently find the desired eigenvalues. Deepcontour achieves up to a 5.63x speedup across diverse scientific datasets while maintaining strict numerical rigor. By merging the predictive power of deep learning with the numerical rigor of classical solvers, this work establishes an efficient and robust paradigm for solving large-scale GEPs.

2503.19158 2026-06-10 cs.LG q-bio.QM 版本更新

Integrating Biological-Informed Recurrent Neural Networks for Glucose-Insulin Dynamics Modeling

整合生物信息递归神经网络用于葡萄糖-胰岛素动态建模

Stefano De Carli, Nicola Licini, Davide Previtali, Fabio Previdi, Antonio Ferramosca

发表机构 * Department of Management, Information and Production Engineering, University of Bergamo(管理、信息与生产工程系,贝加莫大学)

AI总结 本文提出生物信息递归神经网络框架,用于更准确地建模葡萄糖-胰岛素动态,以提高人工胰腺系统的个性化血糖调节能力。

详情
Journal ref
IFAC-PapersOnLine, 59(2), 2025, pp. 91-96
Comments
Accepted for publication in the proceedings of the Engineering Diabetes Technologies (EDT 2025). 7 pages, 2 figures and 1 table
AI中文摘要

1型糖尿病管理由于多种变异性因素而复杂。人工胰腺系统通过先进控制算法自动化胰岛素输送,减轻了患者负担。然而,这些系统的有效性依赖于对葡萄糖-胰岛素动态的准确建模,而传统数学模型往往无法捕捉到患者特异性变化。本文引入了生物信息递归神经网络(BIRNN)框架,该框架利用门控递归单元(GRU)架构,并辅以包含生理约束的物理信息损失函数,确保预测准确性和生物原理的一致性。该框架通过商业UVA/Padova模拟器验证,其在葡萄糖预测准确性和未测量状态重构方面优于传统线性模型,即使在胰岛素敏感性昼夜变化下也表现优异。结果表明,BIRNN在人工胰腺系统的个性化葡萄糖调节和未来自适应控制策略中具有潜力。

英文摘要

Type 1 Diabetes (T1D) management is a complex task due to many variability factors. Artificial Pancreas (AP) systems have alleviated patient burden by automating insulin delivery through advanced control algorithms. However, the effectiveness of these systems depends on accurate modeling of glucose-insulin dynamics, which traditional mathematical models often fail to capture due to their inability to adapt to patient-specific variations. This study introduces a Biological-Informed Recurrent Neural Network (BIRNN) framework to address these limitations. The BIRNN leverages a Gated Recurrent Units (GRU) architecture augmented with physics-informed loss functions that embed physiological constraints, ensuring a balance between predictive accuracy and consistency with biological principles. The framework is validated using the commercial UVA/Padova simulator, outperforming traditional linear models in glucose prediction accuracy and reconstruction of unmeasured states, even under circadian variations in insulin sensitivity. The results demonstrate the potential of BIRNN for personalized glucose regulation and future adaptive control strategies in AP systems.

2509.05913 2026-06-10 cs.CV 版本更新

A fine-grained attention and geometric correspondence model for musculoskeletal risk classification in athletes using multimodal visual and skeletal features

基于多模态视觉和骨骼特征的运动员肌肉骨骼风险分类的细粒度注意力与几何对应模型

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Tamanna Shermin, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

发表机构 * Department of Computer Science and Engineering, United International University(计算机科学与工程系,国际联合大学) Department of Data Science and Artificial Intelligence, Monash University(数据科学与人工智能系,墨尔本大学) Faculty of Science and Technology, Charles Darwin University(科学与技术学院,查尔斯达尔文大学) Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory, Dhaka(应用人工智能与智能系统实验室,达卡)

AI总结 提出ViSK-GAT多模态框架,融合图像和骨骼坐标特征,通过细粒度注意力模块和几何对应模块实现运动员肌肉骨骼风险八级分类,关键指标超93%。

详情
Journal ref
Computers and Electrical Engineering, Vol. 138, 111281, 2026
Comments
Published in Computers and Electrical Engineering
AI中文摘要

肌肉骨骼疾病对运动员构成重大风险,早期风险评估对于预防至关重要。然而,现有方法大多针对受控环境设计,由于依赖单一数据类型,无法在复杂环境中可靠地评估风险。本研究引入了ViSK-GAT(视觉-骨骼几何注意力变换器),一种新颖的多模态深度学习框架,利用视觉和基于骨骼坐标的特征对肌肉骨骼风险进行分类。通过结合图像和骨骼坐标创建了自定义多模态数据集(MusDis-Sports),每个样本根据快速全身评估(REBA)系统标记为八个风险类别。ViSK-GAT集成了两个创新模块:细粒度注意力模块(FGAM),在融合前通过自注意力细化模态内特征;以及多模态几何对应模块(MGCM),增强图像特征与坐标之间的跨模态对齐。该模型取得了稳健的性能,所有关键指标均超过93%。概率分布误差指标也显示出较低的均方根误差(RMSE)为0.1205和平均绝对误差(MAE)为0.0156。ViSK-GAT持续优于最先进的深度学习骨干网络,展示了其在推动人工智能驱动的肌肉骨骼风险评估和实现运动领域及时干预方面的潜力。

英文摘要

Musculoskeletal disorders pose significant risks to athletes, and early risk assessment is essential for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research introduces ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework that classifies musculoskeletal risk using both visual and skeletal coordinate-based features. A custom multimodal dataset (MusDis-Sports) was created by combining images and skeletal coordinates, with each sample labeled into eight risk categories based on the Rapid Entire Body Assessment (REBA) system. ViSK-GAT integrates two innovative modules: the Fine-Grained Attention Module (FGAM), which refines intra-modal features through self-attention before fusion, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal alignment between image features and coordinates. The model achieved robust performance, with all key metrics exceeding 93%. Probability distribution error metrics also showed a low Root Mean Squared Error (RMSE) of 0.1205 and a Mean Absolute Error (MAE) of 0.0156. ViSK-GAT consistently outperformed state-of-the-art (SOTA) deep learning backbones and showed its potential to advance artificial intelligence-driven musculoskeletal risk assessment and enable timely interventions in sports.

2502.01272 2026-06-10 cs.LG 版本更新

Boosting Graph Robustness Against Backdoor Attacks: An Over-Similarity Perspective

提升图神经网络对后门攻击的鲁棒性:过度相似性视角

Chang Liu, Hai Huang, Yujie Xing, Xingquan Zuo

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对图后门攻击中触发器与干净节点难以区分的问题,提出基于过度相似性检测的防御方法SimGuard,利用对比学习训练检测器分离触发器,在保持干净节点性能的同时有效防御多种后门攻击。

详情
Comments
After discussions with one of the co-authors, it was decided that this version should not be made public at this time. To respect the co-author's perspective and ensure alignment among all authors, I am requesting the withdrawal of this article
AI中文摘要

图神经网络(GNN)在社交网络和交通网络等任务中取得了显著成功。然而,最近的研究强调了GNN易受后门攻击的脆弱性,引发了对其在实际应用中可靠性的重大担忧。尽管已有初步努力来防御特定的图后门攻击,但现有防御方法面临两个主要挑战:要么无法在触发器和干净节点之间建立明确区分,导致许多干净节点被移除;要么未能消除触发器的影响,使得难以将目标节点恢复到攻击前的状态。通过对各种现有图后门攻击的实证分析,我们观察到这些方法生成的触发器在特征和结构上都表现出过度相似性。基于这一观察,我们提出了一种新颖的图后门防御方法SimGuard。我们首先利用基于相似性的度量来检测触发器,然后采用对比学习训练一个后门检测器,生成能够将触发器与干净节点分离的嵌入,从而提高检测效率。在真实数据集上进行的大量实验表明,我们提出的方法在保持干净节点性能的同时,有效防御了各种图后门攻击。代码将在接收后发布。

英文摘要

Graph Neural Networks (GNNs) have achieved notable success in tasks such as social and transportation networks. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, raising significant concerns about their reliability in real-world applications. Despite initial efforts to defend against specific graph backdoor attacks, existing defense methods face two main challenges: either the inability to establish a clear distinction between triggers and clean nodes, resulting in the removal of many clean nodes, or the failure to eliminate the impact of triggers, making it challenging to restore the target nodes to their pre-attack state. Through empirical analysis of various existing graph backdoor attacks, we observe that the triggers generated by these methods exhibit over-similarity in both features and structure. Based on this observation, we propose a novel graph backdoor defense method SimGuard. We first utilizes a similarity-based metric to detect triggers and then employs contrastive learning to train a backdoor detector that generates embeddings capable of separating triggers from clean nodes, thereby improving detection efficiency. Extensive experiments conducted on real-world datasets demonstrate that our proposed method effectively defends against various graph backdoor attacks while preserving performance on clean nodes. The code will be released upon acceptance.

2510.12071 2026-06-10 cs.LG 版本更新

Influence Dynamics and Stagewise Data Attribution

影响动力学与分阶段数据归因

Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland

发表机构 * University College London(伦敦大学学院) Independent(独立) University of Melbourne(墨尔本大学) Timaeus

AI总结 针对神经网络训练中样本影响动态变化的问题,基于奇异学习理论提出分阶段数据归因框架,预测影响非单调变化(符号翻转、尖峰),并在玩具模型和语言模型中验证与模型学习阶段的对应。

详情
Comments
28 pages, 15 figures
AI中文摘要

当前的训练数据归因(TDA)方法将样本对另一个样本的影响视为静态的,但神经网络在表现出不同影响模式的独特阶段中学习。在这项工作中,我们引入了一个基于奇异学习理论的分阶段数据归因框架。我们预测影响可以非单调地变化,包括符号翻转和发展转变处的尖锐峰值。我们首先在玩具模型中通过分析和实验验证这些预测,表明影响的动态变化直接映射到模型对语义层次结构的逐步学习。最后,我们在语言模型中大规模展示了这些现象,其中令牌级别的影响变化与已知的发展阶段一致。

英文摘要

Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.

2510.08622 2026-06-10 cs.CL cs.SE 版本更新

Automated Alignment between Elicitation Interviews and Requirements

启发式访谈与需求之间的自动对齐

Francesco Dente, Fabiano Dalpiaz, Paolo Papotti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出将访谈转录与用户故事需求自动对齐的任务,定义忠实度和覆盖率两个度量,利用大语言模型和嵌入模型实现自动评估,在四个数据集上达到0.86 macro-F1。

详情
Comments
8 pages
AI中文摘要

软件需求来源于多种启发式技术,其中许多具有对话性质,如访谈。然而,评估这些衍生需求是否忠实反映利益相关者的需求仍然是一项具有挑战性的手工任务。在本文中,我们形式化了将访谈转录与以用户故事表示的需求集合对齐的任务。我们提出了两种启发式对齐度量,称为(i)需求忠实度:转录支持的故事比例,以及(ii)访谈覆盖率:至少被一个故事支持的转录比例。然后,我们使用大语言模型和嵌入模型进行实验,评估自动计算这些度量的能力。在四个数据集上的实验表明,基于LLM的解决方案在手动标注的块-故事对上达到了0.86的宏F1分数。我们还展示了如何将嵌入模型用作阻断器,使方法更具可扩展性。这项工作为更多关于连接对话制品与需求的研究铺平了道路。形式化框架和自动匹配技术是基本组件,可用于新兴任务,如将需求追溯到访谈以及从对话生成需求。

英文摘要

Software requirements are derived from a variety of elicitation techniques, many of which have a conversational nature, like interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a challenging manual task. In this paper, we formalize the task of aligning the transcript of an interview with a collection of requirements represented as user stories. We propose two heuristic metrics for alignment, called (i) requirements faithfulness: the proportion of stories supported by the transcript, and (ii) interview coverage: the proportion of transcript supported by at least one story. Then, we run experiments with large language models and embedding models that assess the ability of evaluating these metrics automatically. Experiments over four datasets show that an LLM-based solution achieves 0.86 macro-F1 on manually labeled chunk-story pairs. We also show how embedding models can be used as blockers to make the approach more scalable. This work paves the way for more research on linking conversational artifacts with requirements. The formal framework and the automated matching techniques are basic components that can be used for emerging tasks such as tracing requirements to interviews and generating requirements from conversations.

2510.07061 2026-06-10 cs.CL 版本更新

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

重新审视印度语言机器翻译和摘要细粒度评估的度量可靠性

Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto

发表机构 * Sharif University of Technology(谢里夫理工学院) Vellore Institute of Technology(韦洛雷理工学院) IIT Kharagpur(印度理工学院达卡分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对印度语言评估不足的问题,提出ITEM基准,系统评估29种自动度量与人工判断的对齐,发现基于LLM的评估器表现最佳,并揭示了异常值影响、任务差异及扰动鲁棒性等关键发现。

详情
Comments
18 pages, 14 figures
AI中文摘要

虽然自动度量推动了机器翻译(MT)和文本摘要(TS)的发展,但现有度量几乎完全针对英语和其他高资源语言开发和验证。这种狭隘的关注使得超过15亿人使用的印度语言在很大程度上被忽视,对当前评估实践的普遍性提出了质疑。为弥补这一空白,我们引入了ITEM,一个大规模基准,系统评估了29种自动度量与六种主要印度语言人工判断的对齐,并丰富了细粒度注释。我们的广泛评估涵盖了与人工判断的一致性、对异常值的敏感性、语言特定可靠性、度量间相关性以及对受控扰动的鲁棒性,揭示了四个核心发现:(1)基于LLM的评估器在段落和系统级别上与人工判断的对齐最强;(2)异常值对度量-人工一致性有显著影响;(3)在TS中,度量在捕捉内容保真度方面更有效,而在MT中,它们更好地反映流畅性;(4)度量在受到不同扰动时,其鲁棒性和敏感性有所不同。总体而言,这些发现为推进印度语言的度量设计和评估提供了关键指导。

英文摘要

While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 29 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) In TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) Metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

2510.04195 2026-06-10 cs.AI 版本更新

Constructing coherent spatial memory in LLM agents through graph rectification

通过图修正构建LLM智能体中的连贯空间记忆

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, Liqiu Meng

发表机构 * Chair of Cartography and Visual Analytics(制图学与视觉分析教授会)

AI总结 提出LLM-MapRepair框架,通过版本控制和边影响评分检测并修正增量构建的导航图中的结构不一致性,在多个基准上显著提升节点和边召回率。

详情
AI中文摘要

给定通过全局遍历导航指令的地图描述,LLM通常能够推断隐式空间布局并通过提供最短路径来回答用户查询。然而,随着环境变大,这种依赖于上下文的查询变得不可行,这促使需要增量地图构建,即从逐步观察中构建完整的拓扑图。我们提出LLM-MapRepair,一个用于LLM驱动的地图构建和修复的框架,旨在检测、定位和修正增量构建的导航图中的结构不一致性。我们的贡献包括:用于图构建的版本控制机制、用于修复优先级的边影响评分,以及为LLM驱动的地图构建和修复量身定制的MANGO基准的清理变体。我们在四个评估设置上评估该框架:合成逐组件消融(gpt-4.1,每个单元n=20个种子)、跨供应商扫描(覆盖OpenAI、Anthropic和Google的七个LLM,在合成和TextWorld程序生成的文本冒险游戏上)、修复阶段评估(在所有42个清理后的MANGO游戏上,具有非零剩余冲突,共534个冲突;三个供应商×三种模式加上两个非LLM参考),以及在《红楼梦》第16-17章上的端到端自然文本部署。在DRC部署中,LLM-MapRepair使用GPT-4.1实现了94.3%的节点召回率(比直接LLM映射高8.6个百分点)和88.2%的边召回率(高55.8个百分点);召回率的提升伴随着预测节点和边数量约为真实值的4倍(表4),这反映了我们在局限性中讨论的离散化驱动的过度生成权衡。

英文摘要

Given a map description through global traversal navigation instructions, an LLM can often infer the implicit spatial layout and answer user queries by providing shortest paths. However, such context-dependent querying becomes incapable as environments grow larger, motivating the need for incremental map construction that builds a complete topological graph from stepwise observations. We propose LLM-MapRepair, a framework for LLM-driven construction and map repair, designed to detect, localize, and correct structural inconsistencies in incrementally constructed navigation graphs. Our contributions include a Version Control mechanism for graph construction, an Edge Impact Score for repair prioritization, and a cleaned variant of the MANGO benchmark tailored for LLM-driven map construction and repair. We evaluate the framework on four evaluation settings: a synthetic per-component ablation (gpt-4.1, n=20 seeds per cell), a cross-vendor sweep over seven LLMs from OpenAI, Anthropic, and Google on both synthetic and TextWorld procedurally-generated text-adventure games, a repair-stage evaluation on all 42 cleaned-MANGO games with non-zero residual conflicts (534 conflicts; three vendors x three modes plus two non-LLM references), and an end-to-end natural-text deployment on Chapters 16-17 of Dream of the Red Chamber. On the DRC deployment, LLM-MapRepair achieves 94.3% node recall (+8.6 pp over direct LLM mapping) and 88.2% edge recall (+55.8 pp), using GPT-4.1; the recall improvements come with predicted node and edge counts that are roughly 4x the ground-truth counts (Table 4), reflecting the discretization-driven over-generation trade-off we discuss in the Limitations.

2507.14725 2026-06-10 cs.LG cs.AI 版本更新

GRID: Scaling Task-Agnostic Inference in Continual Prompt Tuning

GRID:持续提示调优中任务无关推理的规模化

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

发表机构 * State University of New York at Buffalo(纽约州立大学布法罗分校) Department of Computer Science and Engineering(计算机科学与工程系) Institute for Artificial Intelligence and Data Science(人工智能与数据科学研究院)

AI总结 提出GRID框架,通过输出空间感知解码和梯度引导提示选择,解决持续学习中任务无关推理的性能退化与可扩展性问题,在长序列和负迁移基准上提升后向迁移并减少提示内存。

详情
AI中文摘要

基于提示的持续学习提供了一种参数高效的方式,使大型语言模型能够适应任务序列。然而,现有方法通常依赖任务感知推理,并维护不断扩展的任务特定提示集,导致(1)当推理时任务标识符不可用于提示选择时,早期任务性能严重下降;(2)随着任务序列增长,可扩展性受限。我们提出GRID,一个统一的框架来解决这些挑战。GRID包含一个输出空间感知解码机制,通过利用代表性输入和自动标签语义归一化来增强后向迁移,以及一个梯度引导的提示选择策略,将信息量较少的提示压缩为单个聚合表示,以实现可扩展、内存高效的持续学习。在长序列和负迁移基准上的大量实验表明,GRID改善了后向迁移,实现了有竞争力的前向迁移,并显著减少了编码器-解码器和仅解码器架构(包括T5、Qwen和LLaMA)中的提示内存。源代码可从此https URL获取。

英文摘要

Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, existing methods often rely on task-aware inference and maintain an expanding set of task-specific prompts, leading to (1) severe performance degradation on earlier tasks when task identifiers are unavailable for prompt selection at inference time, and (2) limited scalability as task sequence grows. We propose GRID, a unified framework designed to address these challenges. GRID incorporates an output-space-aware decoding mechanism that enhances backward transfer by leveraging representative inputs and automatic label semantic normalization, alongside a gradient-guided prompt selection strategy that compresses less informative prompts into a single aggregated representation for scalable, memory-efficient continual learning. Extensive experiments on long-sequence and negative-transfer benchmarks show that GRID improves backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory across encoder-decoder and decoder-only architectures, including T5, Qwen, and LLaMA. Source code is available at https://github.com/AnushkaTi/GRID.

2509.25760 2026-06-10 cs.CL cs.AI cs.LG 版本更新

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL: 通过强化学习激励诚实的LLM

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TruthRL框架,使用GRPO和三值奖励直接优化LLM的诚实性,减少幻觉并允许不确定时弃权,在知识密集型基准上显著提升诚实性。

详情
Comments
ICML 2026. Code: https://github.com/facebookresearch/TruthRL
AI中文摘要

虽然大型语言模型(LLM)在事实性问题回答上表现出色,但它们仍然容易产生幻觉和不真实的回答,特别是当任务需要其参数知识之外的信息时。事实上,诚实性需要的不仅仅是准确性——模型还必须识别不确定性,并在不确定时弃权以避免幻觉。这对现有方法提出了根本性挑战:优化准确性的方法往往会放大幻觉,而鼓励弃权的方法可能变得过于保守,牺牲正确答案。两种极端最终都损害了诚实性。在这项工作中,我们提出了TruthRL,一个通用的强化学习(RL)框架,直接优化LLM的诚实性。具体来说,我们使用GRPO实现TruthRL,并采用一个简单而有效的三值奖励,区分正确答案、幻觉和弃权。它激励模型不仅通过提供正确回答来减少幻觉,还通过在不确定时启用弃权来提高诚实性。在四个知识密集型基准上的大量实验表明,TruthRL显著减少了幻觉(例如,43.5% → 19.4%)并提高了诚实性(例如,5.3% → 37.2%),在各种骨干模型上均有一致的提升。分析表明,TruthRL的改进源于LLM识别其知识边界的能力增强,从而避免了像基线那样过于保守。

英文摘要

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

2509.25017 2026-06-10 cs.LG cs.CV 版本更新

Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting

不确定性感知的深度学习用于野火危险预测

Spyros Kondylatos, Nikolas Papadopoulos, Gustau Camps-Valls, Ioannis Papoutsis

发表机构 * Aix-Marseille University(艾克斯-马赛大学) University of Cambridge(剑桥大学) University of Malaga(马拉加大学) University of Crete(希腊克里特大学)

AI总结 提出不确定性感知深度学习框架,联合捕获认知不确定性和偶然不确定性,提升短期野火危险预测的准确性和可靠性,F1分数提高2.3%,预期校准误差降低2.1%。

详情
AI中文摘要

野火是最严重的自然灾害之一,对人类和自然生态系统构成重大威胁。日益增长的野火风险增加了对不仅准确而且可靠的预测模型的需求。深度学习在预测野火危险方面显示出潜力;然而,其采用受到对其预测可靠性的担忧的阻碍,部分源于缺乏不确定性量化。为应对这一挑战,我们提出了一个不确定性感知的深度学习框架,该框架联合捕获认知(模型)和偶然(数据)不确定性,以增强短期野火危险预测。在次日预测中,与确定性基线相比,我们表现最佳的模型将F1分数提高了2.3%,并将预期校准误差降低了2.1%,从而提升了预测技能和校准能力。我们的实验证实了不确定性估计的可靠性,并展示了它们在决策支持中的实际效用,包括识别拒绝低置信度预测的不确定性阈值,以及生成伴随不确定性层的良好校准的野火危险图。将预测范围延长至十天,我们观察到偶然不确定性随时间增加,表明环境条件的更大变异性,而认知不确定性保持稳定。最后,我们表明,尽管两种不确定性类型在低不确定性情况下可能是冗余的,但在更具挑战性的条件下它们提供互补的见解,强调了联合建模对稳健野火危险预测的价值。总之,我们的方法显著提高了野火危险预测的准确性和可靠性,推动了可信赖的野火深度学习系统的发展。

英文摘要

Wildfires are among the most severe natural hazards, posing a significant threat to both humans and natural ecosystems. The growing risk of wildfires increases the demand for forecasting models that are not only accurate but also reliable. Deep Learning (DL) has shown promise in predicting wildfire danger; however, its adoption is hindered by concerns over the reliability of its predictions, some of which stem from the lack of uncertainty quantification. To address this challenge, we present an uncertainty-aware DL framework that jointly captures epistemic (model) and aleatoric (data) uncertainty to enhance short-term wildfire danger forecasting. In the next-day forecasting, our best-performing model improves the F1 Score by 2.3% and reduces the Expected Calibration Error by 2.1% compared to a deterministic baseline, enhancing both predictive skill and calibration. Our experiments confirm the reliability of the uncertainty estimates and illustrate their practical utility for decision support, including the identification of uncertainty thresholds for rejecting low-confidence predictions and the generation of well-calibrated wildfire danger maps with accompanying uncertainty layers. Extending the forecast horizon up to ten days, we observe that aleatoric uncertainty increases with time, showing greater variability in environmental conditions, while epistemic uncertainty remains stable. Finally, we show that although the two uncertainty types may be redundant in low-uncertainty cases, they provide complementary insights under more challenging conditions, underscoring the value of their joint modeling for robust wildfire danger prediction. In summary, our approach significantly improves the accuracy and reliability of wildfire danger forecasting, advancing the development of trustworthy wildfire DL systems.

2509.24710 2026-06-10 stat.ML cs.LG cs.NA math.NA 版本更新

MAD: Manifold Attracted Diffusion

MAD: 流形吸引扩散

Dennis Elbrächter, Giovanni S. Alberti, Matteo Santacesaria

发表机构 * Department of Mathematics, University of Vienna(维也纳大学数学系) MaLGa Center, Department of Mathematics, University of Genoa(热那亚大学数学系MaLGa中心)

AI总结 提出流形吸引扩散方法,利用流形假设通过扩展得分函数在推理阶段去除噪声,生成无噪声样本,在玩具问题、合成数据和真实数据上验证有效性。

详情
Journal ref
Forty-third International Conference on Machine Learning, 2026
AI中文摘要

基于得分的扩散模型是从图像分布中生成样本的一种高效方法。我们考虑训练数据来自目标分布的有噪声版本的情况,并提出一种可高效实现的推理过程修改,以生成无噪声样本。我们的方法受流形假设启发,该假设认为有意义的数据集中在高维环境空间的某个低维流形周围。核心思想是,噪声表现为离流形方向上的低幅度变化,而目标分布的相关变化主要限于流形方向。我们引入了扩展得分概念,并表明在简化设置中,它可以将小变化减少为零,同时基本保持大变化不变。我们描述了如何从标准得分的近似中高效计算其近似,并在玩具问题、合成数据和真实数据上展示了其有效性。

英文摘要

Score-based diffusion models are a highly effective method for generating samples from a distribution of images. We consider scenarios where the training data comes from a noisy version of the target distribution, and present an efficiently implementable modification of the inference procedure to generate noiseless samples. Our approach is motivated by the manifold hypothesis, according to which meaningful data is concentrated around some low-dimensional manifold of a high-dimensional ambient space. The central idea is that noise manifests as low magnitude variation in off-manifold directions in contrast to the relevant variation of the desired distribution which is mostly confined to on-manifold directions. We introduce the notion of an extended score and show that, in a simplified setting, it can be used to reduce small variations to zero, while leaving large variations mostly unchanged. We describe how its approximation can be computed efficiently from an approximation to the standard score and demonstrate its efficacy on toy problems, synthetic data, and real data.