arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4065
2605.07574 2026-05-12 cs.CV

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma

AI总结 主流的视觉-语言模型(VLMs)由于依赖标准RGB输入,在处理反射、透明物体等光学模糊场景时存在显著困难。为解决这一问题,本文提出PolarVLM,首个将偏振物理参数融入VLM的多模态框架,通过双流架构和渐进式训练策略,有效避免物理误判并保持通用视觉能力。同时,研究构建了首个面向偏振感知的视觉问答基准PolarVQA,实验表明PolarVLM在多个任务上显著优于RGB基线,尤其在反射识别和玻璃计数任务中提升明显。

Comments 23 pages, 12 figures, including appendices

详情
英文摘要

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.

2605.07429 2026-05-12 cs.CV

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

Linxiao Shi, Siming Zheng, Zerong Wang, Hao Zhang, Jinwei Chen, Bo Li, Shifeng Chen, Peng-Tao Jiang

AI总结 现有移动设备由于光学设计限制,难以生成自然的光学景深效果。为解决这一问题,本文提出 MagicBokeh,一种基于扩散框架的统一方法,能够高效生成高质量的逼真景深效果。该方法通过替代训练策略和聚焦感知的掩码注意力机制,联合优化景深渲染与超分辨率,显著提升了控制精度和视觉真实感,并引入退化感知深度模块以提升低质量输入的深度估计准确性。实验表明,MagicBokeh 能在真实低分辨率图像上高效生成高度逼真的景深效果,为未来景深渲染研究提供了新方向。

Comments Accepted by CVPR 2026

详情
英文摘要

Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at https://github.com/vivoCameraResearch/MagicBokeh.

2605.07384 2026-05-12 cs.LG

StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models

Panqi Chen, Yifan Sun, Shikai Fang, Xiao Fu, Lei Cheng

AI总结 StreamPhy 是一个用于从不规则稀疏测量数据中实时推断高维物理场动态的端到端框架。该方法结合了自适应观测编码器、结构化状态空间模型和高效的 FT-FiLM 解码器,能够在不规则时间间隔下实现内存高效的在线更新与高精度场生成。研究证明 FT-FiLM 在表达能力上优于传统函数张量模型,并在多个物理系统实验中展现出比现有方法更高的准确性和更快的推理速度。

详情
英文摘要

Inferring the evolution of high-dimensional and multi-modal (e.g., spatio-temporal) physical fields from irregular sparse measurements in real time is a fundamental challenge in science and engineering. Existing approaches, including diffusion-based generative models and functional tensor methods, typically operate in offline settings, depend on full temporal observations, or incur substantial inference cost. We propose StreamPhy, an end-to-end framework that enables efficient and accurate streaming inference of full-field physical dynamics from incoming irregular sparse measurements. The framework integrates a data-adaptive observation encoder that is robust to arbitrary observation patterns, a structured state-space model that supports memory-efficient online updates across irregular time intervals, and an expressive Functional Tensor Feature-wise Linear Modulation (FT-FiLM) decoder for continuous-field generation. We prove that FT-FiLM is more expressive than the functional Tucker model, admitting a richer function class for handling complex dynamics. Experiments on three representative physical systems under challenging sampling patterns show that StreamPhy consistently outperforms state-of-the-art baselines, with at least 48\% improvement in accuracy and up to 20--100X faster inference than diffusion-based methods.

2605.07177 2026-05-12 cs.LG cs.AI

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu

AI总结 现有的多模态搜索代理通常按顺序处理目标实体,导致在查询分解为多个独立检索任务时产生冗余的交互轮次。为此,本文提出HyperEyes,一种基于双粒度效率感知强化学习的并行多模态搜索代理,通过将视觉定位与检索融合为单一原子操作,实现对多个实体的并发搜索,并将推理效率作为核心训练目标。HyperEyes采用两阶段训练策略,结合平行可用数据合成管道和双粒度强化学习框架,有效提升了搜索效率与准确性,并引入了兼顾搜索能力与效率的新型评估基准IMEB。

Comments Code & Data: https://github.com/DeepExperience/HyperEyes

详情
英文摘要

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.

2605.06856 2026-05-12 cs.LG cs.CL

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

Ishani Mondal, Shweta Bhardwaj

AI总结 该论文指出,尽管生成式AI系统在标准基准测试中表现优异,但在实际应用场景中却难以发挥实际效用,这一问题在教育、医疗、软件工程和法律等28个部署案例中均有体现。研究认为,当前评估方法存在代理替代、时间坍缩和分布隐藏等缺陷,导致评估结果与实际效用脱节。为此,论文提出了一种新的评估框架SCU-GenEval,强调应基于人类目标和情境,通过长期交互效果来衡量AI系统的实际价值,并引入了多项实用工具以支持该评估范式的落地实施。

Comments 20 pages

详情
英文摘要

Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder's capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.

2605.06644 2026-05-12 cs.LG

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit

AI总结 该研究提出了一种基于成熟染料区域三维结构的机制图算法,用于预测荧光蛋白的量子产率。方法将蛋白质结构转化为分区域的三维残基图,并通过信号通道传播捕捉局部物理信号对染料区域的影响,结合121个特征进行回归预测。该方法在多个基准测试中表现出色,尤其在远程同源蛋白中优于现有模型,揭示了不同荧光蛋白的区域特异性机制。

Comments Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request

详情
英文摘要

Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com

2605.06366 2026-05-12 cs.LG

Layer Collapse in Diffusion Language Models

Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu

AI总结 本文研究了扩散语言模型(DLMs)中出现的“层坍缩”现象,发现其早期层的激活模式高度相似,且由一个主导的超级异常值主导,这一结构在长文本范围内保持稳定。尽管该异常值看似冗余,但对模型输出至关重要,去除会导致输出退化为重复的随机序列。研究还表明,DLMs的冗余分布与自回归模型相反,其冗余主要集中在浅层,且层坍缩是由过度训练而非欠训练引起的,这对模型压缩和部署具有重要实践意义。

Comments 9 Pages, Preprint

详情
英文摘要

Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.

2605.06042 2026-05-12 cs.RO

Accurate Trajectory Tracking with MPCC for Flapping-Wing MAVs

Charbel Toumieh, Jack Zeng, Niel Mistry, Dario Floreano

AI总结 本文研究了扑翼式微型飞行器(MAVs)的高精度轨迹跟踪问题,针对其升力、空速和转向高度耦合且控制输入有限的特点,提出了基于模型预测轮廓控制(MPCC)的控制方法。该方法采用弧长参数化轨迹,实时优化飞行进度,无需预设时间剖面,同时设计了一个紧凑且连续可微的动力学模型,以准确描述扑翼飞行器的耦合气动特性。实验表明,该方法在复杂三维轨迹跟踪中实现了厘米级的轨迹偏差,显著优于现有方法。

Comments 7 pages, 6 figures

详情
英文摘要

Flapping-wing micro aerial vehicles offer quieter and safer operation than rotary-wing drones, yet achieving precise autonomous control of bird-scale ornithopters remains challenging: lift, airspeed, and turning authority are tightly coupled and governed by only a few control inputs. Conventional cascaded controllers treat altitude, speed, and heading independently, producing persistent tracking errors during complex maneuvers, while time-parameterized trajectory tracking requires predefined speed profiles that existing methods cannot robustly produce for these coupled dynamics. We address both limitations simultaneously with a Model Predictive Contouring Control (MPCC) approach that tracks arc-length-parameterized trajectories while optimizing progress online, eliminating the need for predefined timing. However, MPCC requires a dynamical model that captures the coupled aerodynamics without exceeding the computational budget of real-time nonlinear optimization. Here, we propose a compact, continuously differentiable model that captures the dominant couplings of bird-scale ornithopters, enabling real-time predictive control. We validated the method with the XFly ornithopter flying along circular and three-dimensional racing trajectories and achieved a mean deviation from the reference trajectory between 6.5 and 9 cm at speeds up to 3 m/s, which represents an almost 10-fold improvement over prior ornithopter control methods.

2605.05812 2026-05-12 cs.AI

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn

AI总结 本文研究了基于值函数的离线强化学习方法在长时域任务中因引导误差导致的估计不稳定问题,提出了长时域Q学习(LQL)方法。LQL通过引入n步不等式约束,利用铰链损失函数对值函数估计进行修正,有效抑制误差累积,同时无需额外网络或计算开销。实验表明,LQL在多个在线和离线到在线的基准任务中均优于传统的1步和n步TD学习方法。

详情
英文摘要

Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

2605.05775 2026-05-12 cs.CV cs.AI

The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization

Jakob Dexl, Katharina Jeblick, Andreas Mittermeier, Balthasar Schachtner, Anna Theresa Stüber, Johanna Topalis, Maximilian Rokuss, Fabian Isensee, Klaus H. Maier-Hein, Hamza Kalisch, Jens Kleesiek, Constantin M. Seibold, Hussain Alasmawi, Lap Yan Lennon Chan, Yixuan Yuan, Alexander Jaus, Rainer Stiefelhagen, Pauline Ornela Megne Choudja, Konstantin Nikolaou, Christian La Fougère, Sergios Gatidis, Matthias P. Fabritius, Maurice Heimer, Gizem Abaci, Lalith Kumar Shiyam Sundar, Rudolf A. Werner, Jens Ricke, Clemens C. Cyran, Thomas Küstner, Michael Ingrisch

AI总结 本文介绍了第三届 autoPET 挑战赛(MICCAI 2024)的设计与结果,旨在评估在全身 PET/CT 图像中自动分割病灶的算法在多示踪剂、多中心场景下的泛化能力。研究使用了来自两个医院的大量标注数据,并在包含未见示踪剂-中心组合的测试集上评估算法性能,结果显示最佳算法在多个指标上优于基线模型。研究还指出,当前算法在域内多示踪剂分割任务上表现良好,但在跨中心、跨示踪剂的泛化任务中仍面临挑战,性能差异主要受数据异质性和病例难度影响。

Comments Preprint submitted to Medical Image Analysis

详情
英文摘要

We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.

2605.05373 2026-05-12 cs.LG

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

David Leeftink, Max Hinne, Marcel van Gerven

AI总结 本文研究了如何在部分可观测环境中提升强化学习智能体的决策能力,提出了一种基于最优控制中庞特里亚金最小原理(PMP)的神经共态策略方法。该方法通过将循环神经网络中的隐状态与PMP中的共态建立形式联系,使网络内部动态具有可解释性,并引入共态损失函数以显式引导隐状态的结构化学习。实验表明,该方法在部分可观测任务中表现优异,并具备对分布外传感器遮蔽的鲁棒性。

Comments 17 pages, 5 figures

详情
英文摘要

A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle (PMP) from optimal control. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP-derived co-state loss to explicitly structure the internal dynamics. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero-shot out-of-distribution sensor masking. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies.

2605.04617 2026-05-12 cs.CV cs.HC cs.LG

Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition

Zishu Zhou, Zaipeng Xie, Xuanyao Jie

AI总结 可穿戴人体活动识别模型在面对真实世界中用户分布变化时往往性能下降,现有测试时自适应方法多沿用视觉任务的假设,未能充分利用活动识别流中的时间结构特性。本文重新审视时间结构作为条件推理信号的作用,提出了一种基于时间连续性和特征偏差的自适应机制,用于指导何时保持或释放时间惯性以及预测优化的路由位置。基于此,作者设计了SIGHT框架,无需反向传播即可实现轻量高效的实时自适应,实验表明其在实际数据集上优于现有方法,同时降低了计算和内存开销。

详情
英文摘要

Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.

2605.04541 2026-05-12 cs.CV

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Muyao Peng, Shun Zou, Pei An, You Yang, Qiong Liu

AI总结 本文提出了一种名为Angle-I2P的图像到点云配准方法,旨在解决低内点比情况下传统PnP方法难以准确配准的问题。该方法通过引入角度一致性约束和层次注意力机制,有效提升配准的鲁棒性与精度。实验表明,Angle-I2P在多个公开数据集上取得了当前最优的配准效果。

Comments Accepted by ICRA 2026

详情
英文摘要

Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

2605.03650 2026-05-12 cs.CV cs.AI cs.LG

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Zhiyuan Li, Rongzhen Zhao, Wenyan Yang, Wenshuai Zhao, Pekka Marttinen, Joni Pajarinen

AI总结 本文重新思考了视频对象中心学习中的时间一致性问题,指出当前依赖动态模块预测未来对象表示的方法实际上是复杂的离散对应问题的近似。作者提出了一种新的框架“Grounded Correspondence”,通过冻结的骨干网络提取显著区域初始化对象槽,并利用匈牙利匹配实现帧间身份对应,无需可学习的时间建模参数,即可在多个数据集上取得具有竞争力的性能。

详情
英文摘要

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/

2605.03639 2026-05-12 cs.CV

Diffusion Masked Pretraining for Dynamic Point Cloud

Zhuoyue Zhang, Jihua Zhu, Chaowei Fang, Jian Liu, Ajmal Saeed Mian

AI总结 本文提出了一种名为DiMP的统一自监督预训练框架,用于动态点云处理。该方法通过引入扩散模型,解决了现有掩码重建目标中的时空位置泄露和运动不确定性丢失问题。DiMP在位置推理和运动学习中均采用扩散建模,通过预测可见时空上下文中的干净点云中心,提升了位置表示的准确性,并将帧间位移监督转化为条件扩散模型的噪声预测任务,从而更完整地建模运动的条件分布。实验表明,DiMP在多个下游任务中均显著提升了性能。

详情
英文摘要

Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.

2605.01643 2026-05-12 cs.LG cs.AI

AI Alignment via Incentives and Correction

Rohit Agarwal, Joshua Lin, Mark Braverman, Elad Hazan

AI总结 本文从法律与经济学中的威慑与执行模型出发,研究人工智能对齐问题,认为AI系统中的不当行为是对其所受激励的策略性响应,而非单纯的外部失败。文章提出将对齐问题视为一个均衡问题,通过设计奖励机制来引导求解器和审计器之间的行为互动,从而实现更有效的对齐。研究还提出了一种基于强化学习的奖励设计方法,并在实际的大型语言模型代码生成任务中验证了其有效性。

详情
英文摘要

We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.

2605.00539 2026-05-12 cs.CL cs.DC

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao, Mengyang Zhang, Bing Wang, Shaohuai Shi

AI总结 本文提出了一种名为AGoQ的量化方法,旨在提高大语言模型分布式训练的内存效率。该方法通过引入层感知的激活量化算法和8位梯度量化算法,分别实现了接近4位的激活存储和高效通信的梯度存储,从而显著降低内存占用并提升训练速度。实验表明,AGoQ在多个大规模LLaMA模型上相比现有系统,在减少内存消耗和提升训练速度方面均取得了显著优势,同时保持了模型的收敛性能和任务准确率。

详情
英文摘要

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52\% and achieves up to 1.34$\times$ improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.

2605.00370 2026-05-12 cs.LG cs.CY cs.MM

Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

Chunlei Meng, Pengbin Feng, Rong Fu, Hoi Leong Lee, Xiaojing Du, Zhaolu Kang, Zeyu Zhang, Weilin Zhou, Chun Ouyang, Zhongxue Gan

AI总结 该论文提出了一种名为Group Cognition Learning(GCL)的协作学习框架,旨在解决多模态学习中模态主导和虚假模态关联的问题。GCL采用两阶段协作机制,第一阶段通过路由代理和审计代理选择性地促进模态间有益的交互,抑制冗余关联;第二阶段通过公共因子代理和聚合代理生成最终预测,同时保持各模态的独立性。实验表明,GCL在多个多模态基准数据集上取得了优于现有方法的性能,有效提升了模型的鲁棒性和泛化能力。

Comments This study has been Accepted by ICML 2026. The current version is a manuscript, please refer to the official version released at ICML 2026 for the final published version

详情
英文摘要

Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.

2605.00195 2026-05-12 cs.LG

Diversity in Large Language Models under Supervised Fine-Tuning

Roman Klypa, Oleksandr Cherednichenko

AI总结 本研究探讨了监督微调(SFT)对大语言模型生成多样性的影响,指出SFT会导致生成内容的多样性下降,并将这一现象归因于微调数据中低频模式的忽视和预训练知识的遗忘。为此,研究提出了一个新的损失函数Tempered Focal(TOFU)损失,能够同时解决这两个问题。实验表明,TOFU在保持响应质量的同时有效提升了模型输出的多样性,为SFT提供了更合理的方法。

详情
英文摘要

Supervised Fine-Tuning (SFT) is essential for aligning Large Language Models (LLMs) with user intent, yet it is believed to suppress generative diversity. Although this reduction is frequently referenced, formal empirical testing of the phenomenon remains limited. The expressiveness of LLMs by itself was addressed by multiple prior methods. Their varying perspectives suggest that deeper investigation could yield further improvements. In this study, we attribute the decline to two primary drivers: the neglect of low-frequency patterns within fine-tuning datasets and the forgetting of preexisting knowledge. Motivated by our theoretical analysis, we develop Tempered Focal (TOFU) loss, a novel objective that addresses both stated challenges simultaneously. Our extensive evaluation confirms at scale that generation breadth narrows after SFT and strengthens the hypothesis explaining this effect. Across multiple models and benchmarks, we demonstrate that TOFU enhances output diversity while preserving high response quality, offering a principled approach to SFT.

2604.27629 2026-05-12 cs.AI

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

Ke Xu, Zhongyuan Lian

AI总结 本文提出了一种名为WaferSAGE的框架,用于晶圆缺陷的视觉问答分析,该框架结合了小规模视觉语言模型与合成数据生成技术,以解决半导体制造中数据稀缺的问题。研究通过结构化评分标准生成和强化学习方法,提升了缺陷识别与分析的准确性,并在无需大量标注数据的情况下实现了高精度的模型训练。实验表明,该方法在专用工业视觉理解任务中能够超越大型商业模型,为半导体制造提供了隐私保护且成本更低的部署方案。

Comments 16 pages, 3 figures, 8 tables

详情
英文摘要

We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.

2604.23876 2026-05-12 cs.LG

Cardiac Stability Theory: An Axiomatically Grounded Framework for Continuous Cardiac Health Monitoring via Smartphone Photoplethysmography

Timothy Oladunni, Farouk Ganiyu Adewumi

AI总结 本文提出了一种基于公理的框架——心脏稳定性理论(CST),用于通过智能手机光电容积描记(PPG)实现连续的心脏健康监测。该方法通过定义心血管健康为围绕心脏动力学吸引子的稳定性边界,结合李雅普诺夫指数、复发确定性和信号熵等指标,构建了心脏稳定性指数(CSI)。研究展示了CSI在ECG和PPG数据上的优越性能,并通过领域迁移技术实现了在智能手机上的实时应用,为长期非侵入式心脏健康监测提供了新方法。

详情
英文摘要

We present Cardiac Stability Theory (CST), an axiomatically grounded framework formally defining cardiovascular health as a stability margin around a cardiac dynamical attractor. From four axioms we derive the Cardiac Stability Index (CSI), a composite scalar in [0,1] integrating the largest Lyapunov exponent, recurrence determinism, and signal entropy via time-delay embedding. The ECG-based model (CSISurrogateV2, CNN-Transformer) achieves $R^2=0.8788$, MAE$=0.0234$ on PTB-XL (21,799 recordings). We extend CSI to smartphone PPG via Complementary Domain Transfer (CDT): CSISurrogateV2 generates pseudo-labels for the BUT PPG dataset (48 recordings, 12 subjects), training TinyCSINet (122,849 parameters), achieving MAE$=0.0557$, $ρ=0.660$ on the held-out test set ($n=1065$ windows) at ${<}30$ ms mobile latency. CDT is validated on BIDMC, Welltory, and RWS-PPG. Paired validation on 5,035 BIDMC windows yields $r=0.454$ ($ρ=0.485$, $p<10^{-295}$), confirming correlated cardiac stability across modalities. CSI is negatively correlated with age (slope $= -0.000225$ CSI/year, PTB-XL), discriminates atrial fibrillation from normal sinus rhythm (AUROC$=0.89$), and is robust under Perturbation Invariance Training (max AUC drop 1.65\%). We derive HeartSpan, a longitudinal stability metric relative to population age norms, enabling continuous non-invasive cardiac monitoring from commodity smartphones for longevity tracking and cardiac risk stratification.

2604.23750 2026-05-12 cs.LG cs.AI

The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

Shuaizhi Cheng, Xiang Shi, Zhiwei Zhang, Mingwei Li

AI总结 本文研究了基于超网络的即时大语言模型适配方法在处理知识冲突时的失效问题,发现其核心原因是幅度问题而非表示能力不足。通过分析表明,超网络虽然能正确定位模型层,但由于适配器的幅度固定,而预训练知识的幅度随训练频率增加,导致深层冲突知识难以被有效适配。为此,作者提出幅度增强方法,如选择性层增强和冲突感知内化,在无需再训练的情况下显著提升了模型在深层冲突任务上的表现。

Comments 35 pages, 15 figures v2: minor layout fixes and author list update

详情
英文摘要

Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM's weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model's log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.

2604.22251 2026-05-12 cs.RO

False Feasibility in Variable Impedance MPC for Legged Locomotion

Vishal Ramesh

AI总结 本文研究了可变阻抗模型预测控制(MPC)在腿部运动中的“虚假可行性”问题,即控制算法中关节刚度作为瞬时决策变量所导致的可行解集与实际物理可实现解集之间的不匹配。通过引入无量纲参数 α = ωsT,作者分析了这种不匹配的范围,并在单腿跳跃模型中证明了当 α 低于某个临界值时,基于参数的预测无法由实际的刚度指令实现。研究还表明,通过在预测状态中引入刚度信息,可以从根本上消除这种不匹配。

Comments Paper withdrawn to make some revisions in the discussion and experiments sections

详情
英文摘要

Variable impedance model predictive control (MPC) formulations often treat joint stiffness as an instantaneous decision variable. The resulting feasible set strictly contains the physically realizable set under first-order actuator dynamics. We identify this as a formulation error rather than a modeling approximation, formalize the distinction between the parameter-based feasible set F_param and the realizable set F_real, and characterize the regime of mismatch via the dimensionless parameter α = ωsT (actuator bandwidth times task timescale). For the 1D hopping monoped, we prove that below an analytical threshold α_crit derived in closed form from task physics, no admissible stiffness command realizes the parameter-based prediction. Numerical validation in 1D shows monotonic deviation growth as α decreases, with the predicted scaling holding across ten parameter combinations (log-log R2 = 0.986). Mechanism transfer to planar spring-loaded inverted pendulum dynamics confirms center-of-mass and stance-timing deviation as the primary consequence, with regime-dependent friction effects as a tertiary observable. A second threshold α_infeas < α_crit establishes a floor below which restricting the admissible stiffness range cannot repair realizability, closing the conservative-tuning objection. Augmenting the prediction state with stiffness closes the mismatch by construction.

2604.21232 2026-05-12 cs.AI

ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

Xiyin Zeng, Yuyu Sun, Haoyang Li, Shouqiang Liu, Hao Wang

AI总结 本文提出了一种名为 ReCAPA 的分层预测校正架构,旨在解决视觉-语言-动作系统在执行多步骤任务时可能出现的级联失效问题。该方法通过在动作、子目标和轨迹三个层次上引入预测与对比机制,结合语义对齐模块,动态调整执行过程中的偏差,从而提升任务执行的鲁棒性。实验表明,ReCAPA 在多个具身智能代理基准测试中表现优异,优于现有的大型语言模型基线。

详情
英文摘要

Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.

2604.19838 2026-05-12 cs.AI

Resolving space-sharing conflicts in road user interactions through uncertainty reduction: An active inference-based computational model

Julian F. Schumann, Johan Engström, Ran Wei, Shu-Yuan Liu, Jens Kober, Arkady Zgonnikov

AI总结 本文研究了道路用户如何解决空间共享冲突的问题,提出了一种基于主动推理的计算模型,用于模拟两个智能体之间的交互行为。该模型通过隐式通信、规范期望和显式通信三种机制降低交互中的不确定性,揭示了规范和显式通信线索在提升冲突解决成功率中的作用,同时也指出当其他智能体违反规范或传递误导信息时,依赖这些线索可能导致碰撞。该研究为道路用户交互建模提供了理论依据,并具有更广泛的应用前景。

详情
英文摘要

Understanding how road users resolve space-sharing conflicts is important both for traffic safety and the safe deployment of autonomous vehicles. While existing models have captured specific aspects of such interactions (e.g., explicit communication), a theoretically-grounded computational framework has been lacking. In this paper, we extend a previously developed active inference-based driver behavior model to simulate interactive behavior of two agents. Our model captures three complementary mechanisms for uncertainty reduction in interaction: (i) implicit communication via direct behavioral coupling, (ii) reliance on normative expectations (stop signs, priority rules, etc.), and (iii) explicit communication. In a simplified intersection scenario, we show that normative and explicit communication cues can increase the likelihood of a successful conflict resolution. However, this relies on agents acting as expected. In situations where another agent (intentionally or unintentionally) violates normative expectations or communicates misleading information, reliance on these cues may induce collisions. These findings illustrate how active inference can provide a novel framework for modeling road user interactions which is also applicable in other fields.

2604.19792 2026-05-12 cs.AI cs.DC cs.MA cs.NE

OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition

Francisco Angulo de Lafuente, Teerth Sharma, Vladimir Veselov, Seid Mohammed Abdu, Nirmal Tej Kumar, Guillermo Perry

AI总结 本文介绍了 OpenCLAW-P2P v7.0,这是一个去中心化的集体智能平台,旨在让自主AI代理在无需人类审核者的情况下完成科学论文的发布、同行评审、评分和迭代改进。该版本在原有基础上引入了数学理论修正,确保框架的维度一致性、范围约束和符号明确性,并扩展了生态系统,包括用于科学论文生成的开源语言模型 CAJAL。此外,平台保留了四大核心子系统,提升了存储可靠性、检索效率和引用验证准确性。

Comments v7.0: Mathematical corrections (fixed-point condition Eq.4, dimensionally consistent tau-indicator Eq.7, fully specified reputation formula Eq.8 with quality terms q0 and q-bar, discrete-time PD Governor Eq.15, HSR parameter definitions Eq.16); ecosystem developments: CAJAL-4B/9B models, BenchClaw platform, 14 integrations. 36 pages

详情
英文摘要

This paper presents OpenCLAW-P2P v7.0, a comprehensive evolution of the decentralized collective-intelligence platform in which autonomous AI agents publish, peer-review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on the v6.0 foundations -- multi-layer persistence, live reference verification, multi-LLM granular scoring, calibrated deception detection, the Silicon Chess-Grid FSM, and the AETHER containerized inference engine -- this release introduces mathematical corrections to the theoretical framework, ensuring dimensional consistency, proper range constraints, and unambiguous notation throughout. Additionally, this edition documents significant ecosystem expansions including the CAJAL family of open-source language models (4B and 9B parameters) fine-tuned for scientific paper generation. The four major subsystems introduced in v6.0 are retained: (i) a Multi-Layer Paper Persistence Architecture with four storage tiers ensuring zero paper loss; (ii) a Multi-Layer Retrieval Cascade reducing latency from >3s to <50ms; (iii) a Live Reference Verification system detecting fabricated citations with >85% accuracy; and (iv) a Scientific API Proxy providing access to seven public scientific databases. Mathematical corrections in v7.0 include: corrected fixed-point condition in the Sufficient Reason theorem; dimensionally consistent progress-rate indicator; fully specified reputation update formula incorporating quality terms q0 and q-bar; clarified attention-logit bound in the AETHER pruning theorem; explicit range documentation for the calibration mapping; non-negativity guarantee for the depth score; discrete-time notation for the PD Governor; and explicit parameter definitions for the HSR weight formula.

2604.19530 2026-05-12 cs.LG cs.CE stat.ML

Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Akash Yadav, Taiwo A. Adebiyi, Ruda Zhang

AI总结 本文研究了如何为科学基础模型提供校准良好的预测不确定性,提出了一种名为“随机注意”的轻量级推理时修改方法,通过在注意力权重中引入随机性来生成预测集成,无需重新训练模型。该方法通过一个校准目标来调整随机性参数,实现了高效的后校准。实验表明,该方法在天气预测、时间序列和回归任务中表现出更优的校准性能和更窄的预测区间,且计算成本显著低于现有方法。

详情
英文摘要

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a sample average lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on scientific foundation models for weather and time-series forecasting, as well as several regression tasks. Across benchmarks against uncertainty-aware baselines, we find that Sample Average Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable calibration, with adaptation costs nearly three orders of magnitude lower than the next-best baseline.

2604.17565 2026-05-12 cs.CV

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

Hong Jiang, Wensong Song, Zongxing Yang, Ruijie Quan, Yi Yang

AI总结 UniGeo 是一种新型的相机可控图像编辑框架,旨在在不同相机视角下生成几何一致的场景视图。该方法通过在表示层、架构层和损失函数层统一注入几何引导,解决了现有方法在连续相机运动下出现的几何漂移和结构退化问题。实验表明,UniGeo 在多个公开数据集上显著优于现有方法,具有更高的视觉质量和几何一致性。

详情
英文摘要

Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.

2604.14484 2026-05-12 cs.RO cs.AI math.OC

A Nonasymptotic Theory of Gain-Dependent Error Dynamics in Behavior Cloning

Junghoon Seo

AI总结 本文研究了行为克隆(BC)策略在位置控制机器人中的非渐近有限时间误差传播特性,揭示了控制器增益对任务失败概率的影响机制。通过分析增益依赖的闭环动力学,作者提出了一个代理矩阵 $X_\infty(K)$ 来表征位置误差的分布,并将任务失败概率分解为增益放大因子、验证损失和泛化松弛项,表明仅凭训练损失无法预测闭环性能。研究还给出了代理矩阵的标量上界,并对不同系统刚度与阻尼组合下的性能排序进行了分析,为理解BC策略的稳定性提供了理论依据。

详情
英文摘要

Behavior cloning (BC) policies on position-controlled robots inherit the closed-loop response of the underlying PD controller, yet the nonasymptotic finite-horizon consequences of controller gains for BC failure remain open. We show that independent sub-Gaussian action errors propagate through the gain-dependent closed-loop dynamics to yield sub-Gaussian position errors whose proxy matrix $X_\infty(K)$ governs the failure tail. The probability of horizon-$T$ task failure factorizes into a gain-dependent amplification index $Γ_T(K)$ and the validation loss plus a generalization slack, so training loss alone cannot predict closed-loop performance. Under shape-preserving upper-bound structural assumptions, the proxy admits the scalar bound $X_\infty(K)\preceqΨ(K)\bar X$, with $Ψ(K)$ decomposed into label difficulty, injection strength, and contraction. This ranks the four canonical regimes with compliant-overdamped (CO) tightest, stiff-underdamped (SU) loosest, and the stiff-overdamped versus compliant-underdamped ordering system-dependent. For the canonical scalar second-order PD system, the closed-form continuous-time stationary variance $X_\infty^{\mathrm{c}}(α,β)=σ^2α/(2β)$ is strictly monotone in stiffness and damping over the entire stable orthant, covering both underdamped and overdamped regimes, and the exact zero-order-hold (ZOH) discretization inherits this monotonicity. The analysis gives a nonasymptotic finite-horizon extension of the gain-dependent error-attenuation explanation of Bronars et al.

2604.11734 2026-05-12 cs.RO cs.AI

SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma

AI总结 本文提出SCORP,一种用于协作驾驶的场景一致多智能体扩散规划器,结合了稳定的在线强化学习后训练方法。为了解决现有扩散模型在场景一致性和闭环协作目标对齐方面的不足,SCORP引入了基于场景条件的多智能体去噪架构,并设计了两层马尔可夫决策过程以整合逆向去噪链与策略-环境交互。实验表明,SCORP在核心安全与效率指标上显著优于现有开源方法,展现出在协作驾驶任务中的优越性能。

详情
英文摘要

Cooperative driving is a safety- and efficiency-critical task that requires the coordination of diverse, interaction-realistic multi-agent trajectories. Although existing diffusion-based methods can capture multimodal behaviors from demonstrations, they often exhibit weak scene consistency and poor alignment with closed-loop cooperative objectives. This makes post-training necessary for further improvement, yet achieving stable online post-training in reactive multi-agent environments remains challenging. In this paper, we propose SCORP, a scene-consistent multi-agent diffusion planner with stable online reinforcement learning (RL) post-training for cooperative driving. For pre-training, we develop a scene-conditioned multi-agent denoising architecture that couples inter-agent self-attention with a dual-path conditioning mechanism: cross-attention provides direct scene-information injection, while AdaLN-Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post-training, we formulate a two-layer Markov decision process (MDP) that explicitly integrates the reverse denoising chain with policy-environment interaction. We further co-design dense, well-shaped planning rewards and variance-gated group-relative policy optimization (VG-GRPO) to mitigate advantage collapse and gradient instability during closed-loop training. Extensive experiments show that SCORP outperforms strong open-source baselines on WOMD, with 10.47%-28.26% and 1.70%-7.22% improvements in core safety and efficiency metrics, respectively. Moreover, compared with alternative post-training methods, SCORP delivers significant and consistent gains in both driving safety and traffic efficiency, highlighting stable and sustained advances in closed-loop cooperative driving.