arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2603.24577 2026-05-13 cs.CV cs.AI

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit

AI总结 本文提出了一种名为EndoVGGT的框架,用于提升手术场景中可变形软组织的三维重建精度。该方法引入了一个基于图注意力的变形感知模块(DeGAT),通过动态构建特征空间语义图来捕捉组织区域间的长程关联,从而在遮挡情况下更有效地传播结构信息,提高重建的鲁棒性和一致性。实验表明,EndoVGGT在SCARED数据集上显著提升了重建质量,并在未见数据集上表现出良好的泛化能力。

Comments We withdraw this submission due to significant errors in the presentation and logical structure of the paper. We found that the current version does not accurately convey the research findings and requires a major overhaul of the manuscript's methodology description and results analysis

详情
英文摘要

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

2603.24033 2026-05-13 cs.LG

SRG: Score-based Relaxation-guided Generation for Mixed Integer Linear Programming

Ruobing Wang, Xin Li, Yujie Fang, Mingzhong Wang

AI总结 本文提出了一种基于分数松弛引导的生成框架SRG,用于解决混合整数线性规划问题。该方法通过近似松弛引导的随机微分方程,结合基于Transformer的分数网络,将可行性和最优性信号融入生成模型中,从而在解空间中生成高质量的可行解。SRG在推理时无需额外引导模块即可直接采样多样解,并用于构建紧凑的信任区域子问题,实验表明其在多个基准测试中表现优异,尤其在生成候选解的困难场景中具有明显优势,并展现出良好的跨尺度和跨问题的零样本迁移能力。

详情
英文摘要

We propose Score-based Relaxation-guided Generation (SRG), a generative framework based on an approximate formulation of relaxation-guided stochastic differential equations (SDEs) for mixed-integer linear programming. SRG employs a Transformer-based score network that incorporates feasibility and optimality signals into score modeling, encouraging the learned generative model to place more probability mass on feasible, high-quality regions of the solution space. At inference time, SRG directly samples diverse candidate solutions from the learned score model without requiring any additional guidance module. These candidates are then used to construct compact trust-region subproblems for standard MILP solvers. Across multiple public benchmarks, SRG matches or improves upon the solution quality of the strongest learning-based baselines, with particularly strong gains in challenging candidate-generation settings. Moreover, SRG shows promising zero-shot transferability to unseen cross-scale and cross-problem instances, improving solver objectives and reducing search time in several cases through higher-quality initial candidates and compact trust-region search.

2603.23878 2026-05-13 cs.LG cs.AI cs.LO

The Luna Bound Propagator for Formal Analysis of Neural Networks

Henry LeCates, Haoze Wu

AI总结 本文提出了一种基于抽象解释的边界传播方法Luna,用于神经网络的形式化分析。Luna采用C++实现,支持区间边界传播、DeepPoly/CROWN分析以及alpha-CROWN分析,适用于一般的计算图结构。实验表明,Luna在VNN-COMP 2025基准测试中,在边界精度和计算效率方面均优于现有的alpha-CROWN实现。

Comments 32 pages, 29 Figures

详情
英文摘要

The parameterized CROWN analysis, a.k.a., alpha-CROWN has emerged as a practically successful abstract interpretation method for neural network verification. However, existing implementations of alpha-CROWN are limited to Python, which complicates integration into existing DNN verifiers and long-term production-level systems. We introduce Luna, a new abstract-interpretation-based bound propagator implemented in C++. Luna supports Interval Bound Propagation, the DeepPoly/CROWN analysis, and the alpha-CROWN analysis over a general computational graph. We describe the architecture of Luna and show that it outperforms the state-of-the-art alpha-CROWN implementation in terms of both bound tightness and computational efficiency on supported benchmarks from VNN-COMP 2025. Luna is publicly available at https://github.com/ai-ar-research/luna.

2603.11383 2026-05-13 cs.RO cs.AI

Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics

Hendrik Chiche, Antoine Jamme, Trevor Rigoberto Martinez, Gabriel Gomes

AI总结 该研究提出了一种基于视觉的手部阴影逆运动学(IK)重定向方法,用于低成本机械臂的远程操作。通过单目RGB-D相机捕捉手部动作,结合深度感知和坐标变换,生成机械臂关节指令,并通过阻尼最小二乘法求解逆运动学问题,实现了对SO-ARM101机械臂的控制。实验表明,该方法在结构化环境中取得了较高的成功率,并在真实场景中通过引入替代手部检测器提升了鲁棒性,揭示了无标记手部重定向方法的潜力与当前局限。

Comments v2: accepted at IEEE Access (2026); minor revisions per peer review, added WiLoR occlusion-mitigation experiment, error analysis, EMA ablation, and author photos

详情
英文摘要

Teleoperation of low-cost robotic manipulators remains challenging due to the difficulty of retargeting human hand motion to robot joint commands. We present an offline hand-shadowing inverse-kinematics (IK) retargeting pipeline driven by a single egocentric RGB-D camera mounted on 3D-printed glasses. The pipeline detects 21 hand landmarks per hand using MediaPipe Hands, deprojects them into 3D via depth sensing, transforms them into the robot coordinate frame, and solves a damped-least-squares IK problem to produce joint commands for the SO-ARM101 robot (5 arm + 1 gripper joints). A gripper controller maps thumb-index finger geometry to grasp aperture with a multi-level fallback hierarchy. Actions are previewed in a physics simulation before replay on the physical robot. We evaluate the pipeline on a structured pick-and-place benchmark (5-tile grid, 10 grasps per tile, 3 independent runs) achieving an 86.7% +/- 4.2% success rate, and compare it against four vision-language-action (VLA) policies (ACT, SmolVLA, pi_0.5, GR00T N1.5) trained on leader-follower teleoperation data. We provide a quantitative error analysis of the pipeline, reporting a mean IK position error of 36.4 mm, trajectory smoothness metrics showing 57-68% jerk reduction from EMA smoothing, and an ablation study over the smoothing parameter. We also test the pipeline in unstructured real-world environments (grocery store, pharmacy) and find that success is reduced to 9.3% due to hand occlusion by surrounding objects. To mitigate this, we integrate WiLoR as an alternative hand detector, achieving an 8% improvement in hand detection rate over MediaPipe, highlighting both the promise and current limitations of marker-free analytical retargeting.

2603.10281 2026-05-13 cs.LG cs.AI cs.CV

Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework

Rajesh Shrestha, Xiao Fu

AI总结 本文研究了如何将基于分数的去噪器有效集成到ADMM优化算法中,以解决逆问题。针对训练数据流形与ADMM迭代几何不匹配以及收敛性缺乏保证的两个核心挑战,提出了一种新的ADMM-PnP框架,引入包含自动校正、方向校正和分数去噪三阶段的AC-DC去噪器。理论分析表明该框架在适当参数下具有弱非扩张性,保证了固定点球收敛,并在更宽松条件下支持自适应步长的收敛性。实验表明该方法在多种逆问题中优于现有基线。

详情
英文摘要

While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point $\textit{ball convergence}$ using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.

2603.09678 2026-05-13 cs.AI cs.LG cs.SE

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Aman Sharma, Paras Chopra

AI总结 本文提出EsoLang-Bench,一个用于评估大语言模型在陌生编程语言中真实推理能力的基准测试,采用五种小众编程语言(如Brainfuck、Befunge-98等)作为测试语言。这些语言虽然图灵完备,但与主流语言(如Python、JavaScript)相比,在预训练语料中出现频率极低,且缺乏实际应用价值,因此能有效检验模型的分布外泛化能力。实验表明,当前最先进的模型在主流语言任务中表现优异,但在小众语言任务中准确率大幅下降,揭示了模型在跨语言泛化方面仍存在显著差距。

Comments 45 pages, 8 figures, preprint

详情
英文摘要

Large language models achieve near-ceiling performance on code generation benchmarks, yet most of the programming languages used by popular benchmarks such as SWE-bench and HumanEval (e.g. Python, JavaScript) are squarely in-distribution. They appear at scale in pre-training corpora and are heavily reinforced during post-training. To study LLM performance on unfamiliar programming languages, we introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare). All five of our chosen esoteric languages are Turing-complete, so the same algorithmic problems that are solvable in Python or JavaScript are in principle solvable in each of them. Yet, they are unfamiliar to LLMs which makes them a good proxy for evaluating out-of-distribution performance. The unfamiliarity of esoteric languages comprises of: (i) the hard-by-design primitives comprising the language; (ii) substantially less representation in pre-training corpora (340x to over 60,000x fewer public GitHub repositories than Python); (iii) negligible deployment value, which makes targeted inclusion in post-training data economically irrational. We evaluate five frontier models across five prompting strategies and find a dramatic capability gap. The same 80 problems expressed in Python or JavaScript reach 100% accuracy on top frontier models, while the equivalent esoteric versions score only 0-11%. Few-shot learning and self-reflection also fail to close this gap. EsoLang-Bench therefore provides a contamination-resistant testbed for measuring how well frontier models generalise algorithmic problem-solving to programming languages outside their training distribution.

2603.07388 2026-05-13 cs.LG cs.AI

Sparsity and Out-of-Distribution Generalization

Scott Aaronson, Lin Lin Lee, Jiawei Li

AI总结 本文探讨了模型在分布外(OOD)场景下的泛化能力,提出了一种基于稀疏性的理论解释。研究认为,世界通过区分特征呈现,而稀疏假设(即依赖尽可能少的特征)更符合奥卡姆剃刀原则,并能在训练分布与测试分布足够重叠的特征上实现泛化。文章给出了一个形式化定理,扩展了经典样本复杂度界,并将稀疏分类器推广到子空间合取函数,为理解AI对齐中的泛化问题提供了新视角。

详情
英文摘要

Explaining out-of-distribution generalization has been a central problem in epistemology since Goodman's "grue" puzzle in 1946. Today it's a central problem in machine learning, including AI alignment. Here we propose a principled account of OOD generalization with three main ingredients. First, the world is always presented to experience not as an amorphous mass, but via distinguished features (for example, visual and auditory channels). Second, Occam's Razor favors hypotheses that are "sparse," meaning that they depend on as few features as possible. Third, sparse hypotheses will generalize from a training to a test distribution, provided the two distributions sufficiently overlap on their restrictions to the features that are either actually relevant or hypothesized to be. The two distributions could diverge arbitrarily on other features. We prove a simple theorem that formalizes the above intuitions, generalizing the classic sample complexity bound of Blumer et al. to an OOD context. We then generalize sparse classifiers to subspace juntas, where the ground truth classifier depends solely on a low-dimensional linear subspace of the features.

2603.04352 2026-05-13 cs.RO cond-mat.mtrl-sci

A Soft Robotic Demonstration in the Stratosphere

Codrin Tugui, Tirth Thakar, Anatol Gogoj, Alexander White, Ang Leo Li, Alexander Yin, Edward Pomianek, Mihai Duduta

AI总结 该研究针对在极端环境如平流层中运行的软体机器人所面临的耐压、耐温及适应性挑战,提出了一种新型硅橡胶交联方法。通过紫外光引发的铂催化反应,实现了硅橡胶的快速固化与优异电致动性能,显著提升了介电弹性体致动器在极端温度和真空条件下的可靠性。研究通过高空气球实验验证了该材料在类太空环境中的有效性,为未来软体机器人在空间探索等领域的应用提供了新材料解决方案。

详情
英文摘要

Machines designed for operation in Space, as well as other extreme environments, need to be both resilient and adaptable when mission parameters change. Soft robots offer advantages in adaptability, but most lack resilience to the pressure and temperature extremes found as close as the Stratosphere. Dielectric elastomer actuators overcome some of those limitations when built as solid state compliant capacitors capable of converting electrical energy into mechanical work, but the elastomer resilience limits the device's operating window. Here we present a crosslinking mechanism for silicone elastomers under ultraviolet light using trimethyl(methylcyclopentadienyl)platinum(IV) as a catalyst to react hydrosilane to vinyl groups. The formation of carbon-carbon bonds enables fast processing under UV light and exceptional electro-mechanical performance in dielectric elastomer actuators. The material resilience advantage is demonstrated in controlled experiments at -40° and 120° C, as well as near vacuum, in comparison with state-of-the-art acrylic and silicone chemistries. Fully autonomous systems controlling grippers made with the novel silicone were integrated into payloads for high altitude balloon testing. Two stratospheric balloon missions were carried out and demonstrated DEAs as a viable soft robotic technology under space-like conditions (as high as 23.6 km elevation, at <0.05 atm and -55° C). The combinations of chemical building blocks and catalyst can be further expanded to address other challenges for silicones, including adhesion and additive manufacturing.

2602.22586 2026-05-13 cs.LG cs.AI cs.CL

TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen, Muhan Zhang

AI总结 本文提出了一种名为 TabDLM 的统一框架,用于生成包含自由形式文本和结构化数值、类别属性的异构表格数据。该方法结合了掩码扩散语言模型与连续扩散过程,通过双向注意力机制实现文本与数值特征的跨模态交互,有效克服了传统扩散模型和大语言模型在处理异构数据时的局限性。实验表明,TabDLM 在多个基准数据集上表现优异,优于现有的扩散模型和基于大语言模型的生成方法。

Comments Preprint

详情
英文摘要

Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical-language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.

2602.22507 2026-05-13 cs.LG cs.CV

Space Syntax-guided Post-training for Residential Floor Plan Generation

Zhuoyang Jiang, Dongqing Zhang

AI总结 本文研究了住宅平面图生成中空间配置逻辑的优化问题,提出了一种基于空间句法的后训练框架SSPT,通过引入空间句法集成预言机(SSIO)对生成的平面图进行配置质量评估,并将其作为反馈信号指导模型优化。该方法包括两种策略:基于迭代训练的SSPT-Iter和基于强化学习的SSPT-PPO,并构建了新的评估基准SSPT-Bench。实验表明,该方法有效提升了生成平面图的公共空间主导性和功能层级一致性,尤其SSPT-PPO在提升效果和效率方面表现更优。

详情
英文摘要

Residential floor plan generation requires not only geometric fidelity but also spatial configurational logic: shared living spaces should be integrative, while private spaces should remain segregated. Existing generators increasingly use room-relation graphs as input-side conditions, but generated layouts are rarely evaluated on the output side for configurational quality, and such evaluation is rarely fed back into model optimization. We propose Space Syntax-guided Post-training (SSPT), a framework that turns space-syntax integration from a post-hoc analysis tool into a computable feedback signal for already-trained floor plan generators. SSPT introduces the Space Syntax Integration Oracle (SSIO), which converts generated layouts into rectangle-space graphs and measures public-space dominance and functional hierarchy. SSIO is first applied to real residential data to establish empirical configurational references, then connected to two SSPT strategies: SSPT-Iter, a basic generate-filter-retrain route, and SSPT-PPO, the first RL-based post-training route for floor plan generation. We also introduce SSPT-Bench, a new evaluation system for measuring the output-side spatial configurational quality of post-trained generators under an out-of-distribution setting. Experiments show that both strategies improve public-space dominance and functional-hierarchy alignment over the unpost-trained baseline. SSPT-PPO achieves stronger gains, lower variance, and higher efficiency than iterative retraining. These results show that output-side configurational evaluation can serve as actionable post-training feedback, offering a practical path for injecting architectural theory into existing floor plan generation backbones.

2602.19770 2026-05-13 cs.LG cs.AI

The Confusion is Real: GRAPHIC -- A Network Science Approach to Confusion Matrices in Deep Learning

Johanna S. Fröhlich, Bastian Heinlein, Jan U. Claar, Hans Rosenberger, Vasileios Belagiannis, Ralf R. Müller

AI总结 本文提出了一种名为GRAPHIC的方法,用于分析深度学习模型中类别之间的混淆情况。该方法基于网络科学,将中间层的混淆矩阵解释为有向图的邻接矩阵,从而可视化和量化训练过程中的学习动态。GRAPHIC能够揭示类别可分性、数据集问题及网络结构行为,为理解神经网络的学习过程提供了新的视角。

Comments Transactions on Machine Learning Research, 2026

详情
英文摘要

Explainable artificial intelligence has emerged as a promising field of research to address reliability concerns in artificial intelligence. Despite significant progress in explainable artificial intelligence, few methods provide a systematic way to visualize and understand how classes are confused and how their relationships evolve as training progresses. In this work, we present GRAPHIC, an architecture-agnostic approach that analyzes neural networks on a class level. It leverages confusion matrices derived from intermediate layers using linear classifiers. We interpret these as adjacency matrices of directed graphs, allowing tools from network science to visualize and quantify learning dynamics across training epochs and intermediate layers. GRAPHIC provides insights into linear class separability, dataset issues, and architectural behavior, revealing, for example, similarities between flatfish and man and labeling ambiguities validated in a human study. In summary, by uncovering real confusions, GRAPHIC offers new perspectives on how neural networks learn. The code is available at https://github.com/Johanna-S-Froehlich/GRAPHIC.

2602.13267 2026-05-13 cs.CV cs.RO eess.IV

SOAR: Regression-based LiDAR Relocalization for UAVs

Hengyu Mu, Jianshi Wu, Yuxin Guo, XianLian Lin, Qingyong Hu, Sheng Ao, Chenglu Wen, Cheng Wang

AI总结 本文提出SOAR,一种基于回归的无人机激光雷达重定位框架,旨在解决在无GNSS环境下无人机高精度定位的问题。为应对无人机场景中姿态变化大、飞行路径不规则等挑战,SOAR引入了局部保持的滑动窗口注意力模块和局部不变的位置编码,以增强对视角变化的鲁棒性,并设计了坐标无关的特征初始化模块以减少对全局变换的敏感性。此外,作者构建了一个包含4个场景和13条不规则路径的大规模无人机激光雷达定位数据集,显著提升了无人机重定位研究的现实基准。实验表明,SOAR在定位成功率和误差指标上均达到先进水平。

Comments 24 pages, 14 figures

详情
英文摘要

Regression-based LiDAR relocalization has recently emerged as a promising solution for high-precision positioning in GNSS-denied environments. However, these methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in unmanned aerial vehicle (UAV) scenarios due to arbitrary pose variations and irregular flight paths. In this paper, we propose SOAR, a regression-based LiDAR relocalization framework for UAVs. Specifically, we introduce a locality-preserving sliding window attention module with locally invariant positional encoding to capture discriminative geometric structures robust to viewpoint changes. A coordinate-independent feature initialization module is further designed to eliminate sensitivity to global transformations. Furthermore, most existing UAV datasets are limited to evaluate LiDAR relocalization in real-world, due to the lack of synchronized LiDAR scans, accurate 6-DoF poses, or multiple traversals. Thus, we construct a large-scale UAV LiDAR localization dataset with 4 scenes and 13 irregular paths exhibiting rotation and altitude variations, providing a more realistic benchmark for UAVs. Extensive experiments demonstrate that our method achieves state-of-the-art performance, improving the localization success rate by 40% and reducing mean error over 10m on UAVLoc. Our code and dataset will be released soon.

2602.13004 2026-05-13 cs.LG stat.ML

Towards Uncertainty-Aware Federated Granger Causal Learning

Ayush Mohanty, Nazal Mohamed, Nagi Gebraeel

AI总结 该研究旨在解决联邦格兰杰因果学习中缺乏不确定性感知的问题,提出了一种能够量化跨客户端因果关系不确定性的方法。通过分析联邦学习框架中不确定性传播的机制,作者推导了客户端与服务器之间协方差的闭式递推公式,并建立了基于谱半径的收敛条件,从而获得了稳态方差的解析表达式。实验表明,该方法能有效区分真实的跨客户端因果关系与虚假连接,优于现有联邦因果结构学习方法。

Comments Manuscript under review

详情
英文摘要

Granger causality recovers directed interactions from time-series data, but in many distributed systems, the data are vertically partitioned across clients, with each client observing only the variables of its own subsystem. Federated Granger causality (FedGC) recovers cross-client interactions without sharing raw data. Existing FedGC methods, however, return deterministic point estimates with no calibrated measure of uncertainty, leaving operators without a principled basis for identifying reliable cross-client interactions. We address this limitation by characterizing how uncertainty propagates through the FedGC framework. We derive closed-form covariance recursions for the cross-covariances induced by the coupled client-server feedback loop, and establish spectral-radius-based convergence conditions yielding closed-form expressions for the steady-state variances at both the client and server. Under mild stability conditions, we prove that the steady-state uncertainty depends only on client data statistics (aleatoric) and is independent of the priors placed on the model parameters (epistemic). Building on this asymptotic characterization, we construct a post-training hypothesis testing procedure that separates genuine cross-client interactions from spurious edges. Experiments on synthetic and real-world datasets show that the predicted uncertainty propagation matches the theory across multiple operating regimes, while consistently outperforming the state-of-the-art federated causal structure learning baselines.

2602.07892 2026-05-13 cs.LG cs.CL

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong

AI总结 该研究将安全对齐问题视为持续学习过程,旨在缓解大型语言模型在安全微调过程中可能产生的“对齐税”问题,即安全性能提升带来的通用能力下降。研究提出了一种名为OGPSA的方法,通过正交梯度投影技术,从通用能力数据中估计低秩参考子空间,并从安全梯度中去除该子空间的成分,从而在保证安全目标优化的同时减少对通用能力的负面影响。实验表明,OGPSA在多种微调设置下有效提升了安全与实用性的平衡,且兼容主流微调流程。

详情
英文摘要

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFT$\rightarrow$DPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.

2602.07668 2026-05-13 cs.CV cs.AI cs.LG cs.RO

Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Ross Greer, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Giovanni Tapia Lopez, Angel Martinez-Sanchez, Parthib Roy, Jake Rattigan, Mira Sur, Alejandra Vidrio, Thomas Marcotte, Mohan Trivedi

AI总结 该研究提出了一种融合视觉与音频信息的多模态框架L-LIO,用于提升智能车辆中的驾驶员状态评估与环境理解能力。通过引入音频信号,增强对驾驶员、乘客及车外人员状态的感知,从而在安全气囊部署、自动驾驶接管时间预测等场景中提供更全面的信息支持。实验表明,音频在复杂或语境丰富的场景中能提供关键的安全相关信息,为智能车辆决策系统提供了新的干预路径。

详情
英文摘要

The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

2602.06412 2026-05-13 cs.CL cs.LG

Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki

AI总结 该研究针对掩码扩散语言模型(Masked Diffusion-LM)在生成过程中重复计算已稳定位置的问题,提出了一种名为SureLock的优化方法。通过在后验分布稳定时锁定该位置,跳过其后续的计算步骤并缓存其注意力键值,从而显著降低计算复杂度。实验表明,该方法在保持生成质量的同时,可减少30%到50%的算法浮点运算量。

Comments Accepted to ICLR 2026

详情
英文摘要

Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our project page is available at https://daioba.github.io/surelock .

2602.06339 2026-05-13 cs.RO cs.AI

Action Hallucination in Generative Vision-Language-Action Models

Harold Soh, Eugene Lim

AI总结 该论文研究了生成式视觉-语言-动作模型在机器人领域中可能出现的动作幻觉问题,即模型生成违反物理约束的动作,进而导致计划层面的失败。研究分析了这类幻觉的成因,指出其源于可行机器人行为与常见模型结构之间的结构性不匹配,并探讨了拓扑、精度和时间跨度三个关键障碍所带来的不可避免的权衡。该工作为生成式机器人策略的失效提供了机制性解释,并为提升其可靠性与可信度指明了理论方向。

Comments 24 pages; updated setup with minor changes to proofs. changed template

详情
英文摘要

Robot Foundation Models, such as VLAs, promise end-to-end generative robot policies with broad generalization. Yet it remains unclear whether they fundamentally resolve the core problem of action generation in embodied settings, or overcome the long-standing challenges of robotics. We address this question by analyzing action hallucinations that violate physical constraints and their extension to plan-level failures. Focusing on latent-variable generative policies, we show that hallucinations can arise from structural mismatches between feasible robot behavior and common model architectures. We study three such barriers -- topological, precision, and horizon -- and show how they impose unavoidable tradeoffs. Our analysis provides mechanistic explanations for reported empirical failures of generative robot policies and suggests principled directions for improving reliability and trustworthiness, without abandoning their expressive power.

2602.04042 2026-05-13 cs.LG stat.ME stat.ML

Partition Tree: Conditional Density Estimation over General Outcome Spaces

Felipe Angelim, Alessandro Leite

AI总结 本文提出了一种名为 Partition Tree 的新型树状框架,用于在一般结果空间上进行条件密度估计,能够统一处理连续和分类变量。该方法通过数据自适应划分将条件分布建模为分段常数密度,并直接最小化条件负对数似然来学习树结构,提供了一种无需参数假设的可扩展非参数替代方案。此外,文章还引入了 Partition Forest,通过平均条件密度实现对 Partition Tree 的袋外扩展,并在实验中展示了其在概率预测方面的优越性和与最新方法的竞争力。

Comments Code available at https://github.com/felipeangelimvieira/partition_tree

详情
英文摘要

We propose Partition Tree, a novel tree-based framework for conditional density estimation over general outcome spaces that supports both continuous and categorical variables within a unified formulation. Our approach models conditional distributions as piecewise-constant densities on data-adaptive partitions and learns trees by directly minimizing conditional negative log-likelihood. This yields a scalable, nonparametric alternative to existing probabilistic trees that does not make parametric assumptions about the target distribution. We further introduce Partition Forest, a bagging extension obtained by averaging conditional densities. Empirically, we demonstrate improved probabilistic prediction over CART-style trees and competitive performance compared to state-of-the-art probabilistic tree methods and Random Forests.

2602.02799 2026-05-13 cs.LG cs.AI

Joint Learning of Hierarchical Neural Options and Abstract World Model

Wasu Top Piriyakulkij, Wolfgang Lehrach, Kevin Ellis, Kevin Murphy

AI总结 该研究旨在开发能够通过组合已有技能学习新技能的智能体,提出了一个名为AgentOWL的新方法,该方法能够高效地联合学习抽象世界模型和分层神经选项。与现有方法相比,AgentOWL在数据效率和技能泛化能力方面表现出显著优势,并在部分以物体为中心的Atari游戏中验证了其有效性。

详情
英文摘要

Building agents that can perform new skills by composing existing skills is a long-standing goal of AI agent research. Towards this end, we investigate how to efficiently acquire a sequence of skills, formalized as hierarchical neural options. However, existing model-free hierarchical reinforcement algorithms need a lot of data. We propose a novel method, which we call AgentOWL (Option and World model Learning Agent), that jointly learns -- in a sample efficient way -- an abstract world model (abstracting across both states and time) and a set of hierarchical neural options. We show, on a subset of Object-Centric Atari games, that our method can learn more skills using less data than baseline methods and possesses learning and generalization capabilities that the baselines do not have.

2602.02408 2026-05-13 cs.CV cs.AI

ReasonEdit: Editing Vision-Language Models using Human Reasoning

Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

AI总结 ReasonEdit 是一种用于编辑视觉-语言模型(VLM)的新方法,旨在在不干扰模型其他功能的前提下修正其错误,特别针对需要人类与模型进行推理的视觉问答任务。该方法引入了用户在编辑过程中提供推理解释的机制,并通过一种基于网络科学的多模态嵌入技术,在推理时检索相关事实,从而提升编辑效果。实验表明,ReasonEdit 在多个数据集上取得了当前最优的编辑性能,验证了引入人类推理对模型编辑泛化能力的显著提升。

详情
英文摘要

Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

2602.02133 2026-05-13 cs.AI cs.CL

A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse

Moongyu Jeon, Sangwoo Shin, BumJun Kim, Kyelim Lee, Albert No

AI总结 本文理论分析了为何掩码扩散语言模型(MDMs)能够缓解自回归语言模型(ARMs)中的“反转诅咒”问题。研究指出,MDMs通过其任意顺序的掩码训练目标,在参数层面建立了前向与反向条件之间的耦合,使得模型在训练中学习到的词对证据可以迁移到反转查询中。实验验证了这一机制的有效性,表明其有助于提升模型在反转任务中的预测性能。

详情
英文摘要

Autoregressive language models (ARMs) suffer from the reversal curse: after learning ''$A$ is $B$,'' they often fail on the reverse query ''$B$ is $A$.'' Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing ''$[\mathbf{M}]$ is $B$'' during training teaches recovery of $A$ from $B$ in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt ''$B$ is $[\mathbf{M}]$.'' We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through queries and keys without changing the value-side evidence being retrieved. In a one-layer MDM, we prove that forward masked training strengthens evidence that is reusable in reverse queries, induces correlated forward--reverse attention routes, and yields a positively aligned shared-storage gradient component that decreases the reverse loss to first order. Controlled one-layer experiments and large-scale LLaDA/Dream experiments verify these signatures and show that they translate into improved reverse prediction.

2602.02007 2026-05-13 cs.CL cs.AI

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Zhanghao Hu, Qinglin Zhu, Runcong Zhao, Di Liang, Hanqi Yan, Yulan He, Lin Gui

AI总结 本文针对传统检索增强生成(RAG)在智能体记忆应用中的不足,提出了一种新的记忆管理方法xMemory。该方法通过解耦和聚合的原理,将交互历史分解为可复用的事实、更新和区分细节,并构建分层的可修订记忆结构,以提升检索效率和信息准确性。实验表明,xMemory在多个任务和模型上均能有效提升答案质量与推理效率。

Comments Project Address: https://zhanghao-xmemory.github.io/Academic-project-page-template/; Code Address: https://github.com/HU-xiaobai/xMemory

详情
英文摘要

Standard Retrieval Augmented Generation (RAG) is poorly matched to agent memory. Unlike large heterogeneous corpora, agent memory forms a bounded and coherent interaction stream in which many spans are highly correlated or near duplicates. As a result, flat top-$k$ similarity retrieval often returns redundant context, while summary-centric hierarchies can blur the subtle details that distinguish one candidate from another. We argue that agent memory should follow the principle of decoupling before aggregation: the system should first isolate reusable facts, updates, and distinguishing details from similar histories, and only then organise them for efficient retrieval. Based on this principle, we propose xMemory, which constructs a revisable hierarchical memory structure from original messages to segments, memory components, and groups. xMemory segments interaction history into local events, decouples each segment into memory components, aggregates related components into high-level groups using a sparsity--semantic faithfulness objective, and maintains this structure incrementally as memory evolves. At inference time, xMemory retrieves top-down, first selecting a compact backbone of complementary groups and components, and then expanding to segments and raw messages only when additional evidence reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across diverse open source and closed source LLMs show consistent gains in answer quality and inference token efficiency, supported by analyses of redundancy, evidence density, and coverage.

2602.01682 2026-05-13 cs.LG cs.DS stat.ML

Finite and Corruption-Robust Regret Bounds in Online Inverse Linear Optimization under M-Convex Action Sets

Taihei Oki, Shinsaku Sakaue

AI总结 本文研究在线逆线性优化问题,即根据随时间变化的可行集上观测到的最优动作,推断隐藏的目标向量,并推荐符合该目标的行动。研究关注在M-凸可行集(如拟阵)下,能否获得与维度多项式相关的有限悔度界。作者通过结合M-凸集最优解的结构特性与几何体积论证,证明了悔度界为 $O(d\log d)$,部分解决了该问题的开放性疑问,并进一步拓展到对抗性噪声场景,给出了无需先验知识的悔度界 $O((C+1)d\log d)$。

详情
英文摘要

We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent's hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent's true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent's optimal values and those achieved by the learner's recommended actions. Prior work has established a regret bound of $O(d\log T)$, as well as a finite but exponentially large bound of $\exp(O(d\log d))$, where $d$ is the dimension of the optimization problem and $T$ is the time horizon, while a regret lower bound of $Ω(d)$ is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in $d$ is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are M-convex -- a broad class that includes matroids -- a finite regret bound of $O(d\log d)$ is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to $C$ rounds. We obtain a regret bound of $O((C+1)d\log d)$ without prior knowledge of $C$, by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.

2602.01418 2026-05-13 cs.CV cs.LG

Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring, Florian T. Pokorny, Lazaros Nalpantidis

AI总结 本文提出了一种基于抛物线的位置编码方法PaPE,专门用于视觉模态中的注意力架构。该方法从视觉特性的角度出发,结合平移不变性、旋转不变性、距离衰减、方向性和上下文感知等原则进行设计,能够更准确地编码图像、视频、点云等视觉数据中位置信息。实验表明,PaPE在ImageNet-1K等数据集上具有出色的外推能力,并在多个不同模态的数据集上展现出广泛适用性和优越性能。

详情
英文摘要

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

2602.01103 2026-05-13 cs.AI

Probing RLVR training instability through the lens of objective-level hacking

Yiming Dong, Kun Fu, Haoyu Li, Xinyuan Zhu, Yurou Liu, Lijing Shao, Jieping Ye, Zheng Wang

AI总结 本文研究了可验证奖励强化学习(RLVR)在混合专家(MoE)架构中训练不稳定的问题,提出了一种基于目标层“黑客攻击”的分析框架,揭示了训练不稳定性背后的机制。研究发现,训练与推理之间的差距异常增长是导致不稳定的关键病理动态,这一现象此前缺乏机制解释。通过大量实验,本文为设计更稳定的RLVR算法提供了理论指导。

Comments Accepted by ICML 2026

详情
英文摘要

Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.

2602.00400 2026-05-13 cs.AI

KEPO: Knowledge-Enhanced Preference Optimization for Multimodal Reasoning with Applications to Medical VQA

Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen

AI总结 该研究提出了一种名为KEPO的知识增强偏好优化框架,旨在提升多模态模型在医疗视觉问答等复杂推理任务中的表现。针对传统强化学习在稀疏奖励下训练不稳定、探索困难的问题,KEPO引入了质量门控的策略蒸馏机制,仅对高质量轨迹进行教师模型指导,并结合知识引导的探索策略,有效减少噪声干扰,提升推理连贯性与泛化能力。实验表明,KEPO在医疗VQA任务中展现出更优的训练稳定性与分布外性能。

详情
英文摘要

Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.

2601.22334 2026-05-13 cs.LG

DP-λCGD: Efficient Noise Correlation for Differentially Private Model Training

Nikita P. Kalinin, Ryan McKenna, Rasmus Pagh, Christoph H. Lampert

AI总结 本文提出了一种名为DP-λCGD的高效噪声相关方法,用于提升差分隐私模型训练的准确性。该方法通过仅与前一次迭代的噪声相关,并控制性地抵消部分噪声,减少了对历史噪声存储的需求。与现有方法相比,该方法在保持差分隐私保证的同时,显著降低了内存开销,并在实验中表现出更高的模型精度。

详情
英文摘要

Differentially private stochastic gradient descent (DP-SGD) is the gold standard for training machine learning models with formal differential privacy guarantees. Several recent extensions improve its accuracy by introducing correlated noise across training iterations. Matrix factorization mechanisms are a prominent example, but they correlate noise across many iterations and require storing previously added noise vectors, leading to substantial memory overhead in some settings. In this work, we propose a new noise correlation strategy that correlates noise only with the immediately preceding iteration and cancels a controlled portion of it. Our method relies on noise regeneration using a pseudorandom noise generator, eliminating the need to store past noise. As a result, it requires no additional memory beyond standard DP-SGD. We show that the computational overhead is minimal and empirically demonstrate improved accuracy over DP-SGD.

2601.22301 2026-05-13 cs.CV

Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Gonzalo Gomez-Nogales, Yicong Hong, Chongjian Ge, Peiye Zhuang, Marc Comino-Trinidad, Dan Casas, Yi Zhou

AI总结 传统渲染流程依赖复杂的模型、精确的材质和光照以及大量的计算资源来生成逼真的图像,但在处理包含大量动态人物的场景时仍面临可扩展性和真实感的挑战。本文提出C2R(Coarse-to-Real)生成渲染框架,通过粗略的3D模拟生成具有真实风格的都市人群视频,结合粗略3D渲染对场景布局、相机运动和人物轨迹进行显式控制,并利用学习到的神经渲染器根据文本提示生成逼真的外观、光照和细粒度动态。该方法采用两阶段的合成-真实领域对齐策略,先从大规模真实视频中学习生成先验,再利用少量配对的合成数据引入可控性,实现了从粗略到精细的控制,适用于多种CG和游戏输入,并能从最小的3D输入生成时间一致、可控且逼真的城市场景视频。

Comments Project website at https://gonzalognogales.github.io/coarse2real/

详情
英文摘要

Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.

2601.21944 2026-05-13 cs.LG

Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models

Konstantinos P. Panousis, Diego Marcos

AI总结 本文研究了稀疏感知概念瓶颈模型(CBMs)中灵活性与可解释性之间的权衡问题,提出了一种新的评估指标Clarity,用于衡量模型在保持稀疏性和概念激活精度的同时对下游任务的性能影响。通过基于真实概念标注数据集的评估框架,作者对比了多种基于视觉语言模型和属性预测器的CBM方法,并揭示了不同稀疏诱导策略在性能与语义对齐上的显著差异。实验和人类研究验证了Clarity能够更准确地反映人类对模型的信任程度,为可解释性模型的评估提供了新思路。

详情
英文摘要

The widespread adoption of deep learning models in computer vision has intensified concerns about interpretability. Despite strong performance, these models are often treated as black boxes, with limited systematic investigation of their decision-making processes. While many interpretability methods exist, objective evaluation of learned representations remains limited, particularly for approaches that rely on sparsity to "induce" interpretability. In this work, we investigate how modeling choices in Concept Bottleneck Models (CBMs) affect the semantic alignment of concept representations. We introduce Clarity, a novel metric that captures the interplay between downstream performance and the sparsity and precision of concept activations. Using an interpretability assessment framework grounded in datasets with ground-truth concept annotations, we evaluate both VLM- and attribute predictor-based CBMs across three amortized sparsity-inducing strategies ($\ell_1$, $\ell_0$, and Bernoulli-based), alongside several widely used sparsity-aware CBM methods from the literature. Our experiments reveal a critical flexibility-interpretability trade-off: a model's capacity to optimize task performance by deviating from semantic alignment. We demonstrate that under this trade-off, different methods exhibit markedly different behaviors even at comparable performance levels. Finally, we validate our framework through a principled human study, which confirms that Clarity aligns significantly more closely with human trust than standard evaluation metrics.

2601.21351 2026-05-13 cs.LG cs.AI

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou

AI总结 该研究针对分体式注意力-FFN(AFD)架构下的大语言模型服务,在随机工作负载条件下,提出了一个分析性的资源分配框架。研究通过分析每个计算槽的稳态令牌负载,识别出一个关键工作负载指标θ,并据此推导出最优的注意力与FFN计算比例,适用于任意预填充-解码分布。该方法还考虑了同步执行中的瓶颈效应,提供了闭式均场规则及高斯屏障感知的优化,实验表明其预测结果与仿真结果误差在10%以内,为分体式LLM服务的资源分配提供了理论依据和实用指导。

Comments Submitted to Neurips 2026

详情
英文摘要

Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop an analytical provisioning framework for AFD bundles in an $r$A--$1$F topology under stochastic workloads. Two sources of randomness shape the problem: per-slot Attention workload evolves as KV caches grow and completed requests are replenished with random prompt and decode lengths, and synchronized execution across Attention workers introduces a barrier governed by the slowest worker. We address both via a renewal-reward characterization of the per-slot stationary token load, identifying a single workload statistic $θ$ that governs provisioning under arbitrary prefill-decode distributions and admits a nonparametric estimator from request traces. The analysis yields a closed-form mean-field rule for the optimal A/F ratio decomposing into Attention-, communication-, and FFN-bottleneck regimes, together with a Gaussian barrier-aware refinement that quantifies cross-worker synchronization overhead. A trace-calibrated AFD simulator supports the framework across workloads: the predicted optimal ratio matches the simulation-optimal within 10%. Together, these results provide a compact, calibratable account of how stochastic workload structure determines provisioning in disaggregated LLM serving.