arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.25810 2026-05-26 cs.CV

Data-driven Head Motion Generation through Natural Gaze-Head Coordination

数据驱动的自然注视-头部协调头部运动生成

Xiaohan Liu, Yilin Wen, Yusuke Sugano

AI总结 提出首个数据驱动方法,通过自动提取自然注视和头部运动,利用条件变分自编码器生成与注视相关的头部运动,并应用于注视控制的视频生成。

详情
AI中文摘要

我们提出了首个数据驱动的方法,从大规模野外面部视频中建模时间上的注视-头部协调。为了获得可泛化学习的训练数据,我们提出了一种自动流水线,利用现成的基于外观的注视估计器提取自然且多样化的注视和头部运动。为了捕捉注视-头部协调的概率相关性和时间动态,我们将模型建立在生成性条件变分自编码器上,以生成合理且多样化的注视条件头部运动。我们进一步将框架应用于注视控制的面部视频生成,其中我们实现了与输入注视相关的自然逼真头部运动的视频生成——这一方面此前未被强调。人类评估和定量比较证明了我们方法的有效性并验证了我们的设计选择,评估者对我们的方法表现出统计学上显著的偏好,优于基线方法。

英文摘要

We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.

2605.25804 2026-05-26 cs.CV

Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks

基于时空与频率增强深度神经网络的事件到视频重建

Ramna Maqsood, Paulo Nunes, Luís Ducla Soares, Caroline Conti

AI总结 提出MSFET-E2V模型,通过跨域注意力模块融合时空特征与离散小波变换的频率表示,并设计轻量级小波增强跳跃块,实现高质量事件到视频重建,在多个数据集上超越现有方法。

详情
AI中文摘要

事件相机相比传统基于帧的相机具有显著优势,包括高时间分辨率、低延迟和能量效率。这些特性使其适用于高速和高动态范围场景采集;然而,缺乏密集强度帧限制了传统计算机视觉方法在场景理解中的直接应用。事件到视频(E2V)重建旨在通过将异步事件流转换为同步视频帧序列来弥合这一差距。现有的基于卷积神经网络和Transformer的E2V重建方法主要在空间域操作,往往难以恢复精细结构细节并抑制严重重建伪影。为解决这些问题,我们提出MSFET-E2V,一种新颖的多尺度频率增强Transformer模型。其核心是跨域注意力模块,该模块将时空特征与来自离散小波变换的频率感知表示相融合。与仅依赖空间注意力的先前方法不同,我们的方法通过考虑低频和高频分量有效捕捉局部和全局结构,增强细节保留和跨各种运动场景的鲁棒性。此外,我们提出一个轻量级小波增强跳跃块作为跳跃连接,通过联合空间-频率域处理促进伪影抑制和结构细节细化。大量实验表明,MSFET-E2V在多个真实世界事件数据集上取得了优于最先进方法的性能,在重建质量上提供了显著提升。此外,与现有基于Transformer的方法相比,我们提出的模型显著减少了参数数量、GPU内存使用和推理时间。

英文摘要

Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.

2605.25803 2026-05-26 cs.CV

ATV-Net: Adaptive Triple-View Network with Dynamic Feature Fusion

ATV-Net: 自适应三视角网络与动态特征融合

Hsin-Jui Pan, Sheng-Wei Chan, Meng-Qian Li, Chun-Po Shen

AI总结 提出ATV-Net,通过自适应门控融合三种感受野视角(微观、局部、侦察)改进ResNet-101分割头,在Cityscapes上达到80.31% mIoU,证明经典CNN分割仍有竞争力。

详情
Comments
Code will be released soon
AI中文摘要

最近的语义分割研究越来越倾向于更强的上下文建模、密集注意力和基于Transformer的架构。尽管这些模型取得了令人印象深刻的性能,但经典的基于CNN的分割流水线因其简单、高效和易于实现而仍然具有吸引力。本文重新审视了一个实际问题:仅通过修改分割头,基于ResNet的分割模型能改进多少?我们提出了ATV-Net,一种自适应三视角网络,通过三个简单但互补的感受野视角来增强ResNet-101骨干网络。微观视角捕获逐点的语义响应,局部视角建模邻域结构和对象边界,侦察视角提供扩大的上下文线索。ATV-Net不是用固定权重融合这些视角,而是引入自适应决策门,根据输入场景特征动态选择感受野响应。进一步应用紧凑的全局协调层以提高空间和语义一致性。在Cityscapes验证集上的实验表明,ATV-Net达到了80.31%的mIoU。这一结果表明,经典的基于CNN的分割远未过时:通过简单的感受野视角和自适应融合,基于ResNet的流水线可以在不依赖Transformer风格的全局注意力或过于复杂的上下文模块的情况下达到有竞争力的精度水平。

英文摘要

Recent semantic segmentation research has increasingly moved toward stronger context modeling, dense attention, and transformer-based architectures. Although these models achieve impressive performance, classical CNN-based segmentation pipelines remain attractive because of their simplicity, efficiency, and ease of implementation. This paper revisits a practical question: how far can a ResNet-based segmentation model be improved by only modifying the segmentation head? We propose ATV-Net, an Adaptive Triple-View Network that strengthens a ResNet-101 backbone using three simple but complementary receptive-field views. The micro view captures point-wise semantic responses, the local view models neighborhood structures and object boundaries, and the scout view provides enlarged contextual cues. Instead of fusing these views with fixed weights, ATV-Net introduces an Adaptive Decision Gate that dynamically selects receptive-field responses according to input scene characteristics. A compact global coordination layer is further applied to improve spatial and semantic consistency. Experiments on the Cityscapes validation set show that ATV-Net achieves 80.31\% mIoU. This result suggests that classical CNN-based segmentation is still far from obsolete: with simple receptive-field views and adaptive fusion, a ResNet-based pipeline can reach a competitive accuracy level without relying on transformer-style global attention or overly complex context modules.

2605.25802 2026-05-26 cs.CV

Rethinking VLM Representation for VLA Initialization

重新思考用于VLA初始化的VLM表示

Weifeng Lin, Siyuan Huang, Hao Li, Tingwei Chen, Ruichuan An, Xinyu Wei, Jianbo Liu, Hongsheng Li

AI总结 本文通过控制表示设计问题,沿能力级具身VQA监督、参数更新策略和机器人数据预训练三个轴,研究VLA初始化,发现保留预训练VLM表示对动作性能至关重要,而LoRA比全微调提供更可靠的初始化,分阶段基于LoRA的训练获得最强变体。

详情
Comments
9 main-text pages, 5 appendix pages, 4 figures
AI中文摘要

视觉-语言-动作(VLA)模型广泛采用预训练的视觉-语言模型(VLM)作为策略骨干,但目前尚不清楚何种预训练VLM表示对VLA初始化有用。在本文中,我们将VLA初始化作为一个受控的表示设计问题,沿三个轴进行研究:能力级具身VQA监督、参数更新策略和机器人数据预训练。我们的实验表明,原始预训练VLM表示是动作性能的关键来源。然而,具身VQA适应并不产生一致的收益:其收益取决于下游瓶颈,且来自不同能力域的收益并非简单相加。对于更新策略,LoRA提供了比全微调更可靠的初始化,表明过度重塑预训练表示会削弱VLA初始化。机器人数据预训练进一步改善了VLA初始化,通过分阶段基于LoRA的训练获得了最强变体。这些发现共同表明,有效的VLM到VLA适应应在保留对动作学习有用的预训练VLM表示的同时,注入与动作相关的具身和机器人轨迹信号。

英文摘要

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

2605.25801 2026-05-26 cs.CV

PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution

PixelWizard: 迈向高效高保真超大规模空间分辨率视频生成

Wenxue Li, Jingjing Ren, Peng Zhang, Tian Ye, Daiguo Zhou, Jian Luan, Lei Zhu

AI总结 提出PixelWizard框架,通过分层解耦全局结构建模与细节合成,并引入噪声跨度对齐捷径训练,实现超大规模分辨率视频的高效高保真生成,加速超过10倍。

详情
AI中文摘要

高分辨率视频生成面临优化不稳定和计算成本高昂的双重瓶颈。令牌序列的大规模扩展不仅使优化偏向局部纹理而牺牲全局一致性,导致结构崩溃,还带来了高昂的训练成本和严重的推理延迟。为了解决这个问题,我们提出了PixelWizard,一个将全局结构建模与细粒度细节合成分层解耦的框架。PixelWizard首先建立一个紧凑的时空锚点以集中密集的结构先验,然后指导高分辨率下的细粒度生成。这减轻了局部优化偏差,确保结构稳定性而不损害高频细节。利用这种结构稳定性,我们引入了噪声跨度对齐捷径训练来打破推理瓶颈。通过显式建模步长,该机制允许模型以大步长遍历生成轨迹。关键的是,我们结合了指数索引偏置采样和自适应噪声跨度校准,以对齐优化与高分辨率网格的偏移噪声调度,确保鲁棒的少步推理而不产生蒸馏的沉重开销。大量实验表明,PixelWizard在实现卓越视觉质量的同时,将原生2K/4K视频的生成采样加速超过10倍。

英文摘要

High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.

2605.25799 2026-05-26 cs.CV

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

应对源自由跨域小样本学习中加剧的注意力汇聚问题

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

AI总结 针对跨域小样本学习中标准微调加剧注意力汇聚导致判别性下降的问题,提出基于令牌动态重加权的方法抑制简单令牌依赖并增强困难令牌学习,实现新最优性能。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

视觉语言模型(如CLIP)展现了令人印象深刻的泛化能力,但其在跨域小样本学习(CDFSL)中的潜力尚未充分探索,该任务需要模型将源域信息迁移到训练数据稀缺的目标域。尽管注意力汇聚现象已在某些任务的视觉语言模型中被观察到,但其在CDFSL场景中的作用尚未被研究。本文揭示了先前工作忽视的一个关键问题:CDFSL中标准的目标域小样本微调显著加剧了注意力汇聚问题,导致类别间判别性差。为理解这一现象,通过大量实验,我们将其解释为模型对域适应的捷径学习:为克服源域与目标域之间的巨大域差距,模型倾向于将初始更接近目标域类别的令牌(即简单令牌)推得更近,从而加剧注意力汇聚,浪费了学习其他有判别性但初始较远的令牌(即困难令牌)的能力。为解决此问题,我们提出一种新方法,在目标域微调期间根据令牌与目标域类别的相关性动态重加权令牌,明确抑制模型对简单令牌的依赖并增强困难令牌的学习,减少汇聚令牌并提升判别性。在四个基准数据集上的大量实验验证了我们方法的合理性,展现了新的最优性能。我们的代码可在 https://github.com/shuaiyi308/TIR 获取。

英文摘要

Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.

2605.25796 2026-05-26 cs.CR cs.AI cs.CL

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness

SAMark: 一种具有段落级释义鲁棒性的自锚文本水印

Jiahao Huo, Wenjie Qu, Yibo Yan, Kening Zheng, Jiaheng Zhang, Xuming Hu, Philip S. Yu, Mingxun Zhou

AI总结 提出SAMark自锚水印框架,通过建立语义空间中与句子顺序无关的逐步独立绿色区域,结合多通道双曲评分机制和多样性感知过滤策略,在段落级释义攻击下实现高检测率并打破鲁棒性-质量权衡。

详情
AI中文摘要

语义级水印通过将句子作为基本单元,提高了对文本修改的鲁棒性。然而,对段落级释义的鲁棒性仍然困难,因为此类攻击通过改变句子顺序全局性地破坏水印信号。在这项工作中,我们提出了SAMark,一种自锚水印框架,通过建立语义空间中与步骤无关的绿色区域,消除了对句子顺序的依赖。为了提高可检测性,我们引入了一种多通道双曲评分机制,该机制在放大水印信号的同时抑制来自弱对齐候选的噪声。我们进一步提出了一种多样性感知过滤策略,将硬过滤与软正则化相结合,超越了简单的n-gram重复过滤器,以解决语义冗余问题。实验结果表明,在典型的段落级释义攻击下,SAMark实现了高达90.2%的TP@FP1%,平均比最强先前基线高出30%以上,同时保持了与未水印文本相竞争的生成本质量,并打破了限制先前方法的鲁棒性-质量权衡。

英文摘要

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

2605.25794 2026-05-26 cs.AI

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

何时可以信任早期预警?从 LMS 交互日志中排除泄漏的早期结果预测

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

AI总结 针对学习管理系统日志中早期预测结果因时间泄漏而被高估的问题,提出 LEAP 协议(排除泄漏的早期可用性协议),通过截止优先截断和特征溯源审计防止后截止证据进入基准,并在 OULAD 数据集上验证了多种方法的性能。

详情
AI中文摘要

基于学习管理系统(LMS)日志构建的早期预警模型旨在尽早预测课程结束结果,以便及时提供学习者支持。然而,报告的“早期”性能常常因时间泄漏而被夸大。当流程使用了在预测时尚未可用的信息时,就会发生这种情况。我们在时间可用性约束下形式化了基于截止点的早期结果预测,并引入了 LEAP(排除泄漏的早期可用性协议),该协议在连接和聚合之前强制执行截止优先截断,并审计特征来源以防止后截止证据进入基准。我们在公共开放大学学习分析数据集(OULAD)上实例化 LEAP,作为跨周截止点的泄漏控制评估的多步骤协议。使用几种标准学习方法,我们通过 ROC-AUC、PR-AUC、Brier 分数和 F1@0.5 评估性能。结果显示,随着观察窗口的扩大,性能提高,在第 3 周左右有显著提升;随机森林在最早截止点表现最佳,而梯度提升在此后占主导地位。泄漏消融进一步表明,时间违规,特别是通过评估信息,可能会夸大表观的“早期”性能。

英文摘要

Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely learner support. However, reported "early" performance is often inflated by temporal leakage. This occurs when the pipeline uses information that would not yet be available at the time of prediction. We formalize cutoff-based early outcome prediction under a temporal availability constraint and introduce LEAP (Leakage-Excluded Early-Availability Protocol), which enforces cutoff-first truncation prior to joins and aggregation and audits feature provenance to prevent post-cutoff evidence from entering the benchmark. We instantiate LEAP on the public Open University Learning Analytics Dataset (OULAD) as a multi-step protocol for leakage-controlled evaluation across weekly cutoffs. Using several standard learning methods, we evaluate performance using ROC-AUC, PR-AUC, Brier score, and F1@0.5. Results show improving performance as the observation window expands, with a marked gain around week~3; Random Forest performs best at the earliest cutoffs, while Gradient Boosting dominates thereafter. Leakage ablations further show that temporal violations, especially through assessment information, can inflate apparent "early" performance.

2605.25790 2026-05-26 cs.RO

HoLoArm: Deformable Arms for Collision-Tolerant Quadrotor Flight

HoLoArm: 用于碰撞容忍四旋翼飞行的可变形臂

Quang Ngoc Pham, Jonas Eschmann, Yang Zhou, Alejandro Ojeda Olarte, Giuseppe Loianno, Van Anh Ho

AI总结 受蜻蜓翅膀结脉结构启发,提出具有柔性臂的四旋翼HoLoArm,结合强化学习控制策略实现被动变形与快速恢复,在高达7.6 m/s碰撞速度下保持稳定飞行。

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 3582-3589, March 2026
Comments
8 pages, 15 figures, 1 table, Accepted at the IEEE Robotics and Automation Letters (RA-L) and the IEEE International Conference on Robotics and Automation (ICRA), 2026
AI中文摘要

无人机在以人为中心的应用中日益普及,凸显了对能够承受碰撞并快速恢复的设计的需求,以最小化对人类和环境的风险。我们提出了HoLoArm,一种具有柔性臂的四旋翼,其灵感来源于蜻蜓翅膀的结脉结构。这种设计在保持飞行稳定性的同时提供了自然的柔韧性和弹性,并通过集成强化学习(RL)控制策略进一步增强了恢复和悬停性能。实验结果表明,HoLoArm可以在任何方向(包括轴向)被动变形,并根据冲击方向和程度在0.3-0.6秒内恢复。无人机能够在高达7.6米/秒的碰撞速度下存活,并携带540克有效载荷,同时保持稳定飞行。这项工作有助于具有高敏捷性和可靠安全性的软体空中机器人的形态设计,使其能够在杂乱和人类共享的环境中运行,并为未来将柔性结构与智能控制相结合的完全软体无人机奠定了基础。

英文摘要

The increasing use of drones in human-centric applications highlights the need for designs that can survive collisions and recover rapidly, minimizing risks to both humans and the environment. We present HoLoArm, a quadrotor with compliant arms inspired by the nodus structure of dragonfly wings. This design provides natural flexibility and resilience while preserving flight stability, which is further reinforced by the integration of a Reinforcement Learning (RL) control policy that enhances both recovery and hovering performance. Experimental results demonstrate that HoLoArm can passively deform in any direction, including axial one, and recover within 0.3-0.6 s depending on the direction and level of the impact. The drone can survive collisions at speeds up to 7.6 m/s and carry a 540 g payload while maintaining stable flight. This work contributes to the morphological design of soft aerial robots with high agility and reliable safety, enabling operation in cluttered and human shared environments, and lays the groundwork for future fully soft drones that integrate compliant structures with intelligent control.

2605.25789 2026-05-26 cs.LG cs.AI cs.IT math.IT stat.ML

On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

关于自由探索对多臂老虎机遗憾最小化的益处

Yunlong Hou, Zixin Zhong, Vincent Y. F. Tan

AI总结 本文研究在初始自由探索阶段后最小化累积遗憾的多臂老虎机问题,提出一种两阶段算法UFE-KLUCB-H,并证明其相比无自由探索的策略能严格减少遗憾。

详情
Comments
55 pages
AI中文摘要

我们研究了一个随机多臂老虎机问题,其中智能体在遗憾累积之前被授予一个自由探索预算,这是经典遗憾最小化或纯探索范式未涵盖的设置。目标是设计一个自适应策略,在初始自由探索阶段策略性地探索老虎机实例,并在后续阶段最小化累积遗憾。我们形式化了这个带有自由探索的遗憾最小化问题,并识别出一个有趣的区间,其中自由探索预算与时间范围成对数比例。为了量化由于自由探索阶段的可用性而高概率节省的遗憾量,我们引入了一类新的策略,称为$(α,β)$-可能节省策略。我们提出了一种两阶段、可能节省的算法UFE-KLUCB-H,它由一个原则性的自由探索策略UFE和一个历史感知的遗憾最小化策略KLUCB-H组成。推导了UFE-KLUCB-H的实例相关上界,表明UFE-KLUCB-H累积的遗憾严格少于无法访问自由探索阶段的策略。作为补充,我们基于针对自由探索环境定制的多实例扰动论证推导了实例相关下界,证明了UFE-KLUCB-H对于二值老虎机的近乎最优性。我们的上界和下界揭示了累积遗憾中依赖于可用自由探索量的尖锐相变。进行了仿真,表明算法中的强制探索和自适应性导致了更大的遗憾节省。

英文摘要

We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(α,β)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.

2605.25786 2026-05-26 cs.LG cs.AI

NPSolver: Neural Poisson Solver with Iterative Physics Supervision

NPSolver: 具有迭代物理监督的神经泊松求解器

Bocheng Zeng, Rui Zhang, Runze Mao, Mengtao Yan, Xuan Bai, Yang Liu, Zhi X. Chen, Hao Sun

AI总结 提出NPSolver,通过迭代物理监督(利用少量PCG步骤)训练无标签的神经泊松求解器,并引入边界感知Transolver架构,在2D/3D不规则几何上优于物理信息和数据驱动基线。

详情
Comments
kdd 2026
AI中文摘要

在复杂不规则域上高效求解泊松方程仍然是科学计算中的一个基本挑战,因为经典迭代求解器常常因病态系统而面临过长的运行时间。虽然神经算子提供了一种快速的替代方案,但它们通常依赖大规模标记数据集,或者在使用物理信息残差损失时难以处理不稳定的训练动态。我们提出 extsc{NPSolver},一种通过迭代物理监督训练的无标签神经泊松求解器。 extsc{NPSolver} 不依赖完全收敛的数值解或原始PDE残差,而是利用少量预处理共轭梯度(PCG)步骤来优化自身预测,从而提供更稳定且尺度良好的训练信号。理论分析证实,这种迭代监督充当了良态误差代理,并且停止梯度设计对于优化稳定性至关重要。为了更好地捕捉混合边界条件下的边界驱动特征,我们进一步引入了边界感知Transolver( extsc{BA-Transolver})架构,该架构明确分离了内部和边界令牌化。在2D和3D不规则几何上的广泛评估表明, extsc{NPSolver} 优于物理信息和数据驱动基线。此外,一个下游热控制任务突出了该模型进行高效可靠的基于梯度的边界控制的能力。我们将在 https://github.com/intell-sci-comput/NPSolver 发布我们的代码和数据。

英文摘要

Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical iterative solvers often suffer from prohibitive runtime due to ill-conditioned systems. While neural operators offer a fast alternative, they typically rely on large-scale labeled datasets or struggle with unstable training dynamics when using physics-informed residual losses. We propose \textsc{NPSolver}, a neural Poisson solver trained without solution labels via iterative physics supervision. Instead of relying on fully converged numerical solutions or raw PDE residuals, \textsc{NPSolver} utilizes a small number of preconditioned conjugate gradient (PCG) steps to refine its own predictions, providing a more stable and well-scaled training signal. Theoretical analysis confirms that this iterative supervision serves as a well-conditioned error proxy and that a stop-gradient design is essential for optimization stability. To better capture boundary-driven features under mixed boundary conditions, we further introduce the Boundary-Aware Transolver (\textsc{BA-Transolver}) architecture that explicitly separates interior and boundary tokenization. Extensive evaluations on 2D and 3D irregular geometries demonstrate that \textsc{NPSolver} outperforms both physics-informed and data-driven baselines. Furthermore, a downstream thermal control task highlights the model's capability for conducting efficient and reliable gradient-based boundary control. We will release our codes and data at https://github.com/intell-sci-comput/NPSolver.

2605.25784 2026-05-26 cs.CV cs.MM

VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

VertiCue-Bench: 诊断多模态大语言模型是否利用高度线索解决遥感自然场景中的二维歧义

Jing Huang, Duanchu Wang, Junjie Yang, Zihang Cheng, Cheng Li, Lin Cui, Zhouyi Wu, Di Wang

AI总结 提出VertiCue-Bench基准,通过17个任务1534个实例诊断MLLMs是否真正利用冠层高度模型(CHM)的垂直线索解决遥感自然场景中的语义歧义,发现模型在感知高度线索与语义推理之间存在显著脱节。

详情
AI中文摘要

多模态大语言模型(MLLMs)最近在地理空间推理方面显示出有希望的进展。然而,现有的遥感基准仍然主要围绕二维中心,主要基于光学外观评估模型。在自然环境中,由于严重的光谱混淆,这种范式失效,其中生态上不同的区域共享相似的纹理但在垂直结构上根本不同。在这种情况下,明确的3D结构数据,如冠层高度模型(CHMs),成为语义消歧的基本几何证据。然而,目前尚不清楚当前的MLLMs是否能够真正利用垂直线索来解决外观级别的歧义。为了填补这一空白,我们引入了VertiCue-Bench,这是第一个基于CHM的地理空间推理诊断基准。VertiCue-Bench包含1534个精心策划的实例,涵盖17个任务,明确将低级高度感知与歧义感知的语义推理分离。对14个最先进的通用和遥感专用MLLMs的评估,结合反事实模态测试,揭示了惊人的感知-推理分离。虽然模型在读取原始CHM高度线索方面表现出新兴能力,但它们大多未能将几何感知转化为可靠的语义推理,在需要联合约束时通常表现不如仅使用RGB的基线。总体而言,VertiCue-Bench揭示了自然场景理解中关键的几何到语义的差距,为推进地理空间MLLMs提供了可行的见解。

英文摘要

Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.

2605.25781 2026-05-26 cs.CL

Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation

双三角形标注:一种可扩展的人机协同高精度历史文档标注框架

Yi Ren

AI总结 提出双三角形标注框架,通过两层人机协同和跨模型共识自动完成大部分标注工作,实现高精度历史文档结构化信息提取。

详情
Comments
12 pages, 4 figures. ACL ARR 2026 March submission
AI中文摘要

大规模评估历史文档的结构化信息提取需要高精度的真实标注,但传统人工标注成本高昂,而基于大语言模型的完全自动化流水线容易产生幻觉。我们提出双三角形标注,一种双层人机协同框架,利用跨模型共识自动完成大部分标注工作,同时确保高精度输出。第一层中,两个架构独立的多模态大语言模型并行标注每个文档;当它们一致时,标签自动接受,不一致则提交给人工评审。第二层将两个这样的系统相互交叉检查,将剩余冲突升级给领域专家。该框架基于一个假设——模型之间的错误独立性——不需要分布先验或任务特定校准,并且随着模型能力的提升而变得更加自主。在Guides Rosenwald(一个涵盖1887-1906年的法国医疗目录语料库)上,该框架实现了0.003的最终词错误率。大规模应用时,模型共识自动接受了13,595个字段中的85%以上。我们发布了由此产生的基准——Rosenwald指南的第一个结构化提取真实标注——以支持未来历史文档处理工作。

英文摘要

Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.

2605.25778 2026-05-26 cs.CV

OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance

OMGTex: 无需几何引导的一阶段多风格面部纹理重建

Zitong Xiao, Yuda Qiu, Zisheng Ye, Xiaoguang Han

AI总结 提出OMGTex,一种端到端的扩散框架,无需3D几何先验,直接从多风格面部图像重建高质量、可编辑的UV纹理,通过梯度引导推理和语义感知训练实现鲁棒重建与编辑。

详情
Comments
CVPR 2026 (Poster)
AI中文摘要

我们提出OMGTex,一种端到端的基于扩散的框架,用于从多风格面部图像重建高质量且可编辑的面部UV纹理。现有的纹理重建方法面临两个主要限制:(1) 依赖于难以准确估计的3D几何先验,尤其是在面部遮挡或风格化域中,导致脆弱性;(2) 缺乏语义解耦,阻碍了区域特定的纹理编辑和风格迁移。我们的工作同时解决了这两个挑战。 我们的核心创新是一个无几何的流水线,直接将2D面部图像映射到其对应的可编辑UV纹理。我们引入了两种关键技术:首先,为了解决扩散生成中常见的UV错位问题,我们引入了一种推理时的梯度引导细化策略,显式校正结构一致性。其次,我们利用扩散模型固有的语义分布能力,设计了一种新颖的训练范式来增强这种倾向,从而实现面部纹理的语义感知编辑。此外,为了解决多风格纹理重建中的数据稀缺问题,我们构建了CANVAS,这是第一个涵盖真实和多样化风格化领域的全面配对纹理重建数据集。 据我们所知,OMGTex是第一个无几何推理框架,能够在不同领域实现鲁棒、风格一致且可编辑的面部纹理重建。我们的方法在多个面部纹理基准上达到了最先进的性能。

英文摘要

We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously. Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains. To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on multiple facial texture benchmarks.

2605.25775 2026-05-26 cs.CV

DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion

DRFusion: 抗漂移的时间一致红外-可见光视频融合

Xingyuan Li, Haoyuan Xu, Shulin Li, Xiang Chen, Zhiying Jiang, Jinyuan Liu

AI总结 提出一种抗漂移的视频融合方法,将任务重构为历史条件运动生成,通过稳定历史引导和软时间锚定实现时间一致性,并采用解耦结构-运动适应策略,在融合质量和时间稳定性上达到最优。

详情
Comments
11 pages, 7 figures, 4 tables
AI中文摘要

红外和可见光视频融合对于在动态场景中实现全面感知至关重要。然而,保持时间一致性仍然是一个艰巨的挑战。依赖光流的传统方法通常存在几何刚性和重影伪影。此外,标准的基于扩散的融合模型通常以逐帧方式运行;当扩展到自回归设置时,它们缺乏内在的时间约束,并且容易出现严重的误差累积和漂移,其中微小的伪影随时间放大。为了解决这些限制,我们提出了一种抗漂移的视频融合方法,将任务重构为历史条件运动生成。我们引入了稳定历史引导和软时间锚定,将时间一致性重新定义为频谱滤波,无需刚性对齐即可隐式聚合运动动态。此外,我们的解耦结构-运动适应策略通过两阶段训练和潜在细化桥接了预训练先验和结构约束。大量实验表明,我们的方法在融合质量和时间稳定性方面均达到了最先进的性能。

英文摘要

Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.

2605.25771 2026-05-26 cs.LG cs.AI

MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training

MDGMIX: 边界感知的子图混合用于多域图预训练

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang

AI总结 针对多域图预训练中的数据冗余问题,提出MDGMIX框架,通过边界感知子图混合与层次判别学习解耦共享和域特定模式,并在适配时使用轻量级提示加权机制,在少样本分类任务中优于强基线且效率更高。

详情
Comments
Accepted by ICML2026
AI中文摘要

多域图预训练是构建具有跨域泛化能力的基础图模型的关键步骤。然而,现有方法主要依赖联合训练所有源域图,导致计算成本高。此外,尚不清楚所有源域图数据是否对有效迁移有同等贡献。本文通过实验揭示了多域图预训练中存在显著的数据冗余。基于这一发现,我们提出了多域图预训练框架MDGMIX,该框架将边界感知的子图混合与层次判别相结合。通过选择边界节点构建具有挑战性的混合域子图,MDGMIX利用粗粒度域判别和细粒度域分解损失来解耦共享模式与域特定模式。在适配过程中,MDGMIX采用轻量级提示加权机制来迁移源域知识。大量实验表明,MDGMIX在少样本分类任务中持续优于强基线,同时表现出优越的时间和内存效率。代码可在 https://github.com/zhengziyu77/MDGMIX 获取。

英文摘要

Multi-domain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities. However, existing methods predominantly rely on jointly training all source domain graphs, resulting in high computational costs. Furthermore, it remains unclear whether all source domain graph data contribute equally to effective transfer. This paper empirically reveals significant data redundancy in multi-domain graph pre-training. Based on this finding, we propose the Multi-domain Graph Pre-training Framework, MDGMIX, which combines boundary-aware subgraph mixing with hierarchical discrimination. By selecting boundary nodes to construct challenging mixed-domain subgraphs, MDGMIX employs coarse-grained domain discrimination and fine-grained domain decomposition losses to decouple shared patterns from domain-specific patterns. During adaptation, MDGMIX employs a lightweight prompt weighting mechanism to transfer source domain knowledge. Extensive experiments demonstrate that MDGMIX consistently outperforms strong baselines in few-shot classification tasks while exhibiting superior time and memory efficiency. The code is available at: https://github.com/zhengziyu77/MDGMIX.

2605.25765 2026-05-26 cs.CV cs.AI cs.LG

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

通过交叉注意力激活投影实现扩散模型的概念遗忘

Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim

AI总结 提出PURE方法,利用交叉注意力激活空间构建遗忘和保留基,通过线性投影编辑权重,在保持保留概念的同时有效消除目标概念。

详情
AI中文摘要

概念遗忘旨在从预训练的文本到图像扩散模型中擦除目标概念,而无需重新训练。闭式方法在此设置中具有吸引力,因为它们对交叉注意力权重应用单一确定性编辑,并且不增加推理时间成本。然而,现有的闭式方法通过文本编码器对少数命名目标概念的简短锚定提示的响应来表示目标概念,而唤起该概念但不一致命名的释义提示可以绕过编辑。我们认为,目标应该改为在交叉注意力激活空间中表示。文本嵌入描述用户的提示,而交叉注意力激活描述模型即将渲染的内容,后者泛化到锚定模板未覆盖的释义。基于这一观察,我们提出了PURE(U-Net渲染中的投影用于擦除),这是一种闭式方法,从沿短去噪轨迹捕获的逐层交叉注意力激活构建遗忘和保留基,并将单个线性投影器应用于交叉注意力键和值权重。在最近涵盖艺术风格、知识产权、名人和NSFW类别中十个概念的整体概念遗忘基准上,PURE显著减少了在释义和对抗性提示下的目标泄露,同时将保留概念保持接近未编辑模型,在评估方法中实现了最佳的总体遗忘-保留权衡。

英文摘要

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

2605.25764 2026-05-26 cs.CV cs.AI

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

病理基础模型在空间域理解中的基准测试

Bokai Zhao, Yiyang Zhang, Yuanchi Zhu, Hanqing Chao, Long Bai, Tai Ma, Minfeng Xu, Ming Song, Tianzi Jiang

AI总结 提出SpaPath-Bench基准,通过空间域识别任务评估病理基础模型在区分组织区域和捕获空间关系方面的表示能力。

详情
Comments
MICCAI2026
AI中文摘要

病理基础模型(PFMs)已成为从全切片图像(WSIs)中学习可迁移表示的核心方法,通常通过下游临床终点进行基准测试。虽然这种任务级评估不可或缺,但它们对表示本身编码了什么提供了有限的见解,特别是PFM嵌入是否能够区分有意义的组织区域并捕获其空间关系。我们提出了SpaPath-Bench,一个表示级基准,旨在诊断PFMs中的空间表示能力。SpaPath-Bench将配对全切片图像和空间转录组学(ST)数据上的空间域识别(SDI)制定为诊断任务。它整理了42个公开的配对WSI和ST切片,支持跨19个编码器和7种SDI方法的大规模评估,并使用三个互补标准衡量分区质量:无监督空间一致性、转录组学参考一致性和专家参考一致性。在83K次运行中,SpaPath-Bench揭示了不同的预训练范式捕获了组织空间架构的不同方面,并为构建下一代空间感知计算病理模型提供了实用指导。代码和数据管道公开于https://bokai-zhao.github.io/SpaPath-benchboard/。

英文摘要

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at https://bokai-zhao.github.io/SpaPath-benchboard/.

2605.25759 2026-05-26 cs.CV

Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

通过合成局部偏好实现解剖学合理的人体图像生成

Bao Li, Yuliang Xiu, Zhen Liu

AI总结 提出 ASAP 框架,利用局部退化机制构建受控偏好对,并结合局部有界 DPO 变体,在保持整体图像质量的同时减少解剖学错误。

详情
AI中文摘要

大规模文本到图像基础模型已实现显著的视觉真实感,但生成具有正确解剖结构的人体图像仍然具有挑战性。现有方法通过在高品质人体照片上进行监督微调时使用部位特定模块或局部损失加权来强制解剖约束,但此类数据集有限,且由于光照、姿态和背景等混杂因素,通常提供模糊的优化信号。基于偏好的对齐提供了一种替代方案,但标准的直接偏好优化(DPO)平等对待所有像素,因此未能利用解剖伪影的局部性。为了解决这个问题,我们提出了通过合成解剖偏好进行对齐(ASAP)的框架,该框架通过对高保真人体图像应用局部退化机制来构建受控偏好对。该机制通过对图像进行受控实验,在目标区域引入明确的解剖错误,同时保留其余内容。利用这一机制,我们创建了人类解剖偏好(HAP)数据集,包含超过10K个精心挑选的对,用于有效对齐文本到图像人体图像生成模型的解剖结构。为了更好地利用这些受控偏好对的局部性,我们引入了DPO的局部有界变体,该变体优先优化目标解剖区域,同时强制有限偏好间隔以防止过度优化并保持全局语义。我们进一步引入了HAF-Bench,一个用于系统评估解剖保真度的基准。大量实验表明,ASAP在多个基础模型上持续减少解剖错误,同时保持整体图像质量。

英文摘要

Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

2605.25751 2026-05-26 cs.CV

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

SplitAvatar: 基于自回归高斯分裂的单次头部化身

Hongzhe Liao, Chuhua Xian, Hongmin Cai, Haiyang Liu, Fa-Ting Hong

AI总结 提出一种基于自回归高斯分裂的单图像可动画头部化身重建方法,通过图分裂网络渐进生成高斯体,解决高斯数量不匹配和细粒度细节缺失问题。

详情
AI中文摘要

3D高斯泼溅(3DGS)利用各向异性高斯体为高质量场景重建提供了高效方法。最近,基于3DGS的方法显著提升了人类化身的渲染质量,同时实现了实时性能。然而,现有方法存在基于图像和基于3DMM的方法生成的高斯体数量不匹配的问题。这种差异导致重建的表情缺乏细粒度细节。本文提出了一种从单张图像重建可动画头部化身的新方法。我们提出了一种图分裂网络,利用自回归架构从粗到细渐进生成高斯体。为了解决分裂高斯体引起的图不一致性,我们采用网格拓扑扩展方法,使GNN的连通性与增加的高斯数量对齐。此外,我们引入了一种新颖的密度控制方法,包括一个门控机制,为高斯体生成软掩码,防止分裂操作后的过度密集化。这允许对不同面部区域的高斯密度进行动态控制。为了实现平滑快速的训练,我们采用延迟过滤策略,避免在训练期间重新计算图拓扑。实验结果表明,我们的自回归结构通过渐进分裂高斯体有效提升了表情表示能力。这一过程通过GNN引导的分裂实现,合成更精确的面部细节,并达到更高的重建质量。

英文摘要

3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN's connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.

2605.25750 2026-05-26 cs.LG

Invariant-Based Weight Sharing for Message Passing

基于不变量的消息传递权重共享

Florian Seiffarth

AI总结 提出一种基于图不变量的权重共享原则,通过直接根据图不变量索引权重,增强消息传递神经网络的结构感知能力,并在合成与真实数据上取得优于标准MPNN的效果。

详情
Comments
13 pages main paper + 30 pages references and appendix
AI中文摘要

消息传递神经网络(MPNN)是学习图结构域表示的一个强大框架。然而,MPNN中的权重仅作用于特征,限制了其捕捉结构模式的能力。我们引入了一种新颖的结构感知权重共享原则,该原则明确地融入了图结构固有的信息。权重由用户选择的图不变量(即在节点置换下保持不变的函数)直接索引,从而能够在结构等价的子图之间进行系统性的权重复用。我们提出了ShareGNN,该模型在一个简单的编码器-解码器架构中实例化了这一原则,产生了一个具有可学习邻接矩阵和类似Transformer连接性的MPNN。我们证明,其表达能力至少与所选不变量的区分能力相当,从而提供了对模型复杂度的显式控制。在合成数据和真实数据以及子图计数任务上的实验表明,与标准MPNN相比,该方法具有一致的改进,具有超越1-WL测试的竞争力,并且可扩展到大型数据集。

英文摘要

Message-passing neural networks (MPNNs) are a powerful framework for learning representations of graph-structured domains. However, weights in MPNNs act on features only, limiting their ability to capture structural patterns. We introduce a novel structure-aware weight sharing principle that explicitly incorporates information inherent to the graph structure. Weights are indexed directly by user-chosen graph invariants, i.e., functions preserved under node permutations, enabling systematic reuse across structurally equivalent subgraphs. We present ShareGNNs, which instantiate this principle within a simple encoder-decoder architecture, resulting in an MPNN with learnable adjacency and transformer-like connectivity. We show that their expressivity is at least as strong as the discriminative power of the chosen invariants, providing explicit control over the model complexity. Experiments on synthetic and real-world data, as well as subgraph counting tasks, demonstrate consistent improvements over standard MPNNs, competitive expressivity beyond the 1-WL test, and scalability to large datasets.

2605.25749 2026-05-26 cs.IR cs.AI cs.LG

DeGRe: Dense-supervised Generative Reranking for Recommendation

DeGRe: 密集监督的生成式重排序用于推荐

Chaotian Song, Jingyao Zhang, Chenghao Chen, Zisen Sang, Dehai Zhao, Guodong Cao, Boxi Wu, Deng Cai, Jia Jia

AI总结 提出DeGRe框架,通过离线探索中的密集监督信号(Lookahead Evaluator)指导在线生成器(Online Generator)进行单步贪婪解码,解决重排序中的启发式标签偏差和信用分配问题。

详情
Comments
Accepted to KDD 2026 (ADS Track)
AI中文摘要

在多阶段推荐系统中,重排序通过捕获列表内上下文依赖关系来优化整体效用,但其核心挑战在于在指数级排列空间中探索最优序列。最近的研究转向端到端生成式框架,通常利用列表级奖励或偏好对齐来指导生成器训练。然而,这些方法仍面临两个关键问题。首先是启发式标签偏差。现有方法通常基于简单规则构建训练目标,例如将点击项提升到顶部,而忽略列表上下文中的因果依赖关系。其次是信用分配问题。稀疏的列表级后验奖励无法直接指导序列生成中的中间步骤,导致优化方向模糊。为了解决这些问题,我们提出DeGRe(密集监督的生成式重排序),一种通过密集监督弥合离线探索与在线效率之间差距的生成式重排序框架。DeGRe的核心在于其离线-在线解耦设计。在离线阶段,我们引入基于累积回归的Lookahead Evaluator,利用束搜索在未曝光空间中主动挖掘高价值前瞻序列。在训练期间,我们将评估器的逐步价值估计转换为密集监督信号,并将其蒸馏到轻量级在线生成器中。这种机制使生成器能够内化前瞻规划能力,在线推理时仅需一次高效的贪婪解码即可逼近全局最优。实验表明,DeGRe在公开基准和工业数据集上优于基线模型。我们已成功将DeGRe部署到淘宝闪购中,显著提升了在线推荐效果。

英文摘要

In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central challenge lies in exploring optimal sequences within an exponentially large permutation space. Recent studies have shifted towards end-to-end generative frameworks, which typically leverage list-wise rewards or preference alignment to guide generator training. However, these methods still face two critical issues. First is the heuristic label bias. Existing methods often construct training targets based on simple rules, such as promoting clicked items to the top, while ignoring causal dependencies within the list context. Second is the credit assignment problem. Sparse list-level posterior rewards fail to directly guide intermediate steps in sequence generation, leading to ambiguous optimization directions. To address these issues, we propose DeGRe (Dense-supervised Generative Reranking), a generative reranking framework that bridges the gap between offline exploration and online efficiency through dense supervision. The core of DeGRe lies in its offline-online decoupled design. During the offline phase, we introduce a Lookahead Evaluator based on cumulative regression, which leverages beam search to actively mine high-value lookahead sequences in the unexposed space. During training, we transform the step-wise value estimations from the evaluator into dense supervision signals and distill them into a lightweight Online Generator. This mechanism enables the generator to internalize lookahead planning capabilities, requiring only a single efficient greedy decoding pass during online inference to approximate the global optimum. Experiments demonstrate that DeGRe outperforms baseline models on public benchmarks and industrial datasets. We have successfully deployed DeGRe on Taobao Flash Shopping, significantly improving online recommendations.

2605.25748 2026-05-26 cs.AI

Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective

以智能体为中心的社交轨迹预测:自由能原理视角

Yanping Wu, Ji Zhang, Hao Chen, Edmond S. L. Ho, Chongfeng Wei

AI总结 针对现有轨迹预测方法依赖全局状态、部分可观测下信念推理不足及缺乏认知行为约束的问题,提出基于自由能原理的智能体中心轨迹预测框架FEP-Diff,通过双分支时空编码器、目标条件信念学习器和残差扩散轨迹生成器,在受限可观测条件下实现认知合理的预测。

详情
Comments
10 pages, 4 figures
AI中文摘要

轨迹预测方法在捕捉复杂运动模式方面已展现出显著能力。然而,现有方法依赖于全局状态假设,在部分可观测性下存在信念推理不足的问题,且预测中缺乏认知行为约束。这些局限性严重影响了实际部署的可行性和物理合理性。在这项工作中,我们提出了FEP-Diff,一个基于自由能原理的以智能体为中心的轨迹预测框架,旨在现实约束下实现认知合理的预测。具体来说,一个双分支时空编码器从局部观测中提取自我运动动态和社会交互线索。在此基础上,一个目标条件信念学习器推断多模态潜在信念分布,通过自由能目标进行优化,并对局部邻域图施加社会一致性约束以促进相邻智能体之间的认知对齐。最后,一个残差扩散轨迹生成器以学习到的信念表示为条件,通过令牌级代理条件,产生精确且多样化的未来预测。在五个公开基准上的大量实验表明,FEP-Diff在受限可观测性下始终优于最先进的方法。代码:https://anonymous.4open.science/r/FEP-Diff-8876。

英文摘要

Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rely on global state assumptions, suffer from insufficient belief inference under partial observability, and lack cognitive behavioral constraints in prediction. These limitations severely compromise both deployment feasibility and physical plausibility in real-world settings. In this work, we propose FEP-Diff, an agent-centric trajectory prediction framework grounded in the Free Energy Principle, aimed at achieving cognitively plausible predictions under realistic constraints. Specifically, a dual-branch spatiotemporal encoder extracts ego-motion dynamics and social interaction cues from local observations. Building upon this, a goal-conditioned belief learner infers multimodal latent belief distributions optimized via a free-energy objective, with a social consistency constraint on the local neighborhood graph to promote cognitive alignment among neighboring agents. Finally, a residual diffusion trajectory generator is conditioned on the learned belief representations with token-level proxy conditioning, producing precise and diverse future predictions. Extensive experiments on five public benchmarks demonstrate that FEP-Diff consistently outperforms state-of-the-art methods under restricted observability. Code: https://anonymous.4open.science/r/FEP-Diff-8876.

2605.25746 2026-05-26 cs.MA cs.AI

Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

基于结构引导编排的多智能体协调适应

Haoran Li, Shulun Chen, Shaoyuan Sun, Hanchen Wang

AI总结 提出MACA框架,通过概率视角将多智能体协调视为结构与编排的联合后验推断,利用任务和预算条件结构先验指导策略编排,实现高效自适应协调,性能平均提升8.42%且令牌消耗减少43.19%。

详情
Comments
21 pages
AI中文摘要

随着基于大语言模型的多智能体系统规模扩大以处理日益复杂的任务,平衡结构稳定性和动态适应性变得越来越具有挑战性。现有系统通常采用以结构为中心的方法,坚持预先确定的结构,限制了细粒度控制;或者采用以编排为中心的方法,动态调整决策,同时使协调结构隐含且不稳定。为了解决这一挑战,我们从概率角度重新审视多智能体协调,将其视为结构和编排联合分布的后验推断。我们引入了MACA,一个自动协调框架,它学习一个任务和预算条件的结构先验,用于智能体参与和交互。该先验指导基于策略的编排作为后验推断的近似,实现了具有细粒度控制的高效解决方案。在多个基准测试中,MACA比自适应多智能体基线平均高出8.42%,同时使用的令牌数减少了43.19%。进一步研究表明,结构和编排的联合适应抑制了冗余交互,使协调收敛到任务有效的执行。

英文摘要

As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dynamic adaptability becomes increasingly challenging. Existing systems typically adopt either structure-centric methods, committing to structures determined upfront that limit fine-grained control, or orchestration-centric methods, adapting decisions dynamically while leaving coordination structure implicit and unstable. To address this challenge, we revisit multi-agent coordination from a probabilistic perspective, casting it as posterior inference over the joint distribution of structure and orchestration. We introduce MACA, an automated coordination framework that learns a task- and budget-conditioned structural prior over agent participation and interactions. This prior guides a policy-based orchestration as an approximation to posterior inference, enabling efficient solutions with fine-grained control. Across benchmarks, MACA outperforms adaptive multi-agent baselines by an average of 8.42% while using 43.19% fewer tokens. Further investigation reveals that joint adaptation of structure and orchestration suppresses redundant interactions, converging coordination toward task-effective execution.

2605.25745 2026-05-26 cs.CL

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

选择性潜在思考:LLM推理链的自适应压缩

Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore

AI总结 提出选择性潜在思考(SLT)框架,通过置信度门控将冗余推理步骤压缩为潜在表示,关键步骤保留显式思维链,在压缩比相当的情况下准确率比潜在推理基线高22.7%,推理链长度减少58.4%且准确率仅下降2.8%。

详情
AI中文摘要

显式思维链(CoT)推理显著提升了大语言模型(LLMs)的推理能力,但由于冗长的自回归痕迹导致高推理成本。现有的潜在推理方法提供了一种有前景的替代方案,但它们通常将推理视为均匀可压缩的,导致精度关键的中间步骤被过度压缩,从而降低推理准确性。在这项工作中,我们提出了选择性潜在思考(SLT),一个框架,它选择性地将冗余推理跨度压缩为潜在表示,同时在同一推理轨迹中将精度关键的跨度保留为显式CoT。具体来说,SLT首先使用轻量级解码器预测即将到来的短推理跨度,然后应用基于置信度的门控来确定可可靠压缩的最长跨度。被接受的跨度被编码为紧凑的潜在表示以提高推理效率,而不确定或精度关键的推理则保留为显式CoT形式以保持准确性。为了学习这种选择性压缩策略,SLT采用三阶段训练策略,结合跨度级潜在压缩、可靠性感知的未来推理预测和轨迹级强化学习,以优化答案正确性与推理成本之间的权衡。在四个数学推理基准上的大量实验表明,SLT在压缩比相当的情况下,准确率比潜在推理基线高22.7%,同时与显式CoT相比,推理链长度减少58.4%,准确率仅下降2.8%。我们的代码可在https://github.com/hunshi34/SLT找到。

英文摘要

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7\% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4\% with only 2.8\% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.

2605.25740 2026-05-26 cs.LG

Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习中的潜在表示对齐

Hyungkyu Kang, Byeongchan Kim, Min-hwan Oh

AI总结 针对离线目标条件强化学习中价值函数错误泛化的瓶颈,提出潜在对齐价值学习(LAVL)算法,通过潜在表示价值泛化与分层规划的统一框架,在OGBench的22个数据集上20个取得最优性能。

详情
Comments
Accepted in ICML 2026
AI中文摘要

离线目标条件强化学习(GCRL)提供了一个从固定数据集获取目标达成策略的实用框架。然而,在长视野任务中学习可靠的目标条件价值函数仍然具有挑战性。在本文中,我们指出目标条件价值函数中的错误泛化是一个根本性瓶颈,并证明在价值函数中引入适当的归纳偏置对于解决该瓶颈至关重要。基于这些发现,我们提出了潜在对齐价值学习(LAVL),一种离线GCRL算法,它将基于潜在表示的价值泛化与分层规划集成在一个统一框架中。在OGBench上的大量实验表明,LAVL持续优于现有的离线GCRL方法,在22个数据集中的20个上取得了最高性能。值得注意的是,LAVL在长视野任务和轨迹拼接数据集上表现出强大的性能,而先前的方法在这些任务上性能显著下降。我们的代码可在https://github.com/oh-lab/LAVL.git获取。

英文摘要

Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remains challenging. In this paper, we identify erroneous generalization in goal-conditioned value functions as a fundamental bottleneck, and demonstrate that appropriate inductive bias in the value function is crucial for addressing the bottleneck. Building on these findings, we propose Latent-Aligned Value Learning (LAVL), an offline GCRL algorithm that integrates latent-representation-based value generalization with hierarchical planning in a unified framework. Extensive experiments on OGBench demonstrate that LAVL consistently outperforms existing offline GCRL methods, achieving the highest performance on 20 out of 22 datasets. Notably, LAVL exhibits strong performance in long-horizon tasks and trajectory stitching datasets, where prior methods suffer significant performance degradation. Our code is available at https://github.com/oh-lab/LAVL.git.

2605.25739 2026-05-26 cs.LG cs.GT stat.ML

The Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible

行为可信度三难困境:当校准自主性变得不可能

Lauri Lovén, Nam Do, Hassan Mehmood, Dinesh Kumar Sah, Sasu Tarkoma

AI总结 本文证明,在理性监督下,当某些任务超出智能体的可靠能力时,任何具有置信门控自主性的强化学习策略都无法同时实现最大帮助性、最优校准和完全自主性,即行为可信度三难困境。

详情
Comments
48 pages, 3 figures
AI中文摘要

我们证明,在理性监督下,当某些任务超出智能体的可靠能力时,任何具有置信门控自主性的强化学习策略都无法同时实现最大帮助性、最优校准和完全自主性:即行为可信度三难困境。这种不可能性是几何性的——向严格适当的评分规则添加任何非仿射自主性激励都会破坏严格适当性,因此,同时因校准置信度和自主行动而获得奖励的智能体,会在低于委托人批准阈值的任务上系统性地夸大其报告的置信度。行为扰动引理量化了这种膨胀(对于Brier分数,缩放比例为 $w_A/(2 w_C)$),并表明检测需要 $Ω(1/Δ^2)$ 次观测。我们证明委托人的最优监督规则必然是非仿射的,这使得不可能性是无条件的,并且在对数凹密度策略族中与优化器无关。我们形式化了置信门控决策问题,将现有方法映射到三难困境上,并确定了两种建设性的解决路径(承诺、领域分离)。一个540配置的Best-of-N实验测试了五个预注册假设,所有假设均得到强烈证实(效应量 $d = 1.10$ 至 $5.32$),并增加了对可达 $(H, C, A)$ 曲面几何的描述性分析,显示了一个与预测的膨胀饱和一致的平台截断前沿。

英文摘要

We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.

2605.25737 2026-05-26 cs.CV

SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

SFR-Net: 学习尺度截锥体表示用于超广域遥感图像分割

Chuyu Zhong, Keyan Chen, Qinzhe Yang, Bowen Chen, Zhengxia Zou, Zhenwei Shi

AI总结 针对超广域遥感图像中地物尺度差异大和长距离上下文语义连续性问题,提出尺度截锥体表示网络(SFR-Net),通过构建尺度截锥体表示和级联跨尺度融合机制,在GID和FBPS数据集上分别提升mIoU 1.72%和4.29%。

详情
AI中文摘要

像素数量和地理覆盖范围是遥感图像的两个关键特征。现有的遥感图像分割方法通常专注于像素数量小或像素数量大但地理覆盖范围有限的图像。本文介绍了一种针对超广域(UWA)遥感图像的新分割任务,其特点是像素数量大且地理覆盖范围极广。UWA分割的核心挑战在于同时处理尺度变化显著的地物以及保持长距离上下文语义连续性。为了解决这些挑战,我们提出了尺度截锥体表示网络(SFR-Net)。受不同高度拍摄的遥感图像视锥体的启发,我们构建了尺度截锥体表示,实现了不同尺度下地物和上下文特征的统一建模。此外,我们设计了一种级联跨尺度融合机制,以有效整合这些表示,增强局部语义理解,同时确保长距离上下文连续性。在GID和FBPS上的实验结果表明,SFR-Net达到了最先进的性能,相比最强的竞争方法,mIoU分别提高了1.72%和4.29%。此外,所提出的尺度截锥体表示可以集成到通用分割网络中,以提高分割精度和收敛速度。实现代码将在https://github.com/ChuyuZhong/SFR-Net公开。

英文摘要

Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at https://github.com/ChuyuZhong/SFR-Net.

2605.25735 2026-05-26 cs.AI

A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

公理化设计的深度剖析——第一部分:问题表述

Aydin Homay

AI总结 本文聚焦公理化设计中的问题表述步骤,澄清一级功能需求的定义与特性,分析常见误区与困难,并提供实用指导,最后探讨大语言模型在该步骤中的作用。

详情
Comments
The paper is accepted at the ICAD 2026 - MIT and the final camera ready will be available once it got published by the Springer
AI中文摘要

问题表述——将客户需求和约束转化为最小的一组独立的一级功能需求——可以说是每个设计框架中最关键的步骤,包括公理化设计,然而在实践中它经常被误解或低估。本文专门关注公理化设计中的问题表述,澄清一级FR是什么(以及不是什么),解释为什么在给定的相同需求和约束下,它们不应在不同设计者之间合理变化,并强调导致设计失败的内在困难和反复出现的陷阱。讨论主要基于Nam P. Suh的三本书:《设计原理》、《公理化设计:进展与应用》和《复杂性理论》,并提供实用指导,帮助设计者制定适定的一级FR。最后,本文简要回顾了大语言模型时代的问题表述,并讨论了此类工具在一级层面上能够(以及不能)做出什么贡献。

英文摘要

Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is arguably the most critical step in every design framework, including axiomatic design yet it is frequently misunderstood or underestimated in practice. This paper focuses exclusively on problem formulation in axiomatic design it clarifies what first-level FRs are (and are not), explains why they should not legitimately vary across designers given the same needs and constraints, and highlights intrinsic difficulties and recurring pitfalls that lead to design failure. The discussion is grounded primarily in Nam P.Suh's three books. The Principles of Design, Axiomatic Design Advances and Applications, and Complexity Theory, and it offers practical guidance to help designers formulate well-posed first-level FRs. Finally, the paper briefly revisits problem formulation in the era of large language models and discusses what such tools can (and cannot) contribute at the first level.

2605.25730 2026-05-26 cs.CV

DeCoDrift: Stabilizing Decoder Coupling in Closed-Loop Foundation Segmentation

DeCoDrift:闭环基础分割中的解码器耦合稳定化

H. M. Shadman Tabib, Md. Shamsuzzoha Bayzid, M Sohel Rahman

AI总结 针对闭环迭代分割中解码器耦合漂移导致误差累积的问题,提出无需训练或真值监督的推理时稳定化框架DeCoDrift,通过约束提示更新和保持解码器耦合来提升注意力稳定性、时间一致性和分割质量。

详情
Comments
18 Pages, 5 Figures
AI中文摘要

基础分割模型(如Segment Anything Model, SAM)现在常被用于迭代流水线中,其中每个预测掩码被反馈作为下一个提示。这种做法将分割转变为闭环动态过程,但这些系统的解码器级行为在很大程度上仍未得到研究。我们表明,这种反馈循环可能引发一种先前被忽视的故障模式——解码器耦合漂移,其中掩码解码器的交叉注意力逐渐失去与目标对象的对齐,导致误差在迭代中累积。我们通过检测SAM的掩码解码器并推导出无真值的提示-图像耦合、注意力稳定性和时间一致性度量来研究这一现象。在体积电子显微镜数据上,这些解码器内部信号显示,与基于真值锚定的反馈相比,标准迭代提示系统性地降低了注意力对齐和时间一致性。然后,我们将迭代提示形式化为一个离散时间动态系统,并展示近端锚定如何减少反馈循环中的误差放大。基于这一分析,我们引入了DeCoDrift,一个无需训练、推理时稳定的框架,它约束提示更新并在迭代中保持解码器耦合。在大量实验中,DeCoDrift在注意力稳定性、时间一致性和分割质量上持续优于标准迭代提示,无需重新训练或真值监督。更广泛地说,我们的结果表明,解码器内部动态不仅仅是诊断性的:它们为在闭环使用中稳定基础分割模型提供了可操作的信号。

英文摘要

Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder's cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM's mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.