arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.15887 2026-05-18 cs.LG

Practical Validity Conditions for Byzantine-Tolerant Federated Learning

实用的拜占庭容错联邦学习有效性条件

Mélanie Cambus, Darya Melnyk, Tijana Milentijević, Stefan Schmid

AI总结本文提出最小包围球有效性条件及其放松形式c-MEB有效性，为联邦学习提供更实用的有效性保障，适用于多数客户端诚实的情况，并与传统有效性条件建立联系。

详情

AI中文摘要

鲁棒聚合是拜占庭容错联邦学习的核心操作。为了确保聚合质量不依赖数据分布或攻击，需要有效性条件。它们提供了聚合输出必须位于的几何保证。广泛使用的凸有效性要求输出位于诚实向量的凸包内。尽管这种保证在理论上很强，但并不适合现代联邦学习系统，因为它有维度依赖的容错性和排除了许多实际聚合规则。我们引入了最小包围球（MEB）有效性条件及其实松形式c-MEB有效性，其中c是一个常数。我们证明精确MEB有效性仍受有限容错性的限制，而放松的c-MEB有效性在多数客户端诚实的情况下（即n>2t）可以实现。我们为放松条件给出了最优的MinMax-MEB规则，具有c<√2的界，并证明了标准聚合器（包括最小直径平均、中位数和几何中位数）的显式放松MEB保证。最后，我们将MEB有效性与先前文献中研究的凸、放松凸和盒有效性联系起来，从而为拜占庭鲁棒聚合提供了系统的几何有效性条件图谱。我们的结果表明，放松的MEB有效性连接了分布式计算中的有效性条件和拜占庭容错聚合规则，并提供了一个实用的替代方案，替代凸有效性。

英文摘要

Robust aggregation is the core operation in Byzantine-tolerant federated learning. To ensure the quality of aggregation independently of data distribution or attacks, validity conditions are needed. They provide geometric guarantees of where the output of the aggregation must lie. The widespread convex validity requires the output to lie in the convex hull of the honest vectors. Although this guarantee is strong in theory, it is poorly suited to modern federated learning systems, as it has dimension-dependent resilience and excludes many practical aggregation rules. We introduce the minimum enclosing ball (MEB) validity condition for robust aggregation, as well as its multiplicative relaxation, $c$-MEB validity, where $c$ is a constant. We show that exact MEB validity still suffers from limited resilience, while relaxed $c$-MEB validity is achievable if a majority of clients is honest, i.e. $n>2t$. We give an optimal MinMax-MEB rule for the relaxed condition with the bound $c<\sqrt{2}$ and prove explicit relaxed-MEB guarantees for standard aggregators including minimum-diameter averaging, medoid and geometric median. Finally, we relate MEB validity to convex, relaxed-convex and box validity studied in prior literature, thus providing a systematic map of geometric validity conditions for Byzantine-robust aggregation. Our results show that relaxed MEB validity connects validity conditions in distributed computing and Byzantine-tolerant aggregation rules, and offers a practical alternative to convex validity.

URL PDF HTML ☆

赞 0 踩 0

2605.15886 2026-05-18 cs.CL

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

连接多模型数据：俄罗斯国内与对外政策演讲

Daria Blinova, Gayathri Emuru, Rakesh Emuru, Kushagradheer Shridheer Srivastava, Mina Rulis, Sunita Chandrasekaran, Benjamin E. Bagozzi

AI总结本文介绍了一个连接多模态政治通讯的数据集，涵盖俄罗斯政府多个 decades 的官方演讲，提供俄语和英语文本、图像及注释，并通过 transformer 多模态模型进行主题标注，支持政治通讯的多模态分析。

2605.15880 2026-05-18 cs.CV cs.AI

FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization

FSCM：频率增强的空间-频谱耦合Mamba用于红外超光谱图像着色

Tingting Liu, Yuan Liu, Guiping Chen, Xiubao Sui, Qian Chen

AI总结本文提出FSCM框架，通过频率增强的空间-频谱状态空间生成器和双流混合门控模块，提升红外超光谱图像着色的视觉质量和语义一致性。

详情

AI中文摘要

热红外成像对光照变化和烟雾干扰具有鲁棒性，使其在全天候感知中具有重要价值。然而，缺乏自然色彩和精细纹理限制了目标识别、人类视觉解释和可见光模型的迁移。现有红外着色方法主要依赖单波段图像，不足的光谱线索可能导致结构失真和语义混淆。尽管红外超光谱图像提供丰富的光谱响应和材料信息，现有单波段框架在建模空间-频谱耦合和弱纹理细节方面仍有限。为了解决这些问题，本文提出了FSCM，一种光谱信息引导的GAN框架。在FSCM中，由级联FSB单元组成的频率增强空间-频谱状态空间生成器被构建。每个FSB集成了三个互补组件：状态空间建模捕捉全局空间-频谱依赖性；频率增强模块（FEM）结合多级小波分解和傅里叶门控以恢复结构轮廓、方向高频细节和全局频率响应；双流混合门控模块（DGM）整合变形感知采样与稀疏注意力以增强有效局部结构并抑制背景干扰。此外，引入了在线语义分割引导损失以约束生成结果，提高复杂道路场景中的语义一致性。实验表明，FSCM在视觉质量和语义保真度上优于现有红外着色方法。

英文摘要

Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all-weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible-light models. Existing infrared colorization methods mainly rely on single-band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single-band frameworks remain limited in modeling spatial-spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral-information-guided GAN framework. Within FSCM, a frequency-enhanced spatial-spectral state-space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state-space modeling captures global spatial-spectral dependencies; the frequency enhancement module (FEM) combines multi-level wavelet decomposition and Fourier gating to recover structural contours, directional high-frequency details, and global frequency responses; and the dual-stream hybrid gating module (DGM) integrates deformation-aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation-guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.15877 2026-05-18 cs.LG cs.AI

Shapley Neuron Values for Continual Learning: Which Neurons Matter Most?

Shapley神经元值用于持续学习：哪些神经元最为关键？

Mohammad Ali Vahedifar, Abhisek Ray, Qi Zhang

AI总结本文提出Shapley神经元估值框架，通过量化持续学习中神经元重要性，实现无缓冲的持续学习，实验显示其在类别增量学习和任务增量学习中分别提升准确率2.88%和6.46%。

Comments This paper has been accepted to ICML 2026

2605.15874 2026-05-18 cs.LG

Ti-iLSTM: A TinyDL Approach for Logic-Level Anomaly Detection in Industrial Water Treatment Systems

Ti-iLSTM：一种用于工业水处理系统逻辑层异常检测的TinyDL方法

Mandar Joshi, Farzana Zahid, Judy Bowen, Matthew M. Y. Kuo, Valeriy Vyatkin, Emil Karlsson

AI总结本文提出Ti-iLSTM框架，利用TinyDL实现轻量级设备端逻辑层异常检测，通过优化LSTM内存和空间占用，在SWaT和WADI数据集上均表现出高检测性能。

Comments NA

详情

AI中文摘要

工业水处理系统（IWTS）是安全关键的网络物理基础设施，由于连接性增加，这些系统面临能够操纵过程行为而不产生明显设备异常的网络威胁。特别是逻辑层欺骗性攻击可以保持数值合理的测量值，同时破坏预期的因果关系。这些攻击难以通过基于阈值的监控检测，或需要重型服务器导向的异常检测模型。本文探讨了Tiny Deep Learning（TinyDL）在为资源受限的可编程逻辑控制器（PLCs）提供轻量级设备端逻辑层异常检测方面的潜力。我们提出了一种新的框架，基于TinyDL的增量LSTM（Ti-iLSTM），优化了长短期记忆（LSTM）的内存和空间占用，以检测基于可编程逻辑控制器（PLC）的工业水处理系统（IWTS）中的逻辑层不一致。在公开可用的SWaT数据集上的实验表明，优化后的模型实现了高检测性能（F1-score=0.983和ROC-AUC=0.998）。在WADI数据集上的部署式验证证实，所提出的轻量级框架在单个数据集之外仍然适用。研究证明，结合逻辑意识监督与Tiny Deep Learning（TinyDL）序列学习，可以创建一种适用于工业环境资源受限可编程逻辑控制器（PLCs）的高效且准确的异常检测方法。

英文摘要

Industrial Water Treatment Systems (IWTS) are safety critical cyber-physical infrastructures and due to increased connectivity, these systems are exposed to cyber threats that can manipulate process behaviour without creating obvious devices outliers. In particular, logic-layer deception anomalies can preserve numerically plausible measurements while breaking expected cause-and-effect relationships in the control process. These attacks are difficult to detect using threshold-based monitoring or require heavy server-oriented anomaly detection models. This paper explores the potential of Tiny Deep Learning (TinyDL) to provide lightweight on-device logic-level anomaly detection for resource constrained Programmable Logic Controllers (PLCs). We propose a novel framework, TinyDL-based incremental LSTM (Ti-iLSTM) which optimises the memory and space foot print of Long Short-Term Memory (LSTM), to detect logic-layer inconsistencies in Programmable Logic Controller (PLC) based Industrial Water Treatment Systems (IWTS). Experiments on the publicly available SWaT dataset show that the optimised model achieves high detection performance (F1-score=0.983 and ROC-AUC=0.998). A deployment-style validation on the WADI dataset confirms that the proposed light-weight framework remains applicable beyond a single dataset. The research demonstrates that combining logic-aware supervision with Tiny Deep Learning (TinyDL) sequence learning creates an efficient and accurate anomaly detection suitable for resource constrained Programmable Logic Controllers (PLCs) in industrial environments.

URL PDF HTML ☆

赞 0 踩 0

2605.15871 2026-05-18 cs.AI

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

代理发现神经架构：AIRA-Compose和AIRA-Design

Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

AI总结本文提出AIRA-Compose和AIRA-Design框架，通过自主设计神经网络架构，实现超越标准Transformer的基础模型，提升模型性能和效率。

Comments 55 pages, 28 figures, 21 tables

详情

AI中文摘要

为实现递归自我改进，我们研究了LLM代理自主设计超越标准Transformer的基础模型。我们引入双框架方法：AIRA-Compose用于高层架构搜索，AIRA-Design用于底层机制实现。AIRA-Compose使用11个代理在24小时预算内探索基本计算原语。代理评估百万参数候选者，将顶级设计扩展到350M、1B和3B规模。这产生了14种架构，属于两个家族：AIRAformers（基于Transformer）和AIRAhybrids（Transformer-Mamba）。在1B规模上预训练，这些模型在Llama 3.2和Composer-found基线中表现一致。在下游任务中，AIRAformer-D和AIRAhybrid-D在Llama 3.2上分别提高了2.4%和3.8%的准确性。此外，AIRA-Compose发现具有高度高效扩展前沿的模型：AIRAformer-C比Llama 3.2和Composer的最佳Transformer分别快54%和71%，而AIRAhybrid-C比Nemotron-2快23%，比Composer的最佳混合模型快37%。AIRA-Design让20个代理编写新的注意力机制以处理长距离依赖性和高性能训练脚本。在Long Range Arena基准测试中，代理设计的架构在文档匹配和文本分类任务上接近人类最先进的水平，差距在2.3%和2.6%以内。在Autoresearch基准测试中，Greedy Opus 4.5在固定时间预算下达到0.968验证bits-per-byte，超过已发布的最低值。这些框架展示了AI代理可以自主发现与或超越人工设计基线的架构和算法优化。这为发现下一代基础模型建立了一种强大的范式，标志着向递归自我改进迈出明显一步。

英文摘要

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

URL PDF HTML ☆

赞 0 踩 0

2605.15868 2026-05-18 cs.CV

SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval

SOLAR：用于对称多模态检索的自监督联合学习

Wenjie Yang, Hang Yu, Yuyu Guo, Peng Di

AI总结本文提出SOLAR框架，通过自监督学习解决对称多模态检索问题，利用未标记数据提升模型性能，实验显示其在基准测试中优于现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

本文针对对称多模态到多模态检索问题，提出SOLAR框架，通过两阶段自监督学习利用未标记数据。第一阶段学习图像-文本对的交集掩码以对齐语义差异，第二阶段构造正负样本进行多模态嵌入学习。引入新基准测试评估性能，实验表明SOLAR在参数和嵌入维度上均优于现有方法。

英文摘要

In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.

URL PDF HTML ☆

赞 0 踩 0

2605.15862 2026-05-18 cs.LG q-bio.NC

From Observed Viability to Internal Predictive Approximation: A Single-Subject Latent-Space Analysis of Gait Dynamics Under Occlusal Constraint

从观察到的可行性到内部预测近似：对在咬合约束下步态动力学的单个受试者潜在空间分析

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

AI总结研究通过单个受试者数据探讨在咬合约束下步态组织的纵向转变能否在预测潜在空间框架中近似，发现模型能保留位移层次，但不支持通用预测或临床可行性预测。

Comments 31 pages, 1 figure, 9 tables. Exploratory single-subject study combining gait analysis, occlusal observational probes, PCA-based latent-space modeling, and supervised predictive approximation

详情

AI中文摘要

适应性生物力学系统可能在潜在组织和纵向行为不同但仍表现出相似的可观察步态性能。本研究探讨在预测潜在空间框架内是否能近似观察到的纵向步态组织转变，不声称具有临床预测或因果咬合效应。在帕金森病受试者中采用探索性单个受试者设计，在两次相隔11周的会话中使用仪器化的鞋垫记录步态。测试了六个咬合观测探针：自然咬合、张口脱钩、强咬合、两个中性关系的垂直维度增加以及一个垂直维度增加伴随下颌前移。使用主成分分析构建PC1-PC2潜在表示。一个简化监督机器学习模型，以前馈神经网络实现，用于近似观察到的M1-M2转变。主要分析聚焦于三个中性关系条件，测试位移层次是否能被重现。模型保留了OC3 < ONL < OC2.5的顺序。扩展的六个探针分析也保留了探索性位移模式的全局结构，OC3和OC3P紧密聚集，最高位移与OC2.5和张口脱钩相关。留出的M2和留条件出分析显示条件依赖的近似变异性。这些发现不建立可推广的预测、治疗优势、因果咬合效应或临床可行性预测。仅支持受限结论，即观察到的纵向潜在转变可以在该单个受试者数据集中内部近似，为未来多受试者预测可行性模型提供了方法学桥梁。

英文摘要

Adaptive biomechanical systems may show similar observable gait performance while differing in latent organization and longitudinal behavior. This study examines whether an observed longitudinal transformation of gait organization can be approximated within a predictive latent-space framework, without claiming clinical prediction or causal occlusal effects. Using an exploratory single-subject design in a Parkinsonian participant, gait was recorded with instrumented insoles during two sessions separated by eleven weeks. Six occlusal observational probes were tested: natural occlusion, open-mouth disengagement, strong clenching, two vertical-dimension increases in centric relation, and one vertical-dimension increase with mandibular protrusion. Principal Component Analysis was used to construct a PC1--PC2 latent representation. A simplified supervised machine-learning model, implemented as a feed-forward neural network, was trained to approximate the observed M1--M2 transformation. The primary analysis focused on the three centric-relation conditions and tested whether the displacement hierarchy could be reproduced. The model preserved the ordering OC3 < ONL < OC2.5. The extended six-probe analysis also preserved the global structure of the exploratory displacement pattern, with OC3 and OC3P closely grouped and the highest displacements associated with OC2.5 and open-mouth disengagement. Held-out M2 and leave-condition-out analyses showed condition-dependent approximation variability. These findings do not establish generalizable prediction, therapeutic superiority, causal occlusal effects, or clinical viability forecasting. They support only the restricted conclusion that observed longitudinal latent transformations can be internally approximated within this single-subject dataset, providing a methodological bridge toward future multi-subject predictive viability models.

URL PDF HTML ☆

赞 0 踩 0

2605.15860 2026-05-18 cs.CV

On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry

在极端分辨率不对称下的RGB-TIR立体校准

Michał Król, Michał Salamonowicz, Władysław Skarbek, Michał Tomaszewski

AI总结本文提出一种实用的RGB与TIR摄像头立体校准框架，解决低分辨率热传感器下的校准问题，通过动态OLED屏幕和专用角点检测算法实现高精度校准，验证了系统在建筑能耗评估中的应用。

Comments 27 pages, 12 figures, 3 tables

详情

AI中文摘要

准确的RGB-热红外（TIR）立体摄像头系统的几何校准对于多模态建筑围护结构分析至关重要，但在使用低成本、极低空间分辨率的热传感器时仍具挑战性。本文提出了一种实用的立体校准框架，用于2028 x 1520像素的RGB摄像头与仅80 x 62像素的TIR摄像头配对，像素比例约为1:625。一个主动OLED屏幕动态切换模态特定的图案（TIR用棋盘格，RGB用ChArUco）于同一物理表面，提供可控且可重复的热对比度。一种结合透视校正、Hessian鞍点分析和Mean Shift局部定位的专用角点检测算法，在无需每帧参数调整的情况下，可靠地检测80 x 62像素的棋盘格。基线约束的束调整在平面校准物体退化下强制物理一致的立体几何，得到32.7毫米（名义30毫米）的立体基线，整体重投影误差为0.382像素。该系统在热活跃建筑模型上进行验证，使用恒定深度和每像素深度估计，证明了TIR到RGB投影的一致性，适用于建筑能耗评估。

英文摘要

Accurate geometric calibration of RGB-thermal infrared (TIR) stereo camera systems is essential for multimodal building envelope analysis, yet remains challenging when low-cost thermal sensors with very low spatial resolution are employed. This paper presents a practical stereo calibration framework for an RGB camera (2028 x 1520 px) paired with a TIR camera operating at only 80 x 62 px - a pixel-count ratio of approximately 1:625. An active OLED screen dynamically switches modality-specific patterns (checkerboard for TIR, ChArUco for RGB) on a single physical surface, providing controlled and repeatable thermal contrast. A dedicated corner detection algorithm combining perspective rectification, Hessian saddle-point analysis, and Mean Shift localisation achieves reliable checkerboard detection at 80 x 62 px without per-frame parameter tuning. A baseline-constrained bundle adjustment enforces physically consistent rig geometry under the planar-calibration-object degeneracy, yielding a stereo baseline of 32.7 mm (nominal 30 mm) with an overall reprojection error of 0.382 px. The system is validated on a thermally active building mock-up using constant-depth and per-pixel depth estimation, demonstrating consistent TIR-to-RGB projection suitable for building energy performance assessment.

URL PDF HTML ☆

赞 0 踩 0

2605.15855 2026-05-18 cs.CV

Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

少即是多：我们是否需要为扩散模型的强化学习微调进行每一步优化？

Renye Yan, Jikang Cheng, Shikun Sun, Yi Sun, You Wu, Wei Peng, Zongwei Wang, Ling Liang, Junliang Xing, Yimao Cai

AI总结本文研究了扩散模型强化学习微调中优化步骤的影响，提出AdaScope方法通过动态控制训练时机提升生成质量并降低计算成本。

详情

Journal ref: CVPR2026

AI中文摘要

尽管在图像生成性能方面表现强大，扩散模型的重建目标限制了与人类偏好的对齐。强化学习通过显式奖励实现这种对齐。然而，大多数研究将强化学习应用于完整的去噪轨迹，使其计算成本高且削弱偏好对齐，即做更多却收获更少。我们观察到，在去噪阶段，强化学习微调的影响差异显著。在早期阶段，图像结构不稳定且远离最终奖励信号。在这一阶段应用强化学习会导致延迟奖励和动作-奖励不匹配，导致高方差和低效更新。相反，在后期阶段，奖励收益趋于饱和，持续训练容易导致过拟合局部细节，加剧奖励黑客问题。为解决这些挑战，我们提出了AdaScope，一种增强的强化学习插件，通过感知去噪过程中的结构演变和语义一致性，自适应地确定强化学习的最佳干预时间，并在去噪收敛和奖励收益饱和时动态终止训练。结果，它实现了罕见的'双重收益'：计算成本的降低和显著的性能提升。我们为AdaScope的设计提供了理论依据。与最先进方法相比，AdaScope在性能上提高了66%，同时将计算成本降低了59%。

英文摘要

Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.

URL PDF HTML ☆

赞 0 踩 0

2605.15845 2026-05-18 cs.RO

Structured Jacobian Construction for Motion Optimization with High-Order Time Derivatives in Multi-Link Systems

多连杆系统中含高阶时间导数的运动优化结构雅可比构造

Taiki Ishigaki, Ko Ayusawa, Eiichi Yoshida

AI总结本文提出一种结构化雅可比方法，用于多连杆系统中包含高阶时间导数的运动优化问题，通过系统化表示物理量及其高阶导数，提高计算效率和稳定性。

详情

AI中文摘要

本文提出了一种新的框架，用于计算涉及多连杆系统的运动优化问题中的雅可比矩阵，其中物理量用高阶时间导数表示。在机器人和人类的运动优化中，成本函数可能包含如加加速度或力的时间变化率等高阶时间导数，以捕捉平滑性和感知特性，特别是在运动技能分析和表达行为中。然而，此类雅可比矩阵通常通过数值或自动微分计算，而未显式利用多连杆结构，导致计算成本增加和数值不稳定性。为解决这一限制，我们提出了一种结构化雅可比方法，基于全面的运动计算框架，在多连杆结构中系统地表示物理量及其高阶时间导数。所提出的方法系统地推导了关于广义坐标及其高阶导数的雅可比表达式，包括动量、力和关节扭矩。所得到的框架适用于直接和逆优化。通过数值实验，我们证明了所提出的方法在计算效率上优于数值和自动微分方法，同时达到可比的精度。此外，我们通过从运动数据中恢复成本函数权重，展示了其在逆优化中的有效性。这些结果表明，所提出的公式为涉及多连杆系统中高阶时间导数的运动优化提供了可扩展和结构化的计算基础。

英文摘要

This paper presents a novel framework for Jacobian computation in motion optimization problems involving multi-link systems, where physical quantities are represented using higher-order time derivatives. In motion optimization of robots and humans, cost functions may incorporate higher-order time derivatives, such as jerk or the time variation of forces, to capture smoothness and perceptual characteristics, particularly in motion skill analysis and expressive behaviors, thereby necessitating Jacobian computations involving these quantities. However, such Jacobians are typically computed using numerical or automatic differentiation without explicitly exploiting the underlying multi-link structure, which can lead to increased computational cost and numerical instability. To address this limitation, we propose a structured Jacobian formulation for motion optimization, based on the comprehensive motion computation framework, in which physical quantities and their higher-order time derivatives are systematically represented along the multi-link structure. The proposed method systematically derives analytical expressions for Jacobians of kinematic and dynamic quantities, including momentum, forces, and joint torques, with respect to generalized coordinates and their higher-order derivatives. The resulting framework is applicable to both direct and inverse optimization. Through numerical experiments, we demonstrate that the proposed method improves computational efficiency compared to numerical and automatic differentiation, while achieving comparable accuracy. Furthermore, we demonstrate its effectiveness in inverse optimization by recovering cost function weights from motion data. Together, these results indicate that the proposed formulation provides a scalable and structured computational foundation for motion optimization involving higher-order time derivatives in multi-link systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15843 2026-05-18 cs.CV

WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

WorldAct：将单体3D世界激活为可交互的以对象为中心的场景

Jichen Hu, Jiawei Guo, Jiazhong Cen, Chen Yang, Sikuang Li, Wei Shen

AI总结 WorldAct通过多模态代理将静态生成的3D世界分解为可编辑的交互场景，支持对象级编辑和任务执行，保留全局一致性。

Comments Project page: https://sjtu-deepvisionlab.github.io/WorldAct

详情

AI中文摘要

最近基于生成场景合成的3D世界建模系统，如Marble，能够创建连贯且可探索的3D环境，但其输出通常是静态的单体资产，具有有限的可编辑性和物理交互能力。这限制了其在沉浸式内容创作和具身模拟中的应用，其中生成的世界必须被主动修改和操控。为解决这一挑战，我们提出了WorldAct，一个将静态生成的3D世界转换为可编辑和交互准备的场景的框架。WorldAct使用多模态代理指导场景分解，识别可操作对象，重建几何对齐的对象级网格以供交互，并通过3D修复恢复残留背景。所得到的场景支持对象级编辑、碰撞感知的操作和具身任务执行，同时保持全局场景一致性。实验表明，WorldAct比原始生成场景支持更丰富的交互场景，表明了通往可编辑和交互3D世界模型的实用路径。

英文摘要

Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.

URL PDF HTML ☆

赞 0 踩 0

2605.15836 2026-05-18 cs.RO cs.AI

GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks

GAP：用于操作任务数据高效视觉运动学习的几何锚预训练

Davide Buoso, Andrea Protopapa, Stefano Di Carlo, Francesca Pistilli, Giuseppe Averta

AI总结本文提出GAP，通过预训练空间适配器生成稳定的几何锚点，提升在稀疏数据下的视觉运动策略学习性能，实验显示其在多个任务中优于其他方法。

Comments Project webpage at https://lambdavi.github.io/gap

详情

AI中文摘要

从稀缺专家演示中学习视觉运动策略仍然是机器人操作中的核心挑战。主要困难在于将高维RGB表示压缩为与控制相关的几何表示而不过拟合。虽然使用冻结的预训练视觉基础模型（VFMs）提高了数据效率，但大多数任务适应仍落在小的空间池化模块上，这在微调时容易捕捉到任务无关的捷径并失去几何基础。更广泛地说，用于策略学习的预训练视觉表示在面对轻微场景扰动时表现不佳，凸显了需要以鲁棒性为导向的归纳偏置。我们提出几何锚预训练（GAP），一种简单的、无动作的预热阶段，通过在下游模仿学习前正则化空间适配器。GAP在轻量级模拟代理任务上预训练池化层，其中对象掩码可免费获得，鼓励适配器生成位于物体上的关键点，覆盖其空间范围，并保持时间上的锐利和可重复性。这会产生稳定的几何锚点，为少样本策略学习提供可靠的坐标接口，同时保持VFM冻结。我们在RoboMimic和ManiSkill上评估GAP，在严重数据稀缺（15-50次演示）和领域转移下。一个简单的适配器通过GAP正则化，始终优于更强的注意力池化器和端到端微调，分别在RoboMimic Can上以15次演示达到62%的成功率（比AFA高16%），在长周期高精度Tool Hang任务上以50次演示达到63%，在ManiSkill StackCube上以30次演示达到61%（比完全微调高11%）。代理阶段轻量且完全解耦于下游任务，使其在不同环境和操作技能中具有实用性。

英文摘要

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.

URL PDF HTML ☆

赞 0 踩 0

2605.15835 2026-05-18 cs.CV

Community-aware evaluation and threshold calibration for open-set plankton image recognition

面向社区的开放集浮游生物图像识别评估与阈值校准

Xi Chen, Eryuan Huang, Yingjun Xiao, Gang Fang

AI总结本文研究了开放集浮游生物识别中样本级与社区级评估的不匹配问题，提出Open-Set Community Distortion（OSCD）指标，并通过实验验证社区感知阈值校准对减少OSCD的有效性，强调应将开放集识别视为生态测量问题。

Comments Manuscript. 14 figures/tables in total

详情

AI中文摘要

自动化的浮游生物图像识别在水生态系统监测中日益被使用，但部署的分类器不可避免地会遇到未见过的物种和非目标颗粒。开放集识别方法通常使用样本级指标如AUROC、AUPR和FPR@95%未知召回率操作点进行评估，而生态监测依赖于对物种丰度和多样性的社区级估计。本研究通过受控伪社区和三个涵盖海洋浮游动物、海洋浮游植物和淡水浮游生物的数据库，探讨了这些目标之间的不匹配。我们定义了Open-Set Community Distortion（OSCD），即Bray-Curtis风格的已知物种误差加上未知bin的误差，具有方向性成分，区分已知物种的高估和低估。闭集分类器在已知类准确率上表现高，但未知样本常被高信心吸收。样本级OOD指标不足以选择生态操作点：对于MSP，FPR@95%未知召回率阈值在所有三个数据集上产生了较大的测试社区OSCD，主要是因为真实已知物种被过度拒绝到未知bin中。社区感知阈值校准在SYKE-ZooScan 2024和SYKE-IFCB 2022上减少了MSP OSCD；在ZooLake上，固定召回率基线已接近社区感知阈值，最佳社区级方法是原型距离变体而非MSP。因此，社区感知校准的益处取决于验证社区的代表性以及固定召回率与社区最优之间的差距。这些结果表明，开放集浮游生物识别应被视为生态测量问题，而不仅仅是样本级检测任务。

英文摘要

Automated plankton image recognition is increasingly used in aquatic ecosystem monitoring, but deployed classifiers inevitably encounter unseen taxa and non-target particles. Open-set recognition methods are usually evaluated with sample-level metrics such as AUROC, AUPR, and FPR@95% unknown-recall operating points, whereas ecological monitoring depends on community-level estimates of taxon abundance and diversity. This study examines the mismatch between these objectives using controlled pseudo-communities and three datasets spanning marine zooplankton imaged by ZooScan, marine phytoplankton imaged by IFCB, and freshwater plankton imaged by an in-situ camera. We define Open-Set Community Distortion (OSCD), a Bray-Curtis-style error over known taxa plus an unknown bin, with directional components distinguishing known-taxon overestimation from underestimation. Closed-set classifiers achieved high known-class accuracy, but unknown samples were often absorbed with high confidence and in structured ways. Sample-level OOD metrics were not sufficient to select ecological operating points: for MSP, FPR@95% unknown-recall thresholds produced large test-community OSCD on all three datasets mainly because true known taxa were over-rejected into the unknown bin. Community-aware threshold calibration reduced MSP OSCD relative to fixed 95% known recall on SYKE-ZooScan 2024 and SYKE-IFCB 2022; on ZooLake the fixed-recall baseline was already close to the community-aware threshold, and the best community-level method was a prototype-distance variant rather than MSP. The benefit of community-aware calibration therefore depends on validation-community representativeness and the gap between fixed recall and the community optimum. These results show that open-set plankton recognition should be evaluated as an ecological measurement problem, not only as a sample-level detection task.

URL PDF HTML ☆

赞 0 踩 0

2605.15831 2026-05-18 cs.SD cs.AI

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

将音乐建模为时频图像：一种用于音乐生成的2D分词器

Yuqing Cheng, Xingyu Ma, Guochen Yu, Xiaotao Gu

AI总结本文提出BandTok，一种面向生成的2D梅尔频谱分词器，通过单个共享码本生成梅尔频带token，提升自回归建模能力，实验表明其在数据有限情况下表现优异。

详情

AI中文摘要

自回归音乐生成高度依赖音频分词器。现有高保真编码器常使用残差多码本量化，虽保留重建质量但序列展平后语言建模复杂，因残差层次强序列依赖且放大误差积累。我们提出BandTok，一种面向生成的2D梅尔频谱分词器，通过单个共享码本生成梅尔频带token，生成物理可解释的时频token网格，具有更独立的token结构，更适合自回归建模。BandTok通过多尺度PatchGAN目标和EMA码本更新提升重建质量。我们进一步引入具有2D Rotary Position Embedding（2D RoPE）的自回归语言模型，以在生成过程中保持时间和频带结构。实验表明，BandTok优于残差码本分词器，在数据有限情况下表现优异。本工作源代码和生成演示已公开。

英文摘要

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.15824 2026-05-18 cs.CV

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon：迈向实时和交互式的人体服装视频定制

Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao

AI总结本文提出FashionChameleon框架，通过单件服装视频数据实现交互式多服装视频定制，保留动作一致性，实现实时生成23.8FPS，比现有方法快30-180倍。

Comments Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

详情

AI中文摘要

以人为中心的视频定制，特别是在服装层面，已显示出显著的商业价值。然而，现有方法无法支持低延迟和交互式服装控制，这对电子商务和内容创作应用至关重要。本文研究如何在仅使用单件服装视频数据的情况下，实现交互式多服装视频定制并保持动作一致性。我们提出了FashionChameleon，一个用于自回归视频生成中的人体服装定制的实时交互框架，用户可以在生成过程中交互式切换服装。FashionChameleon包含三个关键技术：(i) 代替在多服装视频数据上训练，我们使用上下文学习在单个参考服装对上训练教师模型。通过保留图像到视频的训练范式，同时强制参考和服装图像之间不匹配，模型被鼓励在单件服装切换时隐式保持一致性。(ii) 为了在生成过程中实现一致性和效率，我们引入了带有上下文学习的流式蒸馏，通过上下文教师强制微调模型，并通过梯度加权分布匹配蒸馏提高外推一致性。(iii) 为了将模型扩展到交互式多服装视频定制，我们提出了无训练KV缓存调度，包括服装KV刷新、历史KV撤回和参考KV解耦，以在保持动作一致性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制和一致的长视频外推，同时在单个GPU上实现实时生成23.8 FPS，比现有基线快30-180倍。

英文摘要

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15822 2026-05-18 cs.LG stat.ML

Intrinsic Wasserstein Rates for Score-Based Generative Models on Smooth Manifolds

基于光滑流形的分数布朗运动生成模型的内在Wasserstein速率

Guoji Fu, Taiji Suzuki, Wee Sun Lee, Atsushi Nitanda

AI总结本文研究了在光滑流形上基于分数布朗运动的生成模型的内在Wasserstein速率，证明了在满足一定条件的流形上，变差保持的SGM估计器能达到特定的样本指数，且分析了分数近似在不同噪声 regime 下的表现。

2605.15803 2026-05-18 cs.CV cs.LG

Embedding-perturbed Exploration Preference Optimization for Flow Models

嵌入扰动探索偏好优化用于流模型

Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li

AI总结本文提出E²PO框架，通过嵌入层面扰动维持优化稳定性，提升生成模型对人类偏好的对齐效果。

Comments Accepted by ICML 2026

详情

AI中文摘要

最近的进展已将强化学习（RL）确立为对齐生成模型与人类意图的关键范式。然而，基于群体的优化框架（如GRPO）面临关键限制：群体内方差的快速衰减。随着群体内样本的差异性降低，方差趋于零。这消除了优化所需的信号，使过程不稳定，迫使策略提前停滞或奖励黑客。现有策略如改变初始噪声或增加群体大小往往无法解决这一根本问题，导致训练不稳定或收益递减。为克服这些挑战，我们提出嵌入扰动探索偏好优化（E²PO），一种通过嵌入层面扰动维持优化的新型框架。我们的方法在样本群体内引入结构化扰动，保证了鲁棒的方差，从而在训练过程中保持判别信号。大量实验表明，我们的方法显著优于现有最佳基线，实现了更忠实的人类偏好对齐。

英文摘要

Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

URL PDF HTML ☆

赞 0 踩 0

2605.15796 2026-05-18 cs.CV

Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion

通过姿态感知解缠和点云融合实现3D与2D指纹的跨模态注册

Xiongjun Guan, Jianjiang Feng, Jie Zhou

AI总结本文提出统一框架，实现3D指纹预处理与跨接触式和非接触式2D指纹的注册，结合非参数可视化解缠、点云融合、姿态归一化和姿态感知注册策略，提升3D与2D指纹兼容性。

详情

AI中文摘要

三维（3D）指纹保留全局指纹几何和局部脊线结构，避免接触引起的变形，但难以与传统二维（2D）指纹系统集成。本文针对3D采集与跨模态匹配之间的中间阶段，提出统一框架，用于3D指纹预处理和跨接触式和非接触式2D模态的注册。框架结合四个组件：1）非参数可视化和解缠方法，将3D指纹点云转换为卷轴等效2D表示，无需全局指纹模型；2）点云融合管道，将多个部分3D捕捉注册并拼接为更完整的指纹模型；3）基于椭圆的姿态归一化方法用于标准指纹对齐；4）姿态感知的跨模态注册策略，提高3D指纹与非接触式和接触式2D指纹的兼容性。在自建的多模态指纹数据库（含150个指纹）上的实验表明，所提框架实现了脊线级3D注册精度、鲁棒的姿态估计和一致的2D兼容性提升。特别是3D融合误差集中在0.09 mm，非接触式2D-3D注册达到脊线尺度投影精度，姿态感知解缠相对于通用3D解缠提高了真实匹配分数。这些结果支持3D指纹作为跨异构指纹模态的有效几何桥梁。

英文摘要

Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D--3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities.

URL PDF HTML ☆

赞 0 踩 0

2605.15794 2026-05-18 cs.CL

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

ForMaT：用于视觉接地多语言PDF翻译的数据集

Michał Ciesiółka, Dawid Wiśniewski, Adrian Charkiewicz, Kamil Guttmann

AI总结 ForMaT通过3956个PDF构建多语言翻译数据集，保留布局元数据以提升多模态翻译效果，采用K-Medoids采样捕捉复杂元素，揭示当前MT系统在空间接地与几何同步上的不足，为布局感知翻译模型提供基准。

2605.15793 2026-05-18 cs.LG

AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training

AOT-POT：面向大规模PDE预训练的自适应算子变换

Qitan Lv, Hong Wang, Zhongkai Hao, Wen Wu, Xuenan Xu, Bowen Zhou, Feng Wu, Chao Zhang

AI总结本文提出AOT-POT方法，通过自适应算子变换将复杂多样的PDE解算子转化为统一形式，提升预训练效果，实验表明其在12个PDE基准上表现优异，参数仅增加3%。

详情

AI中文摘要

在多样化的偏微分方程（PDE）数据集上预训练神经算子已成为科学机器学习中构建通用替代模型的有前景方向。然而，PDE解算子的内在复杂性和结构多样性使得多PDE预训练从根本上具有挑战性。现有方法主要通过增加模型容量来应对，而本文受经典数值分析启发，提出将复杂多样的解算子转换为更简单、对齐更好的形式，以便联合建模。由于最优转换因PDE类型而异，必须自适应且输入依赖，使单个神经算子能近似整个算子家族。本文将这一想法实例化为AOT-POT（自适应算子变换预训练算子变换器），通过扩展隐藏表示为多个并行流，自适应聚合和重新分配流并在每个子层前后进行，利用Sinkhorn投影的双随机矩阵混合流以实现稳定训练。这些机制共同将多样化的解算子重塑为统一形式，可被单一架构有效建模。实验表明，AOT-POT在12个PDE基准上仅增加3%的参数，相对L2误差减少最高达77.6%（平均40.9%）。进一步微调AOT-POT在域内PDE上将L2误差减少最高达92%，在域外PDE上减少89%（未在预训练中见到的类型），证明自适应算子变换是提升PDE基础模型的有效且互补方向，超越单纯扩大模型容量。

英文摘要

Pre-training neural operators on diverse partial differential equation (PDE) datasets has emerged as a promising direction for building general-purpose surrogate models in scientific machine learning. However, the inherent complexity and structural diversity of PDE solution operators make multi-PDE pre-training fundamentally challenging. Existing methods mainly address this by increasing model capacity, while leaving the target solution operators unchanged. Inspired by classical numerical analysis, we instead propose to transform complex and diverse solution operators into simpler, better-aligned forms that are easier to model jointly. Since the optimal transformation varies across PDE types, it must be adaptive and input-dependent, allowing a single neural operator to approximate an entire family of operators. We instantiate this idea as AOT-POT (adaptive operator-transformation for pre-training operator transformer), which expands hidden representations into multiple parallel streams, adaptively aggregates and redistributes them before and after each sub-layer, and mixes streams through Sinkhorn-projected doubly stochastic matrices for stable training. These mechanisms together reshape diverse solution operators into a unified form that can be effectively modeled by a single architecture. Empirically, AOT-POT achieves state-of-the-art performance on 12 PDE benchmarks with only 3\% additional parameters, reducing relative L2 error by up to 77.6\% (40.9\% on average). Fine-tuning AOT-POT further reduces L2 error by up to 92\% on in-domain PDEs and 89\% on out-of-domain PDEs (unseen types during pre-training), demonstrating that adaptive operator transformation is an effective and complementary direction for advancing PDE foundation models beyond simply scaling model capacity.

URL PDF HTML ☆

赞 0 踩 0

2605.15792 2026-05-18 cs.CV

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

反向流动：大型多模态模型中的生成到理解协同效应

Yujun Tong, Dongliang Chang, Zijin Yin, Xintong Liu, Yuanchen Fang, Zhanyu Ma

AI总结本文提出生成到理解协同效应，通过生成过程作为中间推理步骤提升多模态理解，揭示生成与理解的双向关系及模型自我反思的不足。

Comments Accepted by CVPR 2026 Findings

详情

AI中文摘要

多模态AI长期目标是构建统一模型，使视觉理解和生成相互促进。尽管BAGEL和BLIP3o取得进展，但统一仍单向：理解指导生成，而生成如何支持理解未被研究。本文提出生成到理解协同效应，使视觉生成成为显式推理步骤，通过生成细节增强、上下文扩展等可控生成行为产生自生成视觉思考，并反馈至模型以优化感知，无需重新训练。在十二个基准测试中，反向信息流一致提升多模态理解。研究显示生成保真度限制感知增益，不同编辑提示家族影响迁移效率。进一步分析模型能否决定想象内容，尽管模型可生成合理编辑，但自生成视觉思考缺乏稳定任务对齐，揭示当前大多模态模型未达到真实自我反思。本文揭示统一认知缺失机制，表明想象并非理解终点，而是其起点。

英文摘要

The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.

URL PDF HTML ☆

赞 0 踩 0

2605.15789 2026-05-18 cs.LG eess.SP stat.ML

Learning Context-conditioned Gaussian Overbounds for Convolution-Based Uncertainty Propagation

基于卷积的不确定性传播的上下文条件高斯上界学习

Ruirui Liu, Xuejie Hou, Yiping Jiang, Hui Ren

AI总结本文提出一种统一的学习框架，通过训练神经网络生成上下文感知的高斯上界，确保在有限分位数网格上具有可证明的保守性，并在满足三个显式正则性假设时在认证区间内保持连续尾保守性。

详情

AI中文摘要

不确定性量化在安全关键领域至关重要——从自动驾驶到航空、金融和健康——其中决策必须依赖保守的界限而非点估计。预测层面的区间（如分位数回归、符合预测、方差网络或贝叶斯模型）通常不具有可组合性：将两个变量的区间相加不一定得到其和的合法区间或保持覆盖率。在航空领域，高斯上界用复杂的误差分布替换为保守的高斯分布，其尾部支配真实分布，因此保守性通过线性操作传播。然而，经典上界是全局的，通常过于保守，且难以适应特征条件误差。我们提出了一种统一的学习框架，训练神经网络生成上下文感知的高斯上界——均值和尺度——在有限分位数网格上具有可证明的保守性，并在满足三个显式正则性假设时在认证区间内保持连续尾保守性。我们的上界损失在选定的分位数上强制保守性，同时用一种类似瓦瑟斯坦的项惩罚分布距离。所学习的界限支持在强制网格上进行保守的线性组合和卷积分析，并在假设成立时在认证区间内进行保守性分析，同时比传统方法更不冗余。我们提供了离散到连续保守性的范围分析和紧域目标正则性的分析，并在合成数据和真实世界数据集上进行了验证，包括多路径、电离层和对流层残差误差。在这些设置中，该方法在保持强制网格上的保守性的同时，提供了更紧的界限。该框架是模态无关的，并适用于需要在动态环境中进行保守、特征条件不确定性估计的学习系统。

英文摘要

Uncertainty quantification is essential in safety-critical settings--from autonomous driving to aviation, finance, and health--where decisions must rely on conservative bounds rather than point estimates. Predictor-level intervals (e.g., from quantile regression, conformal prediction, variance networks, or Bayesian models) generally do not compose: adding two per-variable intervals need not yield a valid interval for their sum or preserve coverage. In aviation, Gaussian overbounding replaces complex error distributions with a conservative Gaussian whose tails dominate the truth, so conservatism propagates through linear operations. Yet classical overbounds are global, often overly conservative, and hard to adapt to feature-conditioned errors. We propose a unified learning framework that trains neural networks to produce context-aware Gaussian overbounds--mean and scale--with provable conservatism on a finite quantile grid and, under three explicit regularity assumptions, continuous-tail conservatism on a certified interval. Our overbounding loss enforces conservativeness at selected quantiles while penalizing distributional distance with a Wasserstein-style term. The learned bounds support conservative linear-combination and convolution analysis on the enforced grid, and on the certified interval when assumptions hold, while being less redundant than traditional methods. We provide a scoped analysis of discrete-to-continuous conservatism and compact-domain objective regularity, and validate on synthetic data and real-world datasets, including multipath, ionospheric, and tropospheric residual errors. Across these settings, the method yields tighter bounds while maintaining conservatism on the enforced grid and in experiments. The framework is modality-agnostic and applicable to learning systems that require conservative, feature-conditioned uncertainty estimates in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2605.15787 2026-05-18 cs.LG cs.AI

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

通过结构推断理解Grokking：Transformer需要贝叶斯彩票

Kai Hidajat, Solden Stoll, Joseph An

AI总结研究探讨了Transformer在延迟泛化现象中的结构推断机制，提出贝叶斯彩票理论，解释了泛化延迟与结构学习的关系。

详情

AI中文摘要

为什么一个已经记忆了训练集的Transformer要在数千步后才开始泛化？现有解释将这种延迟归因于范数最小化、特征出现或稀疏子网络的晚期发现。这些解释捕捉了过渡过程中的重要部分，但忽略了注意力模型特有的约束：如果注意力丢弃了一个信息性token，就没有有界下游计算能恢复它。我们正式将注意力建模为任务依赖图的隐式贝叶斯后验，并证明泛化需要两个分离的条件：一个与MLP容量相关的Goldilocks界，与基于范数的Grokking理论一致，以及一个新的贝叶斯结构性条件，要求注意力对每个信息性token放置足够的质量。这种分离解释了延迟泛化为延迟结构推断。训练早期，MLP通过不匹配的特征记忆，驱动交叉熵损失接近零，从而使注意力缺乏结构梯度。权重衰减必须在记忆消失前侵蚀记忆，使缺失的图变得可学习，产生已知的逆权重衰减延迟，我们推导为结构等待时间。然后证明这种解释-消除延迟可通过KL基于的结构性干预绕过，产生Grokking时间的逆干预强度缩放定律。在算法序列任务上的实验将结构与容量分离，显示这种贝叶斯彩票与彩票转移相匹配或优于。

英文摘要

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

URL PDF HTML ☆

赞 0 踩 0

2605.15782 2026-05-18 cs.RO cs.SY eess.SY

Reactive Robot-Centric Safety for Autonomous Navigation in Constrained and Dynamic Environments

反应式机器人中心安全机制用于受约束和动态环境中的自主导航

Viswa Narayanan Sankaranarayanan, Vignesh K. Viswanathan, Akshit Saradagi, Sumeet Satpute, George Nikolakopoulos

AI总结本文提出一种基于3D激光雷达感知的复合控制屏障函数安全过滤器，用于实时保障自主机器人在受限动态环境中的安全导航，通过现场实验验证其在复杂环境下的可靠性。

Comments 9 pages, 12 figures, currently under review

详情

AI中文摘要

本文研究了如何利用仅有的机载传感器确保自主机器人导航在空间受限动态环境中的实时安全性。我们提出了一种实时控制架构，将基于3D激光雷达感知的复合控制屏障函数（CBF）安全过滤器直接集成到自主系统中。该感知驱动框架通过机载点云数据动态施加避障约束，从而在控制频率下处理大量约束，同时对常规任务执行保持最小侵入性。安全区域定义为与平台几何一致的体坐标系下的椭球体，这在机器人旋转时会在世界坐标系中诱导时间变化约束；通过为每个激光雷达点专门制定时间变化（CBF）处理方式来应对这一影响。通过在地下环境中进行多场实验，利用四足平台执行视觉检查任务，验证了系统在动态障碍、不安全高层参考、突然定位异常以及狭窄走廊穿越情况下的可靠运行。

英文摘要

In this work, we address the problem of ensuring real-time safety in autonomous robot navigation, in spatially constrained dynamic environments, by utilizing only onboard sensors. We present a real-time control architecture that integrates a 3D LIDAR perception-based composite control barrier function(CBF)-based safety filter directly into the autonomy pipeline. The proposed perception-driven framework enforces collision avoidance constraints dynamically from onboard point cloud data, thus allowing a large number of constraints to be handled at the control frequency, while remaining minimally invasive to nominal task execution. The safety region is defined as an ellipsoid in the body-frame, consistent with the geometry of the platform, which induces time-varying constraints in the world frame as the robot rotates; this effect is handled through a dedicated formulation of time-varying (CBF) for each LIDAR point. We validate the system through multiple field experiments in underground environments by utilizing a quadruped platform performing a visual inspection task, demonstrating reliable operation in the presence of dynamic obstacles, unsafe high-level references, abrupt localization anomalies, and while traversing through narrow corridors.

URL PDF HTML ☆

赞 0 踩 0

2605.15779 2026-05-18 cs.RO cs.AI

A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking

一种面向拓扑的时空切换框架用于连续多无人机跟踪

Jianlin Ye, Christos Kyrkou, Panayiotis Kolios

AI总结本文提出一种实时多摄像头多车辆跟踪系统，通过拓扑基于的时空切换机制解决多无人机视角下的身份持续性问题，实验显示其切换成功率高达99.8%，优于传统Re-ID方法。

详情

Journal ref: 2026 International Conference on Unmanned Aircraft Systems (ICUAS)

AI中文摘要

将无人机（UAVs）整合到智能交通系统（ITS）中为交通监控提供了全景可见性，但可扩展部署受到轨迹碎片化的影响，其中车辆身份在多UAV视角下丢失。尽管最先进的框架在优化局部轨迹提取和稳定性方面表现优异，但它们通常作为孤立的数据孤岛，生成不连贯的轨迹，从而阻碍了网络层面的分析，如起讫点估计。本文提出了一种实时多摄像头多车辆跟踪（MCMT）系统，旨在处理全局身份持续性。针对俯视视角中基于外观的重识别（Re-Identification）的视觉模糊和计算成本，我们引入了一种轻量级的拓扑基于的时空切换机制。我们实现了高吞吐量的并行管道，利用YOLO11和ByteTrack处理同时的4K流。我们的核心贡献是一种确定性的队列基于的匹配算法，利用几何重叠和虚拟车道离散化来通过FIFO队列预测性地管理身份切换。在复杂的城市环境中，包括交叉口和汇入交通，实验结果表明在连续交通流中的切换成功率（HOSR）为99.8%，显著优于Re-ID基线（74.1%），同时验证了边缘部署的可行性。源代码可在https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system获取。

英文摘要

The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi-UAV Fields of View (FOV). While state-of-the-art frameworks excel in optimizing local trajectory extraction and stability for single-drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network-level analysis such as Origin-Destination estimation. This paper presents a real-time Multi-Camera Multi-Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance-based Re-Identification (Re-ID) in nadir views, we introduce a lightweight Topology-Based Spatiotemporal Handover mechanism. We implement a high-throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue-based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re-ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system.

URL PDF HTML ☆

赞 0 踩 0

2605.15775 2026-05-18 cs.LG

Continual Learning of Domain-Invariant Representations

持续学习领域不变表示

Pascal Janetzky, Tobias Schlagenhauf, Stefan Feuerriegel

AI总结本文提出持续学习领域不变表示方法，通过结合回放训练和定制的序列不变性对齐，提升模型在未见目标领域上的泛化能力。

Comments ICML 2026

详情

AI中文摘要

持续学习（CL）旨在在多个领域上依次训练模型，而不遗忘之前学习的知识。然而，现有CL方法优化的是领域内性能，因此容易学习到虚假的、领域特定的提示（"捷径学习"），这限制了部署后对未见领域的泛化能力。本文通过持续学习领域不变表示来解决这一限制。我们引入了一类广泛的方法，依次学习捕捉跨领域的不变结构的表示。我们的方法受到观察的启发，即这些不变结构往往保留了底层因果机制，这可以减少对领域特定提示的过拟合风险，从而提供更好的领域外泛化能力。我们提出的方法结合了基于回放的训练和定制的序列不变性对齐，以学习并保留时间上的不变结构。我们在面向部署的协议下评估了我们的方法，该协议测量在未见目标领域上的性能。在六个基准和现实世界的数据集上，涵盖视觉、医学、制造和生态领域，我们的方法在泛化到未见目标领域方面始终优于现有CL基线。作为消融研究，我们进一步表明，简单的顺序训练扩展现有领域不变表示学习（DIRL）方法只能提供有限的收益。据我们所知，这是首次开发用于持续学习的领域不变表示方法。

英文摘要

Continual learning (CL) aims to train models sequentially over multiple domains without forgetting previously learned knowledge. However, existing CL methods optimize for in-domain performance and are therefore prone to learning spurious, domain-specific cues (``shortcut learning''), which limits generalization to unseen domains after deployment. In this paper, we address this limitation through continual learning of domain-invariant representation. We introduce a broad class of CL methods that sequentially learn representations capturing invariant structures across domains. Our methods are motivated by the observation that such invariant structures often preserve the underlying causal mechanisms, which can reduce the risk of overfitting to domain-specific cues and thus offer better out-of-domain generalization. Our proposed CL methods combine replay-based training with a tailored sequential invariance alignment to learn -- and preserve -- invariant structures over time. We evaluate our methods under a deployment-oriented protocol that measures performance on unseen target domains. Across six benchmark and real-world datasets spanning vision, medicine, manufacturing, and ecology, our methods consistently outperform existing CL baselines in terms of generalization to unseen target domains. As an ablation, we further show that naïve extensions of sequential training with existing domain-invariant representation learning (DIRL) methods provide only limited benefits. To the best of our knowledge, this is the first work to develop domain-invariant representation methods for CL.

URL PDF HTML ☆

赞 0 踩 0

2605.15134 2026-05-18 cs.LG

Training ML Models with Predictable Failures

用可预测的故障训练机器学习模型

Will Schwarzer, Scott Niekum

AI总结本文提出了一种通过预测故障来评估机器学习模型部署安全性的方法，通过分析评估集中的故障评分来预测部署规模的故障率，并提出改进的细调目标以减少预测误差。

Comments 32 pages, 9 figures. Updated with acknowledgments

详情

AI中文摘要

估计机器学习模型在部署规模上失败的频率对于预部署安全评估至关重要，但可行的评估集通常不足以观察关键的故障。Jones等人（2025）通过从评估集中最大的k个故障评分进行外推来预测部署规模的故障率。我们对这种估计器的预测误差进行了有限k分解，并表明在典型情况下它具有内在的偏向于过预测的偏差，这在安全有利的方向。这种偏差在评估集遗漏了部署集包含的罕见高故障模式时会被抵消，导致预测在部署规模下低估。我们提出了一种细调目标，即预测性损失，以解决这种故障模式。在两个概念验证实验中，语言模型密码游戏和RL网格世界中，微调显著减少了保留的预测误差，同时保持了主要任务能力，并实现了与监督基线相似的安全性。

英文摘要

Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15120 2026-05-18 cs.RO cs.AI cs.CV

CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning

CLOVER：端到端自动驾驶规划的闭环价值估计与排序

Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang

AI总结 CLOVER通过闭环价值估计与排序框架，解决端到端自动驾驶规划中训练与评估不匹配的问题，通过生成器和评分器的轻量级架构提升规划器性能，实现更准确的候选轨迹排序。

详情

AI中文摘要

端到端自动驾驶规划器通常通过模仿单条记录轨迹进行训练，但通过基于规则的规划指标进行评估，这导致了训练与评估之间的不匹配：接近记录路径的轨迹可能违反规划规则，而偏离记录路径的替代方案可能仍有效且得分高。这种不匹配对提案选择规划器尤其限制，因为其性能依赖于候选集覆盖和评分器排序质量。我们提出了CLOVER，一种用于端到端自动驾驶规划的闭环价值估计与排序框架。CLOVER采用轻量级生成器-评分器架构：生成器产生多样化的候选轨迹，评分器预测规划指标子分数以在推理时对它们进行排序。为了扩展提案支持超越单轨迹模仿，CLOVER构建了评估器过滤的伪专家轨迹，并通过集级别覆盖监督训练生成器。然后，它执行保守的闭环自我蒸馏：评分器被拟合到生成的提案上的真实评估子分数，而生成器则通过稳定性正则化向教师选择的前k和向量帕累托目标进行细化。我们分析了当评分器不完美时如何改进生成器，证明了当评分器选择的目标在真实评估下得到丰富且更新保持保守时，评分器介导的细化是可靠的。在NAVSIM上，CLOVER实现了94.5 PDMS和90.4 EPDMS，建立了新的状态。在更具挑战性的NavHard分割上，它获得了48.3 EPDMS，与最强报告结果相匹配。在补充的nuScenes开环评估中，CLOVER在比较方法中实现了最低的L2误差和碰撞率。代码数据将在https://github.com/WilliamXuanYu/CLOVER上发布。

英文摘要

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

URL PDF HTML ☆

赞 0 踩 0

2605.14736 2026-05-18 cs.SD cs.LG

IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

IsoNet：在复杂声学环境中具有空间意识的音频视觉目标语音提取

Dinanath Padhya, Sajen Maharjan, Binita Adhikari, Ishwor Raj Pokharel

AI总结 IsoNet通过结合多通道STFT特征、GCC-PHAT空间线索和面部条件视觉嵌入，实现了在紧凑麦克风阵列上的目标语音提取，其在复杂声学环境中的性能优于传统方法。

Comments 8 pages

详情

AI中文摘要

目标语音提取在紧凑设备中仍然具有挑战性，因为单耳神经模型缺乏空间证据，而经典波束成形器在麦克风孔径仅为几厘米时会失去分辨能力。我们提出了IsoNet，一种适用于紧凑4麦克风阵列的音频-视觉目标语音提取系统。IsoNet结合了复数多通道STFT特征、GCC-PHAT空间线索、面部条件视觉嵌入以及辅助到达方向监督，嵌入到U-Net掩码估计网络中。三种课程变体在25,000个模拟VoxCeleb混音上进行了训练，逐步增加SNR难度。在覆盖-1到10 dB SNR的困难测试集上，IsoNet-CL1实现了9.31 dB SI-SDR，比混合物提高了4.85 dB，PESQ 2.13和STOI 0.84。Oracle延迟和求和及MVDR波束成形器在相同混合物上分别降低了4.82 dB和6.08 dB SI-SDRi，表明所提出的学习多模态条件解决了传统空间过滤无效的领域。消融研究显示，视觉条件、GCC-PHAT特征和扩展延迟-bin编码带来了持续的增益。结果建立了紧凑阵列、面部可选语音提取的基线，在受控模拟中识别了剩余的现实部署障碍，特别是相位重建、多干扰源混合物和模拟到现实的转移。

英文摘要

Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.

URL PDF HTML ☆

赞 0 踩 0