arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

发表机构 * Khoury College of Computer Sciences, Northeastern University（东北大学库里计算机科学学院）； Cheriton School of Computer Science, University of Waterloo（滑铁卢大学切里顿计算机科学学院）

AI总结针对多敏感属性交叉群体的偏差问题，提出在覆盖约束下扩展偏差缓解框架，通过整数线性规划优化缓解策略，权衡偏差近似误差与数据效率，并刻画公平的代价。

Comments Accepted to FAccT 2026

详情

AI中文摘要

机器学习模型已被证明在多个敏感属性（如种族和性别）交叉的个体上表现出歧视性结果或性能下降。这源于两个相互关联的挑战：缺乏量化偏差（可能是交叉的）的原则性措施，以及训练数据中交叉子群的代表性不足。我们扩展了一个最近的偏差缓解框架，以纳入覆盖约束，确保跨群体（包括交叉子群）的充分代表性。由于对所有群体实现完全零偏差可能不是数据高效的（意味着可能需要大量数据），我们的解决方案在满足覆盖约束的同时，用偏差的小近似误差换取更高的数据效率。我们还将偏差缓解表述为一个整数线性规划，优化所有缓解策略，并刻画公平的代价，即最小数据修改成本，作为公平容忍度的函数。这对于法律合规（法规可能规定特定的公平阈值）和数据治理（使从业者能够在偏差减少和数据修改（特别是数据购买）成本之间做出明智的权衡）都至关重要。我们在公开数据集上评估了我们的技术，表明通过我们的框架进行偏差缓解可以保持多个分类器的预测准确性，并且覆盖约束虽然出于统计考虑，但对于保持下游机器学习性能至关重要。

英文摘要

Machine learning models have been shown to exhibit discriminatory outcomes or degraded performance for individuals at the intersection of multiple sensitive attributes, such as race and gender. This stems in part from two interrelated challenges: the lack of principled measures for quantifying bias (potentially intersectional), and insufficient representation of intersectional subgroups in training data. We extend a recent bias mitigation framework to incorporate coverage constraints that enforce sufficient representation across groups, including intersectional subgroups. Since achieving exactly zero bias for all groups may not be data efficient (meaning it may require large amounts of data), our solution trades small approximation errors in bias for greater data efficiency while satisfying coverage constraints. We also formulate bias mitigation as an integer linear program that optimizes over all mitigation strategies, and characterize the price of fairness, the minimum data modification cost, as a function of fairness tolerance. This is essential both for legal compliance, where regulations may mandate specific fairness thresholds, and for data governance, enabling practitioners to make informed trade-offs between bias reduction and data modification (particularly, data purchasing) costs. We evaluate our techniques on publicly available datasets, demonstrating that bias mitigation via our framework preserves predictive accuracy across multiple classifiers, and that coverage constraints, while motivated by statistical considerations, are essential for preserving downstream ML performance.

URL PDF HTML ☆

赞 0 踩 0

2606.20459 2026-06-19 cs.AI 新提交

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions

IVF实验室环境条件的上下文感知分层贝叶斯建模

Zahra Asghari Varzaneh, Reza Khoshkangini, Pia Saldeen, Lars Johansson, Thomas Ebner

发表机构 * Department of Computer Science and Media Technology, Malmö University（马尔默大学计算机科学与媒体技术系）

AI总结提出55个上下文感知时间特征捕捉培养箱微环境动态，结合分层贝叶斯Beta回归模型跨诊所共享环境效应，将预测误差从3-5%降至1.27%，并在北欧诊所实现R²=0.86和64%误差降低。

详情

AI中文摘要

IVF妊娠率通常使用患者层面变量进行建模，而高分辨率实验室环境数据仍未得到充分利用。我们表明这是一个错失的机会。我们不再依赖原始传感器平均值，而是设计了55个上下文感知的时间特征，包括滚动热稳定性、同时温湿度符合性、峰值应力持续时间和应力后恢复速度，这些特征捕捉了培养箱微环境的动态。基于来自一家亚洲IVF诊所的61周数据，这些特征将交叉验证预测误差降低至1.27%，而原始平均值的误差为3-5%。然后，我们训练了一个分层贝叶斯Beta回归模型，通过部分池化在亚洲和北欧诊所之间共享环境效应，同时保留特定于诊所的基线。在来自北欧诊所的保留数据上，该模型在35-39岁年龄组中实现了R²=0.86和相对于朴素基线的64%误差降低，表明结构化的环境监测包含具有临床意义的可迁移信号。

英文摘要

IVF pregnancy rates are routinely modeled using patient-level variables, while high-resolution laboratory environmental data remain underutilized. We show that this is a missed opportunity. Rather than relying on raw sensor averages, we engineer 55 context-aware temporal features, including rolling thermal stability, simultaneous temperature-humidity adherence, peak stress duration, and post-stress recovery speed, that capture the dynamics of incubator microenvironments. On 61 weeks of data from an Asian IVF clinic, these features reduce cross-validated prediction error to 1.27%, compared to 3-5% for raw averages. We then train a hierarchical Bayesian Beta regression model that shares environmental effects across an Asian and a Northern European clinic via partial pooling, while preserving site-specific baselines. On held-out data from the Northern European clinic, the model achieves R2 = 0.86 and a 64% error reduction for the 35-39 age group over a naive baseline, demonstrating that structured environmental monitoring contains clinically meaningful, transferable signal.

URL PDF HTML ☆

赞 0 踩 0

2606.20458 2026-06-19 cs.RO 新提交

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation

慢速大脑，快速规划器：延迟鲁棒的VLM增强城市导航

Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

发表机构 * Amazon FAR（亚马逊 FAR）； UCLA（加州大学洛杉矶分校）； Independent（独立）； Zhejiang University（浙江大学）

AI总结针对移动机器人在人行道导航中轨迹评分差距问题，提出一种无需训练的延迟鲁棒轨迹级融合层，利用VLM选择候选轨迹并与规划器输出融合，在挑战场景下降低ADE 30%。

详情

AI中文摘要

基于学习的 sidewalk 导航规划器可以实时生成多样化的候选轨迹，但其评分函数在挑战性场景中往往无法选择最佳轨迹，即使同一集合中存在更好的候选，也会输出使移动机器人驶入草地、朝向行人或错误方向的轨迹。我们称之为轨迹评分差距：在真实世界的人行道导航中，基于锚点的规划器的最佳选择与最佳候选之间的差距很大，这可能是由于规划器的高层场景理解能力有限。我们不是用端到端的视觉-语言-动作模型替换规划器，而是提出一种VLM-规划器接口，使用VLM从规划器的候选集合中选择一个候选索引，然后将其与规划器的初始输出融合。然而，VLM每次查询需要1-3秒，因此无法直接驱动5-20Hz的控制循环。我们贡献了一种无需训练、延迟鲁棒的轨迹级融合层，通过指数衰减的几何相似性将过时的VLM选择转化为实时规划器评分。在约2000个具有挑战性的真实世界场景（例如交叉口、行人相遇）中，VLM选择相比规划器的最佳选择实现了30%的ADE降低，而规划器在常规场景中仍保持竞争力。在仿真中，Score Fusion在高达5秒的延迟下仍保持>80%的成功率。我们在移动机器人上展示了完整系统，在具有不同网络延迟的具有挑战性的校园人行道上进行导航。

英文摘要

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.

URL PDF HTML ☆

赞 0 踩 0

2606.20455 2026-06-19 cs.CV 新提交

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

PCFootprint：用于从航空LiDAR点云中提取矢量化建筑足迹的大规模数据集与基准

Haoyuan Shen, Kuihao Wang, Ruisheng Wang, Yujun Liu

发表机构 * School of Architecture and Urban Planning, Shenzhen University（深圳大学建筑与城市规划学院）

AI总结提出首个大规模航空激光扫描点云建筑足迹提取数据集PCFootprint，含33000个瓦片及跨域测试集，通过评估主流方法揭示复杂地理环境下的挑战。

Comments 14 pages, 9 figures

详情

AI中文摘要

建筑足迹提取是摄影测量、遥感和计算机视觉中的基本任务。近年来，基于图像的方法在高分辨率光学影像的矢量化足迹提取方面取得了显著进展。然而，光学影像本质上易受遮挡、透视畸变和残余地形位移的影响，导致足迹提取不完整或错位。此外，缺乏显式高程信息限制了其在细节层次建筑建模中的直接适用性。本文提出PCFootprint，这是首个用于从机载激光扫描点云中提取足迹的大规模公共数据集。PCFootprint包含来自爱沙尼亚土地和空间发展局的33000个瓦片，覆盖多样化的城市和乡村景观。每个瓦片大小为128×128米，并配有与点云对齐的系统性矢量化足迹。该数据集包括一个3000个瓦片的跨域测试集，用于评估跨地理区域的泛化能力。我们通过评估主流方法建立了全面的基准。实验结果表明，在复杂地理环境中存在高类内方差、数据不平衡和噪声等显著挑战。我们相信PCFootprint将推动建筑建模、城市场景理解和地理空间分析的未来研究。PCFootprint数据集公开于：https://this https URL。

英文摘要

Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.

URL PDF HTML ☆

赞 0 踩 0

2606.20438 2026-06-19 cs.AI 新提交

Interpretable Sperm Morphology Classification via Attention-Guided Deep Learning

可解释的精子形态分类：基于注意力引导的深度学习

Zahra Asghari Varzaneh, Reza Khoshkangini, Thomas Ebner, Lars Johansson

发表机构 * Department of Computer Science and Media Technology, Malmö University（马尔默大学计算机科学与媒体技术系）

AI总结提出注意力引导的深度学习框架，结合EfficientNet-B0和CBAM模块进行精子形态分类，在SMIDS和HuSHem数据集上分别达到90.2%和93.9%的准确率，并通过Grad-CAM++可视化增强可解释性。

2606.20431 2026-06-19 cs.LG 新提交

Sparsity, Superposition, and Forgetting: A Mechanistic Study of Representation Retention in Continual Learning

稀疏性、叠加与遗忘：持续学习中表示保持的机制研究

Jan Wasilewski, Jędrzej Kozal, Michał Woźniak, Bartosz Krawczyk

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； Wrocław University of Science and Technology（弗罗茨瓦夫科技大学）

AI总结通过可控玩具框架研究持续学习中的遗忘机制，发现叠加随时间增加但任务边界处有瞬降，高稀疏性增加叠加但不必然导致遗忘，任务级有效秩随稀疏性增长。

详情

AI中文摘要

持续学习（CL）系统常常遗忘先前获得的知识，但由于真实数据集纠缠了许多因素，遗忘的机制在实践中难以孤立。我们提出了一个可控的玩具世界框架，使这些机制可观察和可测试。使用合成生成器-分离器流水线，我们定义了真实潜在特征，构建了具有可调稀疏性和重叠的任务，并引入了表示强度和叠加（特征间的方向重叠）的可测量量。然后，我们通过拟合保留、叠加和暴露历史之间的稀疏动态关系（通过SINDy）来研究保留动态——表示强度的时间变化。基于有效秩的互补任务级分析表征了表示能力如何在任务间分配。我们的受控实验得出三个要点。（1）叠加随时间增加，在任务边界处有瞬降，表明边界特定的干扰而非稳定漂移。（2）更高的特征稀疏性导致更多叠加，但不必然引起遗忘；当表示保持强时，尽管重叠，遗忘可以减少。（3）任务级有效秩随稀疏性增长，表明在稀疏机制下更广泛的能力使用。这些结果共同细化了常见直觉——更多叠加导致更多遗忘，通过显示重叠与表示强度和能力分配相互作用。我们的玩具分析为CL提供了可证伪的假设和诊断工具。

英文摘要

Continual learning (CL) systems often forget previously acquired knowledge, yet the mechanisms driving forgetting remain hard to isolate in practice because real datasets entangle many factors. We present a controlled, toy-world framework that makes these mechanisms observable and testable. Using a synthetic generator-separator pipeline, we define ground-truth latent features, build tasks with tunable sparsity and overlap, and introduce measurable quantities for representation strength and superposition (directional overlap among features). We then study retention dynamics-the temporal change of representation strength by fitting sparse dynamical relations (via SINDy) between retention, superposition, and exposure history. A complementary task-level analysis based on effective rank characterizes how representational capacity is allocated across tasks. Our controlled experiments yield three takeaways. (1) Superposition tends to increase over time with transient dips at task boundaries, suggesting boundary-specific interference rather than steady drift. (2) Higher feature sparsity induces more superposition yet does not inevitably cause forgetting; when representations remain strong, forgetting can be reduced despite overlap. (3) Task-level effective rank grows with sparsity, indicating broader capacity usage under sparse regimes. Together, these results nuance the common intuition that more superposition leads to more forgetting by showing that overlap interacts with representation strength and capacity allocation. Our toy analysis provides falsifiable hypotheses and diagnostic tools for CL.

URL PDF HTML ☆

赞 0 踩 0

2606.20426 2026-06-19 cs.RO 新提交

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation

TaCauchy：面向视觉触觉仿真的可扩展有限元框架

Hengfei Zhao, Yifan Xie, Junhao Gong, Yue Sun, Kai Zhu, Weihua He, Shoujie Li, Haohuan Fu, Wenbo Ding

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Huawei Inc.（华为技术有限公司）

AI总结提出TaCauchy框架，基于UIPC求解器在Isaac Sim中集成有限元法，直接计算柯西应力张量并投影为接触力，实现高保真触觉仿真，支持多种传感器，物理验证SSIM>0.93。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026

详情

AI中文摘要

基于视觉的触觉传感器需要高保真仿真以支持强化学习，然而现有方法难以在GPU加速的机器人平台中提供精确的机械应力场。我们提出TaCauchy，一个可扩展的有限元法（FEM）框架，将严格的基于物理的力计算集成到Isaac Sim中。TaCauchy基于统一增量势接触（UIPC）求解器，直接从超弹性本构定律计算柯西应力张量，并将其投影到接触表面以获得牵引力和压力分布，从而从第一性原理而非经验估计提供机械真实值。我们的框架具有几何感知自适应细化的自动网格生成和模块化传感器接口，能够以最小配置快速集成多种传感器（GelSight Mini、DIGIT、9DTact）。性能基准测试显示，单环境帧率为33.40 FPS，60个并行环境的总吞吐量为555 FPS，应力提取开销低于1 ms。物理验证实验表明，在1.2556 N至4.7332 N的力范围内，仿真与真实触觉响应高度一致，SSIM超过0.93，证实了该框架为下游机器人操作任务提供准确、基于物理的力监督的能力。

英文摘要

Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.20424 2026-06-19 cs.RO 新提交

LIT-GS: LiDAR-Inertial-Thermal Gaussian Splatting for Illumination-Robust Mapping

LIT-GS: 面向光照鲁棒建图的激光雷达-惯性-热高斯泼溅

Shikuan Shi, Chunran Zheng, Jiaming Xu, Tianyong Ye, Tao Yu, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University（深圳大学机电与控制工程学院）； Department of Mechanical Engineering, The University of Hong Kong（香港大学机械工程系）

AI总结提出LIT-GS框架，利用激光雷达平面几何约束联合优化位姿与高斯，解决光照变化和纹理缺失场景下RGB依赖的脆弱性问题，提升几何精度与渲染质量。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

AI中文摘要

高斯泼溅实现了实时神经渲染，但现有的激光雷达-惯性-视觉（LIV）高斯建图流程由于依赖RGB光度线索，在光照变化和纹理缺失场景下仍然脆弱。我们提出了LIT-GS，一个激光雷达-惯性-热高斯泼溅框架，将激光雷达导出的平面几何作为显式约束注入到位姿/结构优化和高斯优化中。具体来说，我们利用LIV视觉地图点作为置信度感知的跨模态锚点，建立可靠的热-激光雷达关联，并在弱热监督下将加权的激光雷达点到平面残差引入光束法平差，以联合优化相机位姿和3D点。基于优化后的结构，我们进一步引入一个激光雷达平面正则化的可微泼溅目标，约束渲染的3D点与局部观测平面对齐，从而减轻低对比度热图像中的表面增厚和结构漂移。在专有序列和公开数据集上的实验表明，LIT-GS在几何精度和渲染质量上持续优于最先进的基于LIV的高斯泼溅基线，尤其是在具有挑战性的光照条件下。

英文摘要

Gaussian Splatting has enabled real-time neural rendering, yet existing LiDAR-inertial-visual (LIV) Gaussian mapping pipelines remain fragile under illumination changes and texture-deficient scenes due to their reliance on RGB photometric cues. We present LIT-GS, a LiDAR-inertial-thermal Gaussian Splatting framework that injects LiDAR-derived plane geometry as an explicit constraint in both pose/structure refinement and Gaussian optimization. Specifically, we exploit LIV visual map points as confidence-aware cross-modal anchors to establish reliable thermal-LiDAR associations, and incorporate weighted LiDAR point-to-plane residuals into bundle adjustment to jointly refine camera poses and 3D points under weak thermal supervision. Building on the refined structure, we further introduce a LiDAR-plane-regularized differentiable splatting objective that constrains rendered 3D points to align with locally observed planes, mitigating surface thickening and structural drift in low-contrast thermal imagery. Experiments on proprietary sequences and public datasets demonstrate that LIT-GS consistently improves geometric accuracy and rendering quality over state-of-the-art LIV-based Gaussian Splatting baselines, particularly in challenging lighting conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.20419 2026-06-19 cs.CV 新提交

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

谱查询-键乘积权重引导用于免训练VLM幻觉缓解

Karn Tiwari, Varnith Chordia, Prathosh A P

发表机构 * Indian Institute of Science, Bengaluru（印度科学理工学院，班加罗尔）； Snap Research（Snap 研究院）

AI总结提出QK乘积引导，一种无数据、免训练、零推理成本的权重编辑方法，通过抑制中间层主导奇异模式减少对象幻觉，在三个GQA基VLM上平均降低CHAIR$_s$ 4.0%。

Comments Under Review

详情

AI中文摘要

视觉语言模型（VLM）通常生成流畅但视觉上无依据的描述，尤其是提及图像中不存在的对象。我们提出QK乘积引导，一种无数据、免训练、零推理成本的权重编辑方法，用于减少对象幻觉。该方法通过抑制选定中间层中少量主导奇异模式，直接编辑每头的查询-键乘积（即产生softmax前注意力logits的算子）。然后，通过封闭形式的仅查询更新将编辑后的乘积映射回查询权重，同时保持共享的键权重固定，使编辑兼容分组查询注意力。我们进一步将QK乘积分解为对称和反对称分量，以区分相互内容相似性模式与方向性注意力模式。在三个基于GQA的VLM上，QK乘积引导实现了平均相对CHAIR$_s$降低4.0%，而匹配的随机模式控制显示可忽略的变化。可解释性消融表明，幻觉信号特定于主导QK模式，并主要定位于对称相互注意力通道。总体而言，QK乘积引导提供了一种解码时缓解的简单替代方案，无需额外数据、微调或推理时开销，同时基本保持多模态能力。

英文摘要

Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0\%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.

URL PDF HTML ☆

赞 0 踩 0

2606.20418 2026-06-19 cs.SD 新提交

MixProLAP: Mixture-Induced Uncertainty Modeling for Probabilistic Language-Audio Pretraining

MixProLAP：混合诱导的不确定性建模用于概率性语言-音频预训练

Yu Nakagome, Jaesong Lee, Soo-Whan Chung

发表机构 * LINE WORKS Corporation（LINE WORKS公司）； NAVER Cloud Corporation（NAVER Cloud公司）

AI总结提出概率性音频-语言预训练框架MixProLAP，通过混合音频-文本对模拟重叠声音，建模多对多对应不确定性，并引入多级包含损失，在音频-文本检索中优于确定性基线。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

声学环境通常包含多个重叠的声音事件，且同一声学场景可以用不同的文本描述，使得音频-文本对齐存在固有的模糊性。本文提出一种概率性音频-语言预训练框架，用于建模音频-文本对齐中的多对多对应不确定性。与学习确定性点嵌入的传统对比方法不同，我们的方法将每个模态表示为分布，并学习不确定性感知的跨模态对齐。我们不依赖基于掩码的不确定性模拟，而是混合音频-文本对以创建更真实反映实际声学混合的重叠声音，并捕捉声音事件之间的语义包含关系。我们进一步引入多级包含损失，以强制表示与这些关系一致。在音频-文本检索基准上的实验表明，所提方法优于确定性基线。

英文摘要

Acoustic environments often contain multiple overlapping sound events, and the same acoustic scene can be described using diverse textual expressions, making audio-text alignment inherently ambiguous. This paper proposes a probabilistic audio-language pretraining framework to model many-to-many correspondence ambiguity in audio-text alignment. Unlike conventional contrastive methods that learn deterministic point embeddings, our approach represents each modality as a distribution and learns uncertainty-aware cross-modal alignment. Rather than relying on masking-based uncertainty simulation, we mix audio-text pairs to create overlapping sounds that better reflect real acoustic mixtures and capture semantic inclusion relations among sound events. We further introduce a multi-level inclusion loss to enforce representations consistent with these relations. Experiments on audio-text retrieval benchmarks show that the proposed method outperforms deterministic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.20416 2026-06-19 cs.LG cs.CV 新提交

On the Redundancy of Timestep Embeddings in Diffusion Models

扩散模型中时间步嵌入的冗余性研究

José A. Chávez

发表机构 * Independent Researcher, Lima, Peru（独立研究者，秘鲁利马）

AI总结本文通过理论和实验证明，在U-Net和Diffusion Transformer架构中，扩散模型无需显式时间步嵌入也能达到全局最优，甚至在某些指标上超越有条件模型。

Comments 17 pages

详情

AI中文摘要

扩散模型严重依赖显式的时间步嵌入来调节不同噪声尺度下的去噪过程。在这项工作中，我们通过分析时间步嵌入对U-Net和Diffusion Transformer架构的影响，挑战了这些时间信号的必要性。除了经验证据外，我们提供了一个理论框架，证明在某些条件下，无需显式时间步条件即可达到扩散训练目标的全局最小值。我们的发现揭示了当完全移除时间步嵌入时令人惊讶的鲁棒性。在CelebA和CIFAR-10数据集上的大量消融研究表明，这些时间无关模型可以保持高结构保真度，甚至在竞争性指标（包括FID、精确率和召回率）上超越其有条件对应模型。我们的分析表明，这些架构可以在特定假设下从损坏输入中隐式推断噪声尺度，使得显式时间条件变得冗余。这项研究挑战了长期以来的时间条件范式，并为更高效、更注重结构的生成架构铺平了道路。

英文摘要

Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.20415 2026-06-19 cs.LG 新提交

Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids

伪特征填充：一种针对电网虚假数据注入的轻量级防御方法

Farhin Farhad Riya, Shahinul Hoque, Yingyuan Yang, Jinyuan Sun, Kevin Tomsovic

发表机构 * University of Tennessee（田纳西大学）； The University of Illinois at Springfield（伊利诺伊大学斯普林菲尔德分校）； Clemson University（克莱姆森大学）

AI总结提出一种轻量级防御框架，通过基于输入统计分布的伪特征填充增加输入维度，使对抗攻击因扰动不可转移和填充结构不可预测而计算不可行，显著提升深度神经网络在电网状态估计中的鲁棒性。

详情

AI中文摘要

深度神经网络（DNN）在各种任务中取得了显著的准确性，包括在信息物理系统（CPS）中用于检测关键操作期间的虚假数据注入攻击（FDIA）。然而，CPS的独特基础设施使得DNN容易受到攻击者的利用，以逃避检测。此外，CPS的独特性质对传统的FDIA防御机制提出了挑战。本文提出了一种创新的防御框架，通过引入一个额外的输入层，该层使用从输入统计分布中导出的伪特征值对输入样本进行填充，从而增强DNN抵御此类攻击的能力。这种填充以随机化和数据感知的方式增加了输入维度，使得由于精心设计的扰动的不可转移性和填充结构的不可预测性，对抗攻击在计算上变得不可行。我们的方法轻量级、与模型无关，并且不需要对核心架构进行修改，使其在现实世界的CPS环境中高度可部署。我们在关键电网应用（如使用IEEE 14节点、30节点、118节点和300节点系统的状态估计）上评估了我们的框架。对抗性设置下的实验表明，我们的填充策略显著提高了模型的鲁棒性，对性能的影响可以忽略不计，并有效缓解了原本会绕过传统防御的攻击。

英文摘要

Deep Neural Networks DNNs have achieved remarkable accuracy in various tasks including their application in CyberPhysical Systems CPS for detecting False Data Injection Attacks FDIA during critical operations However the unique infrastructure of CPS makes DNNs vulnerable to exploitation by attackers aiming to evade detection Additionally the distinct nature of CPS presents challenges for conventional defense mechanisms against FDIA This paper proposes an innovative defense framework that strengthens DNNs against such attacks by introducing an additional input layer that performs padding in the input samples using pseudofeature values derived from the inputs statistical distribution This padding increases the input dimensionality in a randomized and dataaware manner making adversarial attacks computationally infeasible due to the nontransferable nature of crafted perturbations and the unpredictability of the padded structure Our method is lightweight modelagnostic and requires no modifications to the core architecture making it highly deployable in realworld CPS settings We evaluated our framework on critical power grid applications such as state estimation using the IEEE 14bus 30bus 118bus and 300bus systems Experiments under adversarial settings demonstrate that our padding strategy significantly improves model robustness with negligible impact on performance and effectively mitigates attacks that would otherwise bypass conventional defenses

URL PDF HTML ☆

赞 0 踩 0

2606.20404 2026-06-19 cs.CV 新提交

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FlowBender: 面向自校正条件流的反馈感知训练

Daniel Gilo, Sven Elflein, Ido Sobol, Or Litany

发表机构 * Technion（以色列理工学院）； NVIDIA（英伟达）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结针对条件扩散/流模型常违反任务约束的问题，提出FlowBender闭环框架，将对齐误差作为输入训练网络学习校正策略，在图像翻译、复原和3D纹理贴图中同时提升保真度与合理性。

Comments Project page: https://flow-bender.github.io/

详情

AI中文摘要

条件扩散和流模型通常无法满足定义其任务的约束条件。例如，深度条件模型经常产生重新提取的深度与输入不一致的图像，尽管定义约束的前向算子（深度预测器）在训练和推理期间都可用。现有方法通常分为两类：将条件信号视为静态线索并在推理时忽略对齐信息的监督模型，以及通过手动调整的线性更新咨询约束的基于引导的方法，通常以生成样本的合理性为代价来换取对条件的保真度。我们认为这两种范式的根本差距在于模型从未被训练利用自身的对齐误差。我们引入FlowBender，一个闭环框架，将此误差视为一等输入，训练网络学习基于推理时反馈的校正策略。在每一步，无引导的前瞻传递估计干净信号，通过前向算子计算特定任务的偏差，然后细化传递消耗此信号以产生校正速度。我们提出了FlowBender的几种变体，包括用于可微算子的基于梯度的公式和用于不可微设置（如JPEG压缩）的零阶变体。为了实现高效采样，我们引入了一个前一步捷径，使得以最小的额外计算成本实现闭环校正。在图像到图像翻译、复原和3D网格纹理贴图中，FlowBender始终优于标准监督基线、对齐损失增强训练和最先进的推理时引导，同时提高保真度和合理性，而不是在它们之间进行权衡。项目页面：此 https URL

英文摘要

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.20400 2026-06-19 cs.LG 新提交

The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

无标注合成数据生成中风格多样性的重要性

Zahra Abbasiantaeb, Zeno Belligoli, Omar Essam, Mohammad Aliannejadi

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结提出无需人工标注的对话生成框架，利用主题和风格属性增强多样性，并设计两种后处理风格化模型，实验表明风格多样性比主题多样性更关键，性能可达人工标注数据的93.3%。

详情

AI中文摘要

为意图分类生成高实用性的合成数据通常需要人工标注的种子数据，这在快节奏的工业环境中往往不可用。在本文中，我们提出了一个完全无需人工标注数据、仅依赖意图定义的合成对话生成框架。我们提出的对话生成框架利用两种不同类型的主题和风格属性来提高数据多样性。此外，我们提出了两种新颖的后处理风格化模型，称为Univ和Exam，以将合成的LLM生成的语句转换为更多样化、更接近人类的语言风格。为了提升数据质量，我们利用LLM作为评判的过滤过程。在工业数据集和公开数据集上的实验结果表明，所提出的方法达到了使用人工标注训练数据所获得性能的93.3%。至关重要的是，研究结果揭示，对于合成数据的实用性，风格多样性比主题多样性更为关键，因为它能防止模型学习虚假的风格相关性。此外，研究表明，在生成过程中融入风格属性比后处理风格适应更有效。

英文摘要

Generating high-utility synthetic data for intent classification typically requires human-annotated seed data, which is often unavailable in fast-paced industrial settings. In this paper, we propose a framework for synthetic dialogue generation that works entirely without human-annotated data, relying solely on intent definitions. Our proposed dialogue generation framework utilizes two different types of topic and style attributes to improve data diversity. Also, we propose two novel post-hoc stylization models called Univ and Exam to transform synthetic LLM-generated utterances into more varied, human-like linguistic styles. To enhance data quality, we utilize an LLM-as-a-judge filtering process. Experimental results on both industrial and public datasets demonstrate that the proposed approach achieves up to 93.3% of the performance obtained using human-annotated training data. Crucially, the findings reveal that style diversity is more critical than topic diversity for synthetic data utility, as it prevents models from learning spurious stylistic correlations. Furthermore, the study shows that incorporating style attributes during the generation process is more effective than post-hoc style adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.20390 2026-06-19 cs.CV 新提交

Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification

几何感知超像素图变换器结合元数据用于皮肤病变分类

Muhammad Azeem, Tanveer Hussain, Amr Ahmed, Ardhendu Behera

发表机构 * Edge Hill University（埃奇希尔大学）

AI总结提出一种基于区域的图学习框架，将病变建模为超像素图，利用几何边属性和元数据上下文节点，通过边缘感知图变换器实现多模态融合，在四个公开数据集上取得优于现有方法的分类性能。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

由于病变结构异质性、类内变异大以及良恶性病例间细微视觉差异，从皮肤镜图像进行自动化皮肤癌分类仍然具有挑战性。现有的CNN/ViT流程通常依赖全局或补丁级特征，并常通过后期融合结合患者元数据，这限制了空间基础的多模态推理。我们提出一种新颖的基于区域的图学习框架，将病变显式建模为空间连贯的超像素区域图，这些区域表示为冻结的CNN特征。为了捕捉细粒度的病变排列，我们将区域间几何编码为边属性，并引入一个与所有区域相连的专用元数据上下文节点，从而在同一关系空间内结构化地整合人口统计学/临床变量。节点表示通过我们的边缘感知图变换器进行更新，随后进行注意力驱动的传播，最终生成用于良恶性分类的图级嵌入。在四个公开基准上的实验表明，显式的区域级关系建模和图原生多模态融合相较于现有技术取得了持续改进。因此，我们建立了一种新的以图为中心的视角，其中CNN特征被建模为关系节点，并通过上下文整合得到改进，从而产生更具表现力和鲁棒性的分类结果。

英文摘要

Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.

URL PDF HTML ☆

赞 0 踩 0

2606.20381 2026-06-19 cs.AI 新提交

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

重新思考LLM FP4预训练中的收缩偏差：几何起源、系统影响与UFP4方案

Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

发表机构 * Ling Team, Ant Group（蚂蚁集团灵团队）

AI总结本文发现E2M1格式因几何不对称导致收缩偏差，该偏差经随机哈达玛变换放大，造成训练不稳定；提出均匀网格E1M2/INT4及UFP4训练方案，在多种模型上实现更低损失。

Comments 18 pages, 12 figures

详情

AI中文摘要

FP4训练有望大幅减少LLM预训练的内存和计算成本，然而当前的FP4硬件路径和方案，包括NVIDIA Blackwell/Rubin级系统和AMD MI350系列GPU，仍以E2M1数据元素为中心。在本研究中，我们识别出该选择的一个根本限制：诸如E2M1的非均匀格式固有地遭受收缩偏差，这是一种由其可表示区间的几何不对称性导致的系统性负舍入误差。我们证明该偏差在层间乘性累积，并被随机哈达玛变换（RHT）放大，为现有基于E2M1的FP4方案中观察到的训练不稳定性提供了统一解释。相比之下，均匀网格（E1M2/INT4）绕过了这种网格几何误差，并能更好地将RHT改进的桶利用率转化为更高的量化质量。基于这一发现，我们提出UFP4，一种均匀4位训练方案，它将RHT应用于所有三个训练GEMM，同时仅对dY施加随机舍入。在Dense 1.5B、MoE 7.9B和MoE 124B的长程预训练中，UFP4始终比强E2M1基线实现更低的BF16相对损失退化，这得到了缩放定律分析和消融研究的支持。我们的结果表明，未来的加速器应支持E1M2/INT4风格的均匀4位网格作为与E2M1并列的一等训练原语。

英文摘要

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

URL PDF HTML ☆

赞 0 踩 0

2606.20376 2026-06-19 cs.LG cs.AI 新提交

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX：快速安全强化学习基准测试

Tristan Tomilin, Mourad Boustani, Mickey Beurskens, Thiago D. Simão

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出基于JAX加速的安全RL基准CRAX，利用MJX物理引擎实现高达100倍加速，包含6个环境套件和3个智能体任务，评估6种方法揭示性能与安全权衡。

详情

AI中文摘要

安全性是强化学习（RL）智能体在机器人、自动驾驶等现实领域部署的核心问题。尽管基准测试对RL的进步至关重要，但现有具有高保真3D物理的安全基准计算速度慢，限制了大规模实验和快速原型开发。为解决这一问题，我们提出CRAX（基于JAX加速的约束RL）。CRAX构建在具有逼真3D动力学的MuJoCo XLA（MJX）物理引擎之上，利用向量化操作和硬件加速，相比基于CPU的同类安全基准实现高达约100倍的加速。该基准包含六个环境套件和三个智能体特定任务，每个任务涵盖三个难度级别。对六种流行安全RL方法的评估表明，没有单一方法在所有任务中占主导地位，并揭示了性能与安全之间的权衡。我们发现，跨难度级别的课程学习和安全迁移可以比直接在更困难设置中训练提高性能。

英文摘要

Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20369 2026-06-19 cs.CL 新提交

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

CATCH-ME if you RAG：针对仇恨与虚假信息交流的上下文注释多轮对抗言论数据集

Helena Bonaldi, Genoveffa Martone, Marco Guerini

发表机构 * Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）； Università Cattolica del Sacro Cuore（圣心天主教大学）

AI总结提出首个大规模、专家策划的多语言对话数据集，覆盖仇恨与虚假信息重叠问题，包含事实核查锚定和跨度标注，支持RAG系统训练更可信的对抗言论模型。

详情

AI中文摘要

在线仇恨言论和虚假信息经常重叠，但NLP研究主要将它们孤立处理。虽然LLMs代表了协助人类针对这两种威胁生成对抗言论的可扩展解决方案，但零样本模型经常生成重复和模糊的回应，凸显了需要高质量示例来指导模型生成。然而，现有的针对仇恨和虚假信息重叠的对抗言论数据集很少，且仅限于单轮英语对话，而现实中的交互跨越多个轮次和语言。为弥补这一差距，我们引入了第一个大规模、专家策划的多语言对话数据集，处理仇恨与虚假信息的交叉点。为确保事实基础，对话还锚定在已验证的外部知识（即事实核查文章和非政府组织报告）中，并包含文档级和块级跨度标注，使其可直接应用于RAG系统。该新资源涵盖五种语言，针对七个边缘化群体的仇恨，能够训练和评估更具说服力、基于事实的对抗言论模型。

英文摘要

Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

URL PDF HTML ☆

赞 0 踩 0

2606.20365 2026-06-19 cs.RO cs.MA 新提交

An Infrastructure-less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots using Ranging Measurements

基于测距的移动机器人团队相对定位的无基础设施、控制无关解决方案

Paolo Golinelli, Tommaso Faraci, Daniele Fontanelli

发表机构 * Department of Industrial Engineering, University of Trento（特伦托大学工业工程系）； Department of Information Engineering and Computer Science, University of Trento（特伦托大学信息工程与计算机科学系）

AI总结提出一种无锚点、完全去中心化的协作定位算法，仅依赖局部里程计、稀疏测距和短程通信，无需控制机器人运动即可实现团队可观测性，采用多假设贝叶斯框架保证鲁棒性。

详情

AI中文摘要

定位机器人团队的能力对于从非结构化环境中的机器人舰队到协作控制和导航任务等应用至关重要。在此类场景中，固定基础设施通常不可用，部署必须快速灵活，系统要求必须最小化。我们提出了一种去中心化协作定位算法，同时解决了所有这些挑战。该方法无锚点、完全去中心化，并且与大多数现有方法不同，不需要控制机器人运动来确保团队可观测性。它仅依赖局部里程计、稀疏的代理间测距测量和短程通信，这些在实践中广泛可用。该算法采用多假设贝叶斯框架，维护所有可行解集，确保在瞬态不可观测条件下的鲁棒性。此外，通过信息共享，每个代理都能受益于整个群体的估计，即使在部分连接条件下也是如此。

英文摘要

The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.20364 2026-06-19 cs.LG 新提交

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

评判以改进：一种去偏的 VLM-as-3D-Judge 协议用于单图像 3D 生成

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结本文提出一种去偏的跨模型 VLM-as-3D-Judge 协议，将评判者从排序扩展到优化，通过训练与评估评判者分离、位置偏差校正及修复三种失效模式，实现轻量级适应下与强基线的匹配。

详情

AI中文摘要

一项伴随研究建立了一个去偏的、跨模型的 VLM-as-3D-Judge，能够可靠地对单图像到 3D 网格质量进行排序，而廉价的几何和 CLIP 代理在此方面表现不足。本文提出：该评判者的偏好能否专门化一个强大的开放生成器 TRELLIS，针对单一资产类别（家具），且无需人工标注？将评判者从排序扩展到优化是本文的工作所在。将 VLM 评判者推入训练和评估循环会暴露排序从未触发的失效模式，因此我们的贡献是对评判者进行优化级别的强化：一个训练评判者（Qwen2.5-VL-7B）与一个评估评判者（InternVL3-8B）保持分离以打破循环性；位置偏差校正；以及针对三种失效模式（图像过载、隐藏几何的溅射渲染、以及奖励干净但错误输出的无参考评判）的修复，并附有校准证据（清晰差距胜率 0.83-1.0；基线间约 0.5）。使用此协议作为独立评估者，仅从公开模型和数据出发，采用轻量级参数高效适应，我们发现我们的方法匹配了强基线而非超越它。独立基线样本几乎不携带可学习的偏好（0.94 顺序翻转率），因此信号必须通过质量对比构造来设计。在六种适应方法、两种输入模式和严重程度扫描中，最具针对性的方法——严重退化下的条件器修复——达到了与基线持平（0.50），而没有方法达到 >=65% 的胜率目标。结果是机制性的：干净输入使评判者饱和，流式 DIT 微调通过采样器被冲刷，而条件器修复是改变几何的位点。胜率在 n=8 个对象时具有方向性。匹配一个强大的公开数据基线本身具有信息量：超越它需要比公开数据上的轻量级 PEFT 更多，而评判者协议是可复用的。

英文摘要

A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.

URL PDF HTML ☆

赞 0 踩 0

2606.20363 2026-06-19 cs.AI 新提交

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

为计算机使用智能体自动生成SKILL.md：基于交互轨迹挖掘

Yuexing Hao, Xiaomin Li

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）

AI总结提出三阶段流水线从GUI轨迹中挖掘可读技能库，但发现可读性不保证下游策略提升，GRPO仅带来微小改进，揭示当前方法的局限性。

详情

AI中文摘要

显式技能库使计算机使用智能体更易于检查，但尚不清楚是否可以从交互数据中挖掘此类库以改进下游策略。我们通过一个三阶段流水线研究这个问题：分割GUI轨迹，将片段聚类为候选技能，并从生成的注释中训练技能感知策略。挖掘的聚类在源基准上是可读的：八个聚类中有五个对InteraSkill Workflows标签的纯度至少为0.95。然而，可读性并不意味着可迁移。GRPO仅将IW技能步骤准确率从18.5%提高到20.5%，使BrowseComp+基本不变，并在关键源域指标上低于简单的频率先验。因此，我们将该方法作为诊断性研究呈现：轨迹挖掘可以暴露可检查的技能结构，但当前的边界检测器、无序片段表示和离线奖励模型不足以实现可靠的跨域策略改进。

英文摘要

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.20359 2026-06-19 cs.LG 新提交

Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act

训练、检索，还是两者兼用？针对安大略省住宅租赁法的正确法定引用的四组头对头比较

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结研究自诉租户、房东和帮助台工作人员如何获得正确的法定引用，通过四组实验比较微调、检索及混合方法，发现SFT+RAG混合模型在精确匹配上得分最高且无幻觉引用。

详情

AI中文摘要

自诉租户、房东和帮助台工作人员需要被指向实际管辖问题的法律条款，并附有正确的法定引用。我们在2006年安大略省住宅租赁法（RTA）及其核心法规上研究此任务，从操作者的角度实证提问：微调是否足够，还是需要混合检索？我们在Qwen2.5-7B-Instruct上运行四组头对头比较（基础零样本、仅LoRA SFT、仅RAG、以及SFT+RAG混合），在一个小型、待人工验证的真实评估集上，以引用的精确匹配（节+小节）评分。基础模型无法引用RTA，仅SFT会错误回忆章节；检索至关重要，并通过构造将幻觉降至零；而SFT+RAG混合模型得分最高，精确匹配为0.481，且无幻觉引用。其优势在于SFT使得条款选择对高召回候选集（损害零样本RAG）更加鲁棒。值得注意的是，这种廉价的bge-small混合模型匹配或超越了基于更大、专门检索模型（更大的嵌入器和交叉编码器重排序器）的管道，更大/改进的训练集也无帮助：在此任务中，强法定引用性能不需要专门的检索模型或更多数据。该工件将幻觉归零并超过了基准提升线，但未达到期望的0.70精确匹配目标。所有结果均基于小型、待人工验证的真实评估集，并作为初步结果报告。

英文摘要

Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act, 2006 (RTA) and its core regulation, asking the operator's question empirically: is fine-tuning enough, or is hybrid retrieval needed? We run a four-arm head-to-head on Qwen2.5-7B-Instruct (base zero-shot, LoRA SFT-only, RAG-only, and an SFT+RAG hybrid), scored on citation exact-match (section+subsection) over a small, human-verification-pending real eval set. The base model cannot cite the RTA and SFT-only mis-recalls sections; retrieval is essential and drives hallucination to zero by construction; and the SFT+RAG hybrid scores highest at 0.481 exact-match with zero hallucinated citations. Its edge comes from SFT making provision selection more robust to the higher-recall candidate sets that hurt zero-shot RAG. Notably, this cheap bge-small hybrid matches or beats a pipeline built on bigger, specialized retrieval models (a larger embedder and a cross-encoder reranker), and a larger/improved training set does not help either: strong statutory-citation performance here does not require specialized retrieval models or more data. The artifact zeroes hallucination and clears the lift-over-base bar but does not reach the aspirational 0.70 exact-match target. All results are on a small, human-verification-pending real eval set and are reported as preliminary.

URL PDF HTML ☆

赞 0 踩 0

2606.20336 2026-06-19 cs.RO 新提交

Autonomous Driving with Priority-Ordered STL Specifications Under Multimodal Uncertainty

多模态不确定性下基于优先级排序STL规范的自动驾驶

Taha Bouzid, Shuhao Qi, Mircea Lazar, Sofie Haesaert

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出一种不确定性感知的轨迹规划框架，通过信号时序逻辑的词典序优先级处理冲突目标，并结合模型预测路径积分控制实现，在仿真中验证了有效性。

详情

AI中文摘要

自动驾驶车辆必须规划满足安全、乘客舒适度和交通规则等多重要求的轨迹。然而，在安全关键场景中，不可能同时满足所有要求，因此需要根据重要性进行优先级排序。同时，在这些安全关键场景中，应明确考虑周围交通（如其他车辆和行人）轨迹预测的不确定性。在这项工作中，我们提出了一种不确定性感知的轨迹规划框架，该框架结合了信号时序逻辑（STL）规范上的预定义词典序，该排序在不确定性下仍然有效。我们使用模型预测路径积分（MPPI）控制实现了该公式，并在仿真场景中展示了我们方法的有效性，表明我们的框架在现实的多模态不确定性下有效处理了冲突目标。

英文摘要

Autonomous vehicles must plan trajectories that satisfy a multitude of requirements on safety, passenger comfort, and compliance with traffic rules. However, in safety-critical scenarios, it is not always possible to satisfy all requirements simultaneously, necessitating their prioritization based on importance. At the same time, in these safety-critical scenarios, the uncertainty in trajectory predictions of the surrounding traffic, such as other vehicles and pedestrians, should be explicitly accounted for. In this work, we propose an uncertainty-aware trajectory planning framework that incorporates a predefined lexicographic ordering over Signal Temporal Logic (STL) specifications that stays valid under uncertainty. We implement this formulation with Model Predictive Path Integral (MPPI) control and we demonstrate the effectiveness of our method on simulation scenarios, showing that our framework efficiently handles conflicting objectives under realistic multi-modal uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2606.20333 2026-06-19 cs.AI 新提交

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill: 用于上下文适应的行为压缩

Xijia Tao, Yihua Teng, Xinyu Fu, Ziru Liu, Kecheng Chen, Yuzhi Zhao, Suiyun Zhang, Rui Liu, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Huawei Research（华为研究院）； City University of Hong Kong（香港城市大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SoftSkill方法，通过可训练的软技能前缀压缩自然语言技能为紧凑连续向量，在冻结基模型上提升问答和数学任务性能，减少标记数量。

详情

AI中文摘要

智能体技能通常以自然语言Markdown文件形式部署，编码回答策略、证据使用习惯和任务流程。这些文件可读且可移植，但间接消耗：对于每个任务实例，冻结的语言模型必须将长文本制品转换为生成时行为。本文探讨自然语言技能是否可以初始化一个紧凑的连续上下文对象，通过可训练的软增量进行优化，同时基模型保持冻结。我们提出SoftSkill，一种冻结骨干方法，通过下一词预测调整此类软技能，并在推理时将其部署为潜在行为先验。在我们的主要单轮设置中，在Qwen3.5-4B上使用长度为32的SoftSkill前缀，相比无技能提示在SearchQA上提升8.3分，LiveMath上提升42.1分，DocVQA上提升1.3分。相对于SkillOpt，SoftSkill在SearchQA上准确率提升5.2分，LiveMath上提升12.5分，同时用少量虚拟标记替换数百到数千个Markdown技能标记。我们进一步研究了作为更难边界情况的智能体执行，其中稀疏轨迹模仿提供了有用信号，但尚未稳健地压缩长程过程行为。更广泛地说，结果表明某些任务技能更适合被视为紧凑的潜在控制，而不是在推理时重新解释的额外Markdown，用于控制冻结模型如何进入任务。

英文摘要

Agent skills are commonly deployed as natural-language Markdown files that encode answer policies, evidence-use habits, and task procedures. These files are readable and portable, but they are consumed indirectly: for each task instance, a frozen language model must translate a long textual artifact into generation-time behavior. This paper asks whether a natural-language skill can instead initialize a compact continuous context object, refined by a trainable soft delta while the base model remains frozen. We propose SoftSkill, a frozen-backbone method that tunes such soft skills with next-token prediction and deploys them as latent behavioral priors at inference time. In our main single-round setting, a length-32 SoftSkill prefix on Qwen3.5-4B improves over no-skill prompting by 8.3 points on SearchQA, 42.1 points on LiveMath, and 1.3 points on DocVQA. Relative to SkillOpt, SoftSkill improves accuracy by 5.2 points on SearchQA and 12.5 points on LiveMath, while replacing hundreds to thousands of Markdown skill tokens with a few virtual tokens. We further study agentic execution as a harder boundary case, where sparse trajectory imitation provides useful signal but does not yet robustly compress long-horizon procedural behavior. More broadly, the results suggest that some task skills are better treated not as additional Markdown to be reinterpreted at inference time, but as compact latent controls over how a frozen model enters the task.

URL PDF HTML ☆

赞 0 踩 0

2606.20322 2026-06-19 cs.RO 新提交

Towards 3D karst underwater scene reconstruction from rotating sonar data

基于旋转声纳数据的3D喀斯特水下场景重建

Georgios Evangelos Margaritis, Lionel Lapierre, Simon Rohou, Zhi Yan, Andreas Nüchter, François Goulette

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris（巴黎综合理工学院ENSTA学院U2IS实验室）； Lab-STICC, ENSTA, Institut Polytechnique de Paris（巴黎综合理工学院ENSTA学院Lab-STICC实验室）； Informatics XVII – Robotics, Julius-Maximilians-Universität Würzburg（尤利乌斯-马克西米利安-维尔茨堡大学信息学XVII – 机器人学）

AI总结针对声纳数据稀疏噪声大、导航漂移导致3D重建困难的问题，提出结合连续时间SLAM校正轨迹与两阶段深度学习表面重建的流水线，生成可沉浸导航的3D网格。

Comments 1st Workshop on Long-term Deployments in the Wild (LoWi)

2606.20310 2026-06-19 cs.CV 新提交

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

通过PRISM：视频扩散模型中间状态中的偏好表示

Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang, Wei Liu

发表机构 * City University of Hong Kong（香港城市大学）； Video Rebirth ； The Chinese University of Hong Kong（香港中文大学）

AI总结提出PRISM方法，利用冻结的视频扩散骨干网络和轻量级查询聚合头从噪声潜变量中解码偏好信号，实现高精度偏好预测和噪声鲁棒性，支持早期最佳采样以降低计算成本并提升视频质量。

详情

AI中文摘要

使用干净的、基于像素的奖励模型评估视频生成，会使评估与噪声扩散过程脱节，并产生巨大的VAE解码成本。在本文中，我们通过提出一个基本问题来挑战这一范式：一个强大的视频生成器能否直接从噪声潜变量中内在地区分偏好？为了回答这个问题，我们引入了\textbf{PRISM}（\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels）。PRISM采用一个轻量级的基于查询的聚合头，配合冻结的视频扩散骨干网络，从噪声潜变量中解码偏好信号。令人惊讶的是，PRISM不仅达到了最先进的偏好准确率，还解锁了强大的噪声鲁棒性，从而实现了早期最佳-$N$采样。这使得在去噪的初始阶段就能过滤掉次优候选，大幅减少计算量并提升视频质量。我们还揭示了骨干网络的生成性能与其内在评估能力之间的强正相关性，从而实现了视频骨干网络的自我改进。

英文摘要

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.20303 2026-06-19 cs.CV 新提交

GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI

GEN-Guard：纠正可部署联邦手术AI的泛化失败

Julia Alekseenko, Pietro Mascagni, AI4SafeChole Consortium, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357（斯特拉斯堡大学，法国国家科学研究中心，法国国家健康与医学研究院，ICube实验室，UMR7357）； Bioimage Analysis Center, Fondazione Policlinico Universitario Agostino Gemelli IRCCS（生物图像分析中心，阿戈斯蒂诺·杰梅利大学综合医院基金会IRCCS）； Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico di Milano, University of Milan（米兰IRCCS卡格兰达基金会马焦雷综合医院，米兰大学）； Monaldi Hospital, AORN dei Colli（莫纳尔迪医院，AORN dei Colli）

AI总结提出GEN-Guard框架，通过客户端阻塞评估检测性能泄漏，并利用分歧感知蒸馏进行特征级校正，提升联邦手术AI的跨机构泛化能力。

Journal ref Int J Comput Assist Radiol Surg. 2026 Jun 14

详情

DOI: 10.1007/s11548-026-03713-0

AI中文摘要

联邦学习（FL）在手术视频AI中实现了协作模型训练，无需共享敏感数据。然而，标准评估实践——仅基于参与医院的验证数据选择“最佳”全局模型——可能导致次优的部署选择。我们将这种关键失败模式识别为性能泄漏，即所选模型过拟合内部联邦数据，无法泛化到未见机构。我们提出GEN-Guard，一个实用的后处理框架，用于检测和纠正联邦手术AI中的泛化失败。它集成了通过客户端阻塞评估（CBE）进行泛化检测，该方法在隔离的客户端分布上验证性能以防止性能泄漏，以及通过分歧感知蒸馏（DAD）进行泛化纠正，该方法学习自适应的特征级校正以实现跨机构鲁棒性。两个组件在标准FL收敛后运行，同时为零样本适应未见环境提供鲁棒支持。我们首先量化了性能泄漏的严重性，观察到在标准评估下模型选择失败（MSF）超过80%。GEN-Guard在两个多中心临床挑战上进行了评估：腹腔镜胆囊切除术中的手术阶段识别和结肠镜中的息肉分割。在两个数据集上，GEN-Guard一致地纠正了这些失败，将联邦内F1分数提高了最多2个点，未见机构性能提高了最多3个点，最差情况机构性能提高了3-9个点。性能泄漏是联邦手术AI中一个系统性且以前未被充分认识的风险。GEN-Guard为检测和纠正此类失败提供了实用解决方案。通过提高跨机构鲁棒性和零样本泛化，它增强了FL在真实世界手术部署中的可靠性。

英文摘要

Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.20302 2026-06-19 cs.CV 新提交

CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection

CUPID: 重构UV纹理图用于可解释的特定人物深度伪造检测

Giovanni Affatato, Sara Mandelli, Edoardo Daniele Cannas, Paolo Bestagini, Stefano Tubaro

发表机构 * Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano（米兰理工大学电子、信息与生物工程系（DEIB））

AI总结提出CUPID方法，利用3D人脸重建的UV纹理图和掩码自编码器，无需深度伪造视频训练即可检测特定人物深度伪造，并实现可解释性和鲁棒性。

详情

AI中文摘要

针对高知名度人物（Person-of-Interest, POI）的深度伪造对现代民主社会构成威胁。当前的POI深度伪造检测方法在鲁棒性、效率和可解释性方面仍存在不足。本文提出CUPID，一种POI视频深度伪造检测器，结合了UV纹理图（源自3D人脸重建的面部外观表示）和掩码自编码器（MAE）的表征学习能力。我们的方法在训练阶段不需要任何深度伪造视频，甚至无需在训练集中包含特定POI：从真实视频帧中提取的UV纹理图与MAE上下文引导重构相结合，产生的潜在空间能够捕获丰富且具有判别性的面部特征，即使对于训练中未见过的身份也是如此。在测试阶段，从描述POI的查询视频中提取的嵌入可以与原始参考视频进行匹配，以评估视频真实性。此外，在UV空间中操作自然提供了额外的可解释性层。具体来说，我们可以提取解码残差图，突出显示测试视频中哪些面部区域与相应POI的身份表示偏差最大。在四个深度伪造数据集上的实验表明，CUPID在大多数数据集上优于当前最先进方法，并在强下采样和压缩下实现了最佳的整体鲁棒性，同时提供了更快的推理速度。我们的实验代码将在以下网址发布：https://this https URL。

英文摘要

Deepfakes targeting a high-profile individual, known as Person-of-Interest (POI), are a threat to modern democracies and societies. Current POI deepfake detection methods still struggle to combine robustness to post-processing, efficiency and interpretability, focal aspects of modern deepfake detectors. In this paper we propose CUPID, a POI video deepfake detector that combines UV texture maps, a facial appearance representation derived from 3D face reconstructions, with the representation learning capabilities of the Masked Autoencoder (MAE). Our method does not require any deepfake videos in its training phase. Moreover, it does not even require to include a specific POI in the training set: the combination of UV texture maps extracted from real video frames and the MAE context-guided reconstruction yields a latent space that captures rich and discriminative facial features also for identities unseen during training. In the testing phase, the embeddings extracted from a query video depicting the POI can be matched against pristine reference videos to assess the video authenticity. Furthermore, operating in the UV space naturally provides an additional layer of interpretability. Specifically, we can extract decoded residual maps that highlight which facial regions of a test video deviate most from the identity representation of the corresponding POI. Experiments on four deepfake datasets show that CUPID outperforms current state of the art on most datasets and achieves the best overall robustness against strong downscaling and compression, providing also substantially faster inference. Our experimental code will be released at https://github.com/polimi-ispl/CUPID.

URL PDF HTML ☆

赞 0 踩 0

2606.20300 2026-06-19 cs.CV 新提交

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD: 跨模态双流解耦用于少样本异常检测

Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang, Xiaopin Zhong, Zongze Wu

发表机构 * Shenzhen University（深圳大学）； Guangzhou Maritime University（广州航海学院）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出跨模态双流异常检测框架CMDS-AD，通过扩散模型生成多样本并利用低频正常估计辅助解耦高频缺陷，在1-shot设置下MVTec 3D-AD上I-AUROC提升5.7%。

Comments Accepted to ECCV 2026!

详情

AI中文摘要

少样本异常检测由于训练数据有限仍然具有挑战性。多模态异常检测（MAD）提供了一种可行的解决方案，利用3D几何线索丰富2D RGB表示并弥补这一稀缺性。然而，现有的MAD方法采用空间均匀的特征处理，混淆了稳定的宏观结构与高频局部缺陷信号，加剧了跨模态错位并增加了假阳性率。为了克服这一问题，我们提出了CMDS-AD，一种跨模态双流异常检测框架。一个LoRA引导的扩散模型生成多样的RGB样本以缓解极端数据稀缺。对于3D正常增强，我们采用预训练的扩散模型作为正常估计器。关键的是，该估计器本质上充当非线性低通滤波器，直接从RGB输入中提取低频正常表示。这建立了一个纯低频信息的辅助估计流，锚定稳健的结构模板，并帮助包含耦合高低频分量的未压缩真实流精确隔离微缺陷。一个坐标感知的分层特征映射器自适应地对齐跨模态语义，而一个乘法评分机制过滤模态特定噪声。在极端1-shot设置下，CMDS-AD在MVTec 3D-AD上实现了5.7%（I-AUROC）和2.0%（AUPRO）的绝对性能提升，在EyeCandies上分别提升了7.7%和5.6%，确立了新的最先进水平。

英文摘要

Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.20292 2026-06-19 cs.LG cs.LO 新提交

Shifting-based Optimizable Linear Relaxations for General Activation Functions

基于平移的可优化线性松弛用于通用激活函数

Philipp Kern, László Antal, Erika Ábráham, Carsten Sinz

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Karlsruhe University of Applied Sciences（卡尔斯鲁厄应用科学大学）

AI总结提出SLiR方法，通过斜率参数化和平移过程生成任意激活函数的线性松弛，在保持正确性的同时实现高效优化，验证属性数量比现有方法多7.8倍。

Comments 21 pages, under review

详情

AI中文摘要

神经网络（NN）的使用正在迅速增加，包括在安全关键领域。为了提供关于NN行为的正式保证，许多验证方法依赖于激活函数的可优化线性松弛。然而，现有技术依赖于为每个激活函数手工制作的松弛。因此，扩展到最先进的激活函数需要大量手动工作。相比之下，我们的方法SLiR（基于平移的线性松弛）具有广泛的适用性，仅需要Lipschitz常数或一组临界点。SLiR通过斜率参数化松弛，并通过平移过程计算相应的偏移，确保在输入域上的可靠上下界，从而在保持正确性的同时实现高效优化。我们的实验表明，SLiR在广泛的实际激活函数上产生紧致的松弛，并且与最先进的方法相比，能够验证多达7.8倍更多的属性。

英文摘要

The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of activation functions. However, existing techniques depend on hand-crafted relaxations for each activation function. Extension to state-of-the-art activation functions therefore requires substantial manual effort. In contrast, our approach SLiR (Shifting-based Linear Relaxations) is broadly applicable, requiring only a Lipschitz constant or a set of critical points. SLiR parameterizes relaxations by their slope and computes the corresponding offset via a shifting procedure that ensures sound upper and lower bounds over the input domain, enabling efficient optimization while maintaining correctness. Our experiments show that SLiR produces tight relaxations across a wide range of practical activation functions and enables verification of up to 7.8x more properties compared to state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0