arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2505.16942 2026-05-27 cs.CV cs.LG

Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation

高效的全对相关性体素采样用于光流估计

Karlis Martins Briedis, Markus Gross, Christopher Schroers

AI总结 提出一种内存和计算高效的算法,实现全对相关性体素采样的精确数学运算,在保持低内存占用的同时显著提升速度,并应用于高分辨率光流估计达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

最近的光流估计方法通常从密集的全对相关性体素中进行局部代价采样。这导致计算和内存复杂度与像素数成二次关系。尽管存在一种按需代价计算的替代内存高效实现,但在实践中速度明显较慢,因此许多先前方法在降采样分辨率下处理图像,丢失了细粒度细节。为了解决这个问题,我们提出了一种算法,用于全对相关性体素采样的内存和计算高效实现,同时仍然匹配RAFT定义的精确数学算子。我们的方法在保持同样低内存使用的情况下,性能优于按需采样高达92%,并且与默认实现相比,内存使用降低高达99%的同时性能至少相当。由于代价采样占整体运行时间的很大一部分,这可以转化为高分辨率输入下端到端模型推理总时间高达63%的节省。我们对现有方法的评估包括一个8K超高清数据集和SEA-RAFT方法的推理时间扩展。通过这一点,我们在高分辨率下在准确性和运行时间上都达到了最先进的结果。

英文摘要

Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details. To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.

2505.11063 2026-05-27 cs.AI cs.CR

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

三思而后行:通过思想修正增强智能体行为安全

Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang

AI总结 提出Thought-Aligner,一种轻量级插件式安全模型,在动作执行前对不安全思想进行因果修正,无需修改底层智能体,通过两阶段对比学习训练,在多个基准和六种LLM上将行为安全从约50%提升至约90%,超越现有防护约23%,同时提升有用性约5%。

Comments Accepted to ICML 2026

详情
AI中文摘要

基于LLM的智能体通过迭代推理、工具使用和环境交互来解决复杂任务,其中每个中间思想直接塑造后续动作。因此,这些思想中的微小偏差可能传播为不安全行为,但现有的防护措施通常仅作用于最终输出或需要侵入式模型修改。我们引入了Thought-Aligner,一种轻量级插件式安全模型,它在动作执行前对不安全思想进行因果修正,而不改变底层智能体。修正后的思想被反馈给智能体,将其决策过程和工具使用引导至更安全的轨迹。由于仅在思想层面操作,Thought-Aligner是模型无关的,可以集成到各种智能体框架中。我们通过在十个风险场景中生成的成对安全和不安全思想上进行两阶段对比学习来训练Thought-Aligner。在多种智能体安全基准和六种LLM上的实验表明,Thought-Aligner将行为安全从无保护时的约50%提升至平均约90%,超过最先进的防护约23%,同时还将有用性提高了约5%。该方法具有低每步延迟和最小开销,实现了可扩展且实用的部署。我们在https://huggingface.co/WhitzardAgent/Thought-Aligner-7B公开发布了Thought-Aligner-7B。

英文摘要

LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.

2502.17666 2026-05-27 cs.LG cs.AI

Yes, Q-learning Helps Offline In-Context RL

是的,Q学习有助于离线上下文强化学习

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

AI总结 本文在离线上下文强化学习框架中整合RL目标,通过150多个数据集实验证明,直接优化RL目标相比算法蒸馏平均提升约30%性能,且价值学习中的保守性带来额外改进。

详情
AI中文摘要

现有的离线上下文强化学习(ICRL)方法主要依赖监督训练目标,这在离线RL设置中已知存在局限性。在本研究中,我们探索了在离线ICRL框架中整合RL目标。通过在150多个GridWorld和MuJoCo环境派生数据集上的实验,我们证明,与广泛采用的算法蒸馏(AD)相比,直接优化RL目标在各种数据集覆盖范围、结构、专业水平和环境复杂性下平均提升约30%的性能。此外,在具有挑战性的XLand-MiniGrid环境中,RL目标使AD的性能翻倍。我们的结果还揭示,在几乎所有测试的设置中,价值学习期间加入保守性带来了额外的改进。我们的发现强调了将ICRL学习目标与RL奖励最大化目标对齐的重要性,并表明离线RL是推进ICRL的一个有前景的方向。

英文摘要

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

2505.02974 2026-05-27 cs.LG

PLAID: A Unified Data Model for Machine Learning on Heterogeneous Physics Simulations

PLAID:面向异构物理模拟的机器学习统一数据模型

Fabien Casenave, Xavier Roynard, Brian Staber, Alexandre Devaux-Rivière, William Piat, Michele Alessandro Bucci, Nissrine Akkari, Abbas Kabalan, Xuan Minh Vuong Nguyen, Luca Saverio, Raphaël Carpintero Perez, Anthony Kalaydjian, Samy Fouché, Thierry Gonon, Ghassan Najjar, Thomas Daniel, Emmanuel Menier, Matthieu Nastorg, Giovanni Catalani, Christian Rey

AI总结 提出PLAID统一数据层,通过标准化异构物理模拟数据并发布六个基准数据集,解决机器学习代理模型缺乏大规模多样化数据集的问题。

Comments Presented at EuRIPS 2025 and accepted at the AI4Physics Workshop @ ICML 2026

详情
AI中文摘要

基于机器学习的代理模型已成为加速模拟驱动科学工作流的强大工具,但其应用受到缺乏大规模、多样化且标准化的物理模拟数据集的限制。现有基准测试通常聚焦于狭窄领域或依赖简化数据模型,未能捕捉由可变几何、网格和拓扑产生的异质性,而这对于评估现实场景中的泛化能力至关重要。我们提出PLAID(物理学习AI数据模型),一个用于异构物理模拟的统一且可扩展的数据层。它在保留模拟数据完整复杂性的同时,支持高效可扩展的机器学习工作流,并附带一个用于数据集构建和操作的库(https://github.com/PLAID-lib/plaid)。我们发布了六个覆盖结构力学和计算流体动力学的数据集,旨在反映真实工业场景并提供标准化基准。该框架包含可复现的评估协议,并与Hugging Face集成,支持开放、社区驱动的基准测试和用户积极参与(https://huggingface.co/PLAIDcompetitions)。

英文摘要

Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows, but their adoption is limited by the lack of large-scale, diverse, and standardized datasets for physics-based simulations. Existing benchmarks often focus on narrow domains or rely on simplified data models, and fail to capture the heterogeneity arising from variable geometries, meshes, and topologies, which is critical for assessing generalization in realistic settings. We introduce PLAID (Physics-Learning AI Data model), a unified and extensible data layer for heterogeneous physics simulations. It preserves the full complexity of simulation data while enabling efficient and scalable machine learning workflows, together with a library for dataset construction and manipulation~(\href{https://github.com/PLAID-lib/plaid}{github.com/PLAID-lib/plaid}). We release six datasets covering structural mechanics and computational fluid dynamics, designed to reflect realistic industrial scenarios and provide standardized benchmarks. The framework includes reproducible evaluation protocols and is integrated with Hugging Face to enable open, community-driven benchmarking with active user participation (\href{https://huggingface.co/PLAIDcompetitions}{huggingface.co/PLAIDcompetitions}).

2503.21510 2026-05-27 cs.LG cs.CV stat.ML

An uncertainty-aware Bayesian framework for machine learning classification models: A case study in land cover classification

一种不确定性感知的贝叶斯机器学习分类模型框架:以土地覆盖分类为例

Samuel Bilson, Miles McCrory, Anna Pustogvar

AI总结 提出一种考虑输入测量不确定性的贝叶斯生成式分类模型框架,通过贝叶斯二次判别分析模型在土地覆盖数据集上验证,该模型在可解释性、不确定性建模和计算效率方面优于随机森林和神经网络。

Comments 38 pages, 16 figures

详情
AI中文摘要

确保机器学习分类模型的预测伴随不确定性估计是可信任人工智能的主要支柱之一。当前不确定性量化研究主要关注ML模型的认知不确定性,但很少考虑输入测量不确定性,而这对于计量学的可追溯性至关重要。在这项工作中,我们提出了一种考虑输入测量不确定性的生成式ML分类模型的贝叶斯框架。我们以贝叶斯二次判别分析(BQDA)模型为例,并将其应用于来自Copernicus Sentinel-2的2020年和2021年计量土地覆盖数据集。我们将该模型的性能与土地覆盖图中更流行的分类模型(如随机森林和神经网络)进行基准测试。为了验证和评估此类模型的泛化能力,我们还在合成分类数据上进行了模拟,改变了输入测量噪声的分布类型和强度。我们发现,对于真实和合成数据,所提出的BQDA模型更可信,因为它更具可解释性,显式建模了输入测量不确定性,并在不同领域和大小的数据集上保持了类别概率输出的预测性能,同时计算效率更高。

英文摘要

Ensuring that predictions of machine learning (ML) classification models are accompanied by uncertainty estimates is one of the main pillars of trustworthy AI. Current research in uncertainty quantification focuses mainly on epistemic uncertainty of the ML model, but rarely takes account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian framework for generative ML classification models that takes account of input measurement uncertainty. We take the specific case of a Bayesian quadratic discriminant analysis (BQDA) model, and apply it to metrological land cover datasets from Copernicus Sentinel-2 from 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. To validate and assess the generalisability of such a model, we also run simulations over synthetic classification data, varying distribution type and strength of the input measurement noise. We find for both real and synthetic data, the BQDA model presented is more trustworthy, in the sense that it is more interpretable, explicitly models the input measurement uncertainty, and maintains predictive performance of class probability outputs across datasets over different domains and sizes, whilst also being more computationally efficient.

2504.08540 2026-05-27 cs.CV

Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

自动驾驶中车道检测数据集:全面综述

Jörg Gamerdinger, Sven Teufel, Oliver Bringmann

AI总结 本文全面综述了20个公开车道检测数据集,通过多维质量指标分析其特性、优势和局限,并指出未来改进方向以推动鲁棒车道检测创新。

详情
AI中文摘要

准确的车道检测对于自动驾驶至关重要,能够在各种道路场景下实现安全可靠的车辆导航。为了支持车道检测算法的开发和评估,已经引入了许多数据集,这些数据集在数据量、传感器类型、注释粒度、环境条件和场景多样性方面各不相同。本文全面综述了20个公开可用的车道检测数据集,系统地分析了它们的特性、优势和局限性。我们基于传感器分辨率、注释类型以及道路和天气条件的多样性等关键性能指标,使用一种新颖的多维数据集质量指标对这些数据集进行分类。通过识别现有挑战和研究空白,我们强调了未来数据集改进的机会,这些改进可以进一步推动鲁棒车道检测的创新。本综述为寻求适用于鲁棒车道检测的数据集的研究人员提供了资源,并为推进自动驾驶的更广泛目标做出了贡献。

英文摘要

Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation across a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of 20 publicly available lane detection datasets, systematically analyzing their characteristics, advantages, and limitations. We classify these datasets based on key performance indicators such as sensor resolution, annotation types and diversity of road and weather conditions using a novel multidimensional metric for dataset quality. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This review serves as a resource for researchers seeking appropriate datasets for robust lane detection and contributes to the broader goal of advancing autonomous driving.

2504.07853 2026-05-27 cs.CV

V2V3D: View-to-View Denoised 3D Reconstruction for Light-Field Microscopy

V2V3D:用于光场显微镜的视图到视图去噪三维重建

Jiayin Zhao, Zhenqi Fu, Tao Yu, Hui Qiao

AI总结 提出无监督视图到视图框架V2V3D,联合优化图像去噪和三维重建,利用噪声独立性实现噪声到噪声去噪,并设计基于波动光学的特征对齐技术恢复高频细节,在效率和性能上超越现有方法。

Comments CVPR 2025; New version: Fix NSFC ID

详情
AI中文摘要

光场显微镜(LFM)因其能够捕捉基于快照的大规模三维荧光图像而受到广泛关注。然而,现有的LFM重建算法对传感器噪声高度敏感,或者需要难以获取的真实标注数据进行训练。为了解决这些挑战,本文引入了V2V3D,一个无监督的基于视图到视图的框架,在统一架构中建立了图像去噪和三维重建联合优化的新范式。我们假设LF图像源自一致的三维信号,每个视图中的噪声是独立的。这使得V2V3D能够融入噪声到噪声原理以实现有效去噪。为了增强高频细节的恢复,我们提出了一种新颖的基于波动光学的特征对齐技术,该技术将用于波动光学中前向传播的点扩散函数转换为专门用于特征对齐的卷积核。此外,我们引入了一个包含LF图像及其对应三维强度体积的LFM数据集。大量实验表明,我们的方法实现了高计算效率,并优于其他最先进的方法。这些进展使V2V3D成为在挑战性条件下进行三维成像的有前景的解决方案。

英文摘要

Light field microscopy (LFM) has gained significant attention due to its ability to capture snapshot-based, large-scale 3D fluorescence images. However, existing LFM reconstruction algorithms are highly sensitive to sensor noise or require hard-to-get ground-truth annotated data for training. To address these challenges, this paper introduces V2V3D, an unsupervised view2view-based framework that establishes a new paradigm for joint optimization of image denoising and 3D reconstruction in a unified architecture. We assume that the LF images are derived from a consistent 3D signal, with the noise in each view being independent. This enables V2V3D to incorporate the principle of noise2noise for effective denoising. To enhance the recovery of high-frequency details, we propose a novel wave-optics-based feature alignment technique, which transforms the point spread function, used for forward propagation in wave optics, into convolution kernels specifically designed for feature alignment. Moreover, we introduce an LFM dataset containing LF images and their corresponding 3D intensity volumes. Extensive experiments demonstrate that our approach achieves high computational efficiency and outperforms the other state-of-the-art methods. These advancements position V2V3D as a promising solution for 3D imaging under challenging conditions.

2504.05046 2026-05-27 cs.CV

MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond

MotionPRO:探索压力在人体动作捕捉及其它领域中的作用

Shenghao Ren, Yi Lu, Jiayi Huang, Jiayi Zhao, He Zhang, Tao Yu, Qiu Shen, Xun Cao

AI总结 本文通过构建包含压力、RGB和光学传感器的大规模人体动作捕捉数据集MotionPRO,并设计基于压力信号或融合压力与RGB的位姿和轨迹估计网络,证明了压力信号在提高物理合理性、全局轨迹精度以及驱动虚拟人和人形机器人方面的必要性和有效性。

Comments fix NSFC ID

详情
AI中文摘要

现有的人体动作捕捉(MoCap)方法大多关注视觉相似性而忽略物理合理性。因此,下游任务如驱动3D场景中的虚拟人或现实世界中的类人机器人会出现时间漂移和抖动、空间问题如滑动和穿透以及全局轨迹精度差等问题。在本文中,我们通过探索压力的作用,从人体与物理世界交互的角度重新审视人体动作捕捉。首先,我们构建了一个大规模的人体动作捕捉数据集,包含压力、RGB和光学传感器(命名为MotionPRO),该数据集由70名志愿者执行400种动作,共计1240万帧姿态。其次,我们通过两个具有挑战性的任务检验压力信号的必要性和有效性:(1)仅基于压力的姿态和轨迹估计:我们提出了一个包含小核解码器和长短期注意力模块的网络,并证明压力可以提供准确的全局轨迹和合理的下半身姿态。(2)融合压力和RGB的姿态和轨迹估计:我们沿相机轴施加正交相似性约束,沿垂直轴施加全身接触约束,以增强交叉注意力策略,融合压力和RGB特征图。实验表明,将压力与RGB特征融合不仅在客观指标上显著提升了性能,而且能够合理地驱动3D场景中的虚拟人(SMPL)。此外,我们证明融入物理感知使类人机器人能够执行更精确和稳定的动作,这对具身人工智能的发展非常有益。项目页面:https://nju-cite-mocaphumanoid.github.io/MotionPRO/

英文摘要

Existing human Motion Capture (MoCap) methods mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion, encompassing a total of 12.4M pose frames. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy to fuse pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics, but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence. Project page is available at: https://nju-cite-mocaphumanoid.github.io/MotionPRO/

2504.02775 2026-05-27 cs.CV cs.LG

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

TailedCore: 面向无监督长尾噪声异常检测的少样本采样

Yoon Gyo Jung, Jaewoo Park, Jaeho Yoon, Kuan-Chuan Peng, Wonchul Kim, Andrew Beng Jin Teoh, Octavia Camps

AI总结 针对正常数据集存在缺陷污染且类别分布未知长尾的挑战,提出TailSampler估计类别大小以独立处理尾类与噪声,并构建基于记忆的异常检测模型TailedCore,在无监督长尾噪声异常检测中达到最先进性能。

Comments Accepted to CVPR2025

详情
AI中文摘要

我们旨在解决一个实际且具有挑战性的无监督异常检测问题,其中正常数据集既包含缺陷区域污染,其产品类别分布又是长尾但未知的。我们观察到现有模型存在尾类与噪声之间的权衡:如果模型对像素噪声鲁棒,则其在尾类样本上的性能会下降,反之亦然。为缓解该问题,我们独立处理尾类和噪声样本。为此,我们提出TailSampler,一种新颖的类别大小预测器,基于嵌入相似度的类别分布对称假设来估计样本的类别基数。TailSampler可用于专门采样尾类样本,从而单独处理它们。基于这些方面,我们构建了基于记忆的异常检测模型TailedCore,其记忆既能很好地捕捉尾类信息,又对噪声鲁棒。我们在无监督长尾噪声异常检测设置上广泛验证了TailedCore的有效性,并表明TailedCore在大多数设置下优于现有最先进方法。

英文摘要

We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

2504.00307 2026-05-27 cs.LG physics.ao-ph

Generating realistic global precipitation fields from modelled atmospheric circulation

从模拟大气环流生成逼真的全球降水场

Michael Aich, Sebastian Bathiany, Philipp Hess, Yu Huang, Niklas Boers

AI总结 提出基于条件扩散模型与UNet架构的生成式机器学习方法,从少量预报大气变量生成高分辨率全球降水场,作为传统参数化方案的替代,减少偏差并实现高效集合预测。

Comments Accepted for publication at Climate Dynamics

详情
AI中文摘要

改进地球系统模型(ESMs)中降水的表示对于评估气候变化的影响,特别是洪水和干旱等极端事件至关重要。在现有的ESMs中,降水并非显式解析,而是通过参数化表示。这些参数化通常依赖于解析近似但计算昂贵的基于柱的物理过程,不考虑位置间的相互作用。它们难以捕捉精细尺度的降水过程,并引入显著偏差。我们提出了一种基于生成式机器学习的新方法,将条件扩散模型与UNet架构相结合,从一小部分预报大气变量生成准确、高分辨率(0.25°)的全球每日降水场。与传统参数化不同,我们的框架高效地生成集合预测,捕捉降水的不确定性,且无需手动微调。我们在ERA5再分析数据上训练模型,并提出一种方法使其能应用于未见过的ESM数据,从而实现概率预测和气候情景的快速生成。通过利用全球预报变量之间的相互作用,我们的方法提供了一种替代参数化方案,减轻了ESM降水中存在的偏差,同时保持与其大尺度(年)趋势的一致性。这项工作表明,复杂的降水模式可以直接从大尺度大气变量中学习,提供了一种计算高效的方法来获得高分辨率降水,而无需以如此高分辨率运行动力模型的成本。

英文摘要

Improving the representation of precipitation in Earth system models (ESMs) is critical for assessing the impacts of climate change and especially of extreme events like floods and droughts. In existing ESMs, precipitation is not resolved explicitly, but represented by parameterizations. These typically rely on resolving approximated but computationally expensive column-based physics, not accounting for interactions between locations. They struggle to capture fine-scale precipitation processes and introduce significant biases. We present a novel approach, based on generative machine learning, which integrates a conditional diffusion model with a UNet architecture to generate accurate, high-resolution (0.25°) global daily precipitation fields from a small set of prognostic atmospheric variables. Unlike traditional parameterizations, our framework efficiently produces ensemble predictions, capturing uncertainties in precipitation, and does not require fine-tuning by hand. We train our model on the ERA5 reanalysis and present a method that allows us to apply it to unseen ESM data, enabling fast generation of probabilistic forecasts and climate scenarios. By leveraging interactions between global prognostic variables, our approach provides an alternative parameterization scheme that mitigates biases present in the ESM precipitation while maintaining consistency with its large-scale (annual) trends. This work demonstrates that complex precipitation patterns can be learned directly from large-scale atmospheric variables, offering a computationally efficient method to obtain high-resolution precipitation without the cost of running the dynamical model at such high resolution.

2504.00167 2026-05-27 cs.RO

Enhancing Physical Human-Robot Interaction: Recognizing Digits via Intrinsic Robot Tactile Sensing

增强物理人机交互:通过机器人本体触觉感知识别数字

Teresa Sinico, Giovanni Boschetti, Pedro Neto

AI总结 利用协作机器人内置扭矩传感器采集人手在触控板上书写数字时的关节力矩和末端力数据,通过双向LSTM网络实现94%准确率的在线数字识别,并在水果递送任务中验证其应用潜力。

详情
AI中文摘要

物理人机交互(pHRI)仍然是实现与机器人直观安全交互的关键挑战。当前的进展通常依赖外部触觉传感器作为接口,这增加了机器人系统的复杂性。在本研究中,我们利用协作机器人的本体触觉感知能力,识别用户在安装在机器人法兰上的无仪器触控板上绘制的数字。我们提出了一个数据集,包含机器人关节扭矩信号以及相应的末端执行器(EEF)力和力矩,这些数据来自机器人每个关节的集成扭矩传感器,用户在手写数字(0-9)时采集。pHRI-DIGI-TACT数据集从不同用户收集,以捕捉手写的自然变化。为增强分类鲁棒性,我们开发了一种数据增强技术来处理反转和旋转的数字输入。双向长短期记忆(Bi-LSTM)网络利用数据的时空特性,实现在线数字分类,在各种测试场景中总体准确率达到94%,包括涉及未参与系统训练的用户。该方法在真实机器人上的水果递送任务中实现,展示了其辅助日常生活的潜力。数据集和视频演示可在 https://TS-Robotics.github.io/pHRI-DIGI/ 获取。

英文摘要

Physical human-robot interaction (pHRI) remains a key challenge for achieving intuitive and safe interaction with robots. Current advancements often rely on external tactile sensors as interface, which increase the complexity of robotic systems. In this study, we leverage the intrinsic tactile sensing capabilities of collaborative robots to recognize digits drawn by humans on an uninstrumented touchpad mounted to the robot's flange. We propose a dataset of robot joint torque signals along with corresponding end-effector (EEF) forces and moments, captured from the robot's integrated torque sensors in each joint, as users draw handwritten digits (0-9) on the touchpad. The pHRI-DIGI-TACT dataset was collected from different users to capture natural variations in handwriting. To enhance classification robustness, we developed a data augmentation technique to account for reversed and rotated digits inputs. A Bidirectional Long Short-Term Memory (Bi-LSTM) network, leveraging the spatiotemporal nature of the data, performs online digit classification with an overall accuracy of 94\% across various test scenarios, including those involving users who did not participate in training the system. This methodology is implemented on a real robot in a fruit delivery task, demonstrating its potential to assist individuals in everyday life. Dataset and video demonstrations are available at: https://TS-Robotics.github.io/pHRI-DIGI/.

2503.14359 2026-05-27 cs.CV

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

ImViD:用于增强VR沉浸感的沉浸式体积视频

Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

AI总结 提出ImViD多视角多模态数据集,支持移动中捕获完整场景,为6自由度多模态沉浸式VR体验提供基准和重建管线。

Comments CVPR 2025 Highlight; Fix NSFC ID

详情
AI中文摘要

用户参与度通过结合视觉和听觉刺激的完全沉浸式多模态体验得到极大增强。因此,VR/AR技术的下一个前沿在于具有完整场景捕获、大6自由度交互空间、多模态反馈以及高分辨率和高帧率内容的沉浸式体积视频。为了促进沉浸式体积视频的重建,我们引入了ImViD,这是一个多视角、多模态数据集,具有完整的面向空间的数据捕获和各种室内/室外场景。我们的捕获设备支持在移动中进行多视角视频-音频捕获,这是现有数据集所不具备的能力,显著提高了数据捕获的完整性、灵活性和效率。捕获的多视角视频(带有同步音频)为5K分辨率、60FPS,持续1-5分钟,包含丰富的前景-背景元素和复杂的动态。我们使用我们的数据集对现有方法进行基准测试,并建立了一个基础管线,用于从多视角视听输入构建用于6自由度多模态沉浸式VR体验的沉浸式体积视频。基准测试以及重建和交互结果证明了我们数据集和基线方法的有效性,我们相信这将激发未来对沉浸式体积视频制作的研究。

英文摘要

User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.

2503.08600 2026-05-27 cs.CL

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

NSF-SciFy:从NSF资助数据库中挖掘科学主张

Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch

AI总结 提出NSF-SciFy数据集,从40万篇NSF摘要中提取280万条科学主张,通过零样本提示联合提取主张和研究提案,并在三个下游任务中微调语言模型取得显著提升。

Comments ACL 2026. 19 pages, 7 figures, 11 tables

详情
AI中文摘要

我们介绍了NSF-SciFy,这是一个从国家科学基金会(NSF)获奖摘要中提取的科学主张和研究提案的综合数据集。虽然以往的科学主张验证数据集在规模和范围上有限,但NSF-SciFy取得了重大进展,包含来自40万篇摘要的280万条主张,涵盖所有科学和数学学科。我们提出了两个重点子集:NSF-SciFy-MatSci,包含来自材料科学奖项的11.4万条主张;以及NSF-SciFy-20K,包含来自五个NSF理事会的13.5万条主张。使用零样本提示,我们开发了一种可扩展的方法来联合提取科学主张和研究提案。我们通过三个下游任务展示了该数据集的实用性:非技术性摘要生成、主张提取和研究提案提取。在我们的数据集上微调语言模型带来了显著改进,相对增益通常超过100%,特别是在主张和提案提取任务中。我们的错误分析表明,提取的主张具有高精度但召回率较低,这表明有进一步改进方法的机会。NSF-SciFy为大规模主张验证、科学发现追踪和元科学分析开辟了新的研究方向。代码和数据可在 https://github.com/darpa-scify/NSFSciFy 获取。

英文摘要

We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.

2407.15073 2026-05-27 cs.AI cs.CL

Multi-Agent Causal Discovery Using Large Language Models

多智能体因果发现使用大型语言模型

Hao Duong Le, Xin Xia, Haijie Xu, Chen Zhang

AI总结 提出多智能体因果发现框架MAC,通过元融合机制结合自主选择SCD算法的辩论编码模块和基于元数据的对抗性辩论模块,在多个基准上取得最优性能。

详情
AI中文摘要

因果发现旨在识别变量之间的因果关系,是各科学领域的基本问题。传统的统计因果发现(SCD)方法仅依赖观测数据,忽略元数据中可用的上下文信息,而近期基于LLM的方法利用元数据但将大型语言模型(LLM)视为单一智能体,使其判断易受记忆或偏见关联影响。为解决这一差距,我们引入MAC(多智能体因果发现框架),将因果发现转化为多智能体辩论与自主选择SCD算法相结合。MAC通过元融合机制桥接两个互补模块:辩论编码模块(DCM)通过自主选择并执行最合适的SCD算法将初始图基于数据,以及元辩论模块(MDM)通过对抗性的肯定-否定-裁判辩论基于元数据精炼图。在五个基准数据集和三个指标(F1、SHD、NHD)上,MAC在五个统计基线和四个基于LLM的基线中取得了最佳综合性能,在使用Gemini-2.0-Flash时在15个评估点中排名第一10次——包括完美重建地震图——并在三个骨干LLM上保持稳健。

英文摘要

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional statistical causal discovery (SCD) methods rely solely on observational data and ignore the contextual information available in metadata, whereas recent LLM-based methods exploit metadata but treat the large language model (LLM) as a single agent, leaving its judgments vulnerable to memorized or biased associations. To address this gap, we introduce MAC (Multi-Agent Causal Discovery Framework), which casts causal discovery as a multi-agent debate coupled with the autonomous selection of an SCD algorithm. MAC combines two complementary modules, bridged by a Meta Fusion mechanism: a Debate-Coding Module (DCM) that grounds an initial graph in data by autonomously selecting and executing the best-suited SCD algorithm, and a Meta-Debate Module (MDM) that refines the graph through an adversarial Affirmative-Negative-Judge debate over the metadata. Across five benchmark datasets and three metrics (F1, SHD, NHD), MAC achieves the best aggregate performance among five statistical and four LLM-based baselines, ranking first on 10 of 15 evaluation points with Gemini-2.0-Flash -- including a perfect reconstruction of the Earthquake graph -- and remains robust across three backbone LLMs.

2501.00520 2026-05-27 cs.CV cs.LG

Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques

创新性矽肺和肺炎分类:利用图Transformer后验建模与集成技术

Bao Q. Bui, Tien T. T. Nguyen, Duy M. Le, Cong Tran, Cuong Pham

AI总结 提出结合图Transformer网络与传统深度神经网络的架构,并采用平衡交叉熵损失函数和集成方法,在自建胸部X光数据集上实现高精度矽肺与肺炎分类。

Comments Withdrawn by the authors because the manuscript contains incomplete and potentially misleading descriptions of the dataset construction and evaluation protocol, particularly in the Dataset and Experimental Setup sections. The work should not be cited or used as an independent reference in its current form

详情
AI中文摘要

本文对矽肺相关肺部炎症的分类与检测进行了全面研究。我们的主要贡献包括:1) 创建了一个名为SVBCX的新策划胸部X光(CXR)图像数据集,该数据集针对不同病原体引起的肺部炎症的细微差别进行了定制,为矽肺和肺炎研究社区提供了宝贵资源;2) 提出了一种新颖的深度学习架构,该架构将图Transformer网络与传统深度神经网络模块相结合,用于有效分类矽肺和肺炎。此外,我们采用平衡交叉熵(BalCE)作为损失函数,以确保不同类别之间的更均匀学习,增强模型辨别肺部状况细微差异的能力。所提出的模型架构和损失函数选择旨在提高炎症检测的准确性和可靠性,特别是在矽肺背景下。此外,我们的研究探索了一种集成方法的有效性,该方法结合了不同模型架构的优势。在构建的数据集上的实验结果表明,与基线模型相比,取得了显著改进。模型集成实现了宏F1分数0.9749,每个类别的AUC ROC分数超过0.99,突显了我们的方法在准确和鲁棒的肺部炎症分类中的有效性。

英文摘要

This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model's ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification.

2410.19248 2026-05-27 cs.LG

CHESTNUT: A QoS Dataset for Mobile Edge Environments

CHESTNUT: 面向移动边缘环境的QoS数据集

Guobing Zou, Fei Zhao, Shengxiang Hu

AI总结 针对现有QoS数据集忽略时间和地理位置等动态属性的问题,提出CHESTNUT数据集,在采集过程中精确记录时间和地理位置信息,以支持移动边缘环境中的QoS预测。

详情
AI中文摘要

服务质量(QoS)是衡量网络服务性能的重要指标。如今,它被广泛应用于移动边缘环境中,以评估移动设备从边缘服务器请求服务时的服务质量。QoS通常涉及多个维度,如带宽、延迟、抖动和数据包丢失率。然而,大多数现有的QoS数据集,例如常见的WS-Dream数据集,主要关注网络服务的静态QoS指标,而忽略了时间和地理位置等动态属性。这意味着它们应该详细记录服务请求时移动设备的位置或请求的时间顺序。然而,这些动态属性对于理解和预测网络服务的实际性能至关重要,因为QoS性能通常随时间和地理位置波动。为此,我们提出了一种新的数据集,在采集过程中精确记录服务质量的时间和地理位置信息,旨在为移动边缘环境中的未来QoS预测提供更准确、可靠的数据支持。

英文摘要

Quality of Service (QoS) is an important metric to measure the performance of network services. Nowadays, it is widely used in mobile edge environments to evaluate the quality of service when mobile devices request services from edge servers. QoS usually involves multiple dimensions, such as bandwidth, latency, jitter, and data packet loss rate. However, most existing QoS datasets, such as the common WS-Dream dataset, focus mainly on static QoS metrics of network services and ignore dynamic attributes such as time and geographic location. This means they should have detailed the mobile device's location at the time of the service request or the chronological order in which the request was made. However, these dynamic attributes are crucial for understanding and predicting the actual performance of network services, as QoS performance typically fluctuates with time and geographic location. To this end, we propose a novel dataset that accurately records temporal and geographic location information on quality of service during the collection process, aiming to provide more accurate and reliable data to support future QoS prediction in mobile edge environments.

2410.00357 2026-05-27 cs.LG stat.ML

Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study

深度ReLU和深度算子网络的神经缩放定律:一项理论研究

Hao Liu, Zecheng Zhang, Wenjing Liao, Hayden Schaeffer

AI总结 本文通过分析深度算子网络的逼近误差和泛化误差,建立了量化神经缩放定律的理论框架,揭示了网络模型大小和训练数据大小与误差之间的关系,并推广到深度ReLU网络。

详情
AI中文摘要

神经缩放定律在深度神经网络的性能中起着关键作用,并在广泛的任务中被观察到。然而,理解这些缩放定律的完整理论框架仍不完善。在本文中,我们探索了深度算子网络的神经缩放定律,这些网络涉及学习函数空间之间的映射,重点关注Chen和Chen风格的架构。这些方法包括流行的深度算子网络(DeepONet),它们使用可学习基函数和依赖于输入函数的系数的线性组合来近似输出函数。我们建立了一个理论框架,通过分析其逼近和泛化误差来量化神经缩放定律。我们阐述了深度算子网络的逼近和泛化误差与网络模型大小和训练数据大小等关键因素之间的关系。此外,我们处理了输入函数表现出低维结构的情况,从而能够推导出更紧的误差界。这些结果也适用于深度ReLU网络和其他类似结构。我们的结果为算子学习中的神经缩放定律提供了部分解释,并为其应用提供了理论基础。

英文摘要

Neural scaling laws play a pivotal role in the performance of deep neural networks and have been observed in a wide range of tasks. However, a complete theoretical framework for understanding these scaling laws remains underdeveloped. In this paper, we explore the neural scaling laws for deep operator networks, which involve learning mappings between function spaces, with a focus on the Chen and Chen style architecture. These approaches, which include the popular Deep Operator Network (DeepONet), approximate the output functions using a linear combination of learnable basis functions and coefficients that depend on the input functions. We establish a theoretical framework to quantify the neural scaling laws by analyzing its approximation and generalization errors. We articulate the relationship between the approximation and generalization errors of deep operator networks and key factors such as network model size and training data size. Moreover, we address cases where input functions exhibit low-dimensional structures, allowing us to derive tighter error bounds. These results also hold for deep ReLU networks and other similar structures. Our results offer a partial explanation of the neural scaling laws in operator learning and provide a theoretical foundation for their applications.

2408.15787 2026-05-27 cs.CL cs.IR

Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions

交互式智能体:通过角色扮演的LLM-LLM交互模拟咨询师-来访者心理咨询

Huachuan Qiu, Zhenzhong Lan

AI总结 提出Interactive Agents框架,通过LLM-LLM角色扮演生成高质量心理咨询对话数据,解决隐私、成本和扩展性问题,并验证其治疗有效性和模型性能。

Comments Accepted to *SEM2026

详情
AI中文摘要

创建有效的心理健康支持对话系统需要高质量的多轮咨询对话数据,然而收集真实的咨询师-来访者对话面临重大挑战,包括隐私问题、高成本和有限的扩展性。我们提出了 extbf{Interactive Agents},一个新颖的框架,通过受控的LLM-LLM交互模拟自然的咨询对话。该框架引入了两个关键创新:(1) 一个个性化的来访者智能体,在整个会话过程中保持一致的心理学特征;(2) 一个咨询师智能体,实现了一个基于理论的三阶段治疗模型,包括探索、洞察和行动阶段。通过使用自动评估指标和基于工作联盟清单的专业咨询师评估进行严格评估,我们证明我们的框架生成了具有治疗有效性的对话,其质量与人类生成的会话相当。在我们的合成数据集(SimPsyDial)上微调的模型,在基于LLM的咨询师的标准成对聊天机器人竞技场评估中达到了最先进的性能。我们的框架提供了一种可扩展、保护隐私的方法,用于生成高质量的咨询对话数据,同时保持专业的治疗标准。

英文摘要

Creating effective dialogue systems for mental health support requires high-quality multi-turn counseling dialogue data, yet collecting real counselor-client conversations presents significant challenges, including privacy concerns, high costs, and limited scalability. We present \textbf{Interactive Agents}, a novel framework that simulates naturalistic counseling dialogues through controlled LLM-to-LLM interactions. The framework introduces two key innovations: (1) a personalized client agent that maintains consistent psychological characteristics throughout a session, and (2) a counselor agent that implements a theoretically grounded three-stage therapeutic model comprising the exploration, insight, and action phases. Through rigorous evaluation using both automatic metrics and professional-counselor assessments based on the Working Alliance Inventory, we demonstrate that our framework generates therapeutically valid dialogues that are comparable in quality to human-generated sessions. Models fine-tuned on our proposed synthetic dataset (SimPsyDial) achieve state-of-the-art performance in a standard pairwise chatbot-arena evaluation of LLM-based counselors. Our framework provides a scalable, privacy-preserving method for generating high-quality counseling dialogue data while maintaining professional therapeutic standards.

2408.05560 2026-05-27 cs.LG math.OC stat.ML

Incremental Gauss-Newton Descent for Machine Learning

增量高斯-牛顿下降法在机器学习中的应用

Mikalai Korbit, Mario Zanon

AI总结 针对标量输出损失逐样本评估的场景,提出增量高斯-牛顿下降法(IGND),通过闭式标量归一化随机梯度实现无需存储或求解曲率矩阵的高效更新,并证明其收敛性。

详情
AI中文摘要

随机梯度更新因其高效性和可扩展性被广泛使用,但其有效步长可能严重依赖于特征缩放和局部模型敏感性。高斯-牛顿方法通过曲率信息处理此类尺度效应,但在标准小批量形式中需要矩阵-向量乘积、线性求解或结构化近似。本文研究每次评估一个样本的标量输出损失的特殊情况。在此设置下,广义高斯-牛顿矩阵的秩至多为1,其唯一可能的非零曲率方向与随机梯度对齐。因此,阻尼高斯-牛顿方向简化为样本梯度的闭式标量归一化。由此产生的更新,即增量高斯-牛顿下降法(IGND),不需要曲率矩阵存储、分解或迭代线性求解。我们推导了该更新,描述了其行为,并将其与归一化梯度下降、自适应一阶方法、随机Polyak步长和小批量高斯-牛顿更新联系起来。在显式光滑性、对齐性和随机逼近假设下,我们证明了IGND更新的平稳性结果。在监督学习、尺度鲁棒性的受控测试以及线性二次控制案例研究上的实验表明,IGND提高了对敏感性缩放的鲁棒性,并且可以在保持简单增量更新的同时,与常见的随机优化器竞争或互补。

英文摘要

Stochastic gradient updates are widely used for their efficiency and scalability, but their effective step sizes can depend strongly on feature scaling and local model sensitivity. Gauss-Newton methods address such scale effects through curvature information, but in their standard mini-batch form they require matrix-vector products, linear solves, or structured approximations. This paper studies the special case of scalar-output losses evaluated one sample at a time. In this setting, the generalized Gauss-Newton matrix has rank at most one, and its only possible nonzero curvature direction is aligned with the stochastic gradient. As a result, the damped Gauss-Newton direction reduces to a closed-form scalar normalization of the sample gradient. The resulting update, Incremental Gauss-Newton Descent (IGND), requires no curvature matrix storage, factorization, or iterative linear solve. We derive the update, characterize its behavior, and relate it to normalized gradient descent, adaptive first-order methods, stochastic Polyak step sizes, and mini-batch Gauss-Newton updates. Under explicit smoothness, alignment, and stochastic approximation assumptions, we prove a stationarity result for the IGND update. Experiments on supervised learning, a controlled test of scale robustness, and a linear-quadratic control case study show that IGND improves robustness to sensitivity scaling and can be competitive with, or complementary to, common stochastic optimizers while retaining a simple incremental update.

2406.03474 2026-05-27 cs.CV

AD-H: Language-guided Autonomous Driving with Hierarchical Agents

AD-H:基于分层智能体的语言引导自动驾驶

Zaibin Zhang, Talas Fu, Shiyu Tang, Yuanhang Zhang, Yifan Wang, Lijun Wang, Huchuan Lu

AI总结 提出AD-H分层多智能体框架,上层MLLM规划器生成中层驾驶指令,下层轻量控制器执行连续动作,通过规则重建1.15M中层指令对,以3B+350M参数超越7B模型,实现长时域泛化与指令遵循。

详情
AI中文摘要

语言引导的自动驾驶需要弥合高级自然语言指令与低级车辆控制之间巨大的抽象鸿沟。使用单个多模态大语言模型(MLLM)将语言直接映射到动作的端到端方法难以应对这种不匹配,往往无法利用模型的推理能力,并且在用于微调的驾驶数据集分布之外表现出有限的泛化能力。为了解决这个问题,我们提出了AD-H,一个分层多智能体框架,明确地将高级决策与低级车辆执行分开。在上层,基于MLLM的规划器解释自然语言命令和环境上下文,生成连贯的中层驾驶指令。在下层,轻量级控制器将这些中层指令转换为精确、连续的控制动作。这种分解与每个组件的功能优势相一致:规划器专注于语义推理和任务分解,而控制器确保稳定和准确的执行。为了支持这种层次结构下的大规模训练,我们设计了一个基于规则的流水线,从驾驶信号中重建中层命令,产生了115万对分层注释。大量实验表明,尽管AD-H使用的参数更少(即3B加350M,而对比模型为7B),但它仍优于最先进的模型,并实现了卓越的长时域泛化和指令遵循性能。我们在https://github.com/zhangzaibin/AD-H公开了我们的数据和代码。

英文摘要

Language-guided autonomous driving requires bridging a large abstraction gap between high-level natural-language instructions and low-level vehicle control. End-to-end approaches that use a single multimodal large language model (MLLM) to map language directly to actions struggle with this mismatch, often failing to exploit the reasoning capabilities of the model and exhibiting limited generalization beyond the distributions of driving datasets used for fine-tuning. To address this issue, we propose AD-H, a hierarchical multi-agent framework that explicitly separates high-level decision-making from low-level vehicle execution. At the upper level, an MLLM-based planner interprets natural-language commands and environmental context to generate coherent mid-level driving instructions. At the lower level, a lightweight controller converts these mid-level instructions into precise, continuous control actions. This decomposition aligns with the functional strengths of each component: the planner focuses on semantic reasoning and task decomposition, while the controller ensures stable and accurate actuation. To support large-scale training under this hierarchy, we design a rule-based pipeline that reconstructs mid-level commands from driving signals, producing 1.15 million hierarchical annotation pairs. Extensive experiments show that AD-H outperforms state-of-the-art models despite using fewer parameters, namely 3B plus 350M compared with 7B, and achieves superior long-horizon generalization and instruction-following performance. We make our data and code publicly accessible at https://github.com/zhangzaibin/AD-H

2405.16417 2026-05-27 cs.CV

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

CRoFT: 面向OOD泛化和开放集OOD检测的并发优化鲁棒微调

Lin Zhu, Yifeng Yang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye

AI总结 针对视觉语言预训练模型微调时分布偏移问题,提出一种基于能量分数梯度最小化的统一微调框架,同时提升闭集OOD泛化能力和开放集OOD检测性能。

详情
AI中文摘要

最近的视觉语言预训练模型(VL-PTMs)在开放词汇任务中取得了显著成功。然而,下游用例通常涉及对VL-PTMs的进一步微调,这可能会扭曲其通用知识并损害其处理分布偏移的能力。在现实场景中,机器学习系统不可避免地会遇到协变量偏移(例如,图像风格的变化)和语义偏移(例如,测试时未见类别)。这凸显了增强对协变量偏移的分布外(OOD)泛化能力,同时检测语义偏移的未见类别的重要性。因此,一个关键但尚未充分探索的问题出现了:如何在微调期间提高VL-PTMs对闭集OOD数据的泛化能力,同时有效检测开放集未见类别?在本文中,我们提出了一种新颖的OOD检测目标函数,该函数也有助于改进OOD泛化。我们表明,最小化训练数据上能量分数的梯度幅度会导致分类损失的域一致Hessian矩阵,这是理论分析揭示的OOD泛化的强指标。基于这一发现,我们开发了一个统一的微调框架,允许同时优化这两个任务。大量实验证明了我们方法的优越性。代码可在https://github.com/LinLLLL/CRoFT获取。

英文摘要

Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.

2404.18539 2026-05-27 cs.CV cs.AI

Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods

基于骨架的方法增强边界分割的拓扑准确性

Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu

AI总结 提出Skea-Topo Aware损失函数,通过骨架感知加权和边界修正项提升网状图像边界分割的拓扑一致性,在三个数据集上相比13种方法VI指标提升最多7点。

详情
Journal ref
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pp. 1092-1100, 2024
AI中文摘要

拓扑一致性在网状图像的边界分割任务中起着关键作用,例如神经元电子显微镜图像中的细胞膜分割、材料显微图像中的晶界分割以及航拍图像中的道路分割。在这些领域中,分割结果的拓扑变化对下游任务产生严重影响,甚至可能超过边界本身的错位。为了增强分割结果的拓扑准确性,我们提出了Skea-Topo Aware损失函数,这是一种新颖的损失函数,考虑了每个物体的形状和像素的拓扑重要性。它由两部分组成。首先,骨架感知加权损失通过更好地利用骨架建模物体几何来提高分割准确性。其次,边界修正项通过使用真实标签和预测中的前景和背景骨架,有效识别并强调预测误差中的拓扑关键像素。实验证明,在三个不同的边界分割数据集上,基于客观和主观评估,我们的方法在VI指标上相比13种最先进方法将拓扑一致性提高了最多7点。代码可在https://github.com/clovermini/Skea_topo获取。

英文摘要

Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https://github.com/clovermini/Skea_topo.

2306.09344 2026-05-27 cs.CV cs.LG

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

DreamSim: 使用合成数据学习人类视觉相似性的新维度

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola

AI总结 本文提出DreamSim指标,通过合成数据训练,在图像布局、对象姿态和语义内容等中高层面上对齐人类感知,并在检索和重建任务中优于现有指标。

Comments Website: https://dreamsim-nights.github.io/ Code: https://github.com/ssundaram21/dreamsim

详情
Journal ref
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
AI中文摘要

当前的感知相似性度量在像素和补丁级别上操作。这些度量在低层颜色和纹理方面比较图像,但未能捕捉图像布局、对象姿态和语义内容中的中层相似性和差异。在本文中,我们开发了一种整体评估图像的感知度量。第一步是收集一个关于以多种方式相似的图像对的人类相似性判断的新数据集。该数据集的关键在于判断几乎是自动的,并且所有观察者共享。为了实现这一点,我们使用最近的文本到图像模型创建沿不同维度扰动的合成对。我们观察到流行的感知度量无法解释我们的新数据,因此我们引入了一个新的度量DreamSim,调整以更好地与人类感知对齐。我们分析了不同视觉属性如何影响我们的度量,发现它主要关注前景对象和语义内容,同时对颜色和布局敏感。值得注意的是,尽管在合成数据上训练,我们的度量能够泛化到真实图像,在检索和重建任务上取得了强劲的结果。此外,我们的度量在这些任务上优于先前学习的度量和最近的大型视觉模型。

英文摘要

Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

2312.02694 2026-05-27 cs.CV

UPOCR: Towards Unified Pixel-Level OCR Interface

UPOCR:迈向统一像素级OCR接口

Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin

AI总结 提出UPOCR,一种基于ViT编码器-解码器和可学习任务提示的统一像素级OCR模型,在文本移除、分割和篡改检测三个任务上以单一模型实现最先进性能。

Comments ICML 2024 Version

详情
AI中文摘要

现有的光学字符识别(OCR)方法依赖于任务特定的设计,具有不同的范式、架构和训练策略,这显著增加了研究和维护的复杂性,并阻碍了在应用中的快速部署。为此,我们提出UPOCR,一种简单而有效的通用模型,用于统一像素级OCR接口。具体来说,UPOCR将不同OCR任务的范式统一为图像到图像的变换,架构统一为基于视觉Transformer(ViT)的编码器-解码器,并带有可学习的任务提示。这些提示将编码器提取的通用特征表示推向任务特定的空间,赋予解码器任务感知能力。此外,模型训练统一以最小化预测图像与真实图像之间的差异为目标,无论任务之间的异质性如何。在三个像素级OCR任务(包括文本移除、文本分割和篡改文本检测)上进行了实验。无需花哨的附加组件,实验结果表明,所提出的方法能够以统一的单一模型同时在三个任务上实现最先进的性能,为未来通用OCR模型的研究提供了有价值的策略和见解。代码可在 https://github.com/shannanyinxiang/UPOCR 获取。

英文摘要

Existing optical character recognition (OCR) methods rely on task-specific designs with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder with learnable task prompts. The prompts push the general feature representations extracted by the encoder towards task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the predicted and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code is available at https://github.com/shannanyinxiang/UPOCR.

2210.02573 2026-05-27 cs.LG

Efficient Learning of Mesh-Based Physical Simulation with BSMS-GNN

基于BSMS-GNN的网格物理模拟高效学习

Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang

AI总结 针对大规模网格物理模拟中图神经网络扩展复杂度和过平滑问题,提出基于二分图确定的双步幅池化策略BSMS-GNN,无需人工粗网格且避免几何边界错误边,显著提升精度和计算效率。

Comments Updates summary: fix the missing remark for yadi and menglei (* mention work partially done during while they are at snap inc.)

详情
AI中文摘要

使用平面图神经网络(GNN)和堆叠消息传递(MP)在大规模网格上学习物理模拟具有挑战性,因为其扩展复杂度与节点数量相关且存在过平滑问题。社区对引入多尺度结构到GNN用于物理模拟的兴趣日益增长。然而,当前最先进的方法受限于依赖人工绘制粗网格或基于空间邻近性构建粗层级,这可能在几何边界引入错误边。受二分图确定启发,我们提出了一种新颖的池化策略——双步幅(bi-stride),以解决上述限制。双步幅在广度优先搜索(BFS)的每个其他前沿上池化节点,无需手动绘制粗网格,并避免了空间邻近性导致的错误边。此外,它实现了每层级单次MP方案以及通过插值进行非参数化池化和反池化,类似于U-Net,显著降低了计算成本。实验表明,所提出的框架BSMS-GNN在代表性物理模拟中,在精度和计算效率方面均显著优于现有方法。

英文摘要

Learning the physical simulation on large-scale meshes with flat Graph Neural Networks (GNNs) and stacking Message Passings (MPs) is challenging due to the scaling complexity w.r.t. the number of nodes and over-smoothing. There has been growing interest in the community to introduce \textit{multi-scale} structures to GNNs for physical simulation. However, current state-of-the-art methods are limited by their reliance on the labor-intensive drawing of coarser meshes or building coarser levels based on spatial proximity, which can introduce wrong edges across geometry boundaries. Inspired by the bipartite graph determination, we propose a novel pooling strategy, \textit{bi-stride} to tackle the aforementioned limitations. Bi-stride pools nodes on every other frontier of the breadth-first search (BFS), without the need for the manual drawing of coarser meshes and avoiding the wrong edges by spatial proximity. Additionally, it enables a one-MP scheme per level and non-parametrized pooling and unpooling by interpolations, resembling U-Nets, which significantly reduces computational costs. Experiments show that the proposed framework, \textit{BSMS-GNN}, significantly outperforms existing methods in terms of both accuracy and computational efficiency in representative physical simulations.

2302.13473 2026-05-27 cs.LG

Towards Interpretable Federated Learning

迈向可解释的联邦学习

Anran Li, Rui Liu, Ming Hu, Yuanyuan Chen, Shipeng Wang, Lizhen Cui, Han Yu

AI总结 本文首次综述可解释联邦学习(IFL),提出涵盖模型解释、调试和数据贡献评估的独特分类体系,并分析代表性方法、评估指标和未来方向。

Comments Survey of interpretable federated learning

详情
AI中文摘要

联邦学习(FL)使多个数据所有者能够在不暴露私有本地数据的情况下协作构建机器学习模型。为了使FL得到广泛采用,平衡性能、隐私保护和可解释性的需求至关重要,尤其是在金融和医疗等关键任务应用中。因此,可解释联邦学习(IFL)已成为一个新兴的研究课题,吸引了学术界和工业界的极大兴趣。其跨学科性质对新研究人员来说可能具有挑战性。在本文中,我们通过提供(据我们所知)第一篇关于IFL的综述来弥合这一差距。我们提出了一个独特的IFL分类法,涵盖了使FL模型能够解释预测结果、支持模型调试以及提供关于单个数据所有者或数据样本贡献的见解的相关工作,这对于公平分配奖励以激励在FL中积极可靠的参与至关重要。我们对代表性的IFL方法、常用的性能评估指标以及构建多功能IFL技术的有前景方向进行了全面分析。

英文摘要

Federated learning (FL) enables multiple data owners to build machine learning models collaboratively without exposing their private local data. In order for FL to achieve widespread adoption, it is important to balance the need for performance, privacy-preservation and interpretability, especially in mission critical applications such as finance and healthcare. Thus, interpretable federated learning (IFL) has become an emerging topic of research attracting significant interest from the academia and the industry alike. Its interdisciplinary nature can be challenging for new researchers to pick up. In this paper, we bridge this gap by providing (to the best of our knowledge) the first survey on IFL. We propose a unique IFL taxonomy which covers relevant works enabling FL models to explain the prediction results, support model debugging, and provide insights into the contributions made by individual data owners or data samples, which in turn, is crucial for allocating rewards fairly to motivate active and reliable participation in FL. We conduct comprehensive analysis of the representative IFL approaches, the commonly adopted performance evaluation metrics, and promising directions towards building versatile IFL techniques.

2009.11997 2026-05-27 cs.LG cs.AI cs.RO

Continual Model-Based Reinforcement Learning with Hypernetworks

基于超网络的连续模型强化学习

Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti

AI总结 提出HyperCRL方法,利用任务条件超网络在序列任务中持续学习动力学模型,避免重新训练并固定存储开销,在机器人 locomotion 和 manipulation 任务中优于现有持续学习方法。

Comments Updated link to project website in the abstract. 7 pages (+2 pages in appendix), 8 figures. In proceedings of the 2021 IEEE International Conference on Robotics and Automation

详情
AI中文摘要

在基于模型的强化学习(MBRL)和模型预测控制(MPC)中,有效规划依赖于学习到的动力学模型的准确性。在MBRL和MPC的许多实例中,该模型被假定为平稳的,并且定期从头开始重新训练,使用从环境交互开始收集的状态转移经验。这意味着训练动力学模型所需的时间——以及计划执行之间的暂停时间——随着收集的经验规模线性增长。我们认为这对于终身机器人学习来说太慢,并提出了HyperCRL,一种使用任务条件超网络在序列任务中持续学习所遇到动力学的方法。我们的方法有三个主要特点:首先,它包括不重新访问先前任务训练数据的动力学学习会话,因此只需存储最近固定大小的状态转移经验;其次,它使用固定容量的超网络来表示非平稳且任务感知的动力学;第三,它优于依赖固定容量网络的现有持续学习替代方案,并且与记忆不断增长的过去经验核心集的基线方法相比具有竞争力。我们展示了HyperCRL在机器人 locomotion 和 manipulation 场景(如推和开门任务)中在连续基于模型的强化学习中的有效性。我们的项目网站(含视频)位于此链接:https://rvl.cs.toronto.edu/blog/hypercrl

英文摘要

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/hypercrl

1909.08210 2026-05-27 cs.LG stat.ML

Reformulation of RBM to Unify Linear and Nonlinear Dimensionality Reduction

RBM的重新表述以统一线性和非线性降维

Jiangsheng You, Chun-Yen Liu

AI总结 本文通过最大后验估计和期望最大化算法重新表述受限玻尔兹曼机为确定性模型,提出无需MCMC的对比散度算法,统一了标量和向量变量的线性和非线性降维。

Comments 16 pages with 7 figures

详情
AI中文摘要

受限玻尔兹曼机(RBM)是一种具有共享权重的两层神经网络,在文献中已被广泛研究用于降维、数据表示和推荐系统。传统的RBM需要对两层上的值进行概率解释,并在训练期间使用马尔可夫链蒙特卡洛(MCMC)过程生成样本。对比散度(CD)算法能高效训练RBM,但其收敛性尚未得到数学证明。在本文中,利用最大后验(MAP)估计和期望最大化(EM)算法,我们证明了无MCMC的CD算法对于条件似然目标函数是收敛的。本文的另一个关键贡献是将RBM重新表述为确定性模型。在重新表述的RBM中,无MCMC的CD算法近似于梯度下降(GD)方法。这种重新表述的RBM可以在节点上采用连续的标量和向量变量,并灵活选择激活函数。数值实验显示了其在线性和非线性降维中的能力,并且对于非线性降维,通过选择合适的激活函数,重新表述的RBM可以优于主成分分析(PCA)。最后,我们展示了其在CIFAR-10数据集(彩色图像)和多变量序列数据上的向量值节点应用,这些应用无法用传统RBM自然配置。这项工作不仅为传统RBM提供了理论见解,而且统一了标量和向量变量的线性和非线性降维。

英文摘要

A restricted Boltzmann machine (RBM) is a two-layer neural network with shared weights and has been extensively studied for dimensionality reduction, data representation and recommendation systems in the literature. The traditional RBM requires a probabilistic interpretation of the values on both layers and a Markov chain Monte Carlo (MCMC) procedure to generate samples during the training. The contrastive divergence (CD) is efficient to train the RBM but its convergence has not been proved mathematically. In this paper, using a maximum a posteriori (MAP) estimate and the expectation maximization (EM) algorithm, we show that the CD algorithm without MCMC is convergent for the conditional likelihood object function. Another key contribution in this paper is the reformulation of the RBM into a deterministic model. Within the reformulated RBM, the CD algorithm without MCMC approximates the gradient descent (GD) method. This reformulated RBM can take the continuous scalar and vector variables on the nodes with flexibility in choosing the activation functions. Numerical experiments show its capability in both linear and nonlinear dimensionality reduction, and, for the nonlinear dimensionality reduction, the reformulated RBM can outperform principal component analysis (PCA) by choosing the proper activation functions. Finally, we demonstrate its application to vector-valued nodes for the CIFAR-10 dataset (color images) and the multivariate sequence data, which cannot be configured naturally with the traditional RBM. This work not only provides theoretical insights regarding the traditional RBM but also unifies the linear and nonlinear dimensionality reduction for scalar and vector variables.

2605.27371 2026-05-27 cs.CY cs.AI

Algorithmic Monocultures in Hiring

招聘中的算法同质化

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang

AI总结 研究招聘算法同质化导致相同个体和种族群体被拒绝的问题,通过分析300万求职者的400万份申请数据,发现明显的种族差异和结果同质性。

Comments Published at FAccT 2026. Website: https://algorithmichiring.github.io/

详情
AI中文摘要

许多雇主使用由少数几家算法供应商构建的算法筛选求职者。我们假设算法同质化导致相同的个体和相同种族群体的成员面临拒绝。我们获取并分析了一个包含300万求职者提交400万份申请的新数据集,所有申请均由同一供应商构建的算法筛选。我们发现求职者结果存在明显的种族差异。根据美国就业歧视标准,亚裔和黑人求职者提交的所有申请中,分别有14.74%和25.87%的申请提交给了对亚裔和黑人求职者产生不利影响的职位。个体也收到同质化的结果:在所有申请10个职位的求职者中,有4%被所有职位推荐拒绝,这一比例高于随机预期。为了更好地理解这种同质性,我们利用招聘算法的确定性可复制性,生成如果求职者申请所有职位本应获得的结果。我们表明,求职者需要广泛申请才能确保他们的申请被人审阅。

英文摘要

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human

2605.27360 2026-05-27 cs.NI cs.AI

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS: 利用AI智能体实现自主6G RAN合成、研究与测试

Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen, Gabriele Gemmi, Andrea Lacava, Ali Saeizadeh, Reshma Prasad, Paolo Testolina, Angelo Feraudo, Soumendra Nanda, Pedram Johari, Salvatore D'Oro, Tommaso Melodia

AI总结 提出GENESIS框架,通过智能体、技能和钩子三种可组合原语及知识层SYNAPSE,将意图转化为经空口实验验证的解决方案,以加速6G无线接入网研发。

Comments 18 pages, 16 figures

详情
AI中文摘要

蜂窝研究与开发受制于六个结构性流程,每个流程每次迭代需要数月的体力工程工作:(i) 将标准或研究论文中的新特性综合为生产代码;(ii) 一致性测试和互操作性测试;(iii) 针对现场异常和多样化部署环境进行加固;(iv) 网络功能的数据驱动优化;(v) 发现并原型化未来标准的新波形、功能及能力;(vi) 保护协议栈免受漏洞攻击。尽管大型语言模型已将通用软件工程中类似的研发工作从数天压缩至数分钟,但其已知缺陷在无线接入网用例中更为严重:它们会幻觉应用程序编程接口并误读规范,导致RAN组件在第一次错误时即失去互操作性,并且它们严重依赖仿真来设计算法,而仿真在迁移到真实硬件时往往失效。为应对这些挑战,我们提出GENESIS,一个智能体人工智能框架,将意图(如规范条款、遥测异常或研究假设)转化为经空口实验验证的解决方案,并反馈到持久知识库中。GENESIS建立在三种可组合原语(智能体、技能、钩子)和一个知识层(SYNAPSE)之上,该知识层既作为事实来源,也作为框架产生的所有工件的接收者,使能力在多次运行中累积。

英文摘要

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.