In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.

URL PDF HTML ☆

赞 0 踩 0

2601.16233 2026-06-19 cs.SI cs.AI 版本更新

UniMM：一种用于多智能体仿真的统一混合模型框架

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

发表机构 * Zhejiang University（浙江大学）； Horizon Robotics

AI总结提出UniMM框架统一回归混合模型与离散NTP模型，通过闭环样本生成缓解分布偏移，并在WOSAC基准上取得最优性能。

Comments Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2026

详情

DOI: 10.1109/TPAMI.2026.3700402

AI中文摘要

仿真在评估自动驾驶系统中起着关键作用，其中生成逼真的多智能体行为是一个关键方面。在多智能体仿真中，主要挑战包括行为多模态性和闭环分布偏移。在本研究中，我们提出了一个统一的混合模型（UniMM）框架，用于生成多模态智能体行为，该框架涵盖了主流方法，包括基于回归的混合模型和离散NTP模型。此外，我们引入了一种针对混合模型的闭环样本生成方法，以缓解分布偏移。在UniMM框架内，我们从模型和数据角度识别了关键配置。我们对各种模型配置进行了系统检查，并全面描述了它们的效果。此外，我们对数据配置的研究强调了闭环样本在实现逼真仿真中的关键作用。为了将闭环样本的优势扩展到更广泛的混合模型中，我们进一步引入了一种时间解缠和对齐机制，以解决捷径学习和离策略学习问题。利用我们探索的见解，UniMM框架内提出的不同变体，包括离散模型、无锚模型和基于锚点的模型，均在WOSAC基准上取得了最先进的性能。

英文摘要

Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.18413 2026-06-19 cs.AI cs.HC 版本更新

Searching for Synergy in Shared Workspace Human-AI Collaboration

在共享工作空间的人机协作中寻找协同效应

Nachiket Kotalwar, Rohini Das, Carolyn Rose

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结研究共享工作空间的人机团队协作，通过Collaborative Gym环境实验发现，缺乏协调结构时增加协作者会降低性能，而结合共享记忆和模拟人在环门控的脚手架可提升团队绩效。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情

AI中文摘要

自动化AI代理越来越强大，但许多科学和专业任务仍需要人类判断和情境专业知识。我们研究共享工作空间的人机团队，其中AI代理和人类协作者必须在提交最终答案前协调职责。使用Collaborative Gym环境和DiscoveryBench任务，我们考察何时添加模拟人类协作者能提升性能，以及何时过程损失将额外协作者变为协调开销。在1482个会话中，当团队缺乏协调贡献的结构时，添加相关协作者会降低性能。然后我们评估一种脚手架，它结合了共享群体记忆和模拟人在环（HITL）门控，其中选定动作需要指定模拟参与者的批准。这种脚手架在三人团队中最为明显，产生了更高的平均性能，具有更清晰的责任信号和更强的专业知识路由到团队动作。总体而言，人机团队如何协调和整合专业知识与他们可用的能力同样重要。

英文摘要

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

URL PDF HTML ☆

赞 0 踩 0

2502.19193 2026-06-19 cs.SI cs.AI cs.NE 版本更新

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA（阿拉巴马大学）； Roma Tre University, Rome, Italy（罗马三大学）

AI总结提出可监督的多智能体入侵响应框架Agentra，通过角色划分、规划-验证循环、安全网关和风险评分机制，将警报转化为结构化响应计划，在120事件语料上F1从0.61提升至0.84，有害动作率降至0.0%。

详情

AI中文摘要

PrototypeNAS: 微控制器单元深度神经网络的快速设计

Mark Deutel, Simon Geis, Axel Plinge

发表机构 * Fraunhofer Institute for Integrated Circuits（弗劳恩霍夫集成电路研究所）

AI总结提出零样本NAS方法PrototypeNAS，通过解耦设计与训练、多架构搜索空间、集成零样本代理和超体积子集选择，快速为不同MCU定制DNN，在图像分类等任务上分钟级找到小模型且精度接近大模型。

Comments Accepted at ECML-PKDD 2026. 18 pages, 7 figures, 4 tables. This work was funded by the European Commission as part of the MANOLO project under the Horizon Europe programme Grant Agreement No.101135782

详情

AI中文摘要

在具有不同硬件约束的边缘设备上实现高效的深度神经网络推理是一项具有挑战性的任务，通常需要为每个设备单独定制DNN架构。为避免大量人工努力，可以使用神经架构搜索。然而，许多现有的NAS方法资源密集且耗时，因为它们需要从头开始训练许多不同的DNN。此外，它们没有考虑目标系统的资源约束。为了解决这些缺点，我们提出了PrototypeNAS，一种零样本NAS方法，用于加速和自动化DNN的选择、压缩和针对不同目标微控制器单元的专门化。我们提出了一种新颖的三步搜索方法，将DNN设计和专门化与给定目标平台上的DNN训练解耦。首先，我们提出了一种新的搜索空间，不仅从单个大型架构中裁剪出较小的DNN，而且结合了多种架构类型的结构优化，以及它们的剪枝和量化配置的优化。其次，我们探索在优化过程中使用集成零样本代理而不是单个代理。第三，我们提出使用超体积子集选择从多目标优化的帕累托前沿中提取DNN架构，这些架构代表了准确性和FLOPs之间最有意义的权衡。我们在三个不同任务（图像分类、时间序列分类和目标检测）的12个数据集上评估了PrototypeNAS的有效性。我们的结果表明，PrototypeNAS能够在几分钟内识别出足够小、可部署在现成MCU上的DNN模型，并且仍然达到与大型DNN模型相当的精度。

英文摘要

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

URL PDF HTML ☆

赞 0 踩 0

2606.17979 2026-06-19 cs.AI 版本更新

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training（机构文本：STAR：时空自适应奖励分配用于文本到图像强化学习后训练）

AI总结针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题，提出STAR方法，利用文本-图像注意力构建时空自适应分配图，对相关潜在区域施加更强策略更新，提升语义对齐和文本渲染性能。

详情

AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势，并以相同强度应用于整个生成轨迹。然而，文本到图像生成自然具有时间和空间结构：不同的去噪步骤负责不同的生成阶段，而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题，我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励（STAR）分配**。STAR利用生成模型内部的文本-图像注意力，从用户提示中真正关心的核心内容开始，构建在去噪步骤和展开中动态变化的空间分配图，并将相同的组相对优势分配给更相关的潜在区域，几乎没有额外的计算开销。然后，STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型，并在三个任务上评估：GenEval、OCR文本渲染和PickScore。实验结果表明，STAR在不改变外部奖励源的情况下，改善了组合语义对齐、文本渲染和偏好优化，在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

URL PDF HTML ☆

赞 0 踩 0

2402.14035 2026-06-19 cs.LG cs.AI 版本更新

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

委员会智慧：来自大型基础模型和领域专家的多样化蒸馏

Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

发表机构 * Rice University（Rice大学）； Google DeepMind（谷歌DeepMind）； Google Inc（谷歌公司）； University of California, Davis（加州大学戴维斯分校）

AI总结针对基础模型向紧凑领域模型蒸馏时能力、架构和模态差异大的问题，提出DiverseDistill框架，通过可学习的问答机制和对齐异构教师输出，在推荐和视觉任务上恢复73-114%的性能差距。

Comments Accepted at the 1st Workshop on Resource-Efficient Learning and Knowledge Discovery (RelKD), KDD 2026

Journal ref Proceedings of the RelKD Workshop at KDD 2026

详情

AI中文摘要

从基础模型向紧凑领域模型进行知识蒸馏因能力、架构和模态的巨大差异而具有挑战性。例如，在我们的实验中，从7600万参数的语言模型蒸馏到200万参数的推荐模型仅能弥补未蒸馏学生与教师之间不到40%的性能差距。我们表明，引入与基础模型共享学生架构特征的领域专家作为多样化教师委员会，能显著改善迁移效果。然而，标准的多教师方法未能利用这种多样性：简单组合异构教师可能使性能低于单教师蒸馏。为此，我们提出DiverseDistill，一种交互式蒸馏框架，采用可学习的问答机制生成教师条件查询，并将异构教师输出对齐到学生的表示空间。与需要基于梯度的协同优化或修改教师架构的方法不同，DiverseDistill在冻结教师的情况下仅通过其中间层的前向推理运行：无需参数更新、无需协同训练、无需架构修改。动态教师重要性机制通过过滤每个样本中低相关性的教师（例如，在推荐任务中减少约30%的前向传播且无质量损失）进一步降低训练成本，而整个蒸馏模块在训练后被丢弃，推理时零开销。在推荐（38倍压缩）和视觉（3.6倍压缩）任务上的评估表明，DiverseDistill恢复了73-114%的师生性能差距，持续优于所有单教师和多教师基线方法。

英文摘要

Knowledge distillation from foundation models to compact domain models is challenging due to substantial gaps in capacity, architecture, and modality. For example, in our experiments, distilling from a 76M-parameter language model to a 2M-parameter recommender closes less than 40% of the performance gap between the undistilled student and the teacher. We show that introducing domain-specific experts -- which share the student's architectural characteristics -- alongside the foundation model as a diverse teacher committee significantly improves transfer. However, standard multi-teacher methods fail to exploit this diversity: naively combining heterogeneous teachers can degrade performance below single-teacher distillation. To address this, we propose DiverseDistill, an interactive distillation framework that employs a learnable Question-Answer mechanism to generate teacher-conditioned queries and align heterogeneous teacher outputs into the student's representation space. Unlike methods requiring gradient-based co-optimization or architectural modification of teachers, DiverseDistill operates with frozen teachers using only forward-pass inference through their intermediate layers: no parameter updates, no co-training, and no architectural surgery. A dynamic teacher importance mechanism further reduces training cost by filtering low-relevance teachers per sample (e.g., ~30% fewer forward passes with no quality loss for recommendation tasks), while the entire Distillation Module is discarded after training, adding zero inference overhead. Evaluations on recommendation (38x compression) and vision (3.6x compression) tasks demonstrate that DiverseDistill recovers 73-114% of the teacher-student performance gap, consistently outperforming all single- and multi-teacher baselines.

URL PDF HTML ☆

赞 0 踩 0

2503.02636 2026-06-19 q-bio.NC cs.AI 版本更新

A Deep Generative Model for Resting-State EEG Synthesis and Transferable Representation Learning

一种用于静息态脑电合成与可迁移表示学习的深度生成模型

Yeganeh Farahzadi, Morteza Ansarinia, Zoltan Kekecs

发表机构 * Institute of Psychology, Eötvös Loránd University（埃斯特哈兹·洛朗大学心理学研究所）； Doctoral School of Psychology, Eötvös Loránd University（埃斯特哈兹·洛朗大学心理学博士学院）； Department of Behavioural and Cognitive Sciences, University of Luxembourg（卢森堡大学行为与认知科学系）

AI总结提出REST-GAN框架，结合对抗训练与自监督重构，从原始时域信号合成静息态EEG并学习可迁移表示，在频谱、连接性及分类任务中表现优异。

详情

AI中文摘要

静息态脑电提供了一种非侵入性的自发脑活动观测方式，但提取有意义的模式常受限于高质量数据稀缺和对人工设计特征的依赖。生成对抗网络（GAN）能够合成神经信号并从原始数据中学习可迁移表示，这一双重能力在脑电研究中尚未被充分探索。本文提出REST-GAN，一个基于GAN的静息态脑电框架，将对抗训练与辅助自监督重构目标相结合，以支持信号合成和无监督特征提取。尽管仅使用原始时域信号训练，未引入显式的频域或传感器拓扑监督，生成的时序列再现了真实脑电的关键时间、频谱和连接特性。在频带功率特征空间中，生成的样本在睁眼和闭眼条件下均表现出高精确率和召回率（EO: 0.91/0.67; EC: 0.87/0.65），而组平均频谱相干矩阵与真实数据在各频段上的平均绝对差异较低（约0.01-0.03）。模型判别器学习到的表示可迁移至独立的静息态人口统计学分类任务，其性能优于直接在原始脑电上训练的模型，并与近期脑电基础模型表现相当，同时所需训练数据和计算资源大幅减少。这些发现突显了一种计算高效的架构驱动策略，其中生成模型不仅作为脑电信号生成器，还作为无监督特征提取器。该方法有望支持更数据高效的脑电分析，同时减少对人工特征工程的依赖。REST-GAN的实现代码见：this https URL。

英文摘要

Resting-state EEG provides a non-invasive view of spontaneous brain activity, but extracting meaningful patterns is often limited by scarce high-quality data and reliance on manually engineered features. Generative adversarial networks (GANs) can synthesize neural signals and learn transferable representations directly from raw data, a dual capability that remains underexplored in EEG research. Here, we introduce REST-GAN, a GAN-based framework for resting-state EEG that combines adversarial training with an auxiliary self-supervised reconstruction objective to support signal synthesis and unsupervised feature extraction. Although trained only on raw time-domain signals, without explicit frequency-domain or sensor-topographic supervision, the generated time series reproduced key temporal, spectral, and connectivity properties of real EEG. In band-power feature space, generated samples showed high precision and recall across eyes-open and eyes-closed conditions (EO: 0.91/0.67; EC: 0.87/0.65), while group-average spectral coherence matrices showed low mean absolute differences from real data across frequency bands (~0.01-0.03). The representations learned by the model's critic transferred to independent resting-state demographic classification tasks, outperforming models trained directly on raw EEG and showing competitive performance relative to a recent EEG foundation model, while requiring substantially less training data and computational resources. These findings highlight a computationally efficient, architecture-driven strategy in which generative models serve not only as EEG signal generators, but also as unsupervised feature extractors. This approach may support more data-efficient EEG analysis while reducing reliance on manual feature engineering. The implementation code for REST-GAN is available at: https://github.com/Yeganehfrh/REST-GAN.

URL PDF HTML ☆

赞 0 踩 0

2509.15927 2026-06-19 cs.LG cs.AI 版本更新

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

增强生成式自动出价：结合离线奖励评估与策略搜索

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Jinghao Chen, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba（阿里巴巴淘宝与天猫集团）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结针对现有生成式自动出价方法无法超越静态数据集进行探索的性能瓶颈，提出AIGB-Pearl方法，通过轨迹评估器和KL-Lipschitz约束的分数最大化方案实现安全高效探索，在模拟和真实广告系统中取得最优性能。

详情

AI中文摘要

自动出价是广告主提升广告效果的关键工具。最近进展表明，AI生成式出价（AIGB）从离线数据中学习条件生成规划器，相比典型的基于离线强化学习（RL）的自动出价方法取得了更优性能。然而，现有AIGB方法仍面临性能瓶颈，因其固有能力无法在静态数据集之外进行带反馈的探索。为解决此问题，我们提出\textbf{AIGB-Pearl}（\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}），一种融合生成式规划与策略优化的新方法。AIGB-Pearl的核心在于构建轨迹评估器以评估生成分数的质量，并设计一个理论上可靠的KL-Lipschitz约束分数最大化方案，确保在离线数据集之外进行安全高效的探索。进一步开发了结合同步耦合技术的实用算法，以保证所提方案所需的模型正则性。在模拟和真实广告系统上的大量实验证明了我们方法的最优性能。

英文摘要

Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose \textbf{AIGB-Pearl} (\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

URL PDF HTML ☆

赞 0 踩 0

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology（首尔科学技术大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； LG CNS

AI总结提出MENTOR方法，通过灵活的教师优化奖励结构，平衡行为对齐与下游性能，提升小模型在工具使用任务中的域外泛化能力。

详情

AI中文摘要

将大型语言模型（LLMs）的工具使用能力蒸馏到小型语言模型（SLMs）中对其实际应用至关重要。主要方法监督微调（SFT）由于与静态教师轨迹的刚性对齐，导致域外（OOD）泛化性能较差。虽然强化学习（RL）提供了一种替代方案，但SLMs的能力限制带来了严峻的困境：稀疏的结果奖励提供的指导不足，而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距，我们提出了MENTOR，它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制，而是利用教师的参考来指导工具使用行为，平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明，与SFT和严格RL基线相比，MENTOR提高了OOD工具使用性能。我们的研究结果表明，在可验证的工具使用环境中，灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

URL PDF HTML ☆

赞 0 踩 0

2510.21978 2026-06-19 cs.LG cs.AI 版本更新

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

超越推理增益：缓解大型推理模型中的通用能力遗忘

Hoang Phan, Xianjun Yang, Yuanshun Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）； New York University（纽约大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结针对强化学习训练导致推理模型遗忘基础能力的问题，提出RECAP重放策略，通过动态目标重加权在线调整训练重点，在保持通用能力的同时提升推理性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）在数学和多模态推理方面取得了显著进展，并已成为当代语言和视觉-语言模型的标准后训练范式。然而，RLVR方法引入了能力退化的重大风险，即模型在长时间训练后，若未采用正则化策略，会遗忘基础技能。我们通过实验证实了这一担忧，观察到开源推理模型在感知和忠实性等核心能力上出现性能下降。虽然施加KL散度等正则化项有助于防止偏离基础模型，但这些项是在当前任务上计算的，因此不能保证保留更广泛的知识。同时，跨异构领域的经验回放使得决定每个目标应获得多少训练权重变得困难。为解决这一问题，我们提出RECAP——一种具有动态目标重加权的重放策略，用于通用知识保留。我们的重加权机制利用短期收敛和不稳定信号在线自适应，将后训练焦点从饱和目标转移到表现不佳或不稳定的目标。我们的方法是端到端的，可直接应用于现有RLVR流程，无需训练额外模型或进行繁重调优。在Qwen2.5-VL-3B和Qwen2.5-VL-7B上的广泛实验证明了我们方法的有效性，该方法不仅保留了通用能力，还通过实现任务内奖励的更灵活权衡提升了推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, in which models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are computed on the current task and therefore do not guarantee preservation of broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training emphasis each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts online using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks using Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

URL PDF HTML ☆

赞 0 踩 0

2601.21542 2026-06-19 cs.CV cs.AI 版本更新

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结提出BA-solver，通过轻量SideNet（1-2%主干大小）学习双向时间感知和双锚点速度积分，在不重新训练主干的情况下，以极低训练成本实现10步内达到100+步Euler求解器质量，支持即插即用。

详情

AI中文摘要

流匹配（FM）模型已成为高保真合成的前沿范式。然而，它们对迭代常微分方程（ODE）求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难：无训练求解器在低神经函数评估（NFE）下性能严重下降，而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距，我们提出了双锚点插值求解器（BA-solver）。BA-solver保留了标准无训练求解器的通用性，同时通过引入轻量级SideNet（主干大小的1-2%）与冻结主干并行，实现了显著加速。具体而言，我们的方法基于两个协同组件：1）双向时间感知，其中SideNet学习近似未来和过去的速度，无需重新训练重型主干；2）双锚点速度积分，利用带有两个锚点速度的SideNet高效近似中间速度，用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹，BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明，BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量，并在仅5次NFE时保持高保真度，且训练成本可忽略不计。此外，BA-solver确保与现有生成流水线的无缝集成，便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

URL PDF HTML ☆

赞 0 踩 0

2601.22970 2026-06-19 cs.LG cs.AI 版本更新

Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic Methods

稳定Q-梯度场以实现Actor-Critic方法中的策略平滑性

Jeong Woon Lee, Kyoleen Kwak, Daeho Kim, Hyoseok Hwang

发表机构 * College of Software, Kyung Hee University（韩国庆熙大学软件学院）

AI总结针对连续动作空间中actor-critic方法策略振荡问题，提出基于评论家微分几何的PAVE框架，通过稳定Q-梯度场实现策略平滑，无需修改actor。

详情

AI中文摘要

通过连续actor-critic方法学习的策略通常表现出不稳定的高频振荡，使其不适合物理部署。当前方法试图通过直接正则化策略输出来强制平滑性。我们认为这种方法治标不治本。在这项工作中，我们从理论上建立了策略非平滑性根本上由评论家的微分几何决定。通过对actor-critic目标应用隐式微分，我们证明了最优策略的敏感性受限于Q函数的混合偏导数（噪声敏感性）与其动作空间曲率（信号区分度）之比。为了实证验证这一理论见解，我们引入了PAVE（策略感知值场均衡），一种以评论家为中心的正则化框架，将评论家视为标量场并稳定其诱导的动作梯度场。PAVE通过最小化Q-梯度波动同时保持局部曲率来修正学习信号。实验结果表明，PAVE在不修改actor的情况下，实现了与策略侧平滑正则化方法相当的平滑性，同时保持了有竞争力的任务性能。

英文摘要

Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

URL PDF HTML ☆

赞 0 踩 0

2602.04396 2026-06-19 cs.LG cs.AI 版本更新

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: 分布式低秩优化与低频通信

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

发表机构 * University of Cambridge（剑桥大学）； Institute of Science and Technology Austria（奥地利科学与技术研究院）； Lancaster University（兰卡斯特大学）； Flower Labs（Flower实验室）

AI总结提出LoRDO框架，统一低秩优化与低频同步，通过全秩准双曲更新恢复子空间探索，在125M-720M模型规模下实现与低秩DDP近似的性能，通信量减少约10倍。

Comments Accepted at ICML 2026

详情

AI中文摘要

通过$\ exttt{DDP}$进行基础模型的分布式训练受限于互连带宽。虽然低频通信策略减少了同步频率，但优化器状态的内存和通信需求仍然构成瓶颈。低秩优化器可以缓解这些限制；然而，在局部更新机制下，工作节点无法访问计算低秩投影所需的全批次梯度，这降低了性能。我们提出$\ exttt{LoRDO}$，一个统一低秩优化与低频同步的原则性框架。我们首先证明，虽然基于伪梯度的全局投影在理论上更优，但它们将优化轨迹永久限制在低秩子空间中。为了恢复子空间探索，我们引入了一个全秩准双曲更新。$\ exttt{LoRDO}$在125M-720M模型规模的语言建模和下游任务中实现了与低秩$\ exttt{DDP}$近乎相同的性能，同时将通信量减少了约10倍。最后，我们表明在具有小秩/小批次大小的极低内存设置中，$\ exttt{LoRDO}$的性能提升更为显著。

英文摘要

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

URL PDF HTML ☆

赞 0 踩 0

2602.22495 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement-aware Knowledge Distillation for LLM Reasoning

面向LLM推理的强化学习感知知识蒸馏

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

发表机构 * Meta ； Guo et al. ； Lin et al. ； Xu et al. ； Shao et al. ； Schulman et al. ； Xie et al.

AI总结提出RL感知蒸馏（RLAD），通过信任区域比率蒸馏（TRRD）在强化学习后训练中实现选择性模仿，解决分布不匹配和目标干扰问题，在逻辑推理和数学基准上优于现有方法。

详情

AI中文摘要

强化学习（RL）后训练最近推动了长链思维推理大语言模型（LLM）的重大进展，但这类模型的高推理成本促使将其蒸馏到更小的学生模型中。大多数现有的知识蒸馏（KD）方法是为监督微调（SFT）设计的，依赖于固定的教师轨迹或基于教师-学生KL散度的正则化。当与RL结合时，这些方法常常遭受分布不匹配和目标干扰：教师监督可能与学生不断变化的rollout分布不一致，并且KL正则化项可能与奖励最大化竞争，需要仔细的损失平衡。为了解决这些问题，我们提出了RL感知蒸馏（RLAD），它在RL期间执行选择性模仿——仅在改进当前策略更新时引导学生向教师学习。我们的核心组件，信任区域比率蒸馏（TRRD），用基于PPO/GRPO风格似然比的目标替代教师-学生KL正则化项，该目标锚定到教师-旧策略混合，从而在学生rollout上产生优势感知、信任区域约束的蒸馏，并自然平衡探索、利用和模仿。在多种逻辑推理和数学基准上，RLAD始终优于离线蒸馏、标准GRPO和基于KL的在策略教师-学生知识蒸馏。

英文摘要

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

URL PDF HTML ☆

赞 0 踩 0

2606.15015 2026-06-19 cs.CV cs.AI 版本更新

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford（牛津大学）； University of Cambridge（剑桥大学）

AI总结提出神经能量场框架NEXUS，通过标量能量和耗散项建模保守与非保守动力学，提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情

AI中文摘要

基于物理的视频生成需要可控的3D物体动力学，这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应，难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS，一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图，并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发，NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应（包括重力和弹性变形）被组合为加性能量项，而非保守效应（如阻尼和冲击引起的能量损失）则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到，并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中，NEXUS在不同力学属性和物理效应组合下，相较于代表性的学习和物理结构化动力学基线，提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导，在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18812 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France（巴黎高等师范学院，PSL大学，法国巴黎）； Soda team, Inria Saclay, Palaiseau, France（Soda团队，法国国家信息与自动化研究所萨克雷中心，法国帕莱索）

AI总结提出通过合成MDP构建强化学习基础模型，利用固定大小的充分统计量使注意力架构适用，在线和离线实验均优于传统算法。

详情

AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动，而结构化领域（表格预测、时间序列预测、图学习、强化学习）则不然。替代方案是合成数据，它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中：TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先}，强化学习是明显的空白：采样一个合成MDP与采样一个合成表格数据集一样可行，然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次}，MDP允许一个固定大小的充分统计量，独立于观察到的回合且形状为表格形式，这使得它们直接适用于用于表格基础模型的基于注意力的架构，只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证，我们完全在合成MDP上训练一个模型，并表明，无需任务特定的调优，它就能在上下文中解决留出的表格基准，包括在线和离线：在线时，使用比UCB-VI和表格Q-learning少得多的回合；离线时，与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

URL PDF HTML ☆

赞 0 踩 0

2606.11537 2026-06-19 cs.AI cs.CE 版本更新

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.（Maum AI公司）； Humelo Inc.（Humelo公司）

AI总结提出ZeSTA框架，通过轻量领域嵌入区分真实与合成语音，结合真实数据过采样，在极低资源下提升零样本文本转语音增强的说话人相似度，保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结提出Vero系列开放视觉语言模型，通过构建600K样本数据集Vero-600K和任务路由奖励，在30个基准测试中平均提升2.9-5.4点，Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情

AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么？最强的视觉语言模型（VLM）表明广泛的视觉推理是可以实现的，但其封闭的数据和强化学习（RL）流程使得其成果难以研究、复现或扩展。我们引入了Vero，一个完全开放的VLM系列，在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励，构建了Vero-600K，一个来自59个数据集的600K样本数据集，并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中，Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型，Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是，基于Instruct模型训练的Vero-Qwen3I-8B，在没有额外蒸馏的情况下，平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示，不同的任务类别引发不同的推理模式，而广泛的收益依赖于联合学习它们，而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.31393 2026-06-19 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata（III-LIDI国立拉普拉塔大学）； CDTEC, Federal University of Pelotas（CDTEC，联邦 Pelotas 大学）； CONICET III-LIDI ； Comision de Investigaciones Cientificas Universidad Nacional de La Plata（科学委员会国立拉普拉塔大学）； Universidade Federal de Pelotas（联邦 Pelotas 大学）

AI总结针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题，提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强，并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign @ CVPR 2026. Non-Proceedings Track (https://genai4sl.github.io/)

详情

AI中文摘要

手语翻译（SLT）仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法，其中GPT-4o生成参考句子的受控释义变体，而手语输入保持不变。采用基于Signformer姿态的Transformer，在两阶段调度下进行训练：先在增强语料库上预训练，然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估：PHOENIX14T（德国手语），具有适度的词汇多样性；GSL（希腊手语），具有高度受控、重复的录制；以及LSA-T（阿根廷手语），具有严重的长尾稀疏性。在PHOENIX14T上，增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知，这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Bosch Research（博世研究院）； University of Haifa（海法大学）

AI总结提出潜在高斯泼溅（LaGS）方法，通过特征高斯体作为动态关键点实现多视图特征聚合，用于4D全景占据跟踪，在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情

DOI: 10.1109/LRA.2026.3703990

AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而，现有方法通常只解决部分问题：它们要么通过边界框提供粗略的几何跟踪，要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中，我们提出了潜在高斯泼溅（LaGS）用于4D全景占据跟踪（4D-POT）。我们重新审视底层表示，将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点，在泼溅到体素网格进行解码之前，能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互，这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节，进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型：this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

URL PDF HTML ☆

赞 0 踩 0

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany（弗赖堡大学计算机科学系）； Qualcomm SARL France（法国.qualcomm SARL）； Automated Driving, Qualcomm Technologies, Inc.（qualcomm Technologies, Inc. 自动驾驶部门）

AI总结提出类别增量运动预测新任务，通过端到端框架结合伪标签与开放词汇分割，利用3D-2D投票机制和查询特征方差重放策略，缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情

AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而，现有方法通常假设一个封闭世界设定，具有固定的对象分类法并依赖高质量感知，限制了其在现实世界中的应用，因为现实世界中感知不完美，且新对象类别可能随时间出现。在这项工作中，我们引入了类别增量运动预测，这是一个新颖的设定，其中新对象类别随时间顺序引入，并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架，该框架适应新引入的类别，同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签，并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测，而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明，我们的方法成功地在已知类别上保持性能，同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移，并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

URL PDF HTML ☆

赞 0 踩 0

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics（LimX动力学）

AI总结提出Any2Any范式，通过运动学对齐和动力学微调，实现预训练全身跟踪模型高效迁移至新的人形机器人本体，仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情

AI中文摘要

全身跟踪（WBT）模型已成为人形机器人的关键基础，使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算，使得在新人形平台上快速部署成本高昂。这自然引发一个问题：预训练的WBT模型能否通过最小化适应跨本体迁移？为回答这个问题，我们提出Any2Any，一种范式，能够高效地将现有WBT专家迁移到新人形本体，仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐，对齐其输入和输出空间，使得预训练的源策略可以在目标本体上有意义地重用。然后，Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调（PEFT）组件进行动力学适应，保留有用的行为先验，同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明，与从头训练相比，Any2Any显著加速收敛并降低训练成本，同时实现具有竞争力或更优的跟踪性能。值得注意的是，仅使用完整训练所需计算和数据的1%，Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明，预训练的WBT专家可以跨本体高效重用，为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

URL PDF HTML ☆

赞 0 踩 0

2602.01425 2026-06-19 cs.AI cs.LG 版本更新

One Probe Won't Catch Them All: Towards Targeted Deception Detection

一个探针无法捕捉所有：迈向有针对性的欺骗检测

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

发表机构 * LASR Labs（LASR实验室）； UK AI Security Institute（英国人工智能安全研究所）

AI总结针对线性探针在欺骗检测中的异质性，提出根据具体欺骗类型匹配探针可显著提升性能（AUC提升0.108），建议组织定义威胁模型并部署相应探针。

详情

AI中文摘要

线性探针是一种有前景的监测AI系统欺骗行为的方法。先前工作表明，在对比指令对和简单数据集上训练的线性分类器可以达到良好性能。然而，这些探针即使在简单场景中也表现出显著失败，包括虚假相关性和对非欺骗响应的误报。在本文中，我们证明欺骗检测本质上是异质的：虽然单个通用探针实现了适度的改进（+0.032 AUC），但事后最优分析显示，当探针与特定欺骗类型匹配时，潜力显著更高（+0.108 AUC），并且合成验证实验表明，当欺骗类型事先已知时，这一上限是先验可实现的。我们的发现表明，指令对捕捉的是欺骗意图而非内容特定模式，这解释了为什么提示选择主导探针性能（占70.6%的方差）。鉴于这种异质性，我们得出结论，组织应定义其特定威胁模型并部署适当匹配的探针，而不是寻求通用的欺骗检测器。

英文摘要

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

URL PDF HTML ☆

赞 0 踩 0

2602.23248 2026-06-19 cs.AI 版本更新

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST（韩国科学技术院）

AI总结提出解耦证明者-验证者游戏（DPVG），通过分离正确性与可检查性训练一个翻译器模型，将固定求解器的解转化为可检查形式，在保持答案正确性的同时提高可检查性，解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI

2505.22829 2026-06-19 cs.LG cs.AI 版本更新

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

弥合分布偏移与AI安全：概念与方法论的协同

Chenruo Liu, Kenan Tang, Yao Qin, Qi Lei

发表机构 * Center for Data Science, New York University New York New York USA ； Computer Science Department, University of California, Santa Barbara Santa Barbara California USA ； Department of Electrical ； Computer Engineering, University of California, Santa Barbara Santa Barbara California USA ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University New York New York USA ； Center for Data Science, New York University ； Computer Science Department, University of California, Santa Barbara ； Computer Engineering, University of California, Santa Barbara ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University

AI总结本文通过分析分布偏移与AI安全之间的概念和方法论协同，建立了特定偏移类型与细粒度安全问题之间的两种联系，促进了两领域研究的深度融合。

Comments 35 pages

2509.03122 2026-06-19 cs.CL cs.AI cs.LG 版本更新

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

从构建到注入：面向大型语言模型的基于编辑的指纹

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

发表机构 * East China Normal University（华东师范大学）； Hasso Plattner Institute/University of Potsdam（哈索罗普拉特纳研究所/波茨坦大学）

AI总结提出端到端注入指纹框架，通过代码混合指纹和多候选编辑方法，解决黑盒部署中指纹的不可感知性和鲁棒性挑战。

Comments preprint

详情

AI中文摘要

可靠的模型指纹对于保护大型语言模型（LLMs）免受未经授权的重新分发和商业滥用至关重要。在黑盒部署中，验证受到对可疑指纹查询的防御性过滤以及可能削弱嵌入所有权证据的下游模型修改的阻碍。这些风险要求指纹在构建和注入方面都具有鲁棒性。在构建方面，先前的范式面临不可感知性的权衡：自然语言指纹可能被意外激活，而乱码指纹在统计上暴露且更容易被过滤。在注入方面，现有方法难以在模型修改下保持持久的触发-目标行为。我们提出了一个端到端的注入指纹框架来解决这些挑战。代码混合指纹（CF）在高复杂度约束下使用最低困惑度的代码混合来缓解这种双向不可感知性权衡。多候选编辑（MCEdit）构建结构冗余、间隔分离的触发-目标映射，以在模型修改下实现优雅降级。在不可感知性、可检测性和无害性方面的广泛评估表明，该框架在几乎不影响实用性的情况下实现了鲁棒的所有权验证。

英文摘要

Reliable model fingerprints are essential for protecting large language models (LLMs) against unauthorized redistribution and commercial misuse. In black-box deployment, verification is hindered by defensive filtering of suspected fingerprint queries, as well as by downstream model modifications that may weaken embedded ownership evidence. These risks require fingerprints to be robust in both construction and injection. For construction, prior paradigms face an imperceptibility trade-off: natural-language fingerprints may be accidentally activated, whereas garbled fingerprints are statistically exposed and easier to filter. For injection, existing methods struggle to preserve persistent trigger--target behaviors under model modification. We propose an end-to-end injected fingerprinting framework to address these challenges. Code-mixing Fingerprints (CF) use lowest-perplexity code-mixing under a high-complexity constraint to mitigate this two-sided imperceptibility trade-off. Multi-Candidate Editing (MCEdit) constructs structurally redundant, margin-separated trigger--target mappings to enable graceful degradation under model modification. Extensive evaluations on imperceptibility, detectability, and harmlessness demonstrate robust ownership verification with negligible impact on utility.

URL PDF HTML ☆

赞 0 踩 0

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet：面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science（数学与计算机科学系）； University of Catania（卡塔尼亚大学）

AI总结提出Proto-LeakNet，利用扩散模型中的信号泄漏痕迹，结合闭集分类与密度开集评估，实现可解释的生成器归因，在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情

DOI: 10.1016/j.cviu.2026.104848

AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明，扩散管道会在其输出中无意中留下持久的统计痕迹，称为信号泄漏，特别是在潜在表示中。基于这一观察，我们提出了Proto-LeakNet，一个信号泄漏感知且可解释的归因框架，它将闭集分类与基于密度的开集评估相结合，对学习到的嵌入进行开集评估，从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域，重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征，而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC，Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒，超越了最先进的方法，并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取：this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

URL PDF HTML ☆

赞 0 踩 0

2602.04306 2026-06-19 cs.CL cs.AI 版本更新

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST（韩国科学技术院）

AI总结针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题，提出框架感知的去偏方法，通过量化框架差异并增强跨框架一致性，有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

随着大语言模型（LLMs）在现实应用中的日益部署，确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力，但一个持续的挑战是隐藏的偏见：LLMs 在标准评估下表现公平，但在这些评估设置之外可能产生有偏见的响应。在本文中，我们识别出框架——语义等价的提示在表达方式上的差异（例如，“A 比 B 好” vs. “B 比 A 差”）——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准，我们发现（1）公平性得分随框架变化显著，以及（2）现有的去偏方法改善了整体（即框架平均）公平性，但往往未能减少框架引起的差异。为了解决这个问题，我们提出了一种框架感知的去偏方法，鼓励 LLMs 在不同框架之间更加一致。实验表明，我们的方法减少了整体偏见，并提高了对框架差异的鲁棒性，使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

URL PDF HTML ☆

赞 0 踩 0

2603.19423 2026-06-19 cs.CR cs.AI cs.LG 版本更新

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

智能体越来越多地部署在文档密集型工作流中，其中敏感私人信息不是边缘情况而是常规输入，例如，预订航班的智能体需要护照号码。在这种情况下，智能体必须使用私人信息准确完成任务，同时绝不在其响应中暴露这些信息，因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型，同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡，我们引入了任务完成与主动隐私提取抵抗（TRAP）。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询，以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型，我们发现所有模型系列都表现出非平凡的泄露，并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露，但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型，没有软约束防御（例如基于提示的防御）能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发，我们提出了结构化的私有字段隔离，该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

URL PDF HTML ☆

赞 0 踩 0

2506.14990 2026-06-19 cs.AI 版本更新

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

MEAL: 持续多智能体强化学习基准

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang

发表机构 * Eindhoven University of Technology, The Netherlands（埃因霍温理工大学，荷兰）； University of Edinburgh, UK（爱丁堡大学，英国）； University of Stuttgart, Germany（斯图加特大学，德国）； King's College London, UK（伦敦国王学院，英国）； University of Liverpool, UK（利物浦大学，英国）

AI总结提出MEAL基准，利用JAX和GPU加速实现100任务序列训练，揭示长序列中出现的失败模式。

Comments To be published in the International Conference on Machine Learning (ICML) 2026

2603.28387 2026-06-19 cs.AI cs.LG 版本更新

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应：提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt（达姆施塔特技术大学）

AI总结研究发现，在临床VLM评估中，提示中提及MRI可用性即可解释70-80%的性能提升，与图像数据是否存在无关，这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情

AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退）上评估了12个开源视觉语言模型（VLM）的二分类性能。两个数据集都包含结构MRI数据，但这些数据不携带可靠的个体级诊断信号。在这些条件下，较小的VLM在引入神经影像上下文后F1分数提升高达58%，蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示，仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变，与影像数据是否存在无关，这是模态坍塌的一个领域特定实例，我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由，而偏好对齐虽然消除了引用MRI的行为，却使两种条件都退化为随机基线。我们的发现表明，表面评估不足以作为多模态推理的指标，这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2604.05435 2026-06-19 cs.AI 版本更新

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

CareTransition-Audit：用于高效护理过渡的出院总结审计基准

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava, Shivali Dalmia, Abhishek Mukherji

发表机构 * Department of Computer Science \& Engineering, University of Minnesota-Twin Cities, Minneapolis, USA ； Centific AI Research, Redmond, USA

AI总结提出基于大语言模型的自动化框架，通过46项检查清单审计出院总结完整性，在MIMIC-IV数据集上基准测试11个模型，最佳模型与临床医生标签的Cohen's kappa约0.5，所有模型难以识别模糊文档。

Comments Accepted as a poster at IEEE-ICHI 2026; Accepted at SD4H@ICML

详情

AI中文摘要

不完整或不一致的出院文档会导致护理碎片化和可避免的再入院。尽管其在患者安全中至关重要，但审计出院总结依赖于人工审查且无法扩展。我们提出一个使用大语言模型（LLM）的自动化审计框架。我们的方法将DISCHARGED框架操作化为一个包含46个问题的检查清单。使用来自MIMIC-IV数据库的50份总结及临床医生真实标签，我们对11个LLM进行基准测试。模型评估的平均文档完整性范围为54.9%至74.2%，最佳模型与临床医生标签的Cohen's kappa值约为0.5，表明中等一致性。所有模型在识别模糊文档（Unclear）方面均存在困难，突显了当前自动化审计的关键差距。本工作为临床文档的系统性质量改进提供了临床医生验证的基准和零样本基线。

英文摘要

Incomplete or inconsistent discharge documentation drives care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies on manual review and does not scale. We propose an automated framework for auditing discharge summaries using large language models (LLMs). Our approach operationalizes the DISCHARGED framework into a checklist of 46 questions. Using 50 summaries from the MIMIC-IV database, with clinician ground-truth labels, we benchmark 11 LLMs. Model-assessed mean documentation completeness ranges from 54.9% to 74.2%, and the best-performing models achieve a Cohen's kappa values around 0.5 against clinician labels, indicating moderate agreement. All models struggle to identify ambiguous documentation (Unclear), highlighting a key gap in current automated auditing. This work provides a clinician-validated benchmark and zero-shot baselines for systematic quality improvement in clinical documentation.

URL PDF HTML ☆

赞 0 踩 0

2604.07593 2026-06-19 cs.AI 版本更新

Too long; didn't solve

太长；没解决

Lucía M. Cabrera, Isaac Saxton-Knight, Jocelyn D'Arcy

发表机构 * Instituto Balseiro（巴塞罗那研究所）； Poindexter Labs（波因迪克斯实验室）

AI总结研究提示长度和解答长度与大型语言模型在数学问题上的性能关系，发现两者与模型失败率正相关。

2605.25160 2026-06-19 cs.AI 版本更新

DRFLOW：用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research（ServiceNow人工智能研究）

AI总结提出DRFLOW基准，评估AI代理从异构源预测个性化工作流的能力，包含5领域100任务，并设计7个诊断指标，实验显示现有代理性能有限。

详情

AI中文摘要

深度研究（DR）系统越来越多地用于复杂信息寻求任务，但现有工作主要关注生成报告和摘要。相比之下，许多企业任务需要代理识别具体的工作流，即一系列行动步骤。例如，代理不应总结预算政策，而应能确定回答诸如“在固定预算下如何申请新员工？”这类问题所需的步骤。因此，我们引入DRFLOW，一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据，然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务，1246个参考工作流步骤，基于超过3900个来源。我们定义了七个诊断指标，涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent（DRFA），一个面向工作流的参考代理，用于预测个性化工作流。我们表明，尽管DRFA相比强基线代理有所改进（平均F1分数提升高达10.02%），但在这些工作流指标上仍有很大的改进空间，表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

URL PDF HTML ☆

赞 0 踩 0

2606.18950 2026-06-19 cs.AI 版本更新

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University（首尔国立大学）

AI总结提出RTSGameBench，基于Beyond All Reason游戏，通过多样化对战、迷你游戏诊断和自进化生成框架，评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情

AI中文摘要

现代视觉语言模型（VLM）在竞争和合作环境中的不确定性下，往往难以进行战略推理，即预测和影响其他智能体的行为。实时策略（RTS）游戏可以作为诊断这一局限性的自然测试平台，因为它们要求与盟友协调、适应对手策略，并在部分可观测性下进行长期规划。然而，现有的RTS基准评估范围有限，缺乏系统的能力诊断，并且局限于预设计的场景覆盖。为了解决这些限制，我们提出了RTSGameBench，它建立在Beyond All Reason之上，这是一款大规模RTS游戏，其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估，通过迷你游戏进行诊断性评估，每个迷你游戏针对单个战略能力，并通过自进化生成框架实现可扩展的覆盖，该框架将自由形式的查询转化为新的迷你游戏，并在连续循环中改进。此外，为了让VLM在大规模RTS游戏中运行，我们提供了RTSGameAgent，它通过具有智能体记忆的有限状态机（FSM）管理单位。我们通过实验验证，多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

URL PDF HTML ☆

赞 0 踩 0

2606.19245 2026-06-19 cs.AI cs.LG 版本更新

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP：分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结提出TxBench-PP基准，用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力，测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情

AI中文摘要

人工智能（AI）代理有望通过压缩解释和决策循环来加速药物发现，但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学（TxBench-PP），这是一个针对小分子临床前药理学的可验证基准，也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论，而非从文献中记忆的事实。该基准包含100个评估，按程序阶段、实验类型和任务结构索引，涵盖作用机制（MoA）和药效学（PD）推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照，在编码环境中检查文件，并返回确定性评分的结构化答案。在16个模型-工具配置（包括11个模型和4,800条轨迹）中，没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试（178/300；95% CI, 51.1-67.6），其次是GPT-5.5 / Pi，为55.3%（166/300；47.0-63.6）。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

URL PDF HTML ☆

赞 0 踩 0

2507.19653 2026-06-19 cs.NI cs.AI cs.LG 版本更新

On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments

关于射线追踪在城市环境中基于学习的射频任务局限性的研究

Armen Manukyan, Hrant Khachatrian, Edvard Ghukasyan, Theofanis P. Raptis

发表机构 * Yerevan State University, Yerevan, Armenia（亚美尼亚叶里温州立大学）； YerevaNN, Yerevan, Armenia（亚美尼亚叶里温YerevaNN）； Institute of Informatics and Telematics, National Research Council, Pisa, Italy（意大利那不勒斯国家研究委员会信息与电信研究所）

AI总结通过罗马城区实测数据评估Sionna射线追踪仿真器，发现天线位置和方向对保真度影响显著，而超参数影响微弱；优化后相关性提升5%-130%，定位误差降低三分之一，但残差城市噪声仍是挑战。

Comments This work was supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

Journal ref 2026 IEEE Wireless Communications and Networking Conference (WCNC)

详情

DOI: 10.1109/WCNC65185.2026.11555460

AI中文摘要

我们研究了Sionna v1.0.2射线追踪在罗马市中心户外蜂窝链路中的真实感。我们使用了包含1,664个用户设备（UE）和六个名义基站（BS）站点的真实测量数据集。利用这些固定位置，我们系统地改变了主要仿真参数，包括路径深度、漫反射/镜面反射/折射标志、载波频率，以及天线的属性如高度、辐射方向和方向图。通过测量功率与仿真功率之间的Spearman相关性，以及基于RSSI指纹的k近邻定位算法，对每个基站的仿真保真度进行评分。在所有实验中，求解器超参数对所选指标的影响微不足道。相反，天线位置和方向被证明是决定性的。通过简单的贪婪优化，我们将不同基站的Spearman相关性提高了5%到130%，而仅使用仿真数据作为参考点的kNN定位误差在真实世界样本上减少了三分之一，但仍比纯真实数据的误差高一倍。因此，精确的几何形状和可信的天线模型是必要但不充分的；忠实地捕捉残余的城市噪声仍然是实现可迁移、高保真户外射频仿真的一个开放挑战。

英文摘要

We study the realism of Sionna v1.0.2 ray-tracing for outdoor cellular links in central Rome. We use a real measurement set of 1,664 user-equipments (UEs) and six nominal base-station (BS) sites. Using these fixed positions we systematically vary the main simulation parameters, including path depth, diffuse/specular/refraction flags, carrier frequency, as well as antenna's properties like its altitude, radiation pattern, and orientation. Simulator fidelity is scored for each base station via Spearman correlation between measured and simulated powers, and by a fingerprint-based k-nearest-neighbor localization algorithm using RSSI-based fingerprints. Across all experiments, solver hyper-parameters are having immaterial effect on the chosen metrics. On the contrary, antenna locations and orientations prove decisive. By simple greedy optimization we improve the Spearman correlation by 5% to 130% for various base stations, while kNN-based localization error using only simulated data as reference points is decreased by one-third on real-world samples, while staying twice higher than the error with purely real data. Precise geometry and credible antenna models are therefore necessary but not sufficient; faithfully capturing the residual urban noise remains an open challenge for transferable, high-fidelity outdoor RF simulation.

URL PDF HTML ☆

赞 0 踩 0

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战：推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona（巴塞罗那人工智能在医学实验室（BCN-AIM），巴塞罗那大学数学与计算机学院）

AI总结提出MAMA-MIA挑战，通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测，在跨洲多中心数据上分析模型泛化性与公平性，发现性能与亚组公平性之间存在权衡。

详情

AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤，也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用，尤其是接受新辅助化疗的患者。然而，现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估，使得直接比较困难，并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题，该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者，而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行，以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明，在共同的外部评估框架下，性能存在显著差异，并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源，以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

URL PDF HTML ☆

赞 0 踩 0

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K：用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney（悉尼科技大学）； University of Sydney（悉尼大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白，构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集，并基于此基准测试了九种最新方法，识别出最鲁棒的方法和最具挑战的场景。

详情

AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中，已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而，对于无干扰辐射场，每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏，限制了发展。为填补这一空白，我们引入了DF3DV-1K，一个包含1048个场景的大规模真实世界数据集，每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像，模拟随意拍摄，涵盖128种干扰类型和161种场景主题，包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K，我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试，识别出最鲁棒的方法和最具挑战的场景。除了基准测试，我们还展示了DF3DV-1K的一个应用：微调基于扩散的2D增强器以改进辐射场方法，在保留集（例如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展，并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取：此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

URL PDF HTML ☆

赞 0 踩 0

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench：一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出CADBench，一个统一的多模态CAD程序生成基准，包含18000个样本和六类基准，评估11种视觉语言模型，揭示了CAD程序生成中的三种常见失败模式。

详情

AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心，但进展难以衡量，因为现有评估分散在数据集、模态和指标上。我们引入CADBench，一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本，涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族，五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染，以及六个指标，涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层，所有家族均进行多样性采样，以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统，生成超过140万个CAD程序。在理想输入下，专用的网格到CAD模型显著优于代码生成VLMs，后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式：几何复杂性增加时重建质量下降，CAD专用模型在模态转移下可能变得脆弱，且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

URL PDF HTML ☆

赞 0 踩 0

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 版本更新

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础：用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.（Spotify美国公司）

AI总结提出替代指标理论框架，证明在弱于分布等价条件下，校准LLM输出可识别平均处理效应，并分析随机性带来的偏差与方差。

详情

AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型（LLM）代替人类参与者，以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效，但这不现实。因此，我们开发了一个统计框架，将替代终点理论适配到LLM。该框架表明，将LLM结果校准到人类结果，在替代性和可比性条件（联合弱于分布等价性）下，可以识别平均处理效应。当这些条件不成立时，感兴趣的效应仅部分可识别，我们提供了诊断方法，可以在历史实验上证伪替代性，并给出有限重叠下最坏情况偏差的界限。我们进一步证明，LLM固有的随机性会引入偏差和方差，但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是，LLM结果作为替代指标的有效性只能对过去的处理被证伪，而无法对新处理被验证，因此对于新颖干预，人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用，以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

URL PDF HTML ☆

赞 0 踩 0

2606.18613 2026-06-19 cs.CL cs.AI 版本更新

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生？PhysAssistBench：交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University（阿尔托大学）； Tencent（腾讯）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Hong Kong Polytechnic University（香港理工大学）； Aarhus University（奥胡斯大学）； Technical University of Munich（慕尼黑工业大学）

AI总结提出PhysAssistBench基准，通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力，发现当前模型不可靠，瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情

AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生，但当前的评估通常测试孤立能力：临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力，其中医生提出不明确的请求，患者模糊描述症状，EHR系统要求精确的工具使用。我们引入PhysAssistBench，一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例，PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理，将静态EHR记录转化为多轮临床场景，同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集，包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明，当前模型在此设置下仍不可靠，这暴露了临床LLM的关键瓶颈：可靠的辅助需要知识、沟通和系统之间的协调，而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

URL PDF HTML ☆

赞 0 踩 0

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics（数学系）； Department of Political and Social Sciences（政治与社会科学系）

AI总结通过受控基准测试，比较量子与经典生成器在脑MRI数据增强中的性能，发现两者均未显著优于仅用真实数据训练，且量子生成器无额外优势。

详情

AI中文摘要

医学图像分类常受限于有限的标注数据，因此生成式增强被提出；最近，量子生成模型被用于此目的，并经常报告准确率提升。然而，这些声称通常基于单次训练运行，未匹配量子与经典生成器的参数预算，也未表征任何收益出现的数据范围。我们提出了一个受控基准测试，隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中，在该空间中，使用变分量子生成器或参数数量几乎相同的经典生成器（1648 vs. 1632）训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器，覆盖从5%到100%的标注数据比例，通过八个随机种子进行配对显著性检验（多重比较校正）以及集内多样性和潜在分布分析。在所有比例下，没有增强变体显著优于仅用真实数据训练，且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展：合成样本分布外移，并且在数据稀缺时严重模式崩溃，而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2509.24725 2026-06-19 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net：基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出Q-Net框架，通过结合卡尔曼滤波与神经网络，解决信号交叉口队列长度估计中的数据融合问题，提升空间转移性和实时性，实现无需昂贵传感设备的准确队列估计。

Journal ref Transportation Research Part C: Emerging Technologies, Volume 190, September 2026, Article 105809

详情

DOI: 10.1016/j.trc.2026.105809

AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源：(i) 接近停止线的环形检测器提供的车辆计数汇总数据，以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD)，但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此，本文提出Q-Net：一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战，如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构，并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现，并通过将aFCD测量分组为固定大小的局部组来提高空间转移性，使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示，Q-Net优于基线方法，能够准确追踪队列的形成和消散，并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性，Q-Net在无需昂贵的传感基础设施（如摄像头或雷达）的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

URL PDF HTML ☆

赞 0 踩 0

2510.00831 2026-06-19 cs.AI cs.LG eess.SP 版本更新

Controlled Comparison of Machine Learning Models for Fault Classification and Localization in Power System Protection

电力系统保护中故障分类与定位的机器学习模型受控比较

Julian Oelhaf, Georg Kordowich, Changhun Kim, Paula Andrea Pérez-Toro, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Department of Electrical Engineering, Media and Computer Science, Ostbayerische Technische Hochschule Amberg-Weiden（奥贝格-魏登应用技术大学电气工程、媒体与计算机科学系）

AI总结在统一电磁暂态数据集和10-50ms决策窗口下，对比机器学习模型在故障分类与定位中的性能，发现分类在10ms时F1>0.98，定位误差稳定在约10%线路长度。

Comments Accepted at IEEE PES Innovative Smart Grid Technologies Europe 2026 (ISGT Europe 2026). Pre-camera-ready author version; final proceedings version may differ

详情

AI中文摘要

现代电力系统因逆变器基和分布式能源的集成而日益复杂，挑战了传统保护方案的可靠性，并推动了机器学习在保护任务中的应用。然而，由于不同研究中的数据集、传感假设和决策时域各异，已发表的结果往往难以比较。本文在相同的传感、时序和验证条件下，基于公共电磁暂态数据集，使用10-50ms的决策窗口以反映保护相关时间尺度，对故障分类（FC）和故障定位（FL）的机器学习模型进行了受控比较。对于FC，性能最佳的非线性模型在10ms时F1分数已超过0.98，而低容量模型在较短时域下性能下降，但随窗口延长而改善，表明相关故障类型信息在最早暂态中已存在。对于FL，顶级模型在所有评估时域下达到约10%归一化线路长度的稳定定位误差，而较弱模型形成明显分离的第二性能层级。线路解析分析显示，定位精度随电网段变化，表明存在拓扑依赖的难度而非仅时间上下文不足。这些发现为比较两个信息需求根本不同的保护任务中的机器学习模型提供了受控参考。

英文摘要

The increasing complexity of modern power systems, driven by the integration of inverter-based and distributed energy resources, challenges the reliability of conventional protection schemes and motivates the use of machine learning for protection tasks. However, published results are often difficult to compare because datasets, sensing assumptions, and decision horizons vary across studies. This paper presents a controlled comparison of machine learning models for fault classification (FC) and fault localization (FL) under identical sensing, timing, and validation conditions on a common electromagnetic transient dataset, using decision windows of 10-50 ms to reflect protection-relevant time scales. For FC, the best-performing nonlinear models achieve F1 scores above 0.98 already at 10 ms, while lower-capacity models degrade at shorter horizons but improve with longer windows, indicating that relevant fault-type information is already present in the earliest transient. For FL, the top-performing models reach a stable localization error of about 10 % of normalized line length across all evaluated horizons, while weaker models form a clearly separated second performance tier. Line-resolved analysis shows that localization accuracy varies across grid segments, indicating topology-dependent difficulty rather than insufficient temporal context alone. These findings provide a controlled reference for comparing machine learning models across two protection tasks with fundamentally different information requirements.

URL PDF HTML ☆

赞 0 踩 0

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新

全球生活便利指数：面向主要经济体纵向分析的机器学习框架

Arun Kumar Selvaraj, Tanay Panat, Rohitash Chandra

发表机构 * Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics（过渡人工智能研究组，数学与统计学学院）； Centre for Artificial Intelligence and Innovation（人工智能与创新中心）； Pingla Institute（Pingla研究所）

AI总结提出全球生活便利指数，结合社会经济和基础设施因素，利用机器学习处理缺失数据，并通过主成分分析和因子分析降维，为政策制定者提供改善生活质量的可操作工具。

详情

AI中文摘要

全球经济、地缘政治条件以及COVID-19疫情等破坏性事件对生活成本和生活质量产生了巨大影响。理解主要经济体中生活成本和生活质量的长期影响至关重要。一个透明且全面的生活指数必须包含生活条件的多个维度。在本研究中，我们提出了一种通过全球生活便利指数量化生活质量的方法，该指数将各种社会经济和基础设施因素整合为一个单一综合得分。我们的指数利用定义生活水平的经济指标，这有助于针对特定领域进行干预改进。我们提出了一个机器学习框架来处理特定国家某些经济指标的数据缺失问题。然后，我们整理并更新数据，并使用降维方法（主成分分析和因子分析）创建自1970年以来主要经济体的生活便利指数。我们的工作通过为政策制定者提供识别需要改进领域（如医疗系统、就业机会和公共安全）的实用工具，显著丰富了相关文献。我们的方法使用开放数据和代码，易于复现并适用于各种情境，为生活质量评估的持续研究和政策制定提供了透明度和可访问性。

英文摘要

The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is essential to comprehend the long-term implications of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework to address missing data for certain economic indicators in specific countries. We then curate and update the data and use a dimensionality reduction approach (Principal Component Analysis and Factor Analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts, providing transparency and accessibility for ongoing research and policy development in quality-of-life assessment.

URL PDF HTML ☆

赞 0 踩 0

2506.01678 2026-06-19 cond-mat.mtrl-sci cs.AI 版本更新

Overcoming Labelled Data Scarcity for Defect Classification in Scanning Tunneling Microscopy

克服扫描隧道显微镜缺陷分类中的标注数据稀缺问题

Nikola L. Kolev, Max Trouton, Filippo Federici Canova, Geoff Thornton, David Z. Gao, Neil J. Curson, Taylor J. Z. Stock

发表机构 * London Centre for Nanotechnology, University College London（伦敦纳米技术中心，伦敦大学学院）； Department of Electronic and Electrical Engineering, University College London（电子与电气工程系，伦敦大学学院）； Department of Physics and Astronomy, University College London（物理与天文学系，伦敦大学学院）； Department of Chemistry, University College London（化学系，伦敦大学学院）； Aalto Science Institute, School of Science, Aalto University（艾尔沃斯科学研究所，艾尔沃斯大学）； Nanolayers Research Computing LTD, London, UK（纳米层研究计算有限公司，伦敦，英国）； Department of Physics, NTNU Norwegian University of Science and Technology（物理系，挪威科技大学）

AI总结提出结合少样本学习和无监督学习的自动分割方法，在仅需少量标注数据下实现高精度STM图像缺陷分类，并在三种表面验证了强泛化能力。

详情

AI中文摘要

扫描隧道显微镜（STM）是一种以原子分辨率对表面成像的强大技术，可深入理解单原子和分子层面的物理化学过程。STM图像分析的一项常规任务是在均匀背景中识别和标记感兴趣的特征。手动执行此操作是一项劳动密集型工作，需要大量人力。为减轻这一负担，我们提出了一种自动化的STM图像分割方法，该方法同时使用少样本学习和无监督学习。与之前的监督方法相比，我们的技术提供了更大的灵活性；它消除了对大型手动标注数据集的需求，因此更容易适应未见过的表面，同时仍保持高精度。我们通过使用该方法识别三种不同表面上的原子特征来展示其有效性：Si(001)、Ge(001)和TiO$_2$(110)，包括吸附在硅和锗表面上的AsH$_3$分子。我们的模型表现出强大的泛化能力，在初始训练后，仅需一个额外的标注数据点即可适应未见过的表面。这项工作朝着高效且与材料无关的STM图像自动分割迈出了重要一步。

英文摘要

Scanning tunnelling microscopy (STM) is a powerful technique for imaging surfaces with atomic resolution, providing insight into physical and chemical processes at the level of single atoms and molecules. A regular task of STM image analysis is the identification and labelling of features of interest against a uniform background. Performing this manually is a labour-intensive task, requiring significant human effort. To reduce this burden, we propose an automated approach to the segmentation of STM images that uses both few-shot learning and unsupervised learning. Our technique offers greater flexibility compared to previous supervised methods; it removes the requirement for large manually annotated datasets and is thus easier to adapt to an unseen surface while still maintaining a high accuracy. We demonstrate the effectiveness of our approach by using it to recognise atomic features on three distinct surfaces: Si(001), Ge(001), and TiO$_2$(110), including adsorbed AsH$_3$ molecules on the silicon and germanium surfaces. Our model exhibits strong generalisation capabilities, and following initial training, can be adapted to unseen surfaces with as few as one additional labelled data point. This work is a significant step towards efficient and material-agnostic, automatic segmentation of STM images.

URL PDF HTML ☆

赞 0 踩 0

2511.08378 2026-06-19 cs.IR cs.AI 版本更新

Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

告别跷跷板：通过混合意图的双重约束实现准确的长期会话推荐

Xiao Wang, Ke Qin, Dongyang Zhang, Xiurui Xie, Shuang Liang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结针对会话推荐中长尾分布导致准确性与多样性冲突的跷跷板问题，提出混合意图双重约束框架HID，通过属性感知谱聚类重构意图映射并区分噪声意图，结合多样性与准确性约束损失，实现长尾与准确性的双赢。

Comments accepted by AAAI 2026 Oral

详情

AI中文摘要

基于会话的推荐（SBR）旨在根据用户的交互会话预测匿名用户的下一次交互。在实际推荐场景中，低曝光物品构成了交互的大部分，形成长尾分布，严重损害了推荐多样性。现有方法试图通过提升尾部物品来解决这一问题，但会导致准确性下降，在长尾与准确性性能之间表现出“跷跷板”效应。我们将这种冲突归因于尾部物品中的会话无关噪声，而现有的长尾方法未能有效识别和约束这些噪声。为了解决这一根本冲突，我们提出了HID（混合意图双重约束框架），这是一个即插即用的框架，通过引入基于混合意图的双重约束，将传统的“跷跷板”转变为“双赢”，同时提升长尾和准确性性能。该框架包含两个关键创新：（i）混合意图学习，我们通过采用属性感知谱聚类重构物品到意图的映射，重新制定了意图提取策略。此外，通过为每个会话分配目标意图和噪声意图，实现了会话无关噪声的区分。（ii）意图约束损失，它引入了两种关于多样性和准确性的新约束范式，以调节物品和会话的表示学习过程。通过严格的理论推导，这两个目标被统一到单个训练损失中。在多个SBR模型和数据集上的大量实验表明，HID能够同时提升长尾性能和推荐准确性，在长尾推荐系统中建立了新的最先进性能。

英文摘要

Session-based recommendation (SBR) aims to predict anonymous users' next interaction based on their interaction sessions. In the practical recommendation scenario, low-exposure items constitute the majority of interactions, creating a long-tail distribution that severely compromises recommendation diversity. Existing approaches attempt to address this issue by promoting tail items but incur accuracy degradation, exhibiting a "see-saw" effect between long-tail and accuracy performance. We attribute such conflict to session-irrelevant noise within the tail items, which existing long-tail approaches fail to identify and constrain effectively. To resolve this fundamental conflict, we propose \textbf{HID} (\textbf{H}ybrid \textbf{I}ntent-based \textbf{D}ual Constraint Framework), a plug-and-play framework that transforms the conventional "see-saw" into "win-win" through introducing the hybrid intent-based dual constraints for both long-tail and accuracy. Two key innovations are incorporated in this framework: (i) \textit{Hybrid Intent Learning}, where we reformulate the intent extraction strategies by employing attribute-aware spectral clustering to reconstruct the item-to-intent mapping. Furthermore, discrimination of session-irrelevant noise is achieved through the assignment of the target and noise intents to each session. (ii) \textit{Intent Constraint Loss}, which incorporates two novel constraint paradigms regarding the \textit{diversity} and \textit{accuracy} to regulate the representation learning process of both items and sessions. These two objectives are unified into a single training loss through rigorous theoretical derivation. Extensive experiments across multiple SBR models and datasets demonstrate that HID can enhance both long-tail performance and recommendation accuracy, establishing new state-of-the-art performance in long-tail recommender systems.

URL PDF HTML ☆

赞 0 踩 0

2601.00014 2026-06-19 eess.SP cs.AI cs.LG 版本更新

Modeling Day-Long ECG Signals to Predict Heart Failure Risk with Explainable AI

建模全天心电图信号以可解释人工智能预测心力衰竭风险

Eran Zvuloni, Ronit Almog, Michael Glikson, Shany Brimer Biton, Ilan Green, Izhar Laufer, Offer Amir, Joachim A. Behar

发表机构 * Leumit Health Services（Leumit健康服务）

AI总结提出DeepHHF深度学习模型，利用24小时单导联心电图数据预测五年内心力衰竭风险，AUC达0.80，优于短时片段和临床评分，可解释性分析显示模型关注心律失常和心脏异常。

详情

AI中文摘要

心力衰竭（HF）影响11.8%的65岁及以上成年人，降低生活质量和寿命。预防HF可降低发病率和死亡率。我们假设将人工智能（AI）应用于24小时单导联心电图（ECG）数据可预测五年内HF风险。为此，使用了Technion-Leumit Holter ECG（TLHE）数据集，包括20年间收集的47,729名患者的69,663条记录。我们的深度学习模型DeepHHF在24小时ECG记录上训练，实现了0.80的受试者工作特征曲线下面积，优于使用30秒片段和临床评分的模型。DeepHHF识别的高风险个体住院或死亡事件概率翻倍。可解释性分析显示DeepHHF关注心律失常和心脏异常。本研究强调了深度学习建模24小时连续ECG数据的可行性，捕捉了对可靠风险预测至关重要的阵发性事件。应用于单导联Holter ECG的人工智能无创、廉价且广泛可及，使其成为HF风险预测的有前景工具。

英文摘要

Heart failure (HF) affects 11.8% of adults aged 65 and older, reducing quality of life and longevity. Preventing HF can reduce morbidity and mortality. We hypothesized that artificial intelligence (AI) applied to 24-hour single-lead electrocardiogram (ECG) data could predict the risk of HF within five years. To research this, the Technion-Leumit Holter ECG (TLHE) dataset, including 69,663 recordings from 47,729 patients, collected over 20 years was used. Our deep learning model, DeepHHF, trained on 24-hour ECG recordings, achieved an area under the receiver operating characteristic curve of 0.80 that outperformed a model using 30-second segments and a clinical score. High-risk individuals identified by DeepHHF had a two-fold chance of hospitalization or death incidents. Explainability analysis showed DeepHHF focused on arrhythmias and heart abnormalities. This study highlights the feasibility of deep learning to model 24-hour continuous ECG data, capturing paroxysmal events essential for reliable risk prediction. Artificial intelligence applied to single-lead Holter ECG is non-invasive, inexpensive, and widely accessible, making it a promising tool for HF risk prediction.

URL PDF HTML ☆

赞 0 踩 0

2601.02149 2026-06-19 cond-mat.mes-hall cond-mat.dis-nn cs.AI 版本更新

AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes

基于人工智能的量子点哈密顿量调优以实现马约拉纳模式

Mateusz Krawczyk, Jarosław Pawłowski

发表机构 * Institute of Theoretical Physics, Wrocław University of Science and Technology（理论物理研究所，沃林大学技术学院）

AI总结本文提出基于神经网络的模型，通过学习量子点模拟器的工作区域，利用输运测量自动调优设备以获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据，采用融合马约拉纳零模关键性质的物理引导损失函数。

Comments 12 pages, 8 figures, 2 tables

Journal ref Phys. Rev. Applied 25, 064032 (2026)

详情

DOI: 10.1103/xkbl-ctwn

AI中文摘要

我们提出了一种基于神经网络的模型，能够学习量子点模拟器广泛的工作区域，并利用此知识通过输运测量自动调优这些设备，以在结构中获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据，采用融合马约拉纳零模关键性质的物理引导损失函数。我们展示了通过适当训练，深度视觉变换器网络可以高效记忆哈密顿量参数与导电图之间的关系，并利用此提出量子点链参数更新，驱动系统进入拓扑相。从参数空间的广泛初始调谐范围开始，单步更新足以生成非平凡零模。此外，通过启用迭代调优过程——系统在每一步获得更新的导电图——我们证明该方法可以处理参数空间更大的区域。

英文摘要

We propose a neural network-based model capable of learning the broad landscape of working regimes in quantum dot simulators, and using this knowledge to autotune these devices - based on transport measurements - toward obtaining Majorana modes in the structure. The model is trained in an unsupervised manner on synthetic data in the form of conductance maps, using a physics-informed loss that incorporates key properties of Majorana zero modes. We show that, with appropriate training, a deep vision-transformer network can efficiently memorize relation between Hamiltonian parameters and structures on conductance maps and use it to propose parameters update for a quantum dot chain that drive the system toward topological phase. Starting from a broad range of initial detunings in parameter space, a single update step is sufficient to generate nontrivial zero modes. Moreover, by enabling an iterative tuning procedure - where the system acquires updated conductance maps at each step - we demonstrate that the method can address a much larger region of the parameter space.

URL PDF HTML ☆

赞 0 踩 0

2604.08552 2026-06-19 cs.DB cs.AI 版本更新

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

使用本体约束的LLM代理自动化标准化遗留生物医学元数据

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

发表机构 * Division of Computational Medicine, Stanford University（斯坦福大学计算医学部）； Department of Biology, University of Pennsylvania（宾夕法尼亚大学生物学系）

AI总结提出基于LLM的元数据标准化系统，通过实时查询标准指南和本体服务，在839条HuBMAP记录上验证，相比纯LLM方法显著提升预测准确性。

详情

AI中文摘要

科学元数据通常不完整且不符合社区标准，限制了数据集的可发现性、互操作性和重用。即使存在标准元数据报告指南，它们通常缺乏机器可操作的表征。生成FAIR数据集需要将元数据标准编码为具有丰富字段规范和精确值约束的机器可操作模板。最近的研究表明，由字段名称和本体约束引导的LLM可以改善元数据标准化，但这些方法将约束视为静态文本提示，仅依赖模型的训练知识。我们提出了一种基于LLM的元数据标准化系统，该系统实时查询标准报告指南和权威生物医学术语服务，以按需检索规范正确的标准。我们在来自人类生物分子图谱计划（HuBMAP）的839条遗留元数据记录上评估了该方法，使用专家策划的金标准进行精确匹配评估。我们的评估表明，与仅使用LLM相比，通过实时工具访问增强LLM在受本体约束和不受本体约束的字段上均持续提高了预测准确性，展示了一种实用的生物医学元数据自动化标准化方法。

英文摘要

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.

URL PDF HTML ☆

赞 0 踩 0

2604.11556 2026-06-19 cs.SE cs.AI 版本更新

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

FM-Agent: 通过基于LLM的Hoare风格推理将形式化方法扩展到大型系统

Haoran Ding, Zhaoguo Wang, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University（并行与分布式系统研究所，上海交通大学）

AI总结提出FM-Agent框架，利用LLM自动生成函数级规范，实现大型系统的组合式推理，在143k行代码的系统中2天内发现522个新bug。

详情

AI中文摘要

LLM辅助的软件开发已日益普遍，并能生成如编译器这样的大型系统。增强生成代码的正确性变得至关重要。然而，由于代码复杂性，大型系统的自动推理仍然具有挑战性。Hoare逻辑提供了一种将大型系统分解为较小组件并分别推理（即组合式推理）的方法。然而，现有工作仍难以扩展，因为Hoare逻辑要求为每个函数编写形式化规范，给人类带来沉重负担。当代码由LLM生成时，问题更加严重，因为开发人员缺乏对每个函数预期行为的深入理解。本文提出FM-Agent，这是第一个实现大型系统自动化组合式推理的框架。利用LLM，FM-Agent引入了一种自顶向下的范式来自动生成函数级规范。具体来说，FM-Agent从调用者期望函数如何行为中推导出函数的规范，因此即使实现有缺陷，生成的规范也能反映开发者的意图。开发者的意图通常用自然语言表达，而现有的验证器只支持公式。因此，FM-Agent推广了Hoare风格推理，以针对自然语言规范推理函数。最后，为了确认错误存在并解释错误原因，FM-Agent自动生成测试用例以触发潜在错误。在我们的评估中，FM-Agent在2天内成功推理了大型系统，每个系统最多有143k行代码。这些系统已经由开发者测试过，但FM-Agent仍然发现了522个新错误。这些错误可能导致严重后果，包括系统崩溃和错误的执行结果。

英文摘要

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

URL PDF HTML ☆

赞 0 踩 0

2606.12500 2026-06-19 cs.LG cs.AI 版本更新

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结本文利用机器学习行为模型替代传统规则模型进行交通微观仿真，通过极端值理论分析模拟冲突预测碰撞频率，在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情

AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案，用于预测当前或计划道路基础设施设计的碰撞频率。然而，现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型，这些模型能较好地再现交通流，但往往无法生成真实的冲突动态，限制了碰撞预测的准确性。机器学习（ML）行为模型的最新进展提供了一个有希望的机会，通过直接从大规模轨迹数据集中学习人类驾驶行为，可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性，我们对英国利兹的五个真实信号交叉口进行了交通微观仿真，使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突，然后使用极端值理论建模以预测碰撞频率。结果表明，ML模型的冲突产生的碰撞预测与实际碰撞数据一致，而基于规则的模型由于缺乏对特定模拟交叉口的模型校准，无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果，这表明尽管当前的ML模型可以真实地再现冲突，但尚不能生成真实的碰撞。总体而言，研究结果表明，基于ML的行为模型在无需特定地点模型校准的情况下，有望从模拟冲突中改进碰撞预测，并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

URL PDF HTML ☆

赞 0 踩 0

2606.13794 2026-06-19 eess.SY cs.AI cs.RO cs.SY 版本更新

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)（斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所）

AI总结提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法，结合在线自适应机制，实现过驱动飞行器的高效非线性控制分配，兼具可解释性和低计算成本。

详情

AI中文摘要

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2509.02581 2026-06-19 cs.DL cs.AI 版本更新

Charting the Future of Scholarly Knowledge with AI: A Community Perspective

用AI绘制学术知识的未来：社区视角

Azanzi Jiomekong, Hande Küçük McGinty, Keith G. Mills, Allard Oelen, Enayat Rajabi, Harry McElroy, Antrea Christou, Anmol Saini, Janice Anta Zebaze, Hannah Kim, Anna M. Jacyszyn, Gollam Rabby, Dirk Betz, Claudia Biniossek, Sanju Tiwari, Sören Auer

发表机构 * TIB Leibniz Information Centre for Science and Technology（蒂宾根莱比锡科学与技术信息中心）； Department of Computer Science, University of Yaounde 1（亚奥内1大学计算机科学系）； Department of Computer Science, Kansas State University（堪萨斯州立大学计算机科学系）； School of EECS, Louisiana State University（路易斯安那州立大学电子工程与计算机科学学院）； Management Science Department, Cape Breton University（cape breton 大学管理科学系）； Department of Development and Research, Performigence（Performigence 发展与研究部）； Department of Engineering and Computer Science, Wright State University（怀特州立大学工程与计算机科学系）； Department of Physics, University of Yaounde 1（亚奥内1大学物理系）； FIZ Karlsruhe, Leibniz Institute for Information Infrastructure（卡尔斯鲁厄莱比锡信息基础设施研究所）； Sharda University, Delhi-NCR, India（德里-纳尔默德印度大学）； L3S Research Center, Leibniz University of Han（汉莱比锡大学L3S研究中心）

AI总结本文从社区视角出发，识别促进跨学科对话、共享挑战、分类新合作并塑造学术知识组织未来研究方向的方法。

Comments 39 pages, 3 figures

详情

AI中文摘要

尽管支持学术知识提取和组织的工具日益普及，许多研究人员仍依赖手动方法，有时是因为对现有技术不熟悉或缺乏领域适应性解决方案。同时，跨学科学术出版物的快速增长使得跟上最新进展越来越困难，进一步凸显了对可扩展的、基于AI的方法来结构化和综合学术知识的需求。各个研究社区已开始独立应对这一挑战，开发旨在构建可靠、动态且可查询的学术知识库的工具和框架。然而，这些社区之间的有限互动阻碍了方法、模型和最佳实践的交流，减缓了向更集成解决方案的进展。本文确定了促进跨学科对话、识别共同挑战、分类新合作并塑造学术知识组织未来研究方向的方法。

英文摘要

Despite the growing availability of tools designed to support scholarly knowledge extraction and organization, many researchers still rely on manual methods, sometimes due to unfamiliarity with existing technologies or limited access to domain-adapted solutions. Meanwhile, the rapid increase in scholarly publications across disciplines has made it increasingly difficult to stay current, further underscoring the need for scalable, AI-enabled approaches to structuring and synthesizing scholarly knowledge. Various research communities have begun addressing this challenge independently, developing tools and frameworks aimed at building reliable, dynamic, and queryable scholarly knowledge bases. However, limited interaction across these communities has hindered the exchange of methods, models, and best practices, slowing progress toward more integrated solutions. This manuscript identifies ways to foster cross-disciplinary dialogue, identify shared challenges, categorize new collaboration and shape future research directions in scholarly knowledge and organization.

URL PDF HTML ☆

赞 0 踩 0

2603.16648 2026-06-19 cs.AI 版本更新

Domain-Independent Dynamic Programming with Constraint Propagation

Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa

发表机构 * Imko Marijnissen 1 ； J. Christopher Beck 2 ； Emir Demirović 1 ； Ryo Kuroiwa 3, 4

Comments 13 pages. To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

Journal ref Proceedings of the International Conference on Automated Planning and Scheduling (2026) | Volume 36(1) | Pages 171-180

2602.05416 2026-06-19 cs.CE cs.AI cs.LG physics.ao-ph physics.flu-dyn 版本更新

Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models

Freja Høgholm Petersen, Jesper Sandvig Mariegaard, Rocco Palmitessa, Allan P. Engsig-Karup

发表机构 * DTU（技术大学）

Comments Submitted for peer-review in a journal. v2: revised version submitted to journal after minor revisions

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

2602.14239 2026-06-19 cs.SI cs.AI cs.LG 版本更新

A Hybrid TGN-SEAL Model for Dynamic Graph Link Prediction

Nafiseh Sadat Sajadi, Behnam Bahrak, Mahdi Jafari Siavoshani

发表机构 * Department of Computer Engineering, Sharif University of Technology（谢尔万大学计算机工程系）； Tehran Institute for Advanced Studies, Khatam University（泰赫兰高级研究院，卡塔姆大学）

Journal ref EPJ Data Science (2026)

2510.24435 2026-06-19 cs.AI 版本更新

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

Benjamin Grando Moreira

发表机构 * Universidade Federal de Santa Catarina（联邦圣卡塔琳娜大学）

Comments 12 pages

Journal ref Proceedings of the 2026 Computer on the Beach

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯分校）； All India Institute of Medical Sciences（全印度医学科学研究所）； Indian Institute of Technology Hyderabad（印度理工学院海得拉巴分校）

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

2406.15465 2026-06-19 cs.CL cs.AI 版本更新

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland（以患者为中心的数字健康研究所，伯恩应用科学大学，比尔，瑞士）； ID Suisse AG, St. Gallen, Switzerland（ID瑞士股份有限公司，圣加尔，瑞士）

1. 智能体、规划与决策 8 篇

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

CogniFold: Always-On Proactive Memory via Cognitive Folding

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Policy-Embedded Graph Expansion: Networked HIV Testing with Diffusion-Driven Network Samples

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

2. 知识表示、推理与符号AI 1 篇

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

3. 多智能体与博弈 6 篇

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

Searching for Synergy in Shared Workspace Human-AI Collaboration

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

4. 搜索、优化与约束求解 1 篇

Flickering Multi-Armed Bandits

5. 机器学习与表示学习 15 篇

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach

PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

A Deep Generative Model for Resting-State EEG Synthesis and Transferable Representation Learning

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic Methods

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Reinforcement-aware Knowledge Distillation for LLM Reasoning

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

Reinforcement Learning Foundation Models Should Already Be A Thing

6. 自然语言与多模态智能 7 篇

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Vero: An Open RL Recipe for General Visual Reasoning

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

7. 机器人与具身智能 7 篇

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Movement Primitives in Robotics: A Comprehensive Survey

PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

Class-Incremental Motion Forecasting

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

8. 可信、安全与AI治理 12 篇

One Probe Won't Catch Them All: Towards Targeted Deception Detection

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

DeFrame: Debiasing Large Language Models Against Framing Effects

The Autonomy Tax: Defense Training Breaks LLM Agents

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

Large Language Models Hack Rewards, and Society

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

9. 评测、基准与数据集 17 篇

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

Too long; didn't solve

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems