arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2507.17853 2026-06-17 cs.CV cs.AI 版本更新

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

Detail++: 文本到图像扩散模型的免训练细节增强器

Lifeng Chen, Jiner Wang, Zihao Pan, Beier Zhu, Xiaofeng Yang, Chi Zhang

AI总结 提出免训练框架Detail++,通过渐进式细节注入策略分解复杂提示词,利用自注意力布局控制与交叉注意力质心对齐损失,提升多主体复杂提示下的生成质量。

详情
AI中文摘要

文本到图像(T2I)生成的最新进展已带来令人印象深刻的视觉结果。然而,这些模型在处理复杂提示词时仍面临重大挑战,尤其是涉及具有不同属性的多个主体时。受人类绘画过程(先勾勒构图,再逐步添加细节)的启发,我们提出Detail++,一个免训练框架,引入新颖的渐进式细节注入(PDI)策略来解决这一局限。具体来说,我们将复杂提示词分解为一系列简化的子提示词,分阶段引导生成过程。这种分阶段生成利用自注意力的固有布局控制能力,首先确保全局构图,然后进行精确细化。为了实现属性与对应主体的准确绑定,我们利用交叉注意力机制,并进一步在测试时引入质心对齐损失,以减少绑定噪声并增强属性一致性。在T2I-CompBench和新构建的风格组合基准上的大量实验表明,Detail++显著优于现有方法,特别是在涉及多个对象和复杂风格条件的场景中。

英文摘要

Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.

2507.15104 2026-06-17 cs.LG cs.AI 版本更新

AnalogFed: Privacy-Preserving Discovery of Analog Circuits at Scale with Federated Generative AI

AnalogFed: 基于联邦生成式AI的大规模模拟电路隐私保护发现

Qiufeng Li, Shu Hong, Tian Lan, Weidong Cao

AI总结 提出AnalogFed,首个结合联邦学习和生成式AI的隐私保护框架,用于大规模模拟电路拓扑发现,通过虚拟令牌注入和同态加密防御成员推理和模型反转攻击,实现高效协作设计。

详情
AI中文摘要

生成式AI的最新进展已展现出对现代硬件设计的变革潜力。然而,由于硬件数据集的专有性和孤立性,无法集中进行模型训练,现有的生成式AI驱动方法难以实现大规模电子设计自动化。实现大规模生成式AI驱动的EDA需要一种新颖的隐私保护框架,能够在不损害机密性的情况下利用分布式数据。本文介绍了AnalogFed,这是首个利用联邦学习和生成式AI进行大规模模拟电路拓扑发现的隐私保护框架。AnalogFed在解决关键安全挑战的同时,确立了协作式模拟拓扑设计的可行性:它通过基于虚拟令牌注入的新型输入扰动策略减轻成员推理攻击,并使用定制的高效同态加密防御模型反转攻击。大量实验证明了AnalogFed的有效性和效率,在保持模型效用的同时实现了强大的隐私保护。该框架为下一代基于生成式AI的硬件设计自动化中的可扩展多方协作奠定了基础。

英文摘要

Recent advances in generative AI (GenAI) have shown transformative potential for modern hardware design. However, existing GenAI-driven approaches fall short of enabling large-scale electronic design automation (EDA) due to the proprietary and siloed nature of hardware datasets, which cannot be centralized for model training. Achieving at-scale GenAI-driven EDA, therefore, requires a novel privacy-preserving framework that can leverage distributed data without compromising confidentiality. This work introduces AnalogFed, the first privacy-preserving framework for large-scale analog circuit topology discovery using federated learning (FedL) and GenAI. AnalogFed establishes the feasibility of collaborative analog topology design while addressing key security challenges: it mitigates membership inference attacks (MIAs) through a novel input perturbation strategy based on dummy token injection, and defends against model inversion attacks with customized, efficient homomorphic encryption. Extensive experiments demonstrate AnalogFed's effectiveness and efficiency, achieving strong privacy protection without degrading model utility. This framework lays the foundation for scalable, multi-party collaboration in next-generation hardware design automation with GenAI.

2507.05163 2026-06-17 cs.CV 版本更新

4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

4DSloMo: 基于异步捕获的高速场景4D重建

Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue

AI总结 提出一种仅使用低帧率相机的高速4D捕获系统,通过异步捕获方案将等效帧率提升至100-200 FPS,并利用视频扩散模型修复稀疏视图伪影,实现高质量高速4D重建。

Comments Webpage: https://openimaginglab.github.io/4DSloMo/

详情
AI中文摘要

从多视角视频重建快速动态场景对于高速运动分析和逼真的4D重建至关重要。然而,大多数4D捕获系统的帧率限制在30 FPS以下,直接从低帧率输入进行高速运动的4D重建可能导致不理想的结果。在这项工作中,我们提出了一种仅使用低帧率相机的高速4D捕获系统,通过新颖的捕获和处理模块实现。在捕获方面,我们提出了一种异步捕获方案,通过错开相机的开始时间来提高有效帧率。通过分组相机并利用25 FPS的基础帧率,我们的方法实现了100-200 FPS的等效帧率,无需专门的高速相机。在处理方面,我们还提出了一种新颖的生成模型来修复由4D稀疏视图重建引起的伪影,因为异步减少了每个时间戳的视角数量。具体来说,我们提出训练一个基于视频扩散的伪影修复模型用于稀疏4D重建,该模型细化缺失细节、保持时间一致性并提高整体重建质量。实验结果表明,与同步捕获相比,我们的方法显著增强了高速4D重建。

英文摘要

Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

2505.19937 2026-06-17 cs.CL cs.SD eess.AS 版本更新

ALAS: An Automatic Latent Alignment Score for Audio Language Models

ALAS:音频语言模型的自动潜在对齐分数

Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan

AI总结 提出ALAS指标,通过计算音频与文本表示的跨模态余弦相似度,无需训练即可评估语音-LLM的音频-文本对齐质量,揭示模型对齐深度与任务需求的关系。

详情
AI中文摘要

大型语言模型(LLM)被扩展为语音-LLM,它们学习的音频-文本对齐质量影响大多数下游口语理解(SLU)行为。然而,尽管融合策略不断增长,但没有标准方法来衡量语音-LLM内部如何将音频帧与文本标记绑定。我们引入ALAS(自动潜在对齐分数),一种模型和任务无关的度量,探测LLM的逐层隐藏状态,将音频和文本表示之间的跨模态余弦相似度与Whisper导出的参考进行评分。ALAS仅需要冻结的前向传递和现成的ASR参考,无需训练或拟合分类器,并校准到可解释的均匀基线,可在任务间比较。将ALAS应用于四个开源语音-LLM(AF3、Qwen2-Audio、Qwen-Omni、SALMONN),在情感识别(IEMOCAP)、开放式SQA(LibriSQA)和多选音频理解(MMAU-speech)上,我们发现对齐的深度和强度反映了每个模型的音频编码器设计以及任务的声学与语义需求,并且ALAS跟踪但不重复任务准确性,暴露了那些得分高但未真正基于音频的模型。我们将ALAS作为开源库发布,以便从业者探测自己的语音-LLM或在新任务上尝试。

英文摘要

Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion strategies, there is no standard way to measure how well a Speech-LLM internally binds audio frames to text tokens. We introduce ALAS (Automatic Latent Alignment Score), a model and task-agnostic metric that probes the LLM's per-layer hidden states, scoring the cross-modal cosine similarity between audio and text representations against a Whisper-derived reference. ALAS needs only a frozen forward pass and an off-the-shelf ASR reference, with no training or fitted classifier, and is calibrated to an interpretable uniform baseline comparable across tasks. Applying ALAS to four open-source Speech-LLMs (AF3, Qwen2-Audio, Qwen-Omni, SALMONN) across emotion recognition (IEMOCAP), open-ended SQA (LibriSQA), and multi-choice audio understanding (MMAU-speech), we find that the depth and strength of alignment reflect each model's audio-encoder design and the acoustic-versus-semantic demands of the task, and that ALAS tracks but does not duplicate task accuracy, exposing models that score well without genuinely grounding in the audio. We release ALAS as an open-source library so that practitioners can probe their own Speech-LLMs or try it on new tasks.

2506.17639 2026-06-17 cs.RO cs.AI 版本更新

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

RLRC:基于强化学习的压缩视觉-语言-动作模型恢复

Yuxuan Chen, Yixin Han, Yize Huang, Xiao Li

AI总结 提出RLRC三阶段压缩恢复流程,通过结构化剪枝、SFT和强化学习恢复以及量化,实现8倍内存减少和2.3倍推理加速,同时保持任务成功率。

Comments 8 pages, 10 figures; accepted by RA-L 2026

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8864-8871, July 2026
AI中文摘要

视觉-语言-动作模型(VLA)在复杂机器人操作中展示了卓越的能力和巨大潜力。然而,其庞大的参数规模和高推理延迟阻碍了实际部署,尤其是在资源受限的平台上。为此,我们对VLA的模型压缩进行了系统的实证研究。基于这些见解,我们提出了\textit{RLRC},一个三阶段压缩和恢复流程,包括结构化剪枝、通过SFT和RL进行性能恢复,以及后续量化。RL阶段引入了评论家预热策略和BC损失正则化,以稳定训练并保持策略行为。RLRC实现了高达8倍的内存减少和2.3倍的推理加速,同时保持原始任务成功率。在多个VLA骨干网络上的大量实验表明,RLRC始终优于现有的压缩基线,突显了其在设备端部署的有效性。项目网站:此https URL

英文摘要

Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io

2506.10981 2026-06-17 cs.CV 版本更新

SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

SceneCompleter:面向生成式新视角合成的密集3D场景补全

Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan

AI总结 提出SceneCompleter,通过几何-外观双流扩散模型在RGBD潜空间进行密集3D场景补全,并引入场景嵌入器整合全局信息,实现跨视角一致的生成式新视角合成。

详情
AI中文摘要

生成模型通过利用强大的图像生成先验,在新视角合成(NVS)方面显示出巨大潜力。然而,现有方法通常遵循2D修复范式,首先补全缺失图像区域,然后进行3D重建。这种策略常常导致几何失真和外观漂移,因为2D修复模型无法可靠推断跨视角一致生成所需的底层3D结构。在本文中,我们提出\textbf{SceneCompleter},一个几何感知框架,将生成式NVS重新表述为密集3D场景补全。SceneCompleter不是单独生成2D视图,而是通过几何-外观双流扩散模型在空间对齐的RGBD潜空间中联合补全几何和外观。为了提供整体场景上下文,我们进一步引入场景嵌入器,根据参考图像的全局语义和风格信息调节生成。然后,补全的RGBD预测被对齐并集成到可扩展的3D场景表示中,实现迭代且连贯的场景补全。在域内和域外数据集上的大量实验表明,SceneCompleter在各种场景下都能生成视觉上合理且几何一致的新视图。项目页面:此https URL

英文摘要

Generative models have shown great promise for novel view synthesis (NVS) by leveraging strong image generation priors. However, existing approaches typically follow a 2D inpainting paradigm, first completing missing image regions and then performing 3D reconstruction. This strategy often causes geometry distortion and appearance drift, as 2D inpainting models cannot reliably infer the underlying 3D structure required for cross-view consistent generation. In this paper, we propose \textbf{SceneCompleter}, a geometry-aware framework that reformulates generative NVS as dense 3D scene completion. Instead of hallucinating isolated 2D views, SceneCompleter jointly completes geometry and appearance through a geometry-appearance dual-stream diffusion model in a spatially aligned RGBD latent space. To provide holistic scene context, we further introduce a Scene Embedder that conditions generation on global semantic and stylistic information from reference images. The completed RGBD predictions are then aligned and integrated into an expandable 3D scene representation, enabling iterative and coherent scene completion. Extensive experiments on in-domain and out-of-distribution datasets demonstrate that SceneCompleter produces visually plausible and geometrically consistent novel views across diverse scenarios. Project Page: https://chen-wl20.github.io/SceneCompleter

2506.03802 2026-06-17 cs.LG 版本更新

Learning in Matching Games with Bandit Feedback

带强盗反馈的匹配博弈学习

Andreas Athanasopoulos, Christos Dimitrakakis

AI总结 研究广义双边匹配市场中代理通过零和博弈交互的学习问题,提出基于UCB的算法,以匹配不稳定性为遗憾度量,实现次线性遗憾上界。

Comments 22 pages, 2 figures

详情
AI中文摘要

我们在广义双边匹配市场中引入了一个学习问题,其中代理选择行动以与其匹配对象进行交互。具体来说,我们考虑一个场景,其中匹配的代理参与具有初始未知收益矩阵的零和博弈,并研究集中式程序是否可以从强盗反馈中学习均衡。我们采用\emph{匹配均衡}的解概念,其中匹配\( \mathfrak{m} \)和一组代理策略\( X \)构成均衡,如果没有代理有动机偏离\( (\mathfrak{m}, X) \)。为了量化候选解\( (\mathfrak{m}, X) \)与均衡\( (\mathfrak{m}^\star, X^\star) \)的偏差,我们引入了\emph{匹配不稳定性}的概念,它作为学习问题的遗憾度量。我们提出了一种基于UCB的算法,其中代理根据收益的乐观估计形成偏好并选择行动。我们的分析建立了一个次线性、实例无关的遗憾上界,并得到了经验证据的进一步支持。

英文摘要

We introduce a learning problem in a generalized two-sided matching market, where agents select actions to interact with their match. Specifically, we consider a setting in which matched agents engage in zero-sum games with initially unknown payoff matrices, and we investigate whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of a \emph{matching equilibrium}, where a matching \( \mathfrak{m} \) and a set of agent strategies \( X \) form an equilibrium if no agent has an incentive to deviate from \( (\mathfrak{m}, X) \). To quantify deviations of a candidate solution \( (\mathfrak{m}, X) \) from the equilibrium \( (\mathfrak{m}^\star, X^\star) \), we introduce the notion of \emph{matching instability}, which serves as a regret measure for the learning problem. We propose a UCB-based algorithm in which agents form preferences and select actions according to optimistic estimates of the payoffs. Our analysis establishes a sublinear, instance-independent regret upper bound, further supported by empirical evidence.

2505.17740 2026-06-17 cs.LG cs.NE physics.comp-ph 版本更新

A tensor network approach for chaotic time series prediction

一种用于混沌时间序列预测的张量网络方法

Rodrigo Martínez-Peña, Román Orús

AI总结 针对混沌时间序列预测问题,提出基于张量网络的模型,通过分解高维数组降低参数复杂度,在精度和计算效率上优于传统回声状态网络。

Comments 15 pages, 4 figures. Comments are welcome!

详情
AI中文摘要

对混沌时间序列进行准确预测是一个复杂的挑战。储层计算是一种受神经形态启发的方法,已成为这项任务的强大工具。它利用动力系统的记忆和非线性,无需大量参数调整。然而,选择和优化储层架构仍然是一个开放问题。下一代储层计算通过采用基于截断Volterra级数的非线性向量自回归简化了该问题,从而降低了超参数复杂度。但后者在最大单项式次数方面存在指数级参数增长。张量网络通过将多维数组分解为低维结构,为解决该问题提供了有前景的方案,从而缓解了维度灾难。本文探索了先前提出的张量网络模型在混沌时间序列预测中的应用,展示了其在精度和计算效率方面相比传统回声状态网络的优势。使用最先进的张量网络方法,我们能够弥合张量网络与储层计算社区之间的差距,促进两个领域的进步。

英文摘要

Making accurate predictions of chaotic time series is a complex challenge. Reservoir computing, a neuromorphic-inspired approach, has emerged as a powerful tool for this task. It exploits the memory and nonlinearity of dynamical systems without requiring extensive parameter tuning. However, selecting and optimizing reservoir architectures remains an open problem. Next-generation reservoir computing simplifies this problem by employing nonlinear vector autoregression based on truncated Volterra series, thereby reducing hyperparameter complexity. Nevertheless, the latter suffers from exponential parameter growth in terms of the maximum monomial degree. Tensor networks offer a promising solution to this issue by decomposing multidimensional arrays into low-dimensional structures, thus mitigating the curse of dimensionality. This paper explores the application of a previously proposed tensor network model for predicting chaotic time series, demonstrating its advantages in terms of accuracy and computational efficiency compared to conventional echo state networks. Using a state-of-the-art tensor network approach enables us to bridge the gap between the tensor network and reservoir computing communities, fostering advances in both fields.

2308.14329 2026-06-17 cs.RO cs.AI 版本更新

SSIL: Self-Supervised Imitation Learning for End-to-End Driving

SSIL: 用于端到端驾驶的自监督模仿学习

Jin Bok Park, Jinkyu Lee, Muhyun Back, Hyun Min Han, Tianwei Ma, Sang Min Won, Sung Soo Hwang, Il Yong Chun

AI总结 提出自监督模仿学习框架SSIL,利用车辆位姿生成伪转向角数据,无需驾驶命令或预训练模型,结合交叉注意力条件方法CACA,在三个基准数据集上达到与监督学习相当的驾驶精度。

Comments 8 pages, 4 figures

详情
AI中文摘要

在自动驾驶中,直接从传感器数据预测车辆控制信号的端到端(E2E)驾驶方法正迅速受到关注。为了学习安全的E2E驾驶系统,需要大量的驾驶数据和人工干预。车辆控制数据由数小时的人类驾驶构建,构建大型车辆控制数据集具有挑战性。通常,公开可用的驾驶数据集是在有限的驾驶场景下收集的,而收集车辆控制数据仅由车辆制造商提供。为了解决这些挑战,本文提出了首个用于E2E驾驶的自监督学习框架——自监督模仿学习(SSIL)。所提出的SSIL框架可以在不使用驾驶命令数据或预训练模型的情况下学习基于视觉的E2E驾驶网络。为了构建伪转向角数据,提出的SSIL从当前和先前时间点通过激光雷达传感器估计的车辆位姿预测伪目标。此外,我们提出了一种新的基于交叉注意力的条件方法(CACA),用于E2E驾驶中的视觉编码器,其中高级指令作为视觉信息的条件信号。我们在三个不同基准数据集上的数值实验表明,所提出的SSIL框架实现了与监督学习对应方法非常相当的E2E驾驶精度。此外,所提出的伪标签预测器优于使用比例积分微分控制器的现有方法,并且所提出的CACA在现有条件方法中实现了优越的性能。

英文摘要

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.

2503.07459 2026-06-17 cs.CL cs.AI 版本更新

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

MedicalAgentsBench:复杂医学推理基准——比较内化推理模型与外化智能体框架

Yanjun Shao, Xiangru Tang, Jiwoong Sohn, Jiapeng Chen, Yuxuan Liao, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein

AI总结 提出MedicalAgentsBench基准(862个复杂临床问题),比较内化推理模型与外化智能体框架在医学推理中的表现,发现两者效果可叠加,最优组合为o3-mini+MDAgents(准确率35.1%)。

Comments https://github.com/gersteinlab/MedicalAgentsBench

详情
AI中文摘要

复杂医学推理需要在多个推理步骤中整合异质性临床证据。大型语言模型(LLM)现在通过两条途径实现:内化推理和外化智能体框架(将问题分解并协作给多个LLM的框架)。为了确定这两条途径是互斥还是互补,我们引入了MedicalAgentsBench,这是一个经过过滤的基准测试,包含862个复杂临床问题,这些题目来自八个医学数据集的并集,经过难度感知筛选和污染筛查。评估了三个内化推理模型(DeepSeek-R1、o1-mini和o3-mini)、七个基础模型和九个外化智能体方法后,我们发现内化和外化方法各自独立地提升了性能,并且它们的益处可以叠加:最高准确率是通过将智能体工作流叠加到内化推理模型上实现的(即o3-mini + MDAgents,准确率35.1%)。帕累托分析表明,这种组合主导了成本-性能前沿;此外,在廉价模型上进行轻量级优化为资源受限环境提供了切入点。我们的基准测试位于此https URL。

英文摘要

Complex medical reasoning requires integrating heterogeneous clinical evidence across multiple inference steps. Large language models (LLMs) now approach this through two routes: internalized reasoning and externalized agent scaffolding (frameworks that decompose problems collaboratively amongst multiple LLMs). To determine whether these routes are exclusive or complementary, we introduce MedicalAgentsBench, a filtered benchmark of 862 complex clinical questions drawn from the union of eight medical datasets via difficulty-aware curation and contamination screening. Evaluating three internalized reasoning models (DeepSeek-R1, o1-mini, and o3-mini), seven base models, and nine externalized agent-based methods, we find that internalized and externalized approaches each independently improve performance, and that their benefits compound: the highest accuracy is achieved by layering agent workflows onto an internalized reasoning model (i.e., o3-mini + MDAgents with 35.1%). Pareto analysis shows this combination dominates the cost-performance frontier; moreover, lightweight optimization on inexpensive models offers an entry point for resource-constrained settings. Our benchmark is at https://github.com/gersteinlab/MedicalAgentsBench.

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结 提出Mordal框架,通过减少候选模型数量和评估时间,自动化搜索用户定义任务的最佳视觉语言模型,相比网格搜索降低GPU耗时8.9-11.6倍,加权Kendall's τ平均提升69%。

详情
AI中文摘要

将多种模态融入大型语言模型(LLMs)是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型(VLMs)因其在医疗、机器人和无障碍等领域的众多实际应用,成为增长最快的多模态模型类别。然而,尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力,它们都是由人类专家手工设计的;目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal,一种自动化多模态模型搜索框架,能够高效地为用户定义的任务找到最佳VLM,无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明,Mordal能够找到给定问题的最佳VLM,其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现,Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

2409.17502 2026-06-17 cs.LG 版本更新

Broadcast Product: Redefining Shape-aligned Element-wise Multiplication and Beyond

广播乘积:重新定义形状对齐的逐元素乘法及其扩展

Yusuke Matsui, Tatsuya Yokota

AI总结 本文引入广播乘积$\boxdot$,形式化扩展Hadamard乘积以处理形状不匹配的张量逐元素乘法,并建立其代数性质及与线性代数的联系,为广播感知的张量运算奠定数学基础。

Comments TMLR2026. OpenReview: https://openreview.net/forum?id=zv0OtOPpPO

详情
AI中文摘要

广播操作在科学计算库中被广泛使用,但其数学形式化在机器学习文献中常常是隐式的且表示不一致。当逐元素乘积被写出但张量形状不匹配时,这个问题经常导致无效的方程。在本文中,我们通过引入广播乘积$\boxdot$来形式化此类操作,该乘积通过形状对齐的元素复制显式扩展了Hadamard乘积。我们提供了广播乘积的严格定义,分析了其代数性质,并展示了如何使用标准线性代数表示它。基于这一框架,我们制定了最小二乘问题并勾勒出一个概念验证的广播分解。作为初步说明,我们展示了该形式化方法能够产生一类具有与传统张量分解不同结构特性的新分解。这项工作为广播感知的张量运算建立了数学基础,将实际实现与严格的张量分析联系起来。

英文摘要

Broadcast operations are widely used in scientific computing libraries, yet their mathematical formulation is often implicit and inconsistently represented in machine learning literature. This problem frequently leads to invalid equations when element-wise products are written despite mismatched tensor shapes. In this paper, we formalize such operations by introducing the broadcast product $\boxdot$, which explicitly extends the Hadamard product through shape-aligned element duplication. We provide a rigorous definition of the broadcast product, analyze its algebraic properties, and show how it can be expressed using standard linear algebra. Building on this framework, we formulate least-squares problems and sketch a proof-of-concept broadcast decomposition. As a preliminary illustration, we show that the formalism enables a new family of decompositions with distinct structural properties from conventional tensor decompositions. This work establishes a mathematical foundation for broadcast-aware tensor operations, connecting practical implementations with rigorous tensor analysis.

2408.12099 2026-06-17 cs.CV cs.CR 版本更新

Query-Efficient Video Adversarial Attack with Stylized Logo on Service Computing

面向服务计算的查询高效视频对抗攻击:带风格化标志

Duoxun Tang, Yuxin Cao, Xi Xiao, Derui Wang, Sheng Wen, Tianqing Zhu

AI总结 提出一种黑盒视频攻击框架SLA,通过风格化标志和强化学习实现低预算、高逼真度的对抗样本生成,在目标攻击中优于现有方法。

Comments Accepted to IEEE Transactions on Dependable and Secure Computing (TDSC)

详情
AI中文摘要

在服务计算中,视频分类已成为许多智能应用的基础。尽管深度神经网络(DNN)在识别视频内容方面表现出色,但最近的研究表明,DNN极易受到对抗样本的影响。因此,理解对抗攻击可以更好地应对紧急情况。为了提高攻击性能,许多基于风格迁移的攻击和基于补丁的攻击被提出。然而,前者的全局扰动会带来不自然的全局色彩,而后者由于扰动空间有限,在目标攻击中难以成功。此外,与大量针对图像分类器的方法相比,视频对抗攻击仍然相对未被充分探索。因此,为了在低预算下生成对抗样本并使其具有更高的逼真度,我们提出了一种新颖的黑盒视频攻击框架,称为风格化标志攻击(SLA)。SLA通过三个阶段进行。第一阶段涉及构建标志的风格参考集,这不仅可以使生成的样本更自然,还可以在目标攻击中携带更多目标类别特征。然后,采用强化学习来确定标志在视频中的风格参考和位置参数,确保风格化标志以最优属性放置在视频中。最后,逐步优化扰动以提高欺骗率。实验结果表明,SLA可以实现比最先进方法更好的性能,并且在面对各种防御方法时仍保持良好的欺骗效果。我们相信SLA可以提高安全社区对视频分类系统可靠性和安全性的认识,并作为可能攻击方法的备忘录。

英文摘要

In service computing, video classification has become fundamental to many intelligent applications. While Deep Neural Networks (DNNs) have demonstrated excellent performance in recognizing video content, recent studies have shown that DNNs are highly vulnerable to adversarial examples. Thus, understanding adversarial attacks can better respond to emergency situations. In order to improve attack performance, many style-transfer-based attacks and patch-based attacks have been proposed. However, the global perturbation of the former will bring unnatural global colors, while the latter is difficult to achieve success in targeted attacks due to the limited perturbation space. Moreover, compared to a plethora of methods targeting image classifiers, video adversarial attacks remain relatively underexplored. Therefore, to generate adversarial examples with a low budget and to provide them with a higher verisimilitude, we propose a novel black-box video attack framework, called Stylized Logo Attack (SLA). SLA is conducted through three stages. The first stage involves building a style reference set for logos, which can not only make the generated examples more natural, but also carry more target class features in targeted attacks. Then, Reinforcement Learning is employed to determine the style reference and position parameters of the logo within the video, which ensures that the stylized logo is placed in the video with optimal attributes. Finally, perturbations are optimized in a step-by-step manner so as to improve the fooling rate. Experimental results indicate that SLA can achieve better performance than state-of-the-art methods and still maintain good deception effects when facing various defense methods. We believe SLA can raise awareness among the security community about the reliability and security of video classification systems and serve as a memorandum of possible attack methods.

2406.07435 2026-06-17 cs.CV cs.LG eess.IV 版本更新

Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration

警惕混叠——信号保留对鲁棒图像复原至关重要

Shashank Agnihotri, Julia Grabinski, Janis Keuper, Margret Keuper

AI总结 针对图像复原网络因混叠导致鲁棒性差的问题,提出BOA-Restormer,通过在频域执行部分下采样和上采样操作,确保无混叠路径,在低成本下提升模型鲁棒性。

Comments Tags: Adversarial attack, image restoration, image deblurring, frequency sampling

详情
AI中文摘要

图像复原网络通常由编码器和解码器组成,分别负责从噪声、失真数据中聚合图像内容并恢复干净、无失真的图像。数据聚合以及高分辨率图像生成通常都伴随着混叠的风险,即标准架构为了在验证数据上达到高PSNR值而牺牲了重建模型输入的能力。代价是模型鲁棒性低。在这项工作中,我们表明,在先进的复原变换器中简单地提供无混叠路径,可以在低复原性能成本下支持改进的模型鲁棒性。为此,我们提出了BOA-Restormer,一种基于变换器的图像复原模型,它在频域中部分执行下采样和上采样操作,以确保整个模型的无混叠路径,同时可能保留所有相关的高频信息。

英文摘要

Image restoration networks are usually comprised of an encoder and a decoder, responsible for aggregating image content from noisy, distorted data and to restore clean, undistorted images, respectively. Data aggregation as well as high-resolution image generation both usually come at the risk of involving aliases, i.e.~standard architectures put their ability to reconstruct the model input in jeopardy to reach high PSNR values on validation data. The price to be paid is low model robustness. In this work, we show that simply providing alias-free paths in state-of-the-art reconstruction transformers supports improved model robustness at low costs on the restoration performance. We do so by proposing BOA-Restormer, a transformer-based image restoration model that executes downsampling and upsampling operations partly in the frequency domain to ensure alias-free paths along the entire model while potentially preserving all relevant high-frequency information.

2404.09790 2026-06-17 cs.CV 版本更新

NTIRE 2024 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2024图像超分辨率挑战赛(x4):方法与结果

Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Hongyu An, Xinfeng Zhang, Zhiyuan Song, Ziyue Dong, Qing Zhao, Xiaogang Xu, Pengxu Wei, Zhi-chao Dou, Gui-ling Wang, Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, Cansu Korkmaz, A. Murat Tekalp, Yubin Wei, Xiaole Yan, Binren Li, Haonan Chen, Siqi Zhang, Sihan Chen, Amogh Joshi, Nikhil Akalwadi, Sampada Malagi, Palani Yashaswini, Chaitra Desai, Ramesh Ashok Tabib, Ujwala Patil, Uma Mudenagudi, Anjali Sarvaiya, Pooja Choksy, Jagrit Joshi, Shubh Kawa, Kishor Upla, Sushrut Patwardhan, Raghavendra Ramachandra, Sadat Hossain, Geongi Park, S. M. Nadim Uddin, Hao Xu, Yanhui Guo, Aman Urumbekov, Xingzhuo Yan, Wei Hao, Minghan Fu, Isaac Orais, Samuel Smith, Ying Liu, Wangwang Jia, Qisheng Xu, Kele Xu, Weijun Yuan, Zhan Li, Wenqin Kuang, Ruijin Guan, Ruting Deng, Zhao Zhang, Bo Wang, Suiyi Zhao, Yan Luo, Yanyan Wei, Asif Hussain Khan, Christian Micheloni, Niki Martinel

AI总结 本文回顾NTIRE 2024图像超分辨率挑战赛(x4),总结参赛方案和成果,推动单图像超分辨率性能边界并概述当前趋势。

Comments NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 6108-6132
AI中文摘要

本文回顾了NTIRE 2024图像超分辨率($\ imes$4)挑战赛,重点介绍了提出的解决方案和获得的结果。该挑战涉及利用先验信息从低分辨率(LR)输入生成对应的高分辨率(HR)图像,放大倍数为四倍。LR图像来源于双三次下采样退化。挑战的目标是获得具有最先进SR性能的设计/解决方案,对计算资源(如模型大小和FLOPs)或训练数据没有限制。该赛道在DIV2K测试数据集上使用PSNR指标评估性能。比赛吸引了199名注册者,其中20支队伍提交了有效参赛作品。这一集体努力不仅推动了单图像SR的性能边界,还提供了对该领域当前趋势的全面概述。

英文摘要

This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.

2404.01965 2026-06-17 cs.LG cs.AI 版本更新

Towards Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks

迈向利用AutoML实现可持续深度学习:深度移位神经网络上的多目标HPO方法

Leona Hennig, Tanja Tornede, Marius Lindauer

AI总结 针对深度学习计算成本高的问题,提出结合多保真度HPO与多目标优化,在深度移位神经网络上同时最大化精度和最小化能耗,实验获得超80%精度且低计算开销。

详情
AI中文摘要

深度学习通过从大型数据集中提取复杂模式,推动了各个领域的发展。然而,深度学习模型的计算需求带来了环境和资源方面的挑战。深度移位神经网络(DSNNs)通过利用移位操作来降低推理时的计算复杂度,提供了一种解决方案。遵循标准DNNs的见解,我们感兴趣的是通过AutoML技术充分利用DSNNs的潜力。我们研究了超参数优化(HPO)的影响,以最大化DSNN性能,同时最小化资源消耗。由于这结合了多目标(MO)优化,其中精度和能耗作为潜在互补目标,我们提出将最先进的多保真度(MF)HPO与多目标优化相结合。实验结果表明了我们方法的有效性,得到了精度超过80%且计算成本低的模型。总体而言,我们的方法加速了高效模型开发,同时实现了可持续的AI应用。

英文摘要

Deep Learning (DL) has advanced various fields by extracting complex patterns from large datasets. However, the computational demands of DL models pose environmental and resource challenges. Deep shift neural networks (DSNNs) offer a solution by leveraging shift operations to reduce computational complexity at inference. Following the insights from standard DNNs, we are interested in leveraging the full potential of DSNNs by means of AutoML techniques. We study the impact of hyperparameter optimization (HPO) to maximize DSNN performance while minimizing resource consumption. Since this combines multi-objective (MO) optimization with accuracy and energy consumption as potentially complementary objectives, we propose to combine state-of-the-art multi-fidelity (MF) HPO with multi-objective optimization. Experimental results demonstrate the effectiveness of our approach, resulting in models with over 80\% in accuracy and low computational cost. Overall, our method accelerates efficient model development while enabling sustainable AI applications.

2606.18074 2026-06-17 stat.ML cs.LG stat.ME 新提交

Tensor-based second-order causal discovery

基于张量的二阶因果发现

Nathan Ouyang, Kexin Wan, Anna Seigal

AI总结 提出TSCD算法,利用观测和干预数据的协方差矩阵张量,在线性结构方程模型下识别有向无环图及其边函数,仅要求噪声不相关,并扩展到非线性模型,具有对数级干预可识别性。

Comments 27 pages, 7 figures. Code available at https://github.com/QWE123665/Tensor-based-Second-order-Causal-Discovery

详情
AI中文摘要

因果发现旨在揭示变量间的因果依赖关系。为此,我们提出了一种称为基于张量的二阶因果发现(TSCD)的算法。其输入是从观测数据和干预数据的协方差矩阵中得到的张量。假设因果依赖关系遵循有向无环图(DAG)上的线性结构方程模型,TSCD输出DAG及其边上的函数,仅要求噪声变量不相关。我们还实现了该方法在非线性模型中的版本。我们关注二阶统计量(通过协方差矩阵)的动机是:相对于高阶矩,它们在统计和计算上更高效;相对于一阶统计量,它们具有可识别性;并且无论变量是否为高斯分布,它们都适用。我们证明,TSCD从对数于变量数量的干预次数中可识别因果顺序和参数。实验表明,TSCD对噪声具有鲁棒性,与现有方法相比具有竞争力,并且可扩展到数百个变量。

英文摘要

Causal discovery seeks to uncover the causal dependencies among variables. For this purpose, we propose an algorithm called Tensor-based Second-order Causal Discovery (TSCD). Its input is a tensor obtained from the covariance matrices of observational and interventional data. Assuming the causal dependencies follow a linear structural equation model on a directed acyclic graph (DAG), TSCD outputs the DAG and the functions on its edges, requiring only that the noise variables are uncorrelated. We also implement a version of the approach for nonlinear models. Our focus on second-order statistics (via the covariance matrices) is motivated by their statistical and computational efficiency relative to higher-order moments, their identifiability relative to first-order statistics, and that they work regardless of whether the variables are Gaussian. We show that TSCD has identifiable causal order and parameters from a number of interventions that is logarithmic in the number of variables. Experiments show that TSCD is robust to noise, competitive with existing methods, and scales to hundreds of variables.

2606.17504 2026-06-17 eess.IV cs.CV 新提交

Two-Stage Fine-Tuning of ResNet50 for High-Sensitivity Melanoma Detection on Dermoscopic Images

ResNet50的两阶段微调用于皮肤镜图像中高灵敏度黑色素瘤检测

Aryan Bhagat

AI总结 提出ResNet50的两阶段微调方法,通过分层训练和低学习率微调解决类别不平衡和迁移学习不足问题,在3826张测试图像上实现AUC-ROC 0.9559,灵敏度87.56%,优于单阶段微调。

Comments 13 pages, 4 figures, 4 tables. Code available at https://github.com/Aryanbhagat23/melanoma-detection

详情
AI中文摘要

黑色素瘤是最危险的皮肤癌,早期检测五年生存率超过99%,但一旦扩散则急剧下降。本文提出并评估了一种两阶段微调方法,用于皮肤镜图像上的二分类黑色素瘤检测,基于ResNet50。解决的核心挑战是类别不平衡和单阶段微调导致的迁移学习次优。在分层训练/验证/测试分割后,仅对训练集应用随机过采样以实现1:1类别平衡。第一阶段冻结ResNet50骨干网络,仅训练分类头;第二阶段以1e-5的低学习率联合微调所有层,以防止对已学习视觉特征的灾难性遗忘。在包含3826张图像的独立测试集上,模型实现了AUC-ROC为0.9559,准确率88.34%,灵敏度87.56%,特异度89.13%,F1分数88.29%。消融研究证实两阶段协议显著优于单阶段微调,灵敏度提升超过4%。Grad-CAM可视化展示了正确的病变定位。提供了完全可部署的Streamlit检测应用程序及所有训练代码。

英文摘要

Melanoma is the most dangerous form of skin cancer with five-year survival rates exceeding 99% when detected early but falling sharply once the disease spreads. This paper proposes and evaluates a two-stage fine-tuning approach for ResNet50 applied to binary melanoma classification on dermoscopic images. The core challenges addressed are class imbalance and suboptimal transfer learning from single-stage fine-tuning. After stratified train/validation/test splitting, random oversampling was applied exclusively to the training set to achieve a 1:1 class balance. Stage 1 trained only the classification head with the ResNet50 base frozen, while Stage 2 fine-tuned all layers jointly at a low learning rate of 1e-5 to prevent catastrophic forgetting of learned visual features. On an independent test set of 3,826 images, the model achieved an AUC-ROC of 0.9559, accuracy of 88.34%, sensitivity of 87.56%, specificity of 89.13%, and F1-score of 88.29%. An ablation study confirms the two-stage protocol significantly outperforms single-stage fine-tuning, with sensitivity gains of over 4%. Grad-CAM visualizations demonstrate correct lesion localization. A fully deployable Streamlit detection application is provided alongside all training code.

2606.12623 2026-06-17 stat.AP cs.LG 新提交

Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

使用因果变换模型(TRAM-DAG)估计急性缺血性卒中个体化治疗效果:一项多中心观察性研究及外部RCT验证

Oliver Dürr, Lisa Herzog, Pascal Bühler, Susanne Wegener, Beate Sick

AI总结 提出因果变换模型(TRAM-DAG)估计急性缺血性卒中患者个体化治疗效果,基于观察数据拟合后,在RCT人群中验证其平均效果与ATE一致,并能正确排序患者预后。

Comments This submission has been withdrawn by the authors pending completion of internal review. A revised version will be posted in due course

详情
AI中文摘要

急性缺血性卒中的个体化医疗需要从平均治疗效果(ATE)转向个体化治疗效果(ITE)估计,以支持治疗决策。在急性缺血性卒中中,随机对照试验(如MR CLEAN研究)显示机械取栓平均优于溶栓。我们旨在识别哪些个体患者从机械取栓中获益最大。关注的结局是三个月时的改良Rankin量表(mRS),这是一个有序的功能残疾指标(0:无症状,6:死亡)。我们证明,在观察性MAGIC多中心卒中患者数据上拟合后,有向无环图上的因果变换模型(TRAM-DAG)可用于ITE估计。为确保与用于验证的MR CLEAN人群的可比性,我们在MAGIC子人群(入院NIHSS≥6,对应MR CLEAN的一项纳入标准)上训练TRAM-DAG。然后使用拟合模型估计MR CLEAN人群中卒中患者的ITE。虽然这些ITE估计无法通过实验确认,但我们显示其平均值与试验报告的ATE一致。此外,ITE估计正确地将试验患者按观察到的良好结局(三个月mRS≤2)频率排序。这些发现支持使用像TRAM-DAG这样的因果模型进行卒中护理中的个性化决策,并突显其弥合观察性证据与临床试验之间差距的能力。

英文摘要

Personalized medicine in acute ischemic stroke requires moving beyond average treatment effects (ATE) to individualized treatment effect (ITE) estimates to support treatment decisions. In acute ischemic stroke, mechanical thrombectomy has been shown to be more effective on average than lysis in randomized controlled trials (RCTs), such as the MR CLEAN study. We aim to identify which individual patients benefit most from mechanical thrombectomy compared to lysis. The outcome of interest is the modified Rankin Scale (mRS) at three months, an ordinal measure of functional disability (0: no symptoms, 6: death). We demonstrate that causal transformation models on directed acyclic graphs (TRAM-DAG) can be used for ITE estimation after being fitted on observational MAGIC multi-center stroke patient data. To ensure comparability with the MR CLEAN population, which we use for validation, we train the TRAM-DAG on a MAGIC sub-population with NIHSS at admission >= 6, corresponding to one inclusion criterion of MR CLEAN. The fitted model is then used to estimate ITEs for stroke patients in the MR CLEAN population. While these ITE estimates cannot be confirmed experimentally, we show that their average is consistent with the trial's reported ATE. Furthermore, the ITE estimates correctly rank trial patients by their observed frequency of a good outcome (mRS at three months <= 2). These findings support the use of causal models like TRAM-DAG for personalized decision-making in stroke care and highlight their ability to bridge the gap between observational evidence and clinical trials.

2606.12666 2026-06-17 cs.CR cs.AI 新提交

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

CAPED:面向移动GUI代理的上下文感知隐私暴露防御

Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

AI总结 针对移动GUI代理截图上传导致的附带视觉隐私暴露问题,提出上下文感知的预上传暴露控制层CAPED,通过任务需求提取、屏幕上下文隐私先验和UI元素解析,选择性暴露任务所需内容,在保持高任务效用的同时显著降低隐私泄露。

详情
AI中文摘要

基于截图的移动GUI代理能够像人类用户一样通过相同的视觉界面操作普通智能手机应用,但这种能力也将每一次屏幕观察变成了隐私边界。在正常任务执行过程中,截图可能暴露联系人、消息、照片、文件、推荐、健康提示等与用户请求无关的敏感上下文。我们称这个问题为附带视觉隐私暴露。现有防御难以解决:文本匿名化遗漏了许多视觉和推理线索,而通用隐私遮蔽可能移除GUI代理完成任务所需的证据和控制。本文提出CAPED,一种面向移动GUI代理的上下文感知预上传暴露控制层。CAPED被设计为手机端保护层:在截图被释放到远程多模态代理之前,它提取任务需求,利用屏幕上下文作为隐私先验,解析可见UI元素,并仅选择性暴露当前任务所需的内容,同时遮蔽附带隐私内容。我们在AndroidWorld上评估CAPED的广泛任务效用,并使用受控的28任务种子隐私评估作为轨迹级附带泄漏的测量工具。在该种子评估中,完整CAPED将成功条件下的加权种子泄漏从原始截图的0.766降低到0.268,同时保持高任务效用。更广泛的AndroidWorld运行显示了剩余的原型级效用成本,但结果支持核心主张:截图上传应被视为明确的设备-云边界决策,由任务驱动的选择性暴露而非全有或全无的屏幕共享来管理。

英文摘要

Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results show that task-driven selective exposure can reduce incidental visual leakage before screenshots are released to a remote GUI agent.

2605.12513 2026-06-17 cs.SI cs.AI 版本更新

SP-GCRL: Influence Maximization on Incomplete Social Graphs

SP-GCRL:在不完整社交图上的影响力最大化

Haohua Niu, Yuxuan Yang, Lingfeng Zhang, Hao Li, Jiao Liang, Zongfu Luo, Luca Rossi

AI总结 本文提出SP-GCRL框架,通过社交传播感知的图对比强化学习实现端到端的种子选择,解决了不完整社交图和非平稳扩散动态带来的挑战,提升了效率和可扩展性。

Comments Accepted by DASFAA 2026. The first two authors contributed equally

详情
AI中文摘要

在现实平台中,影响力最大化(IM)受到不完整、噪声社交图和非平稳扩散动态的挑战。我们提出了SP-GCRL,一种社交传播感知的图对比强化学习框架,该框架在部分可观测性下学习端到端的种子选择。我们首先引入了一种社交传播感知的非线性扩散函数,以建模强化/衰减效应和概率漂移;然后构建了双结构视图,并执行对比学习以获得对缺失边和弱连接具有鲁棒性的节点表示,同时用基于GAT的回归替代昂贵的策略度量以提高效率和可扩展性;最后,我们使用DDQN在这些表示上学习端到端的种子选择策略。在多个真实世界网络上的实验表明,SP-GCRL在预算和拓扑结构上均显著优于启发式和基于学习的基线,同时保持了强大的大规模可扩展性。

英文摘要

Influence maximization (IM) in real platforms is challenged by incomplete, noisy social graphs and non-stationary diffusion dynamics. We propose SP-GCRL, a social-propagation-aware graph contrastive reinforcement learning framework that learns end-to-end seed selection under partial observability.We first introduce a social-propagation-aware nonlinear diffusion function to model reinforcement/diminishing effects and probability drift under repeated exposure; we then construct dual structural views and perform contrastive learning to obtain node representations robust to missing edges and weak ties, while replacing expensive strategy metrics with a GAT-based regression surrogate to improve efficiency and scalability; finally, we use DDQN to learn an end-to-end seed selection policy on top of these representations. Experiments on multiple real-world networks show that SP-GCRL achieves significant gains over heuristic and learning-based baselines across budgets and topologies, while maintaining strong large-scale scalability.

2604.27583 2026-06-17 q-bio.NC cs.RO 版本更新

Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids

通过从婴儿到类人机器人的运动重定向模拟婴儿第一人称感觉运动经验

Francisco M. López, Hoshinori Kanazawa, Ondrej Fiala, Yakov Balashov, Valentin Marcel, Lukas Rustler, Miles Lenz, Dongmin Kim, Yasuo Kuniyoshi, Jochen Triesch, Matej Hoffmann

AI总结 提出一种从单视频重建婴儿3D姿态并映射到物理/虚拟类人平台的方法,实现亚厘米级精度的多感觉流模拟,为发育研究和神经发育障碍早期检测提供新工具。

Comments Accepted at IEEE ICDL 2026. 8 pages, 6 figures. Cite as: F. M. López, H. Kanazawa, O. Fiala, Y. Balashov, V. Marcel, L. Rustler, M. Lenz, D. Kim, Y. Kuniyoshi, J. Triesch, and M. Hoffmann, "Simulating infant first-person sensorimotor experience via motion retargeting from babies to humanoids'', in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-8

详情
AI中文摘要

随着人形机器人能力的增强,从人类到类人人工体的运动重定向变得越来越重要。然而,现有方法大多只关注运动学再现,而忽略了与人类运动相关的丰富感觉运动经验。在这项工作中,我们提出了一个框架,使用物理和虚拟类人机器人模拟婴儿的多模态感觉运动经验。从单个视频中,我们的方法通过提取骨骼结构并从每一帧估计完整的3D姿态来重建婴儿的身体配置。然后,我们将重建的运动映射到几个发育平台上:物理iCub机器人和虚拟模拟器pyCub、EMFANT和MIMo。在这些实体上重放重定向的运动会产生模拟的多感觉流,包括本体感觉(关节和肌肉)、触觉和视觉。对于最佳匹配的实体,重定向实现了亚厘米级的精度,并能够对婴儿发育进行丰富的多模态分析,以及增强的行为自动标注。该框架为婴儿的感觉运动经验提供了一个独特的窗口,为机器人学、发育科学和神经发育障碍的早期检测提供了新工具。代码可在https://this URL获取。

英文摘要

Motion retargeting from humans to human-like artificial agents is becoming increasingly important as humanoid robots grow more capable. However, most existing approaches focus only on reproducing kinematics and ignore the rich sensorimotor experience associated with human movement. In this work, we present a framework for simulating the multimodal sensorimotor experiences of infants using physical and virtual humanoids. From a single video, our method reconstructs the infant's body configuration by extracting its skeletal structure and estimating the full 3D pose from each frame. Then we map the reconstructed motion onto several developmental platforms: the physical iCub robot and the virtual simulators pyCub, EMFANT and MIMo. Replaying the retargeted motions on these embodiments produces simulated multisensory streams including proprioception (joints and muscles), touch, and vision. For the best-matching embodiment, the retargeting achieves sub-centimeter accuracy and enables a rich multimodal analysis of infant development as well as enhanced automated annotation of behaviors. This framework provides a unique window into the infant's sensorimotor experience, offering new tools for robotics, developmental science, and early detection of neurodevelopmental disorders. The code is available at https://github.com/ctu-vras/motion-retargeting/.

2603.25414 2026-06-17 cs.PL cs.AI cs.LG cs.LO 版本更新

Decidable By Construction: Design-Time Verification for Trustworthy AI

可判定性通过构造实现:面向可信AI的设计时验证

Houston Haynes

AI总结 提出一种设计时验证框架,通过将AI模型属性约束为有限生成阿贝尔群上的可判定问题,在训练前以极低计算成本验证数值稳定性、计算正确性和物理一致性,消除后验验证开销。

Comments 21 pages, 1 figure

详情
AI中文摘要

机器学习中一个普遍的假设是模型正确性必须在事后强制执行。我们观察到,决定AI模型是否数值稳定、计算正确或与物理领域一致的属性并不一定需要事后强制执行。它们可以在设计时,在训练开始之前,以边际计算成本进行验证,对于部署在高杠杆决策支持和科学约束环境中的模型尤其重要。这些属性共享特定的代数结构:它们可以表示为有限生成阿贝尔群 $\mathbb{Z}^n$ 上的约束,其中推理在多项式时间内可判定,且主要类型是唯一的。基于这一观察构建的框架组合了三个先前的结果(arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104):一个维度类型系统,通过模型细化携带任意注释作为持久余数据;一个程序超图,仅从类型签名推断Clifford代数等级并推导几何积稀疏性;以及一个自适应领域模型架构,通过前向模式余效应分析和精确正数累积在训练过程中保持两个不变量。我们相信这种组合产生了一个新颖的信息论结果:阿贝尔群上的Hindley-Milner统一在Solomonoff通用先验的可计算限制下计算最大后验假设,将该框架的类型推断置于与通用归纳相同的正式基础上。我们比较了四种当代的AI可靠性方法,并表明每种方法都会引入开销,这些开销可能在部署、层和推理请求中累积。该框架通过构造消除了这种开销。

英文摘要

A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups $\mathbb{Z}^n$, where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff's universal prior, placing the framework's type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.

2604.09998 2026-06-17 cs.CR cs.AI 版本更新

Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit

像锤子一样,它能建造,也能破坏:Reddit上网络安全运营中大语言模型的使用、认知与采纳

Souradip Nath, Chih-Yi Huang, Aditi Ganapathi, Kashyap Thimmaraju, Jaron Mink, Gail-Joon Ahn

AI总结 通过对Reddit网络安全论坛892篇帖子进行混合方法分析,研究安全从业者使用LLM工具的模式、认知和采纳情况,发现LLM主要用于低风险、生产力导向任务,企业级安全平台受关注,但可靠性、验证开销和安全问题限制了其自主性。

Comments This paper appears in the Proceedings of the Twenty-Second Symposium on Usable Privacy and Security (SOUPS) 2026

详情
AI中文摘要

大语言模型(LLM)近期作为增强安全运营中心(SOC)工作流程的有前景工具出现,供应商越来越多地推广用于SOC的自主AI解决方案。然而,对于现实世界安全从业者如何使用、感知和采纳这些工具,仍缺乏实证理解。为填补这一空白,我们对网络安全论坛中的讨论进行了混合方法分析,以了解多样化从业者群体如何将现代LLM工具用于安全运营。具体而言,我们分析了Reddit上三个网络安全论坛在2022年12月至2025年9月间的892篇帖子,并采用定性编码和统计分析相结合的方法,从三个维度考察安全从业者如何讨论LLM工具:(1)他们声明的工具和用例,(2)每个工具在一组关键因素上的感知优缺点,以及(3)他们对这些工具的采纳以及对网络安全行业和个人分析师的预期影响。总体而言,我们的发现揭示了LLM工具采纳的细微模式,突出了LLM在低风险、生产力导向任务中的独立使用,以及对企业级、安全导向LLM平台的积极兴趣。尽管从业者报告了LLM辅助工作流程在效率和效果上的显著提升,但可靠性、验证开销和安全问题等持续存在的问题严重限制了赋予LLM工具的自主性。基于这些结果,我们还为开发和采纳LLM工具提供了建议,以确保组织的安全和网络安全从业者的安全。

英文摘要

Large language models (LLMs) have recently emerged as promising tools for augmenting Security Operations Center (SOC) workflows, with vendors increasingly marketing autonomous AI solutions for SOCs. However, there remains a limited empirical understanding of how such tools are used, perceived, and adopted by real-world security practitioners. To address this gap, we conduct a mixed-methods analysis of discussions in cybersecurity-focused forums to learn how a diverse group of practitioners use and perceive modern LLM tools for security operations. More specifically, we analyzed 892 posts between December 2022 and September 2025 from three cybersecurity-focused forums on Reddit, and, using a combination of qualitative coding and statistical analysis, examined how security practitioners discuss LLM tools across three dimensions: (1) their stated tools and use cases, (2) the perceived pros and cons of each tool across a set of critical factors, and (3) their adoption of such tools and the expected impacts on the cybersecurity industry and individual analysts. Overall, our findings reveal nuanced patterns in LLM tools adoption, highlighting independent use of LLMs for low-risk, productivity-oriented tasks, alongside active interest around enterprise-grade, security-focused LLM platforms. Although practitioners report meaningful gains in efficiency and effectiveness in LLM-assisted workflows, persistent issues with reliability, verification overheads, and security risks sharply constrain the autonomy granted to LLM tools. Based on these results, we also provide recommendations for developing and adopting LLM tools to ensure the security of organizations and the safety of cybersecurity practitioners.

2604.04089 2026-06-17 physics.comp-ph cond-mat.str-el cs.AI cs.HC 版本更新

From Paper to Program: Knowledge Externalization for AI-Assisted Quantum Many-Body Code Generation

从论文到程序:AI辅助量子多体代码生成中的知识外化

Yi Zhou

AI总结 针对AI直接翻译论文为代码时因隐含约定导致失败的问题,提出知识外化方法,通过多阶段人机协作流程将隐式假设显式化,在DMRG和Pfaffian-MPS任务上验证了有效性。

Comments New designed experiments added

详情
AI中文摘要

大型语言模型可以编写科学代码,但当正确性依赖于文献中的默认约定时,直接的论文到程序翻译仍然脆弱。我们将这一瓶颈识别为\textbf{知识外化}:在实现之前将隐式计算假设——索引约定、规范选择、费米子符号、收缩顺序和内存约束——转换为明确的技术规范。我们评估了一个多阶段、人在回路的工作流程,该流程在理论提取和代码生成之间插入这样的规范,并带有验证和停止门。该工作流程在两个算法上不同的量子多体任务上进行了测试:基于变分扫描的密度矩阵重整化群(DMRG)来自教学综述,以及将Hartree-Fock-Bogoliubov态构造性地转换为矩阵乘积态的Pfaffian方法,来自Jin等人五页的信件,Phys. Rev. B 105, L081101 (2022),该代码未公开。对于DMRG,在$4\ imes4$网格中,所有16个规范引导的模型配对都满足物理验证标准,而直接尝试为6/13。散文规范消融实验表明,外化的内容(而非LaTeX格式)是基本要素。对于Pfaffian-MPS,该工作流程在26次存档尝试中成功11次,而直接提示产生零次审计通过。跨规范转移是不对称的:由GPT~5.5实现的非GPT规范通过4/4,而由较弱模型实现的GPT~5.5规范失败4/4,表明存在残留的实现模型瓶颈。由此产生的\textit{论文到程序多体}技能为AI辅助实现多体算法以及诊断外化成功或失败提供了可审计的协议。

英文摘要

Large language models can write scientific code, but direct paper-to-program translation remains fragile when correctness depends on tacit conventions in the literature. We identify this bottleneck as \textbf{knowledge externalization}: converting implicit computational assumptions -- index conventions, gauge choices, fermionic signs, contraction order, and memory constraints -- into an explicit technical specification before implementation. We evaluate a multi-stage, human-in-the-loop workflow that inserts such a specification, with validation and stop gates, between theory extraction and code generation. The workflow is tested on two algorithmically distinct quantum many-body tasks: variational sweep-based Density-Matrix Renormalization Group (DMRG) from a pedagogical review and constructive Pfaffian conversion of Hartree--Fock--Bogoliubov states to matrix product states from the five-page Letter by Jin et al., Phys. Rev. B 105, L081101 (2022), for which no public code is available. For DMRG, all 16 specification-guided model pairings in a $4\times4$ grid satisfy physics-validation criteria, compared with 6/13 direct attempts. A prose-specification ablation indicates that externalized content, not \LaTeX{} formatting, is the essential ingredient. For Pfaffian-MPS, the workflow succeeds in 11/26 archived attempts, whereas direct prompting yields zero audited passes. Cross-specification transfer is asymmetric: non-GPT specifications implemented by GPT~5.5 pass 4/4, while GPT~5.5 specifications implemented by weaker models fail 4/4, indicating a residual implementation-model bottleneck. The resulting \emph{Paper-to-Program Many-Body} skill provides an auditable protocol for AI-assisted implementation of many-body algorithms and for diagnosing where externalization succeeds or fails.

2604.06531 2026-06-17 math.OC cs.LG cs.MA cs.SY eess.SY stat.ML 版本更新

A Generalized Sinkhorn Algorithm for Mean-Field Schrödinger Bridge

平均场薛定谔桥的广义Sinkhorn算法

Asmaa Eldesoukey, Yongxin Chen, Abhishek Halder

AI总结 针对平均场薛定谔桥问题,提出广义Hopf-Cole变换并设计Sinkhorn型递归算法求解积分-偏微分方程组,在弱假设下证明收敛性,数值实验验证有效性。

详情
AI中文摘要

平均场薛定谔桥(MFSB)问题涉及设计一个最小努力控制器,引导具有非局部相互作用的扩散过程在固定截止时间内从给定分布到达另一个分布。与标准薛定谔桥不同,MFSB的动态约束是带有控制器的相互作用智能体群体的平均场极限。它是大规模多智能体系统的自然模型。由于非局部相互作用使问题非凸,MFSB在计算上具有挑战性。我们提出了MFSB的Hopf-Cole变换的推广,并在此基础上设计了一种Sinkhorn型递归算法来求解相关的积分-偏微分方程组。在相互作用势的温和假设下,我们讨论了所提算法的收敛性保证。我们通过排斥和吸引相互作用的数值示例来说明理论贡献。

英文摘要

The mean-field Schrödinger bridge (MFSB) problem concerns designing a minimum-effort controller that guides a diffusion process with nonlocal interaction to reach a given distribution from another by a fixed deadline. Unlike the standard Schrödinger bridge, the dynamical constraint for MFSB is the mean-field limit of a population of interacting agents with controls. It serves as a natural model for large-scale multi-agent systems. The MFSB is computationally challenging because the nonlocal interaction makes the problem nonconvex. We propose a generalization of the Hopf-Cole transform for MFSB and, building on it, design a Sinkhorn-type recursive algorithm to solve the associated system of integro-PDEs. Under mild assumptions on the interaction potential, we discuss convergence guarantees for the proposed algorithm. We present numerical examples with repulsive and attractive interactions to illustrate the theoretical contributions.

2603.27049 2026-06-17 stat.ML cs.LG 版本更新

Overcoming the Incentive Collapse Paradox

克服激励崩溃悖论

Qichuan Yin, Ziwei Su, Shuangning Li

AI总结 针对AI辅助任务中激励崩溃问题,提出哨兵审计支付机制,在有限成本下维持正人力努力,并构建激励感知的主动统计推断框架优化审计率与采样分配。

Comments Accepted to ICML 2026

详情
AI中文摘要

AI辅助任务委派日益普遍,但此类系统中的人力成本高昂且通常不可观测。Bastani和Cachon (2025); Sambasivan等人 (2021) 的最新研究表明,基于准确度的支付方案存在激励崩溃:随着AI准确度提升,维持正向人力努力需要无界支付。我们在预算约束的委托-代理框架中研究这一现象,其中战略型人类代理的输出准确度取决于不可观测的努力。我们的第一个贡献是一般性不可能结果,表明激励崩溃不仅是简单线性支付的局限,而是任何仅基于观测任务结果的支付规则都会出现。为克服这一障碍,我们提出一种哨兵审计支付机制,该机制以有限成本强制执行严格为正且可控的人力努力水平,且与AI准确度无关。在此激励鲁棒的基础上,我们构建了一个激励感知的主动统计推断框架,联合优化(i)审计率和(ii)跨不同难度任务的主动采样与预算分配,以在单一预算下最小化最终统计损失。实验表明,相对于标准主动学习和仅审计基线,该方法改善了成本-误差权衡。

英文摘要

AI-assisted task delegation is increasingly common, yet human effort in such systems is costly and typically unobserved. Recent work by Bastani and Cachon (2025); Sambasivan et al. (2021) shows that accuracy-based payment schemes suffer from incentive collapse: as AI accuracy improves, sustaining positive human effort requires unbounded payments. We study this phenomenon in a budget-constrained principal-agent framework with strategic human agents whose output accuracy depends on unobserved effort. Our first contribution is a general impossibility result showing that incentive collapse is not merely a limitation of simple linear payments, but arises for any payment rule based only on observed task accuracy.To overcome this barrier, we propose a sentinel-auditing payment mechanism that enforces a strictly positive and controllable level of human effort at finite cost, independent of AI accuracy. Building on this incentive-robust foundation, we develop an incentive-aware active statistical inference framework that jointly optimizes (i) the auditing rate and (ii) active sampling and budget allocation across tasks of varying difficulty to minimize the final statistical loss under a single budget. Experiments demonstrate improved cost-error tradeoffs relative to standard active learning and auditing-only baselines.

2603.19697 2026-06-17 eess.AS cs.MM cs.SD 版本更新

Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

Plug-and-Steer:解耦分离与选择的音视频目标说话人提取

Doyeop Kwak, Suyeon Lee, Joon Son Chung

AI总结 提出Plug-and-Steer方法,通过解耦分离与目标选择,利用冻结的纯音频骨干网络和潜引导矩阵实现高保真音视频目标说话人提取。

Comments Accepted by Interspeech 2026; demo available https://plugandsteer.github.io

详情
AI中文摘要

本文的目标是通过解耦分离和目标选择,为音视频目标说话人提取(AV-TSE)提供新视角。传统的AV-TSE系统通常深度融合音频和视觉特征以重新学习整个分离过程,由于野外音视频数据集的噪声特性,这可能会成为保真度的上限。为了解决这个问题,我们提出了Plug-and-Steer,它将高保真分离分配给冻结的纯音频骨干网络,并将视觉模态的作用严格限制在目标选择上。我们引入了潜引导矩阵(LSM),这是一种最小化的线性变换,它重新路由骨干网络内的潜特征,将目标说话人锚定到指定通道。在四种代表性架构上的实验表明,我们的方法有效地保留了不同骨干网络的声学先验,实现了与原始骨干网络相当的可感知质量。音频样本可在以下网址获取:this https URL

英文摘要

The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of the visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to that of the original backbones. Audio samples are available at: https://plugandsteer.github.io

2603.04438 2026-06-17 eess.IV cs.AI cs.LG 版本更新

CogGen: Cognitive-Load-Inspired Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction

CogGen: 认知负荷启发的全无监督深度生成模型用于压缩感知MRI重建

Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang

AI总结 提出CogGen框架,基于认知易到难原则,通过自定进度课程学习和MRI感知双阈值加权策略,将CS-MRI重建分解为分阶段反演问题,理论证明降低局部充分迭代界和累积噪声放大界,实验优于现有无监督和有监督方法。

详情
AI中文摘要

全无监督深度生成建模(FU-DGM)为压缩感知磁共振成像(CS-MRI)重建提供了巨大潜力。代表性的FU-DGM公式,如深度图像先验(DIP)和隐式神经表示(INR),利用架构偏置在图像空间中诱导与正向观测对齐的低维流形。然而,由于底层逆系统高度病态,FU-DGM中长时间的迭代拟合通常导致效率低下和噪声放大。本文受认知易到难学习原则的启发,提出CogGen,一种将CS-MRI重建重新表述为分阶段反演问题的FU-DGM框架。具体地,CogGen通过MRI感知的双阈值加权准则实现自定进度课程学习(SPCL)驱动的渐进调度策略,该准则自适应地调节k空间测量参与。数据一致性残差阈值评估当前生成器的拟合可靠性,而k空间半径阈值控制阶段性的测量暴露,从而避免整个优化过程中的均匀拟合。理论上,我们的分析表明,当早期阶段倾向于易拟合的测量时,CogGen产生更低的局部充分迭代界和更小的累积噪声放大界,解释了CogGen在有限迭代预算内改进的收敛行为和重建保真度。数值实验表明,CogGen的两种实例化,CogGen-DIP和CogGen-INR,在包括无监督和有监督流程在内的现有CS-MRI重建技术中实现了优越的性能。

英文摘要

Fully unsupervised deep generative modeling (FU-DGM) offers significant potential for compressively sampled magnetic resonance imaging (CS-MRI) reconstruction. Representative FU-DGM formulations, such as deep image prior (DIP) and implicit neural representation (INR), employ architectural bias to induce a low-dimensional manifold in the image space that aligns with the forward observation. However, as the underlying inverse system is highly ill-posed, prolonged iterative fitting in FU-DGM typically leads to poor efficiency and noise amplification. In this paper, guided by the cognitive principle of easy-to-hard learning, we propose CogGen, an FU-DGM framework that reformulates CS-MRI reconstruction as a staged inversion problem. Specifically, CogGen implements an self-paced curriculum learning (SPCL)-driven progressive scheduling strategy through an MRI-aware dual-threshold weighting criterion, which adaptively regulates k-space measurement participation. The data-consistency residual thresholding evaluates the fitting reliability of the current generator, while the k-space radius thresholding controls stage-wise measurement exposure, thereby avoiding uniform fitting throughout optimization. Theoretically, our analysis shows that, when early stages favor easy-to-fit measurements, CogGen yields a reduced local sufficient-iteration bound and a smaller cumulative noise-amplification bound, explaining the improved convergence behavior and reconstruction fidelity of CogGen within a finite iteration budget. Numerical experiments demonstrate that both CogGen instantiations, CogGen-DIP and CogGen-INR, achieve superior performance over prevailing CS-MRI reconstruction techniques, including unsupervised and supervised pipelines.