arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

发表机构 * Dyson School of Design Engineering, Imperial College London（帝国理工学院戴森设计工程学院）； Halfspace (part of Accenture)（埃森哲旗下Halfspace）； Technical University of Denmark (DTU Management)（丹麦技术大学（DTU管理系））； Aarhus University (CoRE)（奥胡斯大学（CoRE））

AI总结提出QueryMarket框架和OVBAL算法，通过D-最优性准则估计边际效用，在滚动预算约束下实现成本感知的在线主动学习，适应非平稳流和异构标签成本。

Comments 10 pages, 8 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems

详情

AI中文摘要

数据采集是实时流学习中一个主要瓶颈：分析师必须在滚动预算约束下即时决定购买哪些标签。然而，现有的在线主动学习很少在概念漂移下统一考虑定价、信息增益和滚动预算约束。我们引入了QueryMarket，一个受市场启发的框架，它根据每个传入数据点对模型的估计效用及其价格进行查询。在该框架内，我们提出了OVBAL（基于方差的在线主动学习），它通过使用带有指数遗忘的D-最优性准则估计每个样本的边际效用，并在滚动预算约束下执行成本感知的购买，将数据定价与信息驱动的选择相结合。OVBAL产生了一个简单的、完全在线的决策规则，能够适应非平稳流和异构标签成本。在合成数据和真实世界太阳能发电预测任务上的实验表明，OVBAL在卖方中心定价下特别有效，并且在两种定价方案下，在真实世界任务中实现了更有利的长期误差-成本权衡。

英文摘要

Data acquisition is a major bottleneck for learning in real-time streams: analysts must decide on the fly which labels to purchase while respecting a rolling budget. However, existing online active learning rarely unifies pricing, information gain, and rolling budget constraints under concept drift. We introduce QueryMarket, a market-inspired framework that queries each incoming data point based on its estimated utility to the model and its price. Within this framework, we propose OVBAL (online variance-based active learning), which integrates data pricing with information-driven selection by estimating each sample's marginal utility via a D-optimality criterion with exponential forgetting and executing cost-aware purchases under rolling budget constraints. OVBAL yields a simple, fully online decision rule that adapts to nonstationary streams and heterogeneous label costs. Experiments on synthetic data and a real-world solar power generation forecasting task show that OVBAL is particularly effective under seller-centric pricing and yields a more favorable long-run error-cost trade-off in the real-world task under both pricing schemes.

URL PDF HTML ☆

赞 0 踩 0

2606.17803 2026-06-17 cs.LG 新提交

Continual Self-Improvement with Lightweight Experiential Latent Memories

持续自我改进：轻量级经验潜在记忆

Vaggelis Dorovatas, Nancy Kalaj, Rahaf Aljundi

发表机构 * Toyota Motor Europe（丰田汽车欧洲公司）； University of Trento（特伦托大学）

AI总结提出一种在线方法，将推理时计算转化为轻量级模块化潜在记忆，通过自生成测试时信号进行训练，实现持续改进且避免灾难性遗忘。

详情

AI中文摘要

大型语言模型通过扩展推理时计算实现了强大的推理性能，但本质上仍然是无状态的，丢弃了在此过程中产生的丰富、自生成的推理轨迹。我们研究模型是否可以从这种经验中在线学习，将瞬态计算（推理轨迹）转化为持久可复用的知识，且无需外部监督或访问未来数据。我们表明，对原始推理轨迹进行上下文学习（ICL）无法泛化，反映了令牌级复用的根本局限性：即使经过细化（例如自我反思），单个轨迹也缺乏迁移所需的抽象。相比之下，受近期无监督强化学习工作的启发，我们发现使用自生成的测试时信号（多数投票）作为奖励的轻量级每实例训练能带来显著收益，通常超过全数据集离线训练，这促使从原始轨迹转向学习到的潜在表示。基于这一见解，我们提出一种在线方法，将遇到问题所花费的推理时计算蒸馏为紧凑的模块化潜在记忆，捕捉底层推理结构。这些记忆被存储并检索用于未来输入，通过模块化设计实现持续改进，同时避免灾难性遗忘。重要的是，我们的方法高效，参数化为极其轻量级的软提示记忆（约模型参数的0.001%），仅需少量梯度步训练，但性能与完全参数更新和离线训练相当。在具有挑战性的数学推理基准测试中，我们的方法显著优于零样本和原始数据ICL基线，并在数据集间有效迁移。

英文摘要

Large language models achieve strong reasoning performance by scaling inference-time compute, yet remain fundamentally stateless, discarding the rich, self-produced reasoning traces generated during this process. We investigate whether models can instead learn online from this experience, converting transient computation (reasoning traces) into persistent reusable knowledge, and without external supervision or access to future data. We show that In-Context Learning (ICL) over raw reasoning traces fails to generalize, reflecting a fundamental limitation of token-level reuse: individual traces lack the abstraction needed for transfer, even after refinement (e.g. self-reflection). In contrast, drawing inspiration from recent works on unsupervised reinforcement learning, we find that lightweight per-instance training with self-generated test-time signals (majority voting) as rewards yields substantial gains, often surpassing full-dataset offline training, motivating a shift from raw traces to learned latent representations. Building on this insight, we propose an online method that distills inference-time compute spent on encountered problems into compact modular latent memories capturing the underlying reasoning structure. These memories are stored and retrieved for future inputs, enabling continual improvement while avoiding catastrophic forgetting through modular design. Importantly, our method is highly efficient, parametrized as extremely lightweight soft prompt memories (~0.001% of model parameters) and trained with only a few gradient steps, yet achieving performance competitive with full parametric updates and offline training. Across challenging mathematical reasoning benchmarks, our approach significantly outperforms zero-shot and raw data ICL baselines, while transferring effectively across datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.17800 2026-06-17 cs.CV 新提交

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

MaineCoon: 追求实时音视频社交世界模型

Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie

发表机构 * Catnip AI Team（Catnip AI团队）

AI总结提出MaineCoon，首个22B参数的实时音视频自回归模型，支持单GPU上高达47.5 FPS的流式生成和亚秒级交互，专为社交互动应用优化，引入自重采样、跨模态对齐、领域偏好优化和强化在线策略蒸馏等技术。

Comments 32 pages, 13 figures, 3 tables

详情

AI中文摘要

随着全球视频内容越来越多地在社交平台上用于互动社交目的，为社交世界构建的视频生成模型至关重要，但以往研究在很大程度上忽视了这一点。在这项工作中，我们定义了社交世界模型的位置，并构建了一个原型模型作为实现这一目标的第一步。虽然以往的世界模型成功模拟了物理环境或游戏世界探索，但它们从根本上与以人为中心的社交动态脱节。为了弥合这一差距，作为社交世界模型的第一步，我们提出了MaineCoon，这是首个实时音视频自回归模型，拥有22B参数，能够在单个GPU上实现高达47.5 FPS的创纪录帧率的实时流式生成和亚秒级交互。据我们所知，MaineCoon也是首个专门针对社交互动应用优化的实时音视频生成模型。为了实现高效稳定的训练，我们在MaineCoon中引入了多种新技术，包括自重采样、跨模态表示对齐、领域感知偏好优化和强化在线策略蒸馏（ROPD）。我们还设计了首个智能体流式推理框架，支持千秒级甚至更长的生成，同时通过智能体缓存管理和提示规划减轻漂移。这些创新显著加速了训练，同时优化了实时推理性能。我们相信，这项工作不仅为高质量、低延迟和长时域的音视频自回归模型设定了新的最先进（SOTA）性能基准，而且指出了下一代AI原生社交平台所需的范式转变。

英文摘要

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

URL PDF HTML ☆

赞 0 踩 0

2606.17798 2026-06-17 cs.CV cs.AI 新提交

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

LiveStarPro: 具有分层记忆的主动式流视频理解用于长时域流

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

发表机构 * IEEE

AI总结提出LiveStarPro，通过流验证解码、流因果注意力掩码和树结构分层记忆三个组件，实现长时域流媒体视频的主动理解，在语义正确性和时序误差上分别提升28.9%和降低18.2%。

详情

AI中文摘要

尽管视频大语言模型（Video-LLMs）取得了显著进展，当前的在线架构仍然难以同时处理连续视频流、自主决定何时响应以及保持长时域上下文记忆。这些障碍削弱了实时响应能力，并在长时间交互中导致严重遗忘。在这项工作中，我们引入了LiveStarPro，一个专为长时域流上的主动视频理解而设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流验证解码（SVeD），一种通过单次困惑度验证识别适当响应时机的推理框架，从而消除了对显式静音标记的依赖。第二个组件是流因果注意力掩码（SCAM），一种训练策略，它在可变长度流上强制实现增量视频-语言对齐。第三个组件是树结构分层记忆（TSHM），一种递归记忆架构，它将驱逐的历史信息组织成事件链，从而能够从有效无界的视频流中高效检索。为了在现实在线条件下促进全面评估，我们进一步提出了OmniStarPro，一个大规模基准测试，涵盖15个多样化的真实世界场景，并扩展到小时级流以评估长期回忆。大量实验表明，LiveStarPro持续超越现有方法，在语义正确性上提升28.9%，时序误差降低18.2%，而其流式键值缓存进一步在相同模型上实现了1.58倍的推理加速。模型和代码在此https URL公开。

英文摘要

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

URL PDF HTML ☆

赞 0 踩 0

2606.17791 2026-06-17 cs.CL cs.CV 新提交

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Slop悖论：合成标准化如何侵蚀AI重写放射学报告中的临床不确定性和跨模态对齐

Samar Ansari

发表机构 * School of Computing and Engineering Sciences, University of Chester（切斯特大学计算与工程科学学院）

AI总结本研究通过控制实验测量AI重写放射学报告导致的信息退化，发现电子健康记录摘要虽破坏内容但保留图像-文本对齐，而标准化重写和教学病例准备则相反，造成更大对齐损失，称为slop悖论。

详情

AI中文摘要

AI辅助临床文档工具越来越多地使用大型语言模型（LLMs）对放射学报告进行摘要、标准化和重新格式化。我们提出了对由此产生的信息退化的受控测量。使用印第安纳大学数据集的450份胸部X光报告，我们通过三种真实的LLM重写任务生成合成版本：电子健康记录摘要、标准化重写和教学病例准备。我们测量了实体侵蚀（通过医学命名实体识别）、对冲崩溃（临床不确定性语言的丧失）和跨模态对齐退化（通过BiomedCLIP图像-文本相似度）。我们的核心发现是信息损失与跨模态保真度之间的分离。电子健康记录摘要在内容层面最具破坏性，侵蚀了51.4%的临床实体和43.7%的对冲语言，但它几乎完全保留了图像-文本对齐（下降2.5%）。旨在生成更干净训练数据的两个任务，即标准化重写和教学病例准备，则相反：它们保留了更多实体（分别侵蚀26.8%和29.3%），但导致14.9-16.5%的对齐下降，是电子健康记录摘要的六到七倍。我们称之为slop悖论：使临床文本看起来更干净以用于多模态训练的重写恰恰使其偏离图像。与我们预先指定的假设相反，罕见病理并未优先退化：在九次罕见与常见比较中，没有差异在多重比较校正后幸存，且名义差异方向相反（常见>罕见），因此污染对特定条件监测不可见。退化的主要决定因素是AI重写任务的类型，而非临床内容。这些发现对多模态医学AI数据集构建和AI辅助临床文档的治理具有重要意义。

英文摘要

AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

URL PDF HTML ☆

赞 0 踩 0

2606.17782 2026-06-17 cs.LG 新提交

Blind Recovery of Latent Domains via Unsupervised Symmetry Discovery

通过无监督对称性发现实现潜在域盲恢复

Onur Efe, Arkadas Ozakin

发表机构 * Bogazici University（博阿齐奇大学）

AI总结提出无监督框架，通过发现数据分布的对称性，从无结构观测中恢复潜在域和信号，使用浅层群卷积网络并施加平稳性和局部性正则化。

详情

AI中文摘要

盲逆问题的主要动机是在不知道混淆机制的情况下从损坏的观测中恢复感兴趣的信号。当损坏是卷积时，盲反卷积是一种突出的方法，但当一般线性变换混淆域结构时，它不适用。在这项工作中，我们提出了一个无监督框架，通过发现数据分布的对称性来恢复潜在域和信号。我们的框架将观测建模为从潜在随机场采样的信号的线性测量，并通过在模型输出处施加平稳性和局部性正则化来优化浅层群卷积网络。该模型学习潜在的对称性动作和适当的滤波器，从而将无结构观测映射到基于对称性的表示，揭示潜在信号。在随机过程、伊辛模型、打乱和比特乱序图像以及神经记录上的实验表明，该方法从无结构观测中恢复了潜在域和信号，表明对称性发现是无监督结构学习和盲逆问题的新方向。

英文摘要

Primary motivation in blind inverse problems is to recover signals of interest from corrupted observations without knowing the obfuscating mechanism. Blind deconvolution is a prominent approach when the corruption is convolutional, but it is not applicable when general linear transformations obfuscate the domain structure. In this work, we propose an unsupervised framework for recovering latent domains and signals by discovering symmetries of the data distribution. Our framework models observations as linear measurements of signals sampled from a latent random field, and optimizes a shallow group-convolutional network by imposing stationarity and locality regularization at the model output. The model learns a latent symmetry action and an appropriate filter, thereby mapping unstructured observations to a symmetry-based representation that reveals latent signals. Experiments on stochastic processes, Ising models, shuffled and bit-scrambled images, and neural recordings show that the method recovers latent domains and signals from unstructured observations, suggesting symmetry discovery as a new direction for unsupervised structure learning and blind inverse problems.

URL PDF HTML ☆

赞 0 踩 0

2606.17775 2026-06-17 cs.SD cs.AI cs.NE 新提交

A Neuromorphic Trigger for Efficient Audio Event Detection

一种用于高效音频事件检测的神经形态触发器

Benjamin Hatton, Oliver Rhodes, Luca Peres

发表机构 * ICNS, University of Manchester（曼彻斯特大学ICNS）

AI总结提出基于脉冲神经网络（SNN）的低成本前端触发器，选择性筛选音频片段，在异常声音检测和声音事件检测任务上分别实现0.97的F1分数和42.6倍FLOPs减少。

Comments 9 pages, 4 figures, 6 tables

详情

AI中文摘要

连续音频流的高效处理仍然是实时和资源受限系统面临的关键挑战。本文介绍了一种用于音频事件检测的神经形态触发器，基于脉冲神经网络（SNN）选择性门控下游模型的输入。所提出的触发器作为低成本前端，识别显著音频片段，仅将这些片段转发给计算密集型的模型进行分类等任务。触发器实现为轻量级全连接SNN，并在两个代表性任务上评估：异常声音检测（ASD）和声音事件检测（SED）。对于ASD，触发器在URBAN-SED数据集的类别无关形式下，实现了基于一秒片段的F1分数0.97，显示出识别相关音频区域的高可靠性。对于SED，触发器与Dang分类器结合在DCASE 2017挑战赛任务2数据集上，展示了潜在的42.6倍FLOPs减少，同时将基于事件错误率的下限从0.41降低到0.25。这些结果凸显了神经形态触发器作为实时、节能前端滤波器的潜力，能够大幅降低计算成本。

英文摘要

Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constrained systems. This paper introduces a neuromorphic trigger for audio event detection, based on a spiking neural network (SNN) that selectively gates input to downstream models. The proposed trigger acts as a low-cost front-end, identifying salient audio segments and forwarding only these to a more computationally intensive model for tasks such as classification. The trigger is implemented as a lightweight fully connected SNN and evaluated on two representative tasks: Anomalous Sound Detection (ASD) and Sound Event Detection (SED). For ASD, the trigger achieves a one-second segment-based F1 score of 0.97 on a class-agnostic form of the URBAN-SED dataset, demonstrating high reliability in identifying relevant audio regions. For SED, the trigger is combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset, showing a potential $42.6\times$ reduction in FLOPs while reducing the lower bound of the event-based error rate from 0.41 to 0.25. These results highlight the potential of neuromorphic triggers as real-time, energy-efficient front-end filters, enabling substantial reductions in computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.17756 2026-06-17 cs.LG 新提交

A fairness-aware extension of Stochastic Multicriteria Acceptability Analysis for ranking

一种公平性感知的随机多准则可接受性分析扩展用于排序

Guilherme Dean Pelegrina, Renata Pelissari

发表机构 * Engineering School, Mackenzie Presbyterian University（麦肯锡长老会大学工程学院）

AI总结提出SMAA-Fair，通过重加权排序以提升群体公平性，结合统计均等、rKL和nDKL指标，在保持鲁棒性同时改善受保护群体在有利位置的代表性。

详情

AI中文摘要

公平性已成为涉及个人或社会群体的排序问题的核心关注点，特别是在负责任人工智能议程下。在多准则决策分析中，随机多准则可接受性分析（SMAA）为处理不确定性和不完整偏好信息提供了稳健框架，但未明确解决排序结果中的公平性。本文提出SMAA-Fair，一种公平性感知的SMAA扩展用于排序问题。该方法根据模拟排序的群体公平性水平对其重新加权，使得更公平的排序对可接受性指数和中心权重向量贡献更大。该框架独立于聚合模型，并可纳入不同的公平性度量。本研究采用统计均等、归一化折扣Kullback-Leibler散度（rKL）和归一化折扣累积Kullback-Leibler散度（nDKL）。排序通过公平性调整的可接受性矩阵，使用期望排序和最大可接受性排序得出。我们还根据所得排序的公平程度推导中心权重。使用合成数据和真实数据的数值实验表明，SMAA-Fair改善了受保护群体在有利排序位置中的代表性，同时保持对偏好不确定性的鲁棒性。

英文摘要

Fairness has become a central concern in ranking problems involving individuals or social groups, particularly under the Responsible Artificial Intelligence agenda. In Multi-Criteria Decision Analysis, Stochastic Multicriteria Acceptability Analysis (SMAA) provides a robust framework for handling uncertainty and incomplete preference information, but it does not explicitly address fairness in the resulting rankings. This paper proposes SMAA-Fair, a fairness-aware extension of SMAA for ranking problems. The approach reweights the simulated rankings generated by SMAA according to their level of group fairness, so that fairer rankings contribute more strongly to the acceptability indices and central weights vector. The framework is independent of the aggregation model and can incorporate different fairness metrics. In this study, Statistical Parity, normalized discounted Kullback--Leibler divergence (rKL) and normalized discounted cumulative Kullback--Leibler divergence (nDKL) are adopted. Rankings are derived from the fairness-adjusted acceptability matrix using expected ranking and maximum acceptability ranking. We also derive the central weight according to the degree of fairness in the obtained rankings. Numerical experiments with synthetic and real data show that SMAA-Fair improves the representation of protected groups among favourable ranking positions, while preserving robustness to preference uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2606.17742 2026-06-17 cs.CV q-bio.NC 新提交

BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics

BrainWorld：一种用于全脑4D fMRI动力学的结构先验条件生成模型

Junfeng Xia, Wenhao Ye, Junxiang Zhang, Xuanye Pan, Mo Wang, Quanying Liu

发表机构 * Department of Biomedical Engineering, Southern University of Science and Technology（南方科技大学生物医学工程系）； School of Biomedical Engineering, Shenzhen University（深圳大学生物医学工程学院）

AI总结提出BrainWorld模型，利用结构MRI作为解剖先验条件，通过去噪过程生成全脑4D fMRI动态，在22个数据集上稳定生成400帧轨迹，并通过生成样本增强提升下游任务性能。

详情

AI中文摘要

全脑4D fMRI生成对于建模功能性脑动力学具有重要价值，然而现有的fMRI基础模型主要针对表示学习和下游预测，而非条件预测生成。我们提出BrainWorld，一种用于全脑4D fMRI动力学的结构先验条件生成模型。BrainWorld使用sMRI作为受试者级别的解剖上下文来指导未来的fMRI生成，将结构信息整合到去噪过程中，而非将其视为并行模态。在涵盖不同队列和脑状态的22个数据集上评估，BrainWorld能够生成长达400帧的稳定4D fMRI轨迹，通过生成样本增强提升下游性能，并学习到可迁移的多模态表示，优于基线方法。这些结果共同确立了BrainWorld作为长时程脑动力学建模和多模态表示学习的条件感知生成框架。

英文摘要

Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.17739 2026-06-17 cs.RO cs.AI cs.CV cs.MA 新提交

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

ED3R: 能量感知的分布式灾难检测——基于协作机器人智能体

Lina Magoula, Nikolaos Koursioumpas, Nancy Alonistioti, Ramin Khalili

发表机构 * Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens（雅典大学信息学与电信系）； Huawei Heisenberg Research Center (Munich)（华为海森堡研究中心（慕尼黑））

AI总结提出ED3R框架，通过机器人-远程控制器分层协作与分布式神经回归预测，在不确定性下以最低能耗实现野火检测，成功率达97.18%，能耗降低36.4%，检测速度提升41%。

Comments 14 pages, 9 figures

详情

AI中文摘要

机器人技术有望支持环境监测和自然灾害管理，在这些场景中，决策必须在不确定性、资源限制和严格操作约束下做出。在关键任务（如野火）中，机器人智能体不仅需要以足够置信度识别危险事件，还需管理能量成本和检测时间。本文介绍ED3R，一种用于不确定性下野火检测的能量感知分布式框架。ED3R实现了机器人与远程控制器之间的分层协作决策：远程控制器决定机器人的运动，而机器人感知环境并决定在何处（机载或远程）以及如何执行野火检测。共同目标是以所需置信度检测野火，同时最小化任何机器人操作消耗的能量。ED3R进一步集成了避免附近障碍物、防止冗余探索、实现自适应早期任务完成以及通过自定义惩罚函数确保可行性的机制。ED3R还引入了前瞻能力，通过分布式神经回归模型使智能体能够在执行前评估候选策略以预测未来。该框架通过逼真的机器人仿真、消融研究和基线比较进行评估。总体而言，ED3R的任务成功率高达97.18%。尤其是在最具挑战性的任务中，它比基线减少高达36.4%的能量消耗，并提前高达41%检测到野火。

英文摘要

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.17735 2026-06-17 cs.AI 新提交

GSPan：一种用于任意尺度全色锐化的连续高斯基元表示

Fangyi Li, Xiaoyuan Yang, Yixiao Li, Zongyang Sui, Kangqing Shen, Gemine Vivone

发表机构 * Beihang University（北京航空航天大学）； Tsinghua University（清华大学）； National Research Council - Institute of Methodologies for Environmental Analysis, CNR-IMAA（意大利国家研究委员会 - 环境分析方法研究所）

AI总结提出GSPan框架，将2D高斯溅射引入全色锐化，通过连续可学习的2D高斯基元表示残差细节，实现任意尺度融合，无需重新训练。

详情

AI中文摘要

全色锐化旨在通过融合低分辨率多光谱（LRMS）和全色（PAN）观测生成高分辨率多光谱（HRMS）图像。现有深度学习方法大多将全色锐化视为固定网格预测，限制了尺度适应性。为此，我们提出GSPan框架，将2D高斯溅射（GS）引入全色锐化。GSPan不直接预测像素，而是将逐波段残差细节表示为连续且可学习的2D高斯基元。我们设计了具有空间-光谱交互注意力（SSIA）模块的双流层次交互（DSHI）架构，从互补的PAN和MS观测中估计这些基元。预测的基元被渲染为残差细节场，并注入到上采样的MS图像中。这种连续表示使得GSPan能够在任意目标采样网格上渲染融合图像，无需针对特定尺度重新训练。它进一步支持尺度解耦非对称推理（SDAI）策略，该策略在降低的分辨率下估计基元，并在目标分辨率下渲染融合图像，用于高效的大场景全色锐化。在QuickBird、GaoFen-2、WorldView-3和WorldView-3-4K数据集上的实验表明，GSPan实现了最先进的融合性能。此外，SDAI显著加速了推理，在计算效率和融合质量之间取得了良好的平衡。我们的结果证明了连续高斯残差表示作为固定网格预测的灵活且尺度解耦替代方案的潜力。

英文摘要

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.17713 2026-06-17 cs.CV 新提交

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

异质SAR-光学融合用于云污染下近实时的土地利用和土地覆盖制图：新框架与全球基准数据集

Jiangong Xu, Weibao Xue, Xiaoyu Yu, Jun Pan, Xinlian Lianga, Mi Wang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing（信息工程测绘与遥感国家重点实验室）； School of Computer Science and Information Engineering（计算机科学与信息工程学院）； Hangzhou International Innovation Institute（杭州国际创新研究院）； Oriental Space Port Research Institute（东方航天港研究院）； Hubei Luojia Laboratory（湖北珞珈实验室）

AI总结针对云污染导致光学遥感不可靠的问题，提出端到端异质SAR-光学融合框架CloudLULC-Net，通过光学可靠性调制、异质信息自适应聚合和统一语义映射变换器，实现近实时LULC制图，并构建含40,223个三元组的全球基准数据集CloudLULC-Set，在多个指标上优于现有方法。

详情

AI中文摘要

光学遥感影像经常受到云和云阴影污染的干扰，这限制了其在近实时土地利用和土地覆盖（LULC）制图中的可靠性。尽管合成孔径雷达（SAR）可以提供穿透云层的结构信息，但现有的SAR-光学融合方法通常假设光学观测可靠，未能充分解决云污染引入的语义不确定性。为了解决这个问题，我们提出了CloudLULC-Net，一个端到端的异质SAR-光学融合框架，直接从受云污染的Sentinel-2影像和时间相邻的Sentinel-1 SAR观测中预测LULC图。所提出的网络包含光学可靠性调制以抑制不可靠的光学响应、异质信息自适应聚合以建模光学和SAR表示之间的高阶空间-通道交互，以及一个统一的语义映射变换器，在面向LULC的潜在空间中组织融合特征。进一步引入语义锚点引导优化策略以提高中间语义表示的一致性。为支持该任务，我们构建了CloudLULC-Set，一个大规模基准数据集，包含40,223个精心挑选的SAR-光学-标签三元组，具有跨不同地理区域和云条件的像素级LULC标注。实验结果表明，CloudLULC-Net实现了86.60%的总体精度、83.29%的F1分数和73.51%的平均交并比，优于代表性的异质重建优先和端到端SAR-光学映射方法。与现有全球LULC产品的比较以及不同云覆盖水平下的分析进一步证明了CloudLULC-Net在易云区域目标日期LULC制图中的鲁棒性和实用价值。该项目公开于：https://github.com/your-repo（实际链接请替换）。

英文摘要

Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone regions.The project is publicly available at: https://github.com/RSIIPAC/CloudLULC

URL PDF HTML ☆

赞 0 踩 0

2606.17711 2026-06-17 cs.CV cs.AI 新提交

Structured Adversarial Camouflage via Voronoi Diagrams

基于Voronoi图的结构化对抗伪装

Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer

发表机构 * Fraunhofer IOSB and Fraunhofer Center for Machine Learning（弗劳恩霍夫光学、系统技术及图像处理研究所和弗劳恩霍夫机器学习中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出通过软分配优化种子点位置生成结构化伪装图案，在固定调色板下有效降低行人检测AP，且攻击可跨域转移。

详情

AI中文摘要

像素级对抗补丁计算量大且视觉上可检测，限制了在安全关键系统中的实用性。我们提出对抗性Voronoi伪装，通过软分配在固定可打印调色板下仅优化种子点位置，无需额外正则化即可生成类似结构化碎片伪装图案。在COCO风格AP@[.5:.95]上评估行人检测，朴素放置（Inria -> COCO）表现相当差，而通过分割掩码（3DPeople）进行服装级应用导致AP显著下降。该攻击可迁移到域外背景和跨检测器家族（YOLOv9/10/11/12），表明在黑盒设置中的鲁棒性。使用不同调色板重新绘制在很大程度上抵消了效果，单色调整显示有限容忍度（<=0.17），突出了结构-调色板耦合。参数高效、调色板受限的设计在降低实时检测器性能的同时提高了视觉合理性。物理验证和颜色校准留待未来工作。代码：此https URL。本文最初发表于由信息与通信技术系统技术委员会IST-224-RSY组织的国际军事通信与信息系统会议（ICMCIS），于2026年5月12-13日在英国巴斯举行。

英文摘要

Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria -> COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (<=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: https://github.com/JensBayer/Voronoi This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026.

URL PDF HTML ☆

赞 0 踩 0

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室）； Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich（慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系）； Lab for AI in Medicine, RWTH Aachen University（亚琛工业大学医学人工智能实验室）； Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen（亚琛工业大学医院诊断与介入放射学系）

AI总结本文通过因果审计方法，发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像，纯文本模型与多模态模型性能接近，并提出了基于图像依赖性的评估框架。

详情

AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性，这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的：一个利用发现名称先验的模型得分与读取扫描的模型相同，且没有标准基准能区分它们。我们引入了一种因果审计方法，通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像，并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中，一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平，而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型（针对部分发现）；这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比，纯文本模型在准确率上与放射科医生无统计差异，但基础归因于零，而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计（而非准确性）应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.17706 2026-06-17 cs.LG cs.AI 新提交

Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

混淆感知的迁移教师课程学习框架：解耦评分与节奏效应

Savini Kommalage, Sanka Mohottala, Asiri Gawesha, Dulara Madhusanka, Menan Velayuthan, Dharshana Kasthurirathna, Mahima Milinda Alwis Weerasinghe, Charith Abhayaratne

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology, Sri Lanka（斯里兰卡信息科技学院计算机学院，斯里兰卡）； Faculty of Engineering, University of Sri Jayewardenepura, Sri Lanka（斯里兰卡贾亚韦达内普拉大学工程学院，斯里兰卡）； Faculty of Engineering, Sri Lanka Institute of Information Technology, Sri Lanka（斯里兰卡信息科技学院工程学院，斯里兰卡）； University of Sheffield, United Kingdom（谢菲尔德大学，英国）； Utrecht University, The Netherlands（乌得勒支大学，荷兰）

AI总结提出混淆感知难度评分，通过阶段性子集测试和随机基线解耦课程学习的评分与节奏效应，在CIFAR-10上验证评分可解释性，但全数据下无提升，仅在小数据量下提升数据效率。

Comments Accepted at International Conference on Machine Learning (ICML) GlobalSouthML Workshop (2026)

详情

AI中文摘要

课程学习结合了两个设计选择：样本如何按难度评分，以及较难样本如何逐步引入训练，这使得难以将观察到的性能提升归因于任一组件。我们通过两种评估协议解耦这些因素：阶段性子集测试（独立于课程训练验证评分函数）和基线（将相同的节奏调度应用于随机排序数据）。在迁移教师框架（TTF）中，我们使用这些协议评估一种混淆感知的难度评分，该评分同时考虑正确类别的置信度和错误类别上的概率分布。在CIFAR-10上使用ResNet-18和VGG-16，所提出的评分产生了与人类直觉一致的模型可解释难度排序。然而，在全数据下，无论是课程排序还是反课程排序，都没有比标准训练提高准确率，这表明仅改进评分函数不足以克服TTF中课程学习的已知失败模式。相反，我们发现混淆感知的课程排序带来一致的数据效率优势，在20%数据量下比随机排序高出最多8.7个百分点，表明TTF作为一种数据高效训练方法的潜力。

英文摘要

Curriculum learning couples two design choices, how samples are scored by difficulty and how harder samples are paced into training, making it difficult to attribute observed gains to either component. We disentangle these factors with two evaluation protocols: stage-wise test subsets that validate scoring functions independently of curriculum training, and a baseline that applies the same pacing schedule to randomly ordered data. Within the Transfer Teacher framework (TTF), we use these protocols to evaluate a confusion-aware difficulty score that considers both correct-class confidence and the probability distribution over incorrect classes. On CIFAR-10 with ResNet-18 and VGG-16, the proposed score produces model-interpretable difficulty rankings that align with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improves accuracy over standard training, indicating that improving the scoring function alone is insufficient to overcome the known failure modes of curriculum learning in TTF. In contrast, We find that confusion-aware curriculum ordering result in consistent data-efficiency benefits, outperforming random ordering by up to 8.7% points at the 20% data regime, suggesting the potential of TTF as a data-efficient training method.

URL PDF HTML ☆

赞 0 踩 0

2606.17702 2026-06-17 cs.CV cs.AI 新提交

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

SegTME-UNI2: 一种基于基础模型的可泛化多类细胞分割框架及LLM驱动的组织微环境表征在组织病理学中的应用

Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

发表机构 * Department of Data Science and Artificial Intelligence, School of Computing and Artificial Intelligence, Faculty of Engineering and Technology, Sunway University（双威大学工程与技术学院计算与人工智能学院数据科学与人工智能系）； Faculty of Dentistry, Universiti Malaya（马来亚大学牙科学院）

AI总结提出SegTME-UNI2框架，结合UNI2-H病理基础模型与双头UperNet解码器实现六类语义分割和核实例分割，通过三阶段伪标签课程学习解决标注不足问题，并利用LLM生成临床可解释的TME报告。

详情

AI中文摘要

从常规H&E染色组织学图像中表征肿瘤微环境（TME）需要同时进行细胞分割、特征提取和可解释的临床报告。我们提出了SEGTME-UNI2，一个统一框架来满足这些需求。其核心是UNI2-UPERHOVER，一个双头分割模型，将UNI2-H病理基础模型（ViT-Giant，在来自100K张切片的>100M张图块上预训练）与两个并行的UperNet解码器配对：一个用于六类语义分割，另一个用于水平-垂直梯度回归，从而实现基于分水岭的核实例分离。为了解决大型真实世界数据集中缺乏像素级标注的问题，UNI2-UPERHOVER经历了一个三阶段渐进式伪标签课程。每个阶段训练一个全新模型（无权重迁移），完全通过提高伪标签质量来驱动改进：阶段1：使用人工标注的PanNuke（7,901张图像，189,744个细胞核，0.25 um/像素）。阶段2：使用阶段1模型在271,711个TCGA-UT尺度0图块（0.5 um/像素）上生成的熵过滤伪标签。阶段3：使用阶段2模型在所有1,608,060个TCGA-UT图块（覆盖六个分辨率尺度，0.5-1.0 um/像素）上生成的伪标签。分割输出输入到一个结构化的TME特征提取流水线，计算每个图块的20多个组成、形态、空间熵和细胞间距离指标。这些指标编码为JSON，并传递给微调的NVIDIA BioNeMo GPT模型，以生成临床可解释的TME叙述。在保留的PanNuke和TCGA-UT分区上的初步验证证明了框架的可行性和内部一致性。公开释放了伪标注的TCGA-UT数据集和UNI2-UPERHOVER检查点，以支持大规模TME分析和空间生物学研究。

英文摘要

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

URL PDF HTML ☆

赞 0 踩 0

2606.17698 2026-06-17 cs.AI cs.CL 新提交

从受训者到训练者：用于多智能体推理的LLM设计的强化学习训练环境

Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo

发表机构 * LARK, HKUST (GZ)（香港科技大学（广州）LARK实验室）； University of Cambridge（剑桥大学）； HKUST（香港科技大学）

AI总结提出LLM-as-Environment-Engineer框架，让策略模型自动分析失败轨迹并修改训练环境配置，在MAPF-FrozenLake测试平台上用Qwen3-4B实现最优性能。

详情

AI中文摘要

用于大语言模型（LLM）训练的强化学习流程通常依赖于阶段之间手动重新设计的环境，要求从业者启发式地推断哪种配置最能改进当前策略。为了自动化这一过程，我们提出了LLM-as-Environment-Engineer框架，其中当前策略模型分析失败轨迹及上下文信息，并提出对下一阶段训练环境配置的修改。我们还引入了MAPF-FrozenLake，一个可控的测试平台，其生成器暴露多维环境配置，适合研究和基准测试环境重新设计。在该测试平台上，我们将环境工程师的条件建立在策略行为、失败案例和环境统计的结构化摘要上，从而生成下一训练阶段的配置。以Qwen3-4B为骨干，我们的框架在基准测试中取得了最强的综合性能，优于更大的专有LLM（如GPT、Gemini）和固定环境训练基线。我们进一步分析了哪种形式的上下文最有效，发现成功的环境更新依赖于失败证据并保留已生效的配置。有趣的是，当前的RL检查点比原始基础模型更适合作为环境工程师，这表明策略学习提高了模型诊断其剩余弱点的能力。

英文摘要

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

URL PDF HTML ☆

赞 0 踩 0

2606.17680 2026-06-17 cs.LG cs.CL 新提交

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

EnvRL: 在智能体强化学习中从环境动力学中学习

Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出EnvRL框架，通过状态预测和逆动力学两个辅助目标，将环境动力学学习融入智能体强化学习，在长周期任务中显著提升成功率。

详情

AI中文摘要

强化学习已成为训练大型语言模型作为智能体的强大范式。然而，针对长周期智能体任务的常规强化学习方法往往难以处理稀疏的结果奖励。直观上，这忽略了展开交互轨迹中包含的丰富环境动力学信息。我们认为交互体验本身固有地充当隐式监督信号，揭示了环境的潜在转换机制，并使智能体能够构建更准确的环境内部模型。因此，在这项工作中，我们研究了如何利用这一额外信号来改进策略学习。具体来说，我们提出了EnvRL，一个通过两个辅助目标（状态预测和逆动力学）将环境动力学学习融入智能体强化学习的框架。通过与主要强化学习目标联合优化，我们鼓励智能体从其自身的交互体验中内化环境动力学。在两个长周期智能体基准上的大量实验表明，EnvRL在成功率上比仅使用强化学习的基线有显著提升，例如，当使用GRPO训练时，在ALFWorld上将Qwen-2.5-1.5B-Instruct从72.8%提升到77.4%，在WebShop上从56.8%提升到67.0%。

英文摘要

Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

URL PDF HTML ☆

赞 0 踩 0

2606.17678 2026-06-17 cs.CV cs.AI 新提交

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

先看后答：基于充分性驱动的强化学习实现视觉证据预对齐

Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang, Minghao Sun, Xuancheng Zhu, Yisong Chen, Zexian Wei, Xiaofeng Tao

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Nanyang Technological University（南洋理工大学）； China Telecom（中国电信）

AI总结提出视觉证据预对齐（VEPA）方法，在预训练与后训练之间引入充分性驱动的GRPO优化，以增强多模态大模型对细粒度视觉证据的利用，显著提升视觉密集型任务性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）将强大的文本推理与视觉输入相结合，但其响应可能与底层图像不一致，表明在推理过程中未能有效利用视觉证据。当前的训练范式依赖于大规模基于标题的预训练进行通用对齐，随后通过监督微调和强化学习实现指令遵循和复杂推理。然而，这种预训练仅提供较弱的视觉基础：简短、粗略的标题使模型偏向显著物体，而忽略了细粒度的视觉证据。本文引入视觉证据预对齐（VEPA），作为预训练与后训练之间的中间阶段，探索一种新颖的充分性驱动目标，结合组相对策略优化（GRPO）来优化基于问题的视觉证据描述。在多种基准上的大量实验表明，我们的VEPA在视觉密集型评估上持续提升性能，并补充了标准的监督后训练。进一步分析表明，这种提升源于增强的、可迁移的视觉基础，而非额外的任务特定训练。

英文摘要

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2606.17675 2026-06-17 cs.CV 新提交

Do We Really Need Diffusion? A Fast U-Net for Paired Medical Image Translation

我们真的需要扩散吗？用于配对医学图像翻译的快速U-Net

Alicia Pirwass, Birte Glimm, Michael Munz, Hans-Joachim Wilke

发表机构 * Institute of Artificial Intelligence, Ulm University（乌尔姆大学人工智能研究所）； Institute of Orthopaedic Research and Biomechanics, Centre for Trauma Research, University Hospital Ulm（乌尔姆大学医院创伤研究中心骨科研究与生物力学研究所）； AI for Sensor Data Analytics Research Group, Ulm University of Applied Sciences（乌尔姆应用科学大学传感器数据分析人工智能研究组）

AI总结本文比较轻量级4级U-Net与去噪扩散概率模型（DDPM）在从T2加权MRI估计脂肪分数任务上的性能，发现U-Net在精度和速度上均优于DDPM。

详情

AI中文摘要

磁共振成像-信号脂肪分数（MRI-SFF）量化组织脂肪，是代谢和肌肉骨骼疾病的既定生物标志物。然而，采集需要专门的MRI序列，这些序列并非常规可用。我们研究是否可以通过图像到图像翻译（I2I）从广泛可用的T2加权（T2w）MRI估计SFF。我们进一步使用来自德国国家队列（NAKO）的230048对2D图像（183517训练，23621验证，22910测试）数据集，将轻量级4级U-Net与最先进的去噪扩散概率模型（DDPM）进行比较。两种模型均明显优于恒等基线（Pearson相关系数r=0.769，平均绝对误差MAE=0.070±0.054），证实模型学习了非平凡的跨模态映射。有趣的是，轻量级U-Net在相关性（r=0.975 vs. 0.962）和误差（MAE=0.014±0.015 vs. 0.019±0.019）方面均优于DDPM，同时推理时间减少了208倍（每张图像25.2 ms vs. 5 227.2 ms，使用50步去噪扩散隐式模型（DDIM））。在显著降低计算成本的同时实现强大的临床性能，使得实时临床使用成为可能。

英文摘要

Magnetic resonance imaging-signal fat fraction (MRI-SFF) quantifies tissue fat and serves as an established biomarker for metabolic and musculoskeletal disorders. The acquisition requires, however, specialized MRI sequences, which are not available routinely. We investigate whether SFF can be estimated from widely available T2-weighted (T2w) MRI via image-to-image translation (I2I). We further compare a lightweight 4-level U-Net to a state-of-the-art Denoising Diffusion Probabilistic Model (DDPM) using a dataset of 230 048 paired 2D images (183 517 train, 23 621 val, 22 910 test) from the German National Cohort (NAKO). Both models clearly outperform the identity baseline (Pearson correlation r = 0.769, mean absolute error MAE = 0.070 +/- 0.054), which confirms that the models learn a non-trivial cross-modal mapping. Interestingly, the lightweight U-Net outperforms the DDPM in both correlation (r = 0.975 vs. 0.962) and error (MAE = 0.014 +/- 0.015 vs. 0.019 +/- 0.019), while reducing inference time by a factor of 208 (25.2 ms vs. 5 227.2 ms per image using 50 Denoising Diffusion Implicit Model (DDIM) steps). The strong clinical performance at substantially reduced computational cost enables real-time clinical use.

URL PDF HTML ☆

赞 0 踩 0

2606.17669 2026-06-17 cs.SD 新提交

DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention

DeSRPA: 通过推理时干预的解耦语音角色扮演智能体

Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide

发表机构 * Nagoya University（名古屋大学）； National Institute of Informatics（国立情报学研究所）

AI总结提出DeSRPA框架，通过推理时干预冻结骨干模型，利用双层控制向量机制解耦认知推理与副语言表达，在语音角色扮演中实现个性与情感一致性，超越端到端微调方法。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

虽然大型语言模型（LLMs）已经革新了基于文本的角色扮演，但创建沉浸式语音角色扮演智能体（SRPAs）需要在认知推理和副语言细微差别之间建立无缝桥梁。当前的SRPAs主要依赖于端到端（E2E）微调。然而，这种范式由于依赖角色特定数据而难以泛化到未见过的角色，同时施加了“模态对齐税”，降低了LLM固有的推理能力。我们提出DeSRPA，一种通过在冻结骨干模型上进行推理时干预来实现角色扮演的智能体框架。DeSRPA采用双层控制向量机制，即内部认知引导和外部表达渲染，以同步“思维”和“声音”。在SpeechRole和OmniCharacter基准上的实验表明，DeSRPA在个性和情感一致性上显著优于E2E基线。它实现了高语音自然度，缩小了与GPT-4o Audio等专有模型的差距，同时保持了一种可扩展且无需训练的范式。

英文摘要

While Large Language Models (LLMs) have revolutionized text-based role-playing, creating immersive Speech Role-Playing Agents (SRPAs) requires a seamless bridge between cognitive reasoning and paralinguistic nuances. Current SRPAs primarily rely on end-to-end (E2E) fine-tuning. However, this paradigm suffers from poor generalization to unseen characters due to its reliance on role-specific data, while imposing a "modality alignment tax" that degrades intrinsic LLM reasoning capabilities. We propose DeSRPA, an agentic framework for character role play via inference-time intervention on frozen backbones. DeSRPA employs a dual-level control vector mechanism, Internal Cognitive Steering and External Expressive Rendering, to synchronize "mind" and "voice". Experiments on SpeechRole and OmniCharacter benchmarks demonstrate that DeSRPA significantly outperforms E2E baselines in personality and emotional consistency. It achieves high speech naturalness, narrowing the gap with proprietary models like GPT-4o Audio, while remaining a scalable and training-free paradigm.

URL PDF HTML ☆

赞 0 踩 0

2606.17668 2026-06-17 cs.LG cs.AI q-bio.QM 新提交

ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

ASTEROID: 用于分子动力学多步时间序列预测的时空信息变换器

Kexin Wu, Luonan Chen, Renxiao Wang

发表机构 * Department of Medicinal Chemistry, School of Pharmaceutical Sciences, Fudan University（药学院药物化学系，复旦大学）； School of Mathematical Sciences and School of AI, Shanghai Jiao Tong University（数学科学学院和人工智能学院，上海交通大学）

AI总结提出ASTEROID框架，通过将分子动力学轨迹重构为高维时空序列并集成时空信息变换方程到Transformer中，实现多步原子坐标的直接预测，在多个量子力学分子数据集上显著提升预测精度并降低计算成本。

Comments 32 pages,10 figures

详情

AI中文摘要

分子动力学（MD）模拟计算需求高，尤其对于需要长期分析的大规模系统。准确预测MD模拟结果不仅是一个有吸引力的科学挑战，而且具有重要的实用价值。在这项工作中，我们开发了一个数据驱动框架，称为ASTEROID（用于推断动力学的先进时空变换器），可以直接预测多步原子坐标，避免传统的迭代积分。为此，我们的ASTEROID将MD轨迹重构为高维时空序列，并将时空信息（STI）变换方程集成到Transformer架构中。ASTEROID的核心创新在于其建模多尺度时空依赖性的能力。具体来说，对于空间依赖性，局部-全局自注意力机制捕获短程和长程相互作用。对于时间依赖性，编码器-解码器结构将全局上下文与自回归预测相结合。ASTEROID在几个量子力学衍生的分子数据集上进行了评估。我们的结果表明，ASTEROID不仅在各种基准测试中实现了比现有方法更高的多步预测精度，而且显著降低了传统MD模拟的计算成本。此外，该模型支持在扩展时间尺度上的迭代多步预测。这项工作为加速MD模拟建立了一个稳健且可推广的数据驱动范式。

英文摘要

Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.

URL PDF HTML ☆

赞 0 踩 0