arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.07724 2026-06-09 cs.LG 新提交

A Geometry-Aware Triplane Field Network for Vehicle Aerodynamic Prediction

几何感知三平面场网络用于车辆气动预测

Kangkang Qi, Huiyu Yang, Keqi Ding, Yunpeng Wang, Yuntian Chen, Yuanwei Bin, Rikui Zhang, Jianchun Wang

发表机构 * Southern University of Science and Technology(南方科技大学) Shenzhen Tenfong Technology Co., Ltd.(深圳腾风科技有限公司) Eastern Institute of Technology(东方理工高等研究院)

AI总结 提出几何感知三平面场网络(GTF-Net),通过双流骨干网络结合自适应傅里叶神经算子与CNN,实现车辆气动压力和壁面剪切应力的高效预测,在精度上超越现有方法。

详情
Comments
28 pages, 8 figures
AI中文摘要

高保真计算流体动力学(CFD)对车辆气动分析至关重要,但其成本仍制约早期设计探索。基于机器学习的表面场预测提供了一种更快的替代方案,前提是模型能高效捕捉全局流动上下文和局部几何细节。本文提出一种基于机器学习的方法,名为几何感知三平面场网络(GTF-Net),用于车辆气动压力和壁面剪切应力预测。GTF-Net通过共享多层感知器(MLP)和光滑双线性光栅化,直接从采样表面点构建三平面特征。然后,这些平面由双流骨干网络处理,该网络将自适应傅里叶神经算子(AFNO)谱混合与卷积神经网络(CNN)细化相结合,从而在同一表示中建模长程气动耦合和局部几何诱导变化。在查询阶段,采样的三平面特征与车辆对齐的方向坐标、法向投影特征和基于体素的曲率代理相结合。将GTF-Net与Transolver、几何信息神经算子(GINO)以及基于三平面的代理模型TripNet进行比较。GTF-Net将压力预测的最强基线相对L2误差从0.157降至0.145,壁面剪切应力预测从0.237降至0.226。消融结果表明,AFNO混合、局部CNN细化和查询侧几何编码均有助于提高精度,支持了将结构化三平面表示与显式气动几何线索相结合的提议机制。

英文摘要

High-fidelity computational fluid dynamics (CFD) is crucial to vehicle aerodynamic analysis, but its cost still constrains early-stage design exploration. Machine-learning-based surface-field prediction offers a faster alternative if the model can efficiently capture both global flow context and local geometric detail. This work proposes a machine-learning-based method, named the geometry-aware triplane field network (GTF-Net), for vehicle aerodynamic pressure and wall shear stress prediction. GTF-Net constructs triplane features directly from sampled surface points through a shared multilayer perceptron (MLP) and smooth bilinear rasterization. The planes are then processed by a dual-stream backbone that combines adaptive Fourier neural operator (AFNO) spectral mixing with convolutional neural network (CNN) refinement, so long-range aerodynamic coupling and local geometry-induced variations are modeled in the same representation. At query stage, sampled triplane features are combined with vehicle-aligned directional coordinates, normal-projection features, and a voxel-based curvature proxy. GTF-Net is compared with Transolver, geometry-informed neural operator (GINO), and TripNet, a triplane-based surrogate model. GTF-Net improves the relative L2 error from the strongest baseline value of 0.157 to 0.145 for pressure prediction and from 0.237 to 0.226 for wall shear stress prediction. Ablation results show that AFNO mixing, local CNN refinement, and query-side geometric encoding each contribute to accuracy, supporting the proposed mechanism of combining structured triplane representation with explicit aerodynamic geometry cues.

2606.07723 2026-06-09 cs.RO 新提交

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

VoLo: 面向开放词汇长时程操控的物理编排器

Siyi Chen, Hugo Hadfield, Alex Zook, Mikaela Angelina Uy, Chan Hee Song, Erwin Coumans, Xuning Yang, Faisal Ladhak, Qing Qu, Stan Birchfield, Jonathan Tremblay, Valts Blukis

发表机构 * NVIDIA(英伟达) University of Michigan(密歇根大学)

AI总结 提出VoLoAgent,利用VLM将VLA/WAM作为可中断工具进行物理编排,实现开放词汇长时程操控,并在新基准RoboVoLo上显著优于现有系统。

详情
AI中文摘要

开放词汇长时程操控要求机器人能够推理灵活指令和复杂多物体场景,同时自适应地规划、执行、监控并从失败中恢复。我们通过一个闭环智能体来满足这些需求,其中VLM将异构机器人能力编排为可中断的工具。与虚拟AI智能体不同,在物理世界中,决策、动作和工具调用的时机至关重要,因为物理世界不会暂停等待推理。我们将这种设置称为物理编排,并提出VoLoAgent,这是一种VLM,通过将VLA/WAM视为可中断的工具,在推理过程中与视觉模型和动作原语一起引导其运行,从而进行规划、监控和恢复。为了评估这些长时程能力,我们引入了RoboVoLo,这是一个高保真基准测试,用于开放词汇长时程操控,涵盖常识、记忆/状态跟踪、复杂引用和世界知识,并提供任务级成功率和失败模式诊断。实验表明,VoLoAgent在任务成功率和失败诊断方面显著优于单一VLA/VLM或基于工具的系统,并在真实机器人实验中得到了验证。项目页面:https://chicychen.github.io/VoLo/

英文摘要

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/

2606.07714 2026-06-09 cs.LG cs.AI cs.HC 新提交

Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models

超越准确率:解释自杀意念检测模型中的主题表示

Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman

发表机构 * University of Ottawa(渥太华大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 本研究通过可视化与几何分析,探究自杀意念检测模型内部如何编码心理风险因素,发现主题增强能提升低表征风险因素表示的清晰度与可解释性。

详情
AI中文摘要

自杀意念检测模型通常使用聚合性能指标进行评估,但对其内部如何表示具有心理意义的风险因素知之甚少。在高风险心理健康应用中,理解这些内部表示对于安全性、透明度和负责任部署至关重要。在这项工作中,我们超越准确率,分析在原始和主题增强数据集上训练的自杀检测模型如何在其内部表示空间中编码心理风险因素。通过可视化和几何分析,我们检查主题相关特征的连贯性和可分离性。我们的结果表明,主题感知增强提高了低表征心理社会风险因素(如移民、家庭问题和金融危机)的清晰度和区分度。这些发现表明,增强不仅提高了模型性能,还导致了更结构化和可解释的内部表示。

英文摘要

Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.

2606.07713 2026-06-09 cs.LG cs.AI cs.PF 新提交

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

理论最小化的注意力机制:面向内存最优Transformer内核的数组数学框架

Lenore Mullin, Gaetan Hains

发表机构 * University at Albany(奥尔巴尼大学) Université Paris-Est Créteil(巴黎东大学克雷泰伊分校)

AI总结 提出基于数组数学(MoA)的缩放点积注意力重表述,通过代数构造消除所有中间数组,实现O(n dk + n dv)数据移动,相比标准实现O(n^2 + n dk + n dv)显著降低内存流量,并验证了数值精度。

详情
AI中文摘要

注意力机制是现代基于Transformer的AI中的主要计算瓶颈。其标准实现在序列长度~$n$上产生二次内存流量,而DRAM访问在当代硬件上比算术操作消耗100--1000$\times$更多的能量,因此任何仅关注FLOP计数的分析从根本上误解了瓶颈。我们提出了缩放点积注意力及其数值稳定softmax的数组数学(MoA)重表述,推导出指称范式(DNF),通过代数构造而非经验调优消除了所有中间数组——包括隐式转置键缓冲区和每个softmax临时变量。DNF实现了$O(n dk + n dv)$的数据移动,而标准实现为$O(n^2 + n dk + n dv)$,其中$n$是序列长度,$dk$是键维度,$dv$是值维度,并在具体输入上针对PyTorch全双精度浮点进行了数值验证。与硬件特定的加速器或经验性分块方案(如FlashAttention)不同,MoA从单一代数框架同时提供了数组融合、形状变换正确性和预测性成本模型。内存最小性是在编写任何代码之前就确立的定理。预测性性能模型预计加速2--100$\times$,能耗降低2--50$\times$,优势在超大规模下进一步扩大。该推导建立了一个从Python规范经过操作范式(ONF)和维度提升硬件映射的形式化验证流水线,提供了与DARPA边缘部署和DOE超大规模优先事项直接相关的性能可移植AI内核。

英文摘要

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

2606.07711 2026-06-09 cs.LG cs.AI 新提交

Rosetta Memory: Adaptive Memory for Cross-LLM Agents

Rosetta Memory: 跨LLM智能体的自适应记忆

Hao Yang, Shiqi Shen, Haoxuan Li, Zhipeng Wang, Zhi Gong, Xu Chen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Weixin, Tencent(腾讯微信) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院)

AI总结 提出记忆中心式LLM自适应方法,通过双轮廓条件算子与最小增益采样课程,解决上游记忆激活下游LLM的跨模型适应问题,在多项QA任务中优于基线。

详情
Comments
19 pages, 7 figures
AI中文摘要

记忆是将无状态LLM转变为持久、不断进化的智能体的关键组件,通过经验积累、长程规划和持续自我改进实现。现有记忆系统通常以LLM为中心,并针对特定主干设计记忆操作。然而,在实践中,用户经常切换LLM,例如在编码时使用Claude、在写作时使用GPT,或在单个任务中将不同步骤路由到不同主干以实现成本效益权衡。因此,一个模型写入的记忆通常需要被另一个模型消费。使上游记忆有效适应并激活下游LLM仍然是一个关键但未被充分探索的问题。为弥合这一差距,我们将视角从以LLM为中心的记忆设计转变为以记忆为中心的LLM自适应。具体而言,我们从写入和读取两侧处理上述上下游记忆适应问题,并设计两个轮廓条件算子,它们联合训练以优化记忆存储和呈现方式,从而更好地完成任务。为确保学习到的算子能泛化到广泛的LLM集合,我们提出一种最小增益采样课程,在训练期间优先服务最不被照顾的LLM。为更好地衡量算子的实际贡献而非LLM自身能力,我们设计了一种性能差距奖励,与朴素记忆基线进行比较。在HotpotQA、2WikiMultihopQA和MuSiQue上的实验表明,我们的模型持续优于基线,并且在未见模型替换下保持鲁棒性。

英文摘要

Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long-horizon planning, and continual self-improvement. Existing memory systems typically take the LLM as the center and design memory operations tailored to a specific backbone. In practice, however, users frequently switch between LLMs, for example using Claude for coding and GPT for writing across tasks, or routing different steps to different backbones within a single task for cost-effective trade-offs. As a result, memory written by one model often needs to be consumed by another. Making upstream memory effectively adapt to and activate downstream LLMs remains a critical yet underexplored problem. To bridge this gap, we shift the perspective from LLM-centric memory design to \emph{memory-centric LLM adaptation}. Specifically, we approach the above upstream-downstream memory adaptation problem from both the write and read sides, and design two profile-conditioned operators that are jointly trained to optimize how memory is stored and presented for better task completion. To ensure the learned operators generalize across a broad set of LLMs, we propose a minimum-gain sampling curriculum that prioritizes the least-served LLMs during training. To better measure the operators' actual contribution rather than the LLM's own capability, we design a performance-gap reward that compares against a naive memory baseline. Experiments on HotpotQA, 2WikiMultihopQA, and MuSiQue demonstrate that our model consistently outperforms baselines and remains robust under unseen-model replacement.

2606.07710 2026-06-09 cs.LG cs.AI 新提交

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

WhiFlash: 通过令牌级跨范式路由加速推测解码

Young D. Kwon, Miles Williams, Rui Li, Alexandros Kouris, Stylianos I. Venieris

发表机构 * Samsung AI Center, Cambridge, UK(三星AI中心,剑桥,英国)

AI总结 提出WhiFlash,首个统一自回归与扩散并行草稿的跨范式推测解码方法,通过细粒度路由和缓存优化实现高达69.6%的吞吐量提升。

详情
Comments
Under review
AI中文摘要

大型语言模型的自回归特性仍然是推理的主要瓶颈,特别是在复杂的代理工作负载中。虽然推测解码加速了推理,但当前方法依赖于静态草稿范式,使用自回归草稿模型进行推理或基于扩散的并行草稿模型生成结构化输出。我们经验发现,草稿准确性在单个序列内波动剧烈,静态范式和粗粒度路由导致显著性能未实现。为解决这种波动性,我们引入WhiFlash,首个跨范式推测解码方法,在单个令牌级控制器下统一自回归和基于扩散的并行草稿。WhiFlash采用细粒度路由机制,使用轻量级基于熵的或学习到的神经策略,两者均参数化以在预期令牌增益和延迟之间提供可调平衡。为使高频切换计算可行,我们引入新颖的缓存管理优化——惰性追赶和仅KV预填充,将切换开销降低到每轮延迟的7%以下。通过利用根本不同草稿架构的互补优势,WhiFlash实现了显著更高的接受长度,在特定类别上吞吐量比最先进的自回归EAGLE-3提升高达69.6%,比基于扩散的DFlash提升37.3%。

英文摘要

The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic workloads. While speculative decoding (SD) accelerates inference, current approaches rely on static drafting paradigms, utilising either autoregressive drafting models for reasoning or diffusion-based parallel drafting models for structured outputs. We empirically find that drafting accuracy fluctuates dramatically within a single sequence, leaving significant performance unrealised by static paradigms and coarse-grained routing. To address this volatility, we introduce WhiFlash, the first cross-paradigm SD method that unifies autoregressive and diffusion-based parallel drafting under a single token-level controller. WhiFlash adopts a fine-grained routing mechanism that employs either a lightweight entropy-based or a learned neural policy, both parametrised to provide a tunable balance between expected token gain and latency. To make high-frequency switching computationally viable, we introduce novel cache-management optimisations, Lazy Catch-up and KV-only Prefill, reducing switching overhead to below 7% of per-round latency. By capitalising on the complementary strengths of fundamentally distinct drafting architectures, WhiFlash achieves significantly higher acceptance lengths, yielding category-specific throughput gains of up to 69.6% over the state-of-the-art autoregressive EAGLE-3 and 37.3% over the diffusion-based DFlash.

2606.07708 2026-06-09 cs.CV cs.AI 新提交

Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird's-Eye View Localization

跨视角城市交通数据集:用于单目鸟瞰图定位的无人机监督地面真值

Prakhar Bhardwaj, Simone Weikl, Kilian Mang, Elia Jonas Sandtner

发表机构 * OTH Regensburg(雷根斯堡应用技术大学)

AI总结 提出一个由同步自行车视角和无人机视角视频构建的跨视角城市交通数据集,支持跨视角身份匹配和鸟瞰图预测任务,提供身份级对齐和标准化评估。

详情
AI中文摘要

我们介绍了一个从真实城市交叉口同步的自行车视角视频和无人机航拍视频构建的跨视角城市交通感知数据集和基准。该基准针对两个关联任务:街景和无人机视角目标轨迹之间的跨视角身份匹配,以及利用空中监督的自我到鸟瞰图预测。与先前的城市驾驶和V2X数据集相比,我们的基准提供了跨截然不同视角的身份级对齐,以及标准化评估、标注工具和基线实现。这一设置源于以交叉口为中心的交通分析,其中身份保持、局部交互和全局空间结构必须跨视角联合推理。我们在轨迹和帧级别评估方法,包括跨视角ID精确率/召回率/IDF1、近远分解、时间稳定性和一致性指标。我们还提供了基于楔形的跨视角匹配以及三种BEV预测基线(逆透视映射、MonoLayout风格学习基线和回归基线)的基线结果。结果表明该基准可行但具有挑战性:跨视角匹配实现了高召回率,但仍受过度分配和时间不一致性的限制,而自我到BEV预测受益于空中监督,但在轻量级单目感知下远未饱和。我们希望该基准能支持跨视角感知、城市场景对齐和自我到全局交通理解的未来研究。

英文摘要

We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird's-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near--far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.

2606.07707 2026-06-09 cs.LG 新提交

Decoding Naturalistic Emotion Dynamics from the Brain: An LLM-Enhanced Regression Framework

从大脑解码自然情感动态:一种LLM增强的回归框架

Lemei Zhang, Peng Liu, Hans Dahle Kvadsheim, August Sætre Aasvær, Shuer Ye, Reza Bonyadi, Maryam Ziaei, Jon Atle Gulla

发表机构 * NTNU(挪威科技大学) Kavli Institute for Systems Neuroscience, NTNU(挪威科技大学卡弗里系统神经科学研究所) Microsoft(微软)

AI总结 提出多目标回归框架,利用LLM从自然叙事中提取连续情感特征,结合动态功能连接和机器学习算法,实现从fMRI数据中解码连续情感轨迹,并揭示可解释的情感特异性脑网络拓扑。

详情
AI中文摘要

从神经信号解码情感状态通常被框架化为基于情感稳定刺激的离散单标签分类任务,这种表述过于简化了人类情感的连续、流动和共现特性。本研究通过采用多目标回归框架来重新概念化情感解码,以跟踪随时间变化的多个重叠情感维度作为连续轨迹。利用大型语言模型(LLM)的强大泛化能力,我们从自然听觉叙事《爱丽丝梦游仙境》中提取了细粒度的连续情感特征,作为人类fMRI数据集中主观情感的 scalable 代理。与标准分类范式或过滤网络动态的 mass-univariate 减法对比不同,我们利用正则化和基于核的机器学习算法作为连续估计器来跟踪宏观神经状态变化的幅度。我们证明,基于动态功能连接(DFC)时间快照训练的模型显著优于静态感兴趣区域(ROI)幅度表示,能够有效捕捉快速变化的叙事输入下的连续情感轨迹。此外,通过实施图论可解释人工智能(XAI)技术,我们解构了底层预测特征,揭示了高度可解释的、情感特定的拓扑配置。总体而言,这些结果凸显了LLM自动注释在情感神经科学中的实用性,并为心理建构主义框架提供了令人信服的实证证据,表明动态、分布式的网络交互比严格定位主义的情感解释具有更强的解释力。

英文摘要

Decoding emotional states from neural signals has been typically framed as a discrete, single-label classification task based on emotionally stable stimuli, a formulation that oversimplifies the continuous, fluid, and co-occurring nature of human affect. This study reconceptualizes emotion decoding by adopting a multi-target regression framework to track multiple overlapping emotional dimensions as continuous trajectories over time. Leveraging the robust generalization capabilities of Large Language Models (LLMs), we extracted fine-grained, continuous sentiment profiles from a naturalistic auditory narrative, Alice in Wonderland, to serve as scalable proxies for subjective affect from human fMRI dataset. Departing from standard classification paradigms or mass-univariate subtractive contrasts that filter out network dynamics, we leverage regularized and kernel-based machine learning algorithms as continuous estimators to track the magnitude of macroscale neural state variations. We demonstrate that models trained on temporal snapshots of Dynamic Functional Connectivity (DFC) significantly outperform static region-of-interest (ROI) amplitude representations, effectively capturing continuous emotional trajectories under rapidly fluctuating narrative input. Furthermore, by implementing graph-theoretical Explainable AI (XAI) techniques, we deconstruct the underlying predictive features to reveal highly interpretable, emotion-specific topological configurations. Collectively, these results highlight the utility of LLM-automated annotation in affective neuroscience and provide compelling empirical evidence for psychological constructionist frameworks, demonstrating that dynamic, distributed network interactions offer superior explanatory power over strictly locationist accounts of emotion.

2606.07705 2026-06-09 cs.LG cs.AI 新提交

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

SAW: 面向大语言模型多目标强化学习的阶段感知动态加权

Yuchen He, Baolong Bi, Shenghua Liu, Huaming Liao, Yuyao Ge, Bolin Wan, Siqian Tong, Juan Chen, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对多目标强化学习中奖励学习异步性问题,提出轻量级动态加权机制SAW,利用变异系数实时调整各目标贡献,在GRPO和GDPO框架下提升训练效率和最终性能。

详情
Comments
17 pages, 7 figures, 5 tables
AI中文摘要

尽管多目标强化学习(MORL)对于将大语言模型与复杂的人类偏好对齐至关重要,但当前普遍采用的静态加权求和忽略了一个更基本的现象:不同目标之间的奖励学习明显异步。学习良好的维度会迅速产生同质、低方差的信号,其残留噪声会污染聚合奖励(在GRPO中)或占据优势预算的固定份额(在GDPO中),从而干扰学习不足维度携带的稀缺但高价值的信号。为了解决这种异步性,我们提出了阶段感知动态加权(SAW),一种轻量级、算法无关的动态加权机制。SAW利用变异系数(CV)作为实时信息量的尺度不变代理,根据批次内各维度的相对信息量重新加权其奖励或优势贡献。与需要多次前向和反向传播的基于梯度的方法不同,SAW仅依赖于批次级统计信息,引入的计算开销几乎可以忽略不计。在工具调用和文本摘要任务上的实验表明,SAW在GRPO和GDPO框架下均能一致地提高训练效率和最终性能,证实了其作为多奖励LLM对齐的通用插件。我们的代码可在 https://github.com/Zhaolutuan/SAW 获取。

英文摘要

Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: reward learning is markedly asynchronous across objectives. Well-learned dimensions quickly produce homogeneous, low-variance signals whose residual noise contaminates the aggregated reward (in GRPO) or occupies a fixed share of the advantage budget (in GDPO), interfering with the scarce yet high-value signals carried by under-learned dimensions. To address this asynchrony, we propose Stage-Aware Dynamic Weighting (SAW), a lightweight, algorithm-agnostic dynamic weighting mechanism. SAW utilizes the coefficient of variation (CV) as a scale-invariant proxy for real-time informativeness, reweighting each dimension's reward or advantage contribution by its relative informativeness within the batch. Unlike gradient-based methods that require multiple forward and backward passes, SAW relies solely on batch-level statistics, introducing nearly negligible computational overhead. Experiments on tool-calling and text summarization tasks demonstrate that SAW consistently improves both training efficiency and final performance under both GRPO and GDPO frameworks, confirming it as a general-purpose plug-in for multi-reward LLM alignment. Our code is available at https://github.com/Zhaolutuan/SAW

2606.07702 2026-06-09 cs.LG cs.AI 新提交

EvoCSFL: Surrogate-Assisted Evolutionary Client Selection for Efficient and Robust Federated Learning

EvoCSFL:基于代理辅助的进化客户端选择实现高效鲁棒联邦学习

Lin Qiang, Sun Xiaoyan, Hu Yao, Fang Wei

发表机构 * Jiangnan University(江南大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 针对联邦学习中客户端数据与系统异构性导致收敛慢、鲁棒性差的问题,提出代理辅助的进化客户端选择框架,将选择问题建模为组合优化,用代理模型加速进化搜索,实验表明收敛更快、能耗更低、鲁棒性更强。

详情
AI中文摘要

客户端数据和系统的异构性使得采用随机客户端选择的联邦学习难以获得令人满意的收敛速度和鲁棒性。为解决此问题,本文提出了一种基于代理辅助的客户端进化选择框架。在该框架中,首先使用一些典型的客户端选择策略生成候选集,并开发了一个集成模型性能、通信延迟和能量消耗的度量函数,将客户端选择问题表述为组合优化问题。随后,利用候选选择和度量构建代理模型,以高效逼近所选客户端子集的性能。采用进化算法搜索客户端选择的组合空间,并由代理模型引导以加速收敛。在MNIST、CIFAR10、CINIC10和TinyImageNet上的实验表明,与现有方法相比,所提算法实现了更快的收敛、更低的能量消耗和更好的鲁棒性。

英文摘要

The heterogeneity of client data and systems makes it difficult to achieve satisfactory convergence speed and robustness in federated learning with random client selection. To address this issue, this paper proposes a surrogate-assisted client evolutionary selection framework for federated learning. In this framework, some typical client selection strategies are first used to generate candidate sets, and a metric function that integrates model performance, communication latency, and energy consumption is developed to formulate the client selection problem as a combinatorial optimization one. Subsequently, a surrogate model is constructed using the candidate selections and metric to efficiently approximate the performance of selected client subsets. An evolutionary algorithm is employed to search the combinatorial space of client selections, guided by the surrogate model to accelerate convergence. Experiments on MNIST, CIFAR10, CINIC10, and TinyImageNet demonstrate that the proposed algorithm achieves faster convergence, lower energy consumption, and improved robustness compared to existing methods.

2606.07698 2026-06-09 cs.LG cs.AI 新提交

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

基于图神经网络的药物相互作用预测的药理基因组学知识图谱增强

Juergen Dietrich

发表机构 * AI Solutions Berlin

AI总结 本研究通过整合PharmGKB的药理基因组学先验知识(CYP酶注释)作为特征向量,增强图神经网络在药物相互作用预测中的性能,在配对数据划分下显著提升DDI类型分类,但未能突破信息天花板。

详情
Comments
13 pages
AI中文摘要

应用于药物相互作用(DDI)预测的图神经网络(GNN)仅依赖由SMILES衍生的分子结构图。该系列先前的工作表明,模型性能受限于训练标签的结构信息含量——即信息天花板——仅靠架构改进无法克服。本研究探讨来自PharmGKB数据库的药理基因组学先验知识是否通过提供独立于分子结构且互补的代谢通路背景,部分关闭这一天花板。提取四种临床相关亚型(CYP2D6、CYP3A4、CYP2C19、CYP2C9)的细胞色素P450(CYP)酶底物、抑制剂和诱导剂注释,并将其作为12维特征向量在交互预测前与分子嵌入拼接。在配对水平和药物水平数据划分下进行实验,以量化对未见药物的泛化能力。结果表明,在配对水平划分条件下,知识图谱(KG)增强显著改善了DDI类型分类(F1宏平均:0.532对比基线0.241),而二元交互检测和药物水平泛化仍受信息天花板限制(AUC提升:0.224对比基线0.250)。对严格保留化合物的机制验证确认,增强优先改善CYP2C9介导的交互预测,概率从基线0.033-0.117提升至KG增强后的0.560-0.586。在Tox21基准上的单分子毒性预测扩展实验证实,该效果取决于药理基因组学注释覆盖度。这些发现为后续研究提出的多模态框架提供了动机。

英文摘要

Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by the structural information content of training labels -- an Information Ceiling -- that architectural refinements alone cannot overcome. The present study investigates whether pharmacogenomic prior knowledge from the PharmGKB database partially closes this ceiling by providing metabolic pathway context that is independent of, and complementary to, molecular structure. Cytochrome P450 (CYP) enzyme substrate, inhibitor, and inducer annotations for four clinically relevant isoforms (CYP2D6, CYP3A4, CYP2C19, CYP2C9) are extracted and incorporated as a 12-dimensional feature vector concatenated to the molecular embedding prior to interaction prediction. Experiments are conducted under both pair-level and drug-level data splits to quantify generalization to unseen drugs. Results indicate that knowledge graph (KG) augmentation substantially improves DDI type classification under pair-level split conditions (F1-macro: 0.532 vs. 0.241 baseline), while binary interaction detection and drug-level generalization remain bounded by the Information Ceiling (AUC inflation: 0.224 vs. 0.250 baseline). Mechanistic validation on strictly held-out compounds confirms that augmentation preferentially improves CYP2C9-mediated interaction prediction, with probabilities increasing from 0.033-0.117 (baseline) to 0.560-0.586 (KG-augmented). An extension to single-molecule toxicity prediction on the Tox21 benchmark confirms that the effect is contingent on pharmacogenomic annotation coverage. These findings motivate the multimodal framework proposed for the subsequent study in this series.

2606.07696 2026-06-09 cs.LG cs.AI 新提交

Adversarial Robustness of Activation Steering in Large Language Models

大型语言模型中激活引导的对抗鲁棒性

Kien Le, Thai Le

发表机构 * Independent Researcher(独立研究员) Indiana University(印第安纳大学)

AI总结 研究激活引导在对抗性文本扰动下的鲁棒性,发现所有方法、模型和设置中方向鲁棒性下降高达64%,置信度崩溃,层选择脆弱,揭示其结构性脆弱性。

详情
Comments
9 pages, 2 figures
AI中文摘要

激活引导已成为一种流行的免训练方法,通过在推理时将预计算的方向向量注入模型的残差流来控制LLM行为。然而,其对现实输入变化的鲁棒性尚未得到研究。我们首次系统评估了在输入上施加对抗性文本扰动时激活引导的鲁棒性,涵盖了四种提取方法、三种攻击策略、来自Anthropic Model-Written Evaluation数据集的六种人格以及从1.5B到30B参数的五个模型。攻击在所有设置中普遍成功:方向鲁棒性下降高达64%,攻击后置信度在所有方法和模型中崩溃至接近或低于0.25,并且几乎每个可引导输入的引导强度都下降。层选择同样脆弱,通过自动化方法在干净输入上识别的最优层在扰动下偏移多达17个位置,这一失败加剧了向量级别的崩溃。从对抗性扰动输入中提取向量对于中大型模型上的PCA和MD方法部分恢复了可引导性,但它们始终无法定位改进的最优层,限制了这种缓解措施的实际效益。总之,这些发现揭示了激活引导的脆弱性是结构性的而非方法特定的,并且当前的层选择策略对于实际部署不够鲁棒。

英文摘要

Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the model's residual stream at inference time. Yet its robustness to realistic input variation remains unstudied. We present the first systematic evaluation of activation steering robustness under adversarial text perturbations on the inputs, covering four extraction methods, three attack strategies, six personas from Anthropic Model-Written Evaluation Dataset, and five models ranging from 1.5B to 30B parameters. Attacks succeed broadly across all settings: directional robustness drops by up to 64%, post-attack confidence collapses near or below 0.25 across all methods and models, and steering strength degrades on nearly every steerable input. Layer selection is equally fragile, with the optimal layer identified by an automated method on clean inputs shifting by up to 17 positions under perturbation, a failure that compounds the vector-level breakdown. Extracting vectors from adversarially perturbed inputs partially recovers steerability for PCA and MD on mid-to-large models, but they consistently fail to locate the improved optimal layer, limiting the practical benefit of this mitigation. Together, these findings reveal that the brittleness of activation steering is structural rather than method-specific, and that current layer selection strategies are not robust enough for real-world deployment.

2606.07695 2026-06-09 cs.LG cs.AI 新提交

DSFNet: Learning Dual-Domain Spectral Operators for Multi-Modality Spatio-Temporal Forecasting in Urban Transportation Systems

DSFNet:面向城市交通系统多模态时空预测的双域谱算子学习

Yongchao Li, Yang Li, Zhuoxuan Li, Jun Chen, Chu Zhang, Jinde Cao, Leszek Rutkowski

发表机构 * Southeast University(东南大学) Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies(江苏省现代城市交通技术协同创新中心) City University of Hong Kong(香港城市大学) School of Mathematics, Southeast University(东南大学数学学院) Systems Research Institute of the Polish Academy of Sciences(波兰科学院系统研究所) Luoyang Normal University(洛阳师范学院) Purple Mountain Laboratories(紫金山实验室) AGH University of Krakow(AGH科技大学)

AI总结 提出双域谱滤波网络DSFNet,通过特征域和空间域谱算子分解空间-模态交互,显式建模跨变量耦合与异质空间依赖,结合外部门控机制自适应调节时间动态,在五个真实交通数据集上MAE降低3.21%-10.16%。

详情
AI中文摘要

多模态时空预测(MoSTF)通过引入多样化的交通模态扩展了传统的时空预测。尽管近年来在时空建模方面取得了显著进展,现有方法往往未能显式建模不同模态变量之间的耦合关系。准确的MoSTF具有挑战性,因为它需要建模(1)外生影响下的时间动态异质性和(2)异质空间依赖性以及复杂的跨变量耦合。为了解决这些挑战,我们提出了双域谱滤波网络(DSFNet)。我们的框架采用双域谱滤波来捕获异质空间模式并显式建模变量之间的关系。与基于图的消息传递或节点-模态对上的密集注意力不同,DSFNet将空间-模态交互分解为特征域和空间域谱算子,从而实现了非局部依赖和跨模态耦合的可扩展建模。此外,我们引入了一种外部门控机制,以自适应地调节外部影响下的时间动态。我们通过在五个代表性真实世界交通数据集上的大量实验验证了我们的方法。与次优基线相比,DSFNet在这些数据集上将MAE降低了3.21%-10.16%。结果表明,DSFNet在准确性上显著优于现有最先进基线,同时表现出高效性和鲁棒性。

英文摘要

Multi-Modality Spatio-Temporal Forecasting (MoSTF) extends traditional spatio-temporal forecasting by incorporating diverse traffic modalities. Despite significant recent strides in spatio-temporal modeling, existing approaches often fail to explicitly model the coupling relationships between different modality variables. Accurate MoSTF is challenging, as it requires modeling (1) temporal dynamic heterogeneity under exogenous influences and (2) heterogeneous spatial dependencies alongside complex cross-variable couplings. To address these challenges, we propose the Dual-Domain Spectral Filtering Network (DSFNet). Our framework employs dual-domain spectral filtering to capture heterogeneous spatial patterns and explicitly model the relationships between variables. Unlike graph-based message passing or dense attention over node-modality pairs, DSFNet factorizes space-modality interactions into feature-domain and spatial-domain spectral operators, enabling scalable modeling of nonlocal dependencies and cross-modality couplings. Furthermore, we introduce an external gating mechanism to adaptively regulate temporal dynamics under external influences. We validate our method through extensive experiments on five representative real-world traffic datasets. Compared with the second-best baselines, DSFNet reduces MAE by 3.21%-10.16% across these datasets. The results demonstrate that DSFNet significantly outperforms existing state-of-the-art baselines in accuracy while exhibiting efficiency and robustness.

2606.07694 2026-06-09 cs.LG stat.ML 新提交

Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head

基于可学习Tweedie头的时空图神经网络在稀疏数据上的船舶交通流预测

Kyeongjun Lee, Heeyoung Kim

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 针对船舶交通流数据高度稀疏且间歇性爆发的问题,提出一种模型无关的可学习Tweedie头作为即插即用输出模块,通过优化闭合形式的Tweedie单元偏差并预测均值,同时学习节点级方差幂以捕获港口区域异质性,在真实AIS数据上显著提升RMSE。

详情
AI中文摘要

准确的船舶交通流预测对于智能港口运营和航行安全至关重要。然而,海上交通流数据通常高度稀疏且具有间歇性爆发,使得稳健预测具有挑战性。在这种条件下,传统的时空图神经网络(ST-GNNs)可能退化为保守的接近零的预测,无法捕获非零活动。尽管零膨胀负二项(ZINB)模型部分解决了过多零值问题,但其两部分公式在突变附近仍可能保持保守。为了解决这些问题,我们提出了一种模型无关的可学习Tweedie头,它可以作为即插即用的输出模块附加到任意ST-GNN骨干网络上。与通常需要替代目标的基于似然的Tweedie训练不同,我们的方法优化了闭合形式的Tweedie单元偏差,并预测均值以进行点预测,同时学习节点级方差幂以捕获港口区域间的异质性变异性。在由洛杉矶和长滩港口的真实AIS数据构建的海上交通图上的实验表明,所提出的头在多个ST-GNN骨干网络上一致地提高了RMSE,特别是在非零事件上,从而为实际海上交通控制提供了更可靠的预测。

英文摘要

Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging. Under such conditions, conventional spatio-temporal graph neural networks (ST-GNNs) can degrade toward conservative near-zero predictions and fail to capture non-zero activity. Although zero-inflated negative binomial (ZINB) models partially address excess zeros, their two-part formulation can still remain conservative around abrupt transitions. To address these issues, we propose a model-agnostic learnable Tweedie head that can be attached as a plug-and-play output module to arbitrary ST-GNN backbones. Instead of likelihood-based Tweedie training, which typically requires surrogate objectives, our approach optimizes the closed-form Tweedie unit deviance and predicts the mean for point forecasting while learning a node-level variance power to capture heterogeneous variability across port areas. Experiments on a maritime traffic graph constructed from real-world AIS data in the Port of Los Angeles and Long Beach show that the proposed head consistently improves RMSE across multiple ST-GNN backbones, especially on non-zero events, leading to more reliable forecasts for practical maritime traffic control.

2606.07690 2026-06-09 cs.LG cs.AI 新提交

HARP: Efficient Data Selection for Finetuning Large Language Models

HARP:高效数据选择用于微调大型语言模型

Ning Wang, Zhengxin Zhang, Maosen Tang, Yitang Gao, Claire Cardie, Sainyam Galhotra

发表机构 * Cornell University(康奈尔大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出层次主动区域剪枝(HARP),一种高效的基于训练的数据选择方法,通过层次结构和经验贝叶斯推断降低选择成本,同时保持下游对齐,在多个基准上优于最强基线最多8.9分,且训练样本减少约7倍。

详情
AI中文摘要

微调数据选择需要平衡两个相互竞争的目标:选择改善下游目标的示例,以及在不重复微调模型的情况下做到这一点。无训练选择器具有可扩展性,但依赖于嵌入相似性或聚类等代理,这些可能无法匹配目标目标。基于训练的选择器通过梯度信号、子集评估或Shapley归因更好地反映下游效用,但需要大量昂贵的训练-评估迭代。我们提出层次主动区域剪枝(HARP),一种高效的基于训练的选择器,在降低选择成本的同时保持下游对齐。HARP将训练池组织成节点-叶子层次结构,仅评估代表性叶子,并使用经验贝叶斯后验推断未测量的效用。然后,它使用两个互补的包络选择数据:HARP-C,保守地控制冗余,以及HARP-E,加性地奖励互补区域。我们理论上证明,在局部平滑和有界估计误差下,HARP控制选择误差同时降低训练-评估成本。我们进一步验证,HARP变体实现了最佳结果,并在使用大约7倍更少训练示例的情况下,比最强基线高出最多8.9分。

英文摘要

Finetuning data selection requires balancing two competing goals: selecting examples that improve the downstream objective, and doing so without repeatedly finetuning models. Train-free selectors are scalable but rely on proxies such as embedding similarity or clustering, which may not match the target objective. Train-based selectors better reflect downstream utility through gradient signals, subset evaluation, or Shapley attribution, but require many costly train--evaluate iterations. We propose Hierarchical Active Region Pruning (HARP), an efficient train-based selector that preserves downstream alignment while reducing selection cost. HARP organizes the training pool into a node--leaf hierarchy, evaluates only representative leaves, and infers unmeasured utilities with empirical Bayes posteriors. It then selects data using two complementary envelopes: HARP-C, which conservatively controls redundancy, and HARP-E, which additively rewards complementary regions. We theoretically show that, under local smoothness and bounded estimation error, HARP controls selection error while reducing train--evaluate cost. We further validate that HARP variants achieve the best result and outperform the strongest baseline by up to $+8.9$ points, while using roughly $7\times$ fewer training examples.

2606.07687 2026-06-09 cs.CV cs.AI 新提交

What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

什么使视频世界模型潜在空间与动作相关:预测优于重建

Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 通过统一探针评估,发现动作相关结构主要由时间视频预训练驱动,而非像素重建保真度,其中视频预训练自监督编码器在视觉保真度和动作预测间取得最佳帕累托权衡。

详情
AI中文摘要

视频世界模型越来越多地用于提供预测性视觉表示,但尚不清楚哪些预训练信号在其潜在空间中诱导出与动作相关的结构。我们通过跨多种编码器家族的统一探针评估来研究这个问题,包括仅图像自监督、带或不带潜在预测的视频预训练、基于重建的自编码器、扩散模型以及捷径强制动力学模型。使用共同的逆动力学探针目标,我们发现动作相关结构主要由时间视频预训练驱动,而非像素重建保真度:具有强像素解码质量的模型可能表现出接近零的动作可恢复性,而视频预训练的自监督编码器在视觉保真度和动作预测之间始终实现最佳帕累托权衡。比较V-JEPA和VideoMAE进一步表明,大部分收益来自自然视频时间上下文,特征级潜在预测提供了较小的额外收益。这些趋势在机器人基准测试中转移,尽管CALVIN显示静态环境任务可以通过允许强图像先验来部分掩盖时间结构的重要性。最后,逆动力学监督显著提高了对视觉损坏的鲁棒性,表明动作感知目标正则化了潜在几何,超越了干净环境性能。我们的结果确定时间预测结构——而非重建保真度——是动作相关视频表示的主要成分。

英文摘要

Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.

2606.07686 2026-06-09 cs.LG cs.AI 新提交

Knowledge-Inclusive Adaptive Physics-Informed Neural Network for Microbial Interaction Modelling

知识包容的自适应物理信息神经网络用于微生物相互作用建模

Ravisha Rupasinghe, Rajith Vidanaarachchi, Asela Hevapathige, Sachith Seneviratne, Sen-Lin Tang, Saman Halgamuge

发表机构 * University of Melbourne(墨尔本大学) Academia Sinica(中央研究院)

AI总结 提出一种知识包容的自适应PINN框架,通过整合文本和网络结构知识改进微生物群落建模,在真实和模拟数据集上性能提升最高53%。

详情
Comments
33 pages
AI中文摘要

物理信息神经网络(PINN)是一种在机器学习方法中以方程形式包含知识的方式。除了方程,知识还以其他形式存在,如文本和网络结构。虽然现有的基于PINN的方法从数据中发现方程参数,但它们仅依赖实验测量。我们提出一个新的PINN框架,通过整合辅助知识源来丰富参数发现。我们将该框架应用于微生物学,其中广义Lotka-Volterra(gLV)作为建模微生物群落的生物学基础。我们证明,整合知识可以改进微生物群落建模。我们的框架利用同行评审的宏基因组学文献丰富gLV参数,因为文本提供了gLV单独无法捕捉的外部影响的生物学背景。我们使用数据驱动的整合方法将这些知识与微生物丰度的实验测量相结合。我们通过显式建模微生物相互作用来整合基于网络的结构知识。我们的知识包容框架推断微生物网络,揭示生态学见解。我们根据文献中记录的生态角色验证这些发现。我们在涵盖人类和植物相关微生物群落的真实和模拟数据集上进行评估。我们的框架在无知识情况下比现有技术提升最高53%。知识添加在基于Bray-Curtis差异的准确率上带来最高23%的提升,在R²上带来47%的提升。

英文摘要

Physics-Informed Neural Network (PINN) is a way of including knowledge in the form of equations in Machine Learning methods. Beyond equations, knowledge exists in other forms, such as text and network structure. While existing PINN-based approaches discover equation parameters from data, they rely solely on experimental measurements. We propose a new PINN framework that enriches parameter discovery by incorporating auxiliary knowledge sources. We instantiate our framework for microbiology, where generalised Lotka-Volterra (gLV) serves as a biological foundation for modelling microbial communities. We demonstrate that incorporating knowledge improves microbial community modelling. Our framework enriches the gLV parameters using peer-reviewed metagenomics literature, as text provides biological context on external influences that gLV alone cannot capture. We combine this knowledge with experimental measurements of microbial abundance using a data-driven integration approach. We integrate network-based structural knowledge by explicitly modelling microbial interactions. Our knowledge-inclusive framework infers microbial networks, revealing ecological insights. We validate these findings against ecological roles documented in the literature. We evaluate on real and simulated datasets spanning human- and plant-associated microbial communities. Our framework improves over the state-of-the-art by up to 53%, even without knowledge. Knowledge addition yields gains of up to 23% in Bray-Curtis Dissimilarity-based accuracy and 47% in $\mathrm{R}^2$.

2606.07678 2026-06-09 cs.LG cs.AI 新提交

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

DOG-DPO:几何中的动态优化用于安全对齐

Yi Nian, Tiankai Yang, Yudi Zhang, Qi Pan, Zelong Xu, Shenzhe Zhu, Qingqing Luan, Yue Huang, Xiangliang Zhang, Yue Zhao

发表机构 * University of Southern California(南加州大学) Iowa State University(爱荷华州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) UT Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员) University of Notre Dame(圣母大学)

AI总结 提出DOG-DPO框架,将偏好对表示为模型表示空间中的方向,通过几何分解和多样性覆盖选择子集,仅用11%数据即可恢复大部分安全增益。

详情
AI中文摘要

大型语言模型的安全对齐依赖于偏好数据,但当前的流水线通常训练于大规模冗余数据集。现有的数据选择方法通常独立地对每个偏好对评分,将方向性偏好信息压缩为标量质量或多样性分数。这种以样本为中心的视角在多数据集设置中尤其受限,其中共享的安全方向与数据集特定的残余风险共存。我们提出DOG-DPO,一种无需训练的数据选择框架,将偏好对视为结构化几何信号。DOG-DPO首先将每个偏好对表示为模型表示空间中的一个方向。然后,它将多数据集偏好几何分解为全局锚点子空间和数据集特定的残余子空间。最后,它通过最大化基于多样性的覆盖来选择子集,鼓励在DPO训练前广泛、非冗余地覆盖对齐方向。在六个安全基准和两个模型骨干上,DOG-DPO仅使用11%的偏好对就实现了强大的效用-鲁棒性权衡。它恢复了全数据训练的大部分安全增益,同时完全无需教师、无需训练,并且比代表性选择基线快得多。

英文摘要

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.

2606.07674 2026-06-09 cs.CV q-bio.NC 新提交

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

同时性多动症表型分析:基于常规视频、无标记姿态估计和表格基础模型的跨队列儿科迁移研究

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar, Mayté Castro Jiménez, Cécile A. Hubsch, Sophie Huby, Morgan Dornadic, Gun-Marie Hariz, Eduardo M. Moraud, Jocelyne Bloch, Gabriella A. Horvath, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV)(洛桑大学医院) University of Lausanne (UNIL)(洛桑大学) Institut du Neurone(神经元研究所) Clinique Beau Soleil(博索莱伊诊所) Institut Mutualiste Montpelliérain(蒙彼利埃互助研究所) Military University Hospital of Sfax(斯法克斯军事大学医院) University of Edinburgh(爱丁堡大学) Hospital Sant Joan de Déu(圣琼德迪乌医院) European Reference Network for Rare Neurological Diseases (ERN-RND)(欧洲罕见神经系统疾病参考网络) Instituto de Salud Carlos III(卡洛斯三世健康研究所) CHU Montpellier(蒙彼利埃大学医院) Umeå University(于默奥大学) University Hospital Lausanne(洛桑大学医院) Ecole Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) British Columbia Children’s Hospital(不列颠哥伦比亚儿童医院)

AI总结 提出结合无标记姿态估计、运动学描述符和预训练基础模型的视频框架,在成人数据上训练后迁移至儿科队列,经轻量校准后实现多种多动症现象的同时检测。

详情
AI中文摘要

目的:开发并外部测试一个基于视频的框架,用于同时检测多动症运动障碍现象:肌张力障碍、震颤、肌阵挛、舞蹈症、手足徐动症、投掷症、刻板动作和抽动,使用常规临床记录,并明确测试从成人到儿科人群的外部跨队列迁移。方法:在这项概念验证研究中,该框架结合了无标记姿态估计、运动学描述符和预训练基础模型。在21名确诊多动症的成人和4名健康对照(按标准化方案评估)上开发了共享预测骨干。外部验证在一个独立的外部队列上进行:一个真实世界的儿科样本(n=12,单基因联合多动症)。对于外部数据集,骨干网络未经重新训练直接部署;轻量校准仅调整最终受试者级别的决策步骤,使用由临床医生选择的小标记子集(代表队列表型范围)。结果:在临床医生选择的子集上对决策层进行本地校准后,在保留的儿科患者(n=7)上性能持续提升:汉明准确率从0.804提高到0.839,Jaccard指数从0.548提高到0.633。当评估限制在临床医生一致性更高的现象时,校准后的性能得以保持,Jaccard指数进一步提高(汉明准确率0.9,Jaccard指数0.786),表明增益并非依赖于最不可靠的标签。

英文摘要

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 新提交

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology(光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所) School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院) Department of AI Convergence, Gwangju Institute of Science and Technology(光州科学技术院人工智能融合系)

AI总结 提出分层特征工程框架,包括静态、动态、比率和耦合特征,用于区分声带创伤性和非声带创伤性声音亢进,发现耦合特征对两类分类均关键,PVH AUC 0.891,NPVH AUC 0.728。

详情
Comments
Interspeech 2026
AI中文摘要

动态颈部表面加速度能够实现声音亢进的无创监测,但其亚型的稳健生物标志物仍然有限。本研究利用NeckVibe Challenge数据集区分声带创伤性(PVH)和非声带创伤性(NPVH)声音亢进与健康对照组。我们提出一个分层特征工程框架,包括:(i)静态特征,(ii)动态特征,(iii)基于比率的特征,(iv)捕捉源-滤波器交互的耦合特征。单变量统计分析显示PVH具有强可分性,但NPVH显著性有限,而我们针对高维特征集成优化的机器学习流程发现,耦合特征对两项任务都至关重要。我们实现了PVH的AUC为0.891,NPVH的AUC为0.728,表明虽然PVH近似线性可分,但NPVH的区分受益于非线性特征交互建模。

英文摘要

Ambulatory neck-surface acceleration enables non-invasive monitoring of vocal hyperfunction, yet robust biomarkers for its subtypes remain limited. This study investigates the NeckVibe Challenge dataset to distinguish phonotraumatic (PVH) and non-phonotraumatic (NPVH) from healthy controls. We propose a hierarchical feature engineering framework comprising: (i) static, (ii) dynamic, (iii) ratio-based, (iv) coupling features capturing source filter interactions. While univariate statistical analysis shows strong separability for PVH but limited significance for NPVH, our machine learning pipeline, tailored for high-dimensional feature integration, identifies that coupling features are crucial for both tasks. We achieve an AUC of 0.891 for PVH and 0.728 for NPVH, suggesting that while PVH is near-linearly separable, NPVH discrimination benefits from modeling non-linear feature interactions.

2606.07670 2026-06-09 cs.CV cs.AI 新提交

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

液态神经网络作为动态3D高斯泼溅的即插即用连续时间变形场

Mingzhao Li, Arghya Pal, Guan Yuan Tan

发表机构 * Monash University(莫纳什大学)

AI总结 提出用液态神经网络(LNN)的闭式连续时间(CfC)单元替代MLP,构建显式连续时间变形场,在动态场景重建中匹配或超越MLP基线,尤其擅长高频关节运动。

详情
AI中文摘要

可变形3D高斯泼溅(D-3DGS)通过一个位置编码的MLP(以帧时间t为输入)变形一组规范3D高斯,从单目视频重建动态场景。尽管拟合连续变量,但MLP在架构中不耦合任意两个t值,实际上预测离散的逐帧偏移,使得时间平滑性仅作为优化的副产品出现。我们将变形场重新设计为一组闭式连续时间(CfC)单元,即液态神经网络(LNN),它是液态时间常数ODE的闭式解,同时保留D-3DGS管道的其他部分。每个单元暴露一个sigmoid时间门,在两个候选隐藏状态之间插值,将学习到的对t的平滑响应嵌入损失景观,无需调用任何数值求解器。在八个D-NeRF和七个NeRF-DS场景上,液态场在总体上匹配或超过MLP基线,其最大增益集中在具有最高频关节运动的场景上。结果是一种近乎零摩擦的架构设计,将离散的MLP变形场转变为t的显式连续时间函数。

英文摘要

Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.

2606.07669 2026-06-09 cs.CV cs.AI 新提交

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

MemoVAD: 边缘计算场景下基于动态语义记忆的资源高效视频异常检测

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen, Tian Wang

发表机构 * Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education, Beijing Normal University(北京师范大学大数据云边智能协同教育部工程研究中心)

AI总结 提出MemoVAD边缘-云协同框架,通过不确定性感知门控策略选择性调用云端视觉语言模型,并设计动态语义记忆缓存原型,在降低通信开销的同时提升视频异常检测性能。

详情
Comments
Accepted by IJCAI2026
AI中文摘要

在真实监控场景中部署视频异常检测(VAD)面临着对高层语义的需求以确保有效性,与边缘设备有限计算资源之间的根本矛盾。视觉语言模型(VLM)提供了丰富的开放词汇语义,但其延迟和计算成本阻碍了设备端部署。为解决这一挑战,我们提出MemoVAD,一种边缘-云协同框架,选择性地将VLM语义融入流式VAD。MemoVAD在边缘端使用轻量级检测器和因果时序上下文编码器(TCE)建模时序依赖,运行大部分推理。具体而言,我们引入基于主观逻辑的不确定性感知门控(UAG)策略,以建模感知不确定性,并仅对高不确定性和语义新颖的片段查询云端VLM。此外,设计动态语义记忆(DSM)缓存经VLM验证的原型以实现高效检索,使边缘模型通过语义适配器逐步融入VLM级语义。在真实边缘设备上对UCF-Crime和XD-Violence数据集的实验表明,MemoVAD在显著降低通信开销的同时,超越了当前最优性能。

英文摘要

Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.

2606.07661 2026-06-09 cs.CV cs.DL 新提交

PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing

PereStruct: 面向鲁棒历史文档解析的多模态语义组装

Maksim Shandybo, Ivan Bespalov, Daniil Yefimov, Marina Kosheleva, Alexander Loukianov

发表机构 * IGIC RAS(俄罗斯科学院信息传输问题研究所) Yandex Cloud National University of Science and Technology MISIS(莫斯科国立钢铁合金学院) Nekrasov Central Universal Scientific Library(涅克拉索夫中央综合科学图书馆)

AI总结 针对历史报纸复杂多栏布局的解析难题,提出结合微调YOLO与语义组装模块的多模态方法,在块到文章映射上F1达0.904,BLEU约0.96,显著优于通用视觉语言模型。

详情
Comments
Code and data available at https://github.com/makSShandybo/PereStruct
AI中文摘要

解析具有复杂非标准布局的历史文档仍是大规模档案数字化的基本瓶颈。与现代排版不同,历史报纸存在严重的物理退化和高度不规则的页面结构,即使最先进的视觉语言模型也难以应对,呈现出严重的分布外挑战。我们通过一个专门为解析历史报纸(具有特别复杂多栏布局的文档)设计的自动化流程来弥补这一差距。我们的方法结合了用于布局分析和块检测的微调YOLO架构(在1,426张完全人工标注的扫描页面上训练),以及一个新颖的语义组装模块,该模块通过联合建模基于TF-IDF的词法语义相似性、来自微调YOLO的视觉嵌入以及几何布局约束来重构文章。这种多模态集成实现了最先进的性能,在块到文章映射上取得了0.904的F1分数。值得注意的是,与视觉语言模型(Qwen3.6-35B-A3B和Qwen3.6-Plus)的端到端评估表明,PereStruct实现了显著更高的保真度(BLEU约0.96 vs 0.34),验证了模块化架构在通用VLM难以处理的复杂历史布局上表现出色。为了支持可重复性并推动该领域的研究,我们发布了包含599张标注页面的训练语料库和包含93张页面(具有专家验证的真实块到文章映射)的精选PereStruct基准。该框架为复杂档案材料的高保真数字化和语义重建奠定了坚实基础。

英文摘要

Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.

2606.07660 2026-06-09 cs.CV cs.LG 新提交

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

我们是否需要教基础模型什么是生成图像?基于解析谱自适应的无梯度生成伪影检测

Qiaoyu Chen, Bing Zhang

发表机构 * Harbin University of Commerce(哈尔滨商业大学)

AI总结 提出无梯度方法,将生成伪影检测重构为分布外异常度量问题,通过解析解耦统计与语义偏差,在零样本设置下显著优于梯度优化方法。

详情
AI中文摘要

通过基于梯度的更新来适应基础模型以检测生成伪影会损害其内在表示。在有限样本上优化时,模型会过拟合到局部领域捷径。在专门数据上微调大量权重会引入错误的归纳偏差,在高维特征空间中引起可测量的 $\mathcal{L}_2$ 范数扰动——我们将这一现象形式化为锚点漂移。非线性激活放大了这种漂移,损害了跨未见领域的零样本伪造检测。我们提出了一种无梯度方法,将检测从二分类重新定义为分布外(OOD)异常度量问题。将冻结的基础模型视为稳定的坐标系,通过解析解耦统计和语义偏差,在真实视觉流形上建立一个绝对的自然锚点,该锚点源自注意力加权的空间矩和感知不一致性的正交投影。在极端零样本设置下(在面部伪造上训练,在通用文本到图像生成上测试),我们的方法显著优于梯度优化范式。无反向传播的前向传递和线性求解器实现了硬件无关、边缘可部署的校准,延迟极低。此外,Sherman-Morrison公式使得能够针对新型攻击进行即时在线学习,并通过协方差增量传输实现隐私保护的联邦协作。

英文摘要

Adapting foundation models to detect generative artifacts via gradient-based updates compromises their intrinsic representations. Under optimization on limited samples, models overfit to local domain shortcuts. Fine-tuning massive weights on specialized data introduces erroneous inductive biases, inducing a measurable $\mathcal{L}_2$ norm perturbation in the high-dimensional feature space -- a phenomenon we formalize as anchor drift. Amplified by nonlinear activations, this drift impairs zero-shot forgery detection across unseen domains.We propose a gradient-free methodology reframing detection from binary classification to an out-of-distribution (OOD) anomaly measurement problem. Treating a frozen foundation model as a stable coordinate system, we establish an absolute natural anchor on the real visual manifold by analytically decoupling statistical and semantic deviations, derived from attention-weighted spatial moments and orthogonal projection of perceptual inconsistencies. Evaluated in an extreme zero-shot setting (trained on face forgeries, tested on universal Text-to-Image generations), our method significantly outperforms gradient-optimized paradigms. Backpropagation-free forward passes and linear solvers enable hardware-agnostic, edge-deployable calibration with minimal latency. Furthermore, the Sherman-Morrison formula unlocks instantaneous online learning against novel attacks and enables privacy-preserving federated collaboration via covariance delta transmission.

2606.07658 2026-06-09 cs.CV cs.LG 新提交

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

神经外科医生需要看到的:用于脑肿瘤手术中脑移位补偿的超声合成术中MRI

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit, Río Hortega University Hospital, Valladolid, Spain(西班牙巴利亚多利德里奥·奥尔特加大学医院神经外科神经血管科) Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC), Instituto de Investigación Biosanitaria de Valladolid (IBioVALL), Valladolid, Spain(西班牙巴利亚多利德生物医学研究与计算分析专业组(GEIBAC),巴利亚多利德生物健康研究所(IBioVALL))

AI总结 提出一种端到端流水线,通过融合术前MRI、术中超声生成的合成MRI及锚定该合成图像的可变形配准,生成术前成像空间中的全脑MRI体积,以补偿脑移位,为神经导航提供类似MRI的术中视野更新。

详情
AI中文摘要

最大安全切除是胶质瘤手术的主要目标。硬脑膜打开后,神经导航引导会因脑移位而逐渐退化。术中MRI可以补偿,但需要专用基础设施且很少可用,而术中超声(ioUS)廉价、可重复且与常规工作流程兼容。将ioUS与术前MRI结合的导航系统通常依赖刚性配准;即使是可变形多模态配准也受限于超声散斑对比度、窄视野以及无法表示术前扫描中不存在的结构,最关键的是切除腔和残余肿瘤。我们提出一个端到端流水线,通过合并术前MRI、从ioUS生成的合成MRI以及锚定在该合成图像上的可变形配准,生成术前成像空间中的全脑MRI体积。它集成了一个2.5D残差变换器合成骨干(ResViT-2.5D)和一个两阶段配准,将NiftyReg与合成锚定的SynthMorph阶段耦合,直接对原始扫描仪输入进行操作。在切除后的ReMIND队列上,ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2紧密匹配。在14名受试者的215个专家标志点上,合成锚定配准将平均目标配准误差从6.27毫米降低到5.86毫米,与强大的经典NiftyReg基线(5.85毫米)相当,同时为每个受试者产生微分同胚变形场。贡献不在于配准精度的提高,而在于集成的体积本身,它在超声视野内反映了术中切除后的状态。这为外科医生提供了手术视野的类似MRI的更新,并有可能集成到手术导航工作流程中。

英文摘要

Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.

2606.07654 2026-06-09 cs.CV cs.AI 新提交

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

MM-Matryoshka:通过二维多模态套娃训练框架实现预算弹性视觉文档检索

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Alibaba Cloud Computing(阿里云计算) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MM-Matryoshka,一种二维套娃训练框架,使视觉文档检索器在向量维度和编码器深度上实现弹性预算选择,无需为不同预算训练独立模型。

详情
AI中文摘要

多向量视觉文档检索器通过深度视觉语言模型(VLM)为每个页面生成多个向量,实现强大的细粒度匹配,但这种设计在存储和计算开销上导致部署成本高昂。现有效率技术通常只优化预算的一部分,使得多模态检索器缺乏统一的方法来权衡精度与向量宽度和编码器深度。因此,我们提出MM-Matryoshka,一种用于预算弹性视觉文档检索(VDR)的二维套娃训练框架,使ColPali风格的多向量检索在维度和层两个方向上实现弹性。在推理时,单个检索器可以选择二维可调预算,无需为不同预算训练独立模型。通过在多个代表性骨干网络上的全面实验,我们证明MM-Matryoshka在显著降低存储和计算开销的同时,保留了比直接截断基线高得多的质量,从而为高效VDR提供了稳健的预算弹性。

英文摘要

Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

2606.07653 2026-06-09 cs.CV cs.AI 新提交

A Dataset for Dynamic Human Preferences for Vision Language Models

面向视觉语言模型的动态人类偏好数据集

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一个评估视觉语言模型理解动态人类偏好能力的基准,通过自动化管道生成包含图像依赖变化的数据集,并评估了现有模型。

详情
AI中文摘要

鉴于视觉语言模型(VLM)在人机交互场景中的广泛应用,评估这些模型适应不同用户实时偏好的能力变得重要。尽管近年来引入了越来越多的视觉语言基准,但它们主要侧重于评估静态能力和从大量训练数据中学习的一般偏好。本文引入了一个新的基准,用于评估VLM理解动态人类偏好的能力,即在推理时通过上下文传递的偏好。我们提供了一个自动化管道来生成该基准,包含图像依赖变化、动态多模态人类偏好数据集,并对最新模型在新基准上的表现进行了评估。

英文摘要

Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.

2606.07649 2026-06-09 cs.CV cs.AI 新提交

ViMax: Agentic Video Generation

ViMax: 智能体视频生成

Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia, Chao Huang

发表机构 * The University of Hong Kong(香港大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出ViMax框架,通过多智能体协作实现长视频生成,利用分层叙事引擎和视觉一致性机制,保证叙事连贯性和视觉一致性。

详情
Comments
20 pages, 13 figures
AI中文摘要

长视频生成需要系统的叙事规划和视觉一致性,而当前的短视频方法无法提供。现有方法生成孤立的序列,缺乏叙事结构,并且缺乏跨场景保持角色和环境一致性的机制。我们提出ViMax,一个智能体视频生成框架,通过协调的多智能体协作来解决视频创作问题,其中专门的组件协商叙事决策、视觉连续性和制作质量。我们的框架采用分层叙事引擎,结合检索增强生成以实现全局故事连贯性,以及依赖感知的视觉一致性机制,跨时间边界跟踪角色和环境状态,同时VLM引导的智能体持续监控和优化叙事连贯性和视觉保真度。该框架支持协调的智能体协作以生成扩展的叙事内容,在多场景时间线上保持叙事完整性和视觉连贯性。

英文摘要

Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.

2606.07648 2026-06-09 cs.CV cs.AI 新提交

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer:一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad(印度海得拉巴国际信息技术学院)

AI总结 提出AQIFormer,一种基于Transformer的集成架构,通过前后视图融合、天气感知注意力和多任务学习,在跨城市空气质量分类中达到89.96%准确率,比现有方法提升14.96%。

详情
Comments
Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures
AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一,传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案,利用交通场景中大气污染物的视觉特征。然而,现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer,一种新颖的基于Transformer的集成架构,通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合,以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明,该模型性能良好,准确率达到89.96%,比现有最优方法提高了14.96%。最重要的是,我们的模型保持了出色的跨城市泛化能力,在印度那格浦尔收集的独立数据集上达到81.67%的准确率,通过少量样本自适应仅用极少的训练样本,性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

2606.07646 2026-06-09 cs.CV cs.AI 新提交

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME:从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS(中国科学院自动化研究所多模态人工智能系统实验室)

AI总结 提出DOME域编码器,通过视觉-语言预训练提取密集连续表示,参数化域为分布变量并引入动量更新的稀疏域库,实现零样本显式域建模,在多个基准上超越复杂TTA方法。

详情
AI中文摘要

测试时自适应(TTA)旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布,忽略了真实世界域迁移的多维性和样本特异性,导致自适应脆弱。我们提出DOME,一种有效的域编码器,以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示,将域参数化为分布变量,并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型,即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能,超越了复杂的TTA方法。我们的结果表明,鲁棒的自适应并非源于复杂的自适应算法,而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.