arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2605.26585 2026-05-27 cs.LG

Near-Optimal Regret in Adversarial Kernel Bandits

对抗性核赌博中的近最优遗憾

Yu-Jie Zhang, Hao Qiu, Jonathan Scarlett, Kevin Jamieson

AI总结 针对对抗性核赌博问题,提出基于正则化重要性加权损失估计的指数权重算法,通过显式修正项消除偏差,实现与随机核赌博已知最优率匹配的遗憾界。

详情
AI中文摘要

我们研究对抗性核赌博问题,其中每轮的损失由再生核希尔伯特空间(RKHS)中的任意有界元素诱导。我们提出了一种基于正则化重要性加权损失估计的指数权重算法,并带有一个显式修正项,用于抵消正则化引入的偏差。我们的主要结果将遗憾界限制为 $\widetilde{O}ig(\sqrt{T\, d_*(λ)\,\log|{X}|}ig)$,其中 $d_*(λ)$ 是广泛采用的有效维度概念,用于捕捉核的复杂度。忽略对数因子,这匹配了相关随机核赌博问题中已知的速率。一个显著的应用是 $\mathbb{R}^d$ 上具有平滑参数 $ν$ 的 Matérn$(ν,d)$ 核,此时我们的界特化为 $\widetilde{O}ig(T^{(ν+d)/(2ν+d)}ig)$,改进了 Chatterji 等人 [2019] 先前已知的最佳速率,同时去除了他们分析所需的秩一对手假设。此外,该速率与随机核赌博的已知最优速率相同,并且与并发工作中的下界仅相差一个 $\log T$ 因子。

英文摘要

We study the adversarial kernel bandit problem, in which the loss at each round is induced by an arbitrary bounded element of a reproducing kernel Hilbert space (RKHS). We propose an exponential-weights algorithm built on a regularized importance-weighted loss estimator, together with an explicit correction term that cancels the bias introduced by the regularization. Our main result bounds the regret by $\widetilde{O}\big(\sqrt{T\, d_*(λ)\,\log|{X}|}\big)$, where $d_*(λ)$ is a widely-adopted notion of effective dimension that captures the complexity of the kernel. Up to logarithmic factors, this matches the known rate achieved in the related stochastic kernel bandit problem. A notable application is the Matérn$(ν,d)$ kernel with smoothness parameter $ν$ on $\mathbb{R}^d$, for which our bound specializes to $\widetilde{O}\big(T^{(ν+d)/(2ν+d)}\big)$, improving over the best-known prior rate of Chatterji et al. [2019] while simultaneously removing the rank-one adversary assumption required by their analysis. Moreover, this rate is the same as the known optimal rate for stochastic kernel bandits, and also matches a lower bound from concurrent work up to a $\log T$ factor.

2605.26584 2026-05-27 cs.CV

O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

O-MARC: 全记忆增强压缩蒸馏用于高效视频理解

Peiran Wu, Yunze Liu, Chi-Hao Wu, Chen Chen, Junxiao Shen

AI总结 提出O-MARC框架,通过无训练压缩方法OMAC保留视觉记忆和音频锚点,并利用压缩蒸馏使紧凑模型鲁棒,在多个基准上提升性能并降低推理成本。

详情
AI中文摘要

全模态大语言模型实现了统一的音频视频理解,但长联合令牌序列导致推理成本高昂,且现有基准未能完全隔离噪声用户生成视频中的音视频关联。我们引入了UGC-AVQA,一个公开的UGC基准,包含1000个视频和4816个问答对,其中音频移除测试确保基准问题需要声学和视觉证据。为了降低推理成本,我们提出了OMAC,一种无需训练的即插即用压缩方法,保留显著的视觉记忆和时域锚定的音频锚点。为了进一步使紧凑模型对压缩输入鲁棒,我们引入了O-MARC,一种用于学习记忆压缩多模态上下文的压缩蒸馏框架。在Qwen2.5-Omni-3B上,O-MARC在四个基准上的平均得分提升至45.8,优于全令牌推理的44.1和OmniZip的41.0。与全令牌推理相比,OMAC还保持了推理效率,延迟降低34.6%(1.53倍加速),内存降低34.7%。

英文摘要

Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.

2605.26582 2026-05-27 cs.LG cs.AI

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

离散扩散中随机性的纠错效应

William Yuan, Sungwon Jeong, Amirali Aghazadeh

AI总结 本文系统研究离散扩散模型中马尔可夫转移随机性程度对采样效率与质量的权衡,提出离散搅动与重启采样(DCRS)算法,通过交替正向和反向扩散过程注入受控随机性,在低函数评估次数下改善速度-质量权衡。

详情
AI中文摘要

离散扩散模型在文本和图像生成中取得了强劲性能,但其推理仍然缓慢,且必须内在平衡采样效率与样本质量。在这项工作中,我们系统研究了马尔可夫转移中随机性程度如何主导采样权衡。我们表明,高度确定性的转移收敛迅速但遭受误差累积,而更随机的转移收敛更慢但能达到更高的最终样本质量。通过信息论分析,我们识别出潜在机制为一种由对称地在状态间交换质量的冗余转移诱导的纠错效应,并表明这些转移可证明地收缩采样误差。受此分析启发,我们提出离散搅动与重启采样(DCRS),一种新颖的推理算法,通过交替正向和反向扩散过程注入受控随机性。在合成数据集和大规模基准上的实验表明,DCRS在低函数评估次数下改善了速度-质量权衡。在图像数据集上,与标准采样器相比,DCRS在保持竞争性样本质量的同时,实现了高达10倍的采样步数减少;而在语言基准上,我们观察到更细微的行为,取决于损坏过程和采样程序。

英文摘要

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emph{degree of stochasticity} in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emph{redundant transitions} that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emph{Discrete Churn and Restart Sampling} (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a $10\times$ reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.

2605.26579 2026-05-27 cs.LG

Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

Focal Reward: 基于评分标准的强化学习中的平衡奖励

Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu, Feng Hong, Xinmu Ge, Lin Yuan, Weichang Wu, Qiang Hu, Xiaolu Zhang, Jun Zhou, Jiangchao Yao

AI总结 针对大语言模型在基于多维评分标准的强化学习中奖励失衡的问题,提出Focal Reward方法,通过逆奖励投影机制估计各维度饱和程度并自动重加权,实现细粒度平衡,在18个模型-基准对比中均优于最强静态聚合基线。

Comments Preprint

详情
AI中文摘要

大语言模型中的开放式生成通常需要多维评分标准来充分评估质量并指导强化学习的改进。然而,这种训练范式固有的一个关键困境是不同评分标准维度上的奖励极化不平衡。在此瓶颈下,即使大语言模型在训练后获得相对较高的奖励,它们仍可能在某些维度上表现出严重缺陷,直接导致用户体验下降。为了解决这个问题,我们提出了Focal Reward,一种新颖的目标函数,用于自动平衡基于评分标准的强化学习训练。具体来说,我们首先利用逆奖励投影机制来估计评分标准中每个准则的饱和程度,这构成了校准奖励方向的基础。然后,最终目标函数为每个准则设计了一个自动重新加权的系数,以实现细粒度平衡。跨三个模型规模和六个基准的大量实验表明,我们的Focal Reward方法在所有18个模型-基准比较中均优于最强的静态聚合基线。展开、机制和消融分析进一步表明,这些增益来自于向仍有改进空间的评分标准进行在线、饱和感知的重新分配。

英文摘要

The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.

2605.26576 2026-05-27 cs.CV cs.LG

TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

TrackRef3D: 面向开放世界3D高斯泼溅分割的多视角一致跟踪-标注方法

Yuyang Tan, Renhe Zhang, Hang Zhang, Ao Li, Xin Tan

AI总结 提出TrackRef3D全自动流水线,通过多视角一致跟踪-标注范式解耦目标发现与语义定位,无需人工标注实现开放世界3D高斯泼溅分割。

详情
AI中文摘要

引用3D高斯泼溅(R3DGS)利用自然语言进行3D目标分割,已成为具身AI的关键能力。然而,现有方法通常依赖昂贵的每场景人工标注和每视图伪掩码生成,存在多视角不一致以及对不同查询特异性的泛化能力差的问题。为此,我们提出TrackRef3D,一种全自动流水线,通过引入多视角一致的跟踪-标注范式,从根本上将目标发现与语义定位解耦,无需人工标注即可实现3D高斯泼溅(3DGS)中的开放世界引用分割。具体而言,我们提出轨迹感知语义共识模块(TSCM),通过同义词聚类和轨迹感知投票聚合跨视图预测,建立规范语义身份,从而确保多视角一致性。此外,我们采用可见性感知描述生成策略以缓解歧义,并提出混合训练策略(HTS),利用多正例对比目标联合优化粗粒度类别语义和细粒度引用线索,确保在不同查询特异性下的鲁棒性。在基准上的大量实验表明,TrackRef3D达到了最先进的性能。

英文摘要

Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.

2605.26575 2026-05-27 cs.CL

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

中心性而非各向异性驱动多语言嵌入模型中的跨语言检索不对称性

Adib Sakhawat, Fardeen Sadab, Atik Shahriar

AI总结 本文通过实验证明,在多语言嵌入模型中,中心性(hubness)是导致跨语言检索不对称性的主要几何病理因素,而非各向异性、质心漂移或向量幅度,并推荐使用CSLS替代余弦相似度作为默认检索度量。

Comments 17 pages, 5 figures

详情
AI中文摘要

多语言嵌入模型在部署时假设跨语言检索是对称的:如果语言A的查询检索到语言B中的翻译,反之亦然。但实际上并非如此。我们使用包含英语、孟加拉语、印地语和阿拉伯语的6,518个习语和谚语表达的平行语料库,通过五个生产级编码器(Gemini、Mistral、OpenAI-L、OpenAI-S、Qwen)进行嵌入,将这种失败形式化为互近邻互惠性的缺陷,并测试一个单一的机制性主张:在多语言空间的几何病理中,中心性(hubness),而非各向异性、质心漂移或幅度,是主要的因果驱动因素。在五个预先注册的实验中,预先指定了证伪条件,中心质量在互惠性的联合回归中占主导地位(49.5%的主导份额,是下一个预测因子的1.68倍;偏R²=0.302,而各向异性为0.003),而中心性感知的分数校正(CSLS)缩小了最差到最佳互惠性差距的63.5%,并产生平均模型内效应量,是外科中心向量消融的130倍。后一对比指出了机制:中心性是相似度度量的病理,而非单个中心向量的病理。我们解决了著名的各向异性-中心性悖论,证明两者在统计上是可分离的,并建议用CSLS替换余弦相似度作为多语言嵌入管道的默认检索度量。

英文摘要

Multilingual embedding models are deployed under the assumption that cross-lingual retrieval is symmetric: if a query in language A retrieves its translation in language B, the reverse should also hold. In practice it does not. Using a parallel corpus of 6,518 idiomatic and proverbial expressions in English, Bangla, Hindi, and Arabic, embedded by five production-grade encoders (Gemini, Mistral, OpenAI-L, OpenAI-S, Qwen), we formalise this failure as a deficit in mutual nearest-neighbour reciprocity and test a single mechanistic claim: among the geometric pathologies of multilingual spaces, hubness, not anisotropy, centroid drift, or magnitude, is the dominant causal driver. Across five pre-registered experiments with falsification conditions specified in advance, hub mass dominates a joint regression on reciprocity (49.5% dominance share, 1.68x the next predictor; partial R^2 = 0.302 versus 0.003 for anisotropy), while a hub-aware score correction (CSLS) closes 63.5% of the worst-to-best reciprocity gap and yields a mean within-model effect size 130x larger than surgical hub-vector ablation. The latter contrast pinpoints the mechanism: hubness is a pathology of the similarity metric, not of individual hub vectors. We resolve the well-known anisotropy-hubness paradox by showing the two are statistically dissociable, and we recommend replacing cosine similarity with CSLS as the default retrieval metric for multilingual embedding pipelines.

2605.26571 2026-05-27 cs.LG

Separate Aggregation of Split Network for Personalized Federated Learning

分离网络的分组聚合用于个性化联邦学习

Yunseok Kang, Jaeyoung Song

AI总结 提出PGFedSplit框架,采用分离架构和自适应聚合调度,结合本地与服务器生成的表示,解决客户端数据异构下的个性化与全局泛化权衡问题。

详情
AI中文摘要

联邦学习能够在不共享原始数据的情况下进行协作模型训练,但在客户端数据分布异构时性能会大幅下降。单一的全局模型往往无法满足不同客户端的需求,因此个性化联邦学习被探索用于在保持全局泛化的同时提升客户端特定性能。现有的PFL方法通常面临一个基本权衡:更强的全局共享可能削弱本地专业化,而更强的本地适应则可能导致在数据有限、标签不平衡和缺失类别场景下的过拟合。在这项工作中,我们提出了PGFedSplit,一个在严重客户端异构下同时提升个性化和全局泛化的个性化联邦学习框架。PGFedSplit采用分离架构,并根据不同模型组件的角色执行自适应聚合调度,在保持客户端特定适应的同时实现稳定的知识共享。每个客户端进一步利用本地提取的表示和从服务器端高斯统计生成的合成表示的混合,提升了在标签不平衡和缺失类别条件下的鲁棒性。在Fashion MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet上的大量实验表明,与最先进的PFL方法相比,PGFedSplit在高度异构设置下实现了持续改进,具有稳定的收敛和优越的个性化性能。

英文摘要

Federated learning enables collaborative model training without sharing raw data, but its performance can degrade substantially under heterogeneous client data distributions. A single global model often cannot satisfy diverse client requirements, so personalized federated learning has therefore been explored to improve client specific performance while preserving global generalization. Existing PFL methods often face a fundamental tradeoff in which stronger global sharing can undermine local specialization, whereas stronger local adaptation can lead to overfitting under limited data, label imbalance, and missing class scenarios. In this work, we propose PGFedSplit, a personalized federated learning framework that improves both personalization and global generalization under severe client heterogeneity. PGFedSplit adopts a split architecture and performs adaptive aggregation scheduling tailored to the roles of different model components, enabling stable knowledge sharing while maintaining client specific adaptation. Each client further leverages a mixture of locally extracted representations and synthetic representations generated from server side Gaussian statistics, improving robustness under label imbalance and missing class conditions. Extensive experiments on Fashion MNIST, CIFAR 10, CIFAR 100, and Tiny ImageNet demonstrate consistent improvements over state of the art PFL methods, with stable convergence and superior personalization in highly heterogeneous settings.

2605.26569 2026-05-27 cs.LG

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

分布感知共形预测:一种为时间序列生成高效预测区间的框架

Daniel Schweizer, Peter Kuhn, Jayant Sharma, Shivali Dubey, Malte von Ramin, Christoph Brockt-Haßauer

AI总结 提出分布感知共形预测(DCP)框架,通过集成概率预测器与分数无关的共形校准,为时间序列生成有效且高效的预测区间。

Comments submitted to Journal of Machine Learning Research (JMLR)

详情
AI中文摘要

我们提出了分布感知共形预测(DCP),这是一个统一框架,将蒙特卡洛dropout、深度集成和分位数回归等概率预测器与分数无关的共形校准相结合,以生成有效且高效的预测区间。利用数值反演方法构建区间边界,DCP能够适应任意组合的分布生成预测器和非一致性分数。对合成和真实时间序列数据的基准分析表明,DCP能够在不同的不确定性机制下自适应地校准预测区间。关键的是,DCP的模块化设计便于对不同预测器-分数配对进行即插即用实验,并通过新引入的修正Winkler分数进行定量支持,该分数通过显式惩罚欠覆盖来平衡有效性和效率。虽然DCP推广并扩展了现有方法(如共形分位数回归和共形蒙特卡洛),但其模块化设计允许进一步扩展,为在动态环境和高风险应用中推进不确定性量化奠定了基础。

英文摘要

We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to produce valid and efficient prediction intervals. Leveraging a numerical inversion approach to construct interval bounds, DCP accommodates arbitrary combinations of distribution generating predictors and nonconformity scores. Benchmark analysis on synthetic and real-world time series data demonstrate DCP's ability to adaptively calibrate prediction intervals under varying uncertainty regimes. Crucially, DCP's modular design facilitates plug-and-play experimentation with different predictor-score pairings, quantitatively supported by a newly introduced modified Winkler score that balances validity and efficiency by explicitly penalizing undercoverage. While DCP generalizes and extends existing approaches like Conformalized Quantile Regression and Conformalized Monte Carlo, its modular design allows further extensions, setting a foundation for advancing uncertainty quantification in dynamic environments and high-risk applications.

2605.26567 2026-05-27 cs.AI

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX: 将可执行指南中的决策逻辑内化到大型语言模型中以进行临床推理

Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang, Juexiao Zhou, Hao Peng, Yue Guo

AI总结 提出一种将临床实践指南转化为可执行决策逻辑并生成事实与反事实问答数据的训练流程,通过微调医学大语言模型得到MedGuideX,在四个临床推理基准上平均准确率相对提升10.28%,且医生评估显示其推理步骤更优。

详情
AI中文摘要

临床实践指南(CPGs)编码了基于证据的决策逻辑,临床医生通过评估患者变量、条件标准和推荐规则来应用这些逻辑。然而,现有方法通常将CPGs作为自由文本训练数据或检索源,未能充分利用其程序性决策结构。为了更好地利用这种结构,我们引入了一个基于指南的训练流程,将CPG推荐转化为可执行的临床决策逻辑,并利用它生成事实性和反事实性的问答数据。这些数据教会模型既支持指南推荐的决策,也了解在不同患者条件下决策如何变化。在生成的医学数据上对医学大语言模型进行后训练,得到MedGuideX。在四个临床推理基准上,MedGuideX的平均准确率相对提高了10.28%。医生评估进一步表明,MedGuideX能更好地恢复临床医生撰写的推理步骤,并在忠实性、有效性、完整性和清晰度方面产生医生偏好的推理依据。总体而言,我们的结果表明,来自CPGs的可执行决策逻辑可以转化为可扩展的监督信号,用于构建可靠的医学大语言模型。

英文摘要

Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.

2605.26562 2026-05-27 cs.LG

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

超越整体模型:深度多变量时间序列预测的系统性组件级基准测试

Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

AI总结 提出TSCOMP基准,通过正交实验分解深度预测方法的核心组件,揭示其有效性并构建性能语料库,实现零样本模型构建,优于手工复杂架构。

Comments accepted by KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

虽然先前在多变量时间序列预测中的研究集中于开发复杂的整体模型,但本工作倡导转向对其影响的细粒度、组件级理解。我们提出TSCOMP,这是第一个大规模基准,系统地将深度预测方法分解为其核心、细粒度的组件——涵盖序列预处理、编码策略、包括特定和大规模时间序列模型的网络架构以及优化方法。通过使用约束正交实验设计和广泛评估,我们进行多视角分析,揭示组件在不同骨干网络、数据特征及其交互中的有效性。除了提供见解外,该基准建立了一个包含超过20,000个模型-数据集评估的细粒度性能语料库,支持自动组件选择的学习,从而在新数据集上实现零样本模型构建。我们的实验表明,尽管简单,但基于语料库的方法始终优于最先进的方法,验证了我们评估设计的合理性,并确认系统性组件选择超越了手工设计的复杂架构。所有代码和性能语料库均可在 https://github.com/SUFE-AILAB/TSCOMP 公开获取。

英文摘要

While previous research in multivariate time series forecasting has focused on developing complex holistic models, this work advocates for a shift toward a granular, component-level understanding of their impacts. We propose TSCOMP, the first large-scale benchmark that systematically deconstructs deep forecasting methods into their core, fine-grained components--spanning series preprocessing, encoding strategies, network architectures including specific and large time-series models, and optimization methods. Using constrained orthogonal experimental design and extensive evaluations, we conduct multi-view analyses that reveal component effectiveness across different backbones, data characteristics, and their interactions. Beyond providing insights, this benchmark establishes a fine-grained performance corpus comprising over 20,000 model-dataset evaluations, which supports the learning of automated component selection, enabling zero-shot model construction on new datasets. Our experiments demonstrate that the corpus-driven approach, despite its simplicity, consistently outperforms state-of-the-art methods, validating the soundness of our evaluation design and confirming that systematic component selection surpasses manually designed complex architectures. All code and the performance corpus are publicly available at https://github.com/SUFE-AILAB/TSCOMP.

2605.26560 2026-05-27 cs.CL cs.AI

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

可靠提取临床随访指令:一种混合神经符号管道

Michal Laufer, Yehudit Aperstein, Alexander Apartsin

AI总结 提出混合神经符号管道,结合BioBERT实体提取和确定性日期算术,在合成门诊笔记上实现接近完美的(动作, 日期)对提取F1分数,优于直接生成方法。

Comments 17 pages, 5 figures

详情
AI中文摘要

目标。门诊笔记携带随访指令,将动作与未来时间配对(“两周内进行脑部MRI”)。提取(动作,日期)对支持调度和审计,但生成式提取器会错过日期,因为链接和算术在解码中是隐式的。我们测试了一种混合神经符号管道与直接生成方法的对比。方法。我们定义了TestSpecification和TimeSpecification实体以及ScheduledFor关系。BioBERT提供BIO标注和双仿射链接器;实体通过28动作本体规范化,时间通过确定性方式归一化为天数偏移。我们在一个包含2000份笔记的合成门诊语料库上评估,采用动作不相交划分(18个训练,6个OOV测试),与零样本GPT-4o-mini和LoRA微调LLaMA-3 8B对比,使用笔记级bootstrap 95%置信区间。结果。在259份笔记的已知和OOV划分上,混合管道实现了测试时间对F1分别为0.997和0.986,MAE为0.00天。基线达到了高动作F1(LLaMA-3 0.992;GPT-4o-mini 0.963已知),但对F1保持在0.51-0.57(LLaMA-3)和0.53(GPT-4o-mini),置信区间与混合管道不重叠。结论。将学习的实体提取与确定性日期算术分离在此基准上优于生成方法,泛化到未见动作,并暴露了失败模式。迁移到真实电子健康记录笔记是下一步验证;初步的现实性检查见局限性。

英文摘要

Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.

2605.26559 2026-05-27 cs.LG cs.AI econ.EM

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

审计与修复离散选择中表格基础模型的经济有效性

Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan, Zexin Zhuang

AI总结 提出两阶段适配器,将表格基础模型预测嵌入效用最大化框架,在保证经济一致性的同时提升选择预测精度。

Comments 5 pages, 1 table. Accepted at the FMSD Workshop, ICML 2026

详情
AI中文摘要

表格基础模型在选择预测任务上取得了很高的准确率,但其预测常常违反这些任务所需的经济逻辑:提高价格有时会增加预测需求,隐含的支付意愿估计经常为负或不合理。我们提出了一种两阶段适配器,将基础模型预测嵌入效用最大化框架。在第一阶段,我们估计一个标准选择模型,其参数受经济理论约束。在第二阶段,我们冻结这些参数,并训练一个校正项,将基础模型的预测作为附加信息纳入。结果模型继承了基础模型的精度提升,同时保证了政策扰动下价格-需求的单调关系,并产生可解析计算的权衡指标。在两个交通数据集上,适配器在保持完美经济一致性的同时,相比标准logit模型恢复了高达13个百分点的准确率,这是原始基础模型或传统蒸馏都无法实现的。

英文摘要

Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price sometimes increases predicted demand, and implied willingness-to-pay estimates are frequently negative or implausible. We propose a two-stage adapter that embeds foundation model predictions within a utility-maximization framework. In the first stage, we estimate a standard choice model whose parameters are constrained to obey economic theory. In the second stage, we freeze those parameters and train a correction term that incorporates the foundation model's predictions as additional information. The result is a model that inherits the foundation model's accuracy gains while guaranteeing monotonic price-demand relationships under policy perturbation and producing analytically computable trade-off measures. On two transportation datasets, the adapter recovers up to 13 percentage points of accuracy over a standard logit model while maintaining perfect economic consistency, something neither the raw foundation models nor conventional distillation achieve.

2605.26554 2026-05-27 cs.LG cs.AI

Linear and Neural Dueling Bandits with Delayed Feedback

线性与神经延迟反馈的对抗性赌博机

Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong, Zhi Hong, Zhiyong Wang, Zhongxiang Dai

AI总结 针对随机延迟反馈下的上下文对抗性赌博机问题,提出线性(LDB-DF)和神经(NDB-DF)两种算法,通过将逆概率加权(IPW)机制直接融入损失函数实现无偏校正,并给出线性设置下O(d*sqrt(T))的遗憾界和神经设置下的次线性保证。

详情
AI中文摘要

上下文对抗性赌博机构成了基于偏好的决策制定的基石,在推荐系统和大语言模型对齐中有关键应用。然而,标准算法依赖于即时反馈的理想化假设,这一条件在现实场景(如提示优化)中经常被违反。这种设置带来了独特的理论挑战:与线性赌博机不同,对抗性赌博机估计量缺乏闭式解,使得标准加权技术的朴素适应产生偏差。为解决这一问题,我们形式化了具有随机延迟反馈的上下文对抗性赌博机问题,并提出了两种新颖算法:线性延迟反馈对抗性赌博机(LDB-DF)和神经延迟反馈对抗性赌博机(NDB-DF)。我们方法的核心是一种新颖的估计量,它将逆概率加权(IPW)机制直接集成到损失函数中,确保对延迟或缺失反馈的无偏校正。我们提供了全面的理论分析,为线性设置建立了O(d*sqrt(T))的遗憾界,并为神经设置建立了次线性保证。在模拟和真实数据集上的大量实验证明了我们提出方法的有效性。

英文摘要

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

2605.26546 2026-05-27 cs.AI

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer: 通过在线探索加速移动GUI代理的端侧推理

Runxi Huang, Liyu Zhang, Shengzhong Liu, Xiaomin Ouyang

AI总结 提出MobileExplorer框架,通过在线探索并行探测UI元素并记录为结构化记忆,结合两级回滚机制,减少推理步骤和延迟,提升移动GUI代理的端侧部署效率。

详情
AI中文摘要

移动图形用户界面(GUI)代理使AI模型能够代表用户自主操作智能手机。然而,现有系统主要关注优化任务准确性,并依赖云端模型进行推理,这引入了隐私问题和网络依赖延迟。因此,移动GUI代理的完全端侧部署仍未被充分探索。我们提出MobileExplorer,一种通过在线探索加速基于视觉的移动GUI代理端侧推理的新框架。关键思想是利用视觉语言模型(VLM)较长的每步推理时间,对UI元素进行轻量级并行探索。在模型推理期间,代理主动探测语义相关的UI元素,并将这些探索轨迹记录为结构化记忆。为确保在真实移动环境中可靠执行,我们设计了两级回滚机制,当快速但简单的回溯策略失败时,能够稳健地恢复初始UI状态。收集的探索轨迹随后被总结为简洁的上下文提示,并注入到提示中,以增强后续推理步骤。我们在多个现成设备上使用AndroidWorld基准测试以及新设计的更复杂任务和动态端侧环境评估了MobileExplorer。MobileExplorer将平均推理步骤数和端到端延迟减少了23%,同时将任务成功率维持或提高了高达5%。真实世界中MobileExplorer性能的视频演示可在https://youtu.be/thK7MJmdlvM获取。

英文摘要

Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .

2605.26543 2026-05-27 cs.AI cs.LG

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

PolyFusionAgent: 用于聚合物性能预测和逆向设计的多模态基础模型与自主AI助手

Manpreet Kaur, Xingying Zhang, Qian Liu

AI总结 提出PolyFusionAgent框架,结合多模态聚合物基础模型PolyFusion和工具增强的设计代理PolyAgent,通过对齐序列、拓扑、3D几何和指纹等多模态视图学习共享潜在空间,实现热物理性能预测和化学有效、结构新颖的聚合物逆向设计,并利用文献证据检索闭环设计流程。

Comments 23 pages, 5 figures, 2 tables; Supplementary material included

详情
AI中文摘要

聚合物的发现对于从能量存储到生物医学等领域至关重要,但受到天文数字般的化学设计空间以及结构、性能和先验知识的碎片化表示的阻碍。这种碎片化使得许多AI模型与物理和实验现实脱节,限制了它们支持直接可操作设计决策的能力。在这里,我们介绍PolyFusionAgent,一个交互式框架,将多模态聚合物基础模型(PolyFusion)与工具增强、基于文献的设计代理(PolyAgent)相结合。PolyFusion对齐互补的聚合物视图,包括序列、拓扑、3D几何和指纹,跨越数百万种聚合物,学习一个跨化学和数据体系可迁移的共享潜在空间,改进了热物理性能预测,并实现了超出参考设计空间的化学有效、结构新颖聚合物的性能条件生成。PolyAgent通过将预测和逆向设计与从聚合物文献中检索证据联系起来,在一个工作流中提出、评估和情境化假设,从而闭合设计循环。PolyFusionAgent共同实现了交互式、证据关联的聚合物发现,结合了大规模表示学习、多模态化学知识和可验证的科学推理。

英文摘要

Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.

2605.26538 2026-05-27 cs.CV

Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer

调度式风格注入:在免训练扩散风格迁移中扩展风格-内容帕累托前沿

Amey Sunil Kulkarni

AI总结 通过系统探索层、时间步和ControlNet几何条件四个维度的调度,发现递减调度(浅层和早期时间步强结构注入)优于递增调度,且余弦和平方根时间步调度优于线性,结合近乎独立的gamma调度与ControlNet条件可扩展帕累托前沿,在ArtFID上相对提升6.1%。

Comments Accepted to CVPR NTIRE 2026

详情
AI中文摘要

基于预训练扩散模型的风格迁移已取得快速进展,但一个核心问题仍未充分探索:模型中风格注入应在何处最强?领先的免训练方法StyleID在所有层和时间步上统一使用单个全局参数(gamma),这强制了风格质量与内容保留之间的固定权衡。我们证明这种权衡是不必要的刚性。我们系统地探索了四个控制维度:跨解码器层改变风格注入强度、跨去噪时间步改变强度,以及沿两个轴调度ControlNet几何条件。模式在所有地方一致:递减调度(在较浅层和较早时间步注入更强的结构信号)可靠地优于反向调度。除方向外,调度形状也很重要:余弦和平方根时间步调度优于线性。最重要的是,我们发现gamma调度和ControlNet条件几乎独立。由此产生的组合配置扩展了帕累托前沿,与任何单一基线设置相比,提供了风格保真度和内容保留之间的优越权衡。我们最佳的平衡配置实现了ArtFID 27.036,而StyleID为28.801——相对改进6.1%,在整个风格-内容权衡前沿上具有一致的增益。结果在35种配置(总计超过28,000张风格化图像)上使用四种互补指标进行了验证。这些发现在SD骨干网络上具有相同的排名顺序。所有修改都是免训练、无参数的,仅需几行调度代码;代码可在https://github.com/ameyskulkarni/scheduled_style_injection获取。

英文摘要

Style transfer with pre-trained diffusion models has advanced rapidly, but a core question remains underexplored: where in the model should style injection be strongest? StyleID, the leading training-free method, uses a single global parameter (gamma) uniformly across all layers and timesteps, which forces a fixed tradeoff between style quality and content preservation. We show this tradeoff is unnecessarily rigid. We systematically explore four dimensions of control: varying style injection strength across decoder layers, across denoising timesteps, and scheduling ControlNet geometric conditioning along both axes. The pattern is consistent everywhere: decreasing schedules, with stronger structural signal injection in shallower layers and earlier timesteps, reliably outperform the reverse. Beyond direction, schedule shape matters: cosine and square-root timestep schedules outperform linear. Most importantly, we find that gamma scheduling and ControlNet conditioning are nearly independent. The resulting combined configurations expand the Pareto frontier, offering superior tradeoffs between style fidelity and content preservation compared to any single baseline setting. Our best balanced configuration achieves ArtFID of 27.036 versus StyleID's 28.801 - a 6.1% relative improvement, with consistent gains across the full style-content tradeoff frontier. Results are validated across 35 configurations totaling over 28,000 stylized images using four complementary metrics. These findings generalize across SD backbones with identical rank ordering. All modifications are training-free, parameter-free, and require only a few lines of scheduling code; code is available at https://github.com/ameyskulkarni/scheduled_style_injection.

2605.26537 2026-05-27 cs.CL

Conceptual Steganography

概念隐写术

Zhejian Zhou, Jonathan May

AI总结 提出概念隐写术,通过思维链中的高级推理行为模式而非词汇选择嵌入信息,实验表明其对释义防御比标准关键词方法更鲁棒,且不影响推理效用。

详情
AI中文摘要

语言模型(LM)发出的思维链(CoT)驱动了其大部分能力。然而,承载有用推理的同一序列也可以隐蔽地传递信息:一个未对齐的模型可能在其CoT中嵌入隐蔽信息,从而逃避人类监督,这种隐写术形式被称为编码推理。先前的LM隐写术方案在词元或词汇空间操作,而内容保留释义器是近期工作中规范且有效的防御手段。我们引入了概念隐写术,其中CoT的每一步通过高级推理行为模式而非词汇选择来携带信息。在四个模型家族和两个推理领域中,这种后门通信渠道被证明比标准关键词方法对强释义防御具有更一致的鲁棒性,并且将信息编码到CoT中不会影响其在推理过程中的效用。在提高对这一新风险的认识后,我们进一步证明,一个策略感知的释义器可以关闭大部分渠道,强调了确保野外忠实LLM推理的新挑战和推荐防御措施。

英文摘要

Language Models (LMs) emit Chains-of-Thought (CoTs) that drive much of their capability. However, the same sequence that carries useful reasoning can also covertly convey messages: a misaligned model may embed covert information in its CoT that slips through human supervision, a form of steganography known as encoded reasoning. Prior LM steganography schemes operate in the token or lexical space, and a content-preserving paraphraser is the canonical and effective defense in recent work. We introduce conceptual steganography, in which each step of a CoT carries information through patterns of high-level reasoning behavior, rather than through lexical choice. Across four model families and two reasoning domains, this backdoor communication channel is shown to be consistently more robust to a strong paraphrase defense than standard keyword approaches, and the encoding of information into CoTs does not affect their utility in the reasoning process. Having raised awareness of this new risk, we then demonstrate that a strategy-aware paraphraser can close much of the channel, highlighting new challenges and recommended defenses for ensuring faithful LLM reasoning in the wild.

2605.26535 2026-05-27 cs.LG cs.AI cs.CV cs.NA math.NA

Recursive Flow Matching

递归流匹配

Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu

AI总结 提出递归流匹配(RecFM)框架,通过自一致性约束对齐不同离散化尺度的轨迹,实现高保真单步或少步(2-4步)动态生成,在科学基准上相比领先扩散模拟器加速20倍且提升预测精度。

Comments Project page: https://jhhuangchloe.github.io/RecFM/

详情
AI中文摘要

生成模型已成为解决物理系统和建模复杂时空动态的强大范式。然而,在不产生高计算成本的情况下实现高物理精度仍然是一个基本挑战,因为现有方法面临关键的速度-保真度权衡。在这项工作中,我们引入了递归流匹配(RecFM),一个用于预测复杂时空动态的生成框架。RecFM强制执行自一致性以对齐跨离散化尺度的轨迹,减少离散化误差并改善基于物理任务的各种指标。据我们所知,这是第一种在科学系统中实现高保真单步和少步(2-4步)动态生成的方法,其性能可与最先进的多步求解器相媲美。在具有挑战性的科学基准测试中,RecFM相比领先的扩散模拟器实现了高达20倍的加速,同时提高了预测精度。此外,与普通流匹配相比,RecFM将均方误差降低了超过15%,为实时科学模拟提供了一种可扩展且高效的解决方案。

英文摘要

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20$\times$ speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

AI总结 本文提出一种解耦的边缘可部署管道,结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型,实现风电叶片缺陷定位与结构化报告生成,在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情
AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成;在当前的实践中,这些任务被分开处理,语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道,由三个组件组成,每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器,在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块,将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型,通过量化低秩适应(QLoRA)在947个合成生成的维护报告上进行适配,从该提示生成结构化的JSON报告。检索增强微调(RAFT)进一步将每个建议基于索引的维护程序。五项消融实验,通过BLEU-4、ROUGE-L、幻觉率(HR)和LLM-as-a-Judge评分标准,将该管道与单一视觉-语言模型(VLM)基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10,而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下,QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明,具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

2605.26530 2026-05-27 cs.AI

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

哪些变化重要?通过相关性敏感评估和求解器基础推理实现可信赖的法律AI

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song

AI总结 提出法律相关性敏感评估问题,引入统一评估套件,并设计基于形式推理的对抗多智能体框架LexGuard,以提高法律AI对法律相关变化的校准敏感性。

详情
AI中文摘要

法律推理需要区分重要的变化和不重要的变化。法律AI应在法律无关的扰动下保持稳定,但当扰动改变法律实质要点时应发生变化。我们将这一要求形式化为法律相关性敏感评估问题:LLM应仅对法律相关的变化敏感。我们引入了一个统一的评估套件,涵盖司法公平性、鲁棒性和法规混淆场景中的应变化和不应变化评估。我们的评估表明,现有的法律LLM系统性地对法律无关的变化敏感,并且常常无法区分相关的法律要素和法规规则。为了缓解这些失败,我们提出了LexGuard,一个基于形式推理的对抗多智能体框架。LexGuard将法规形式化为可执行约束,使用对抗智能体提取竞争的事实-法规论点,并调用SMT求解器验证法律满足性和逻辑一致性。实验表明,LexGuard通过减少对操纵性框架的脆弱性、改善相似法规之间的区分、限制法律无关属性的影响以及增加良性重述下的一致性,提高了法律推理的可靠性。我们表明,法律可信赖性不仅需要准确性,还需要对法律实质性变化的校准敏感性。

英文摘要

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

2605.26526 2026-05-27 cs.LG cs.CR

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

开源权重大语言模型微调防御易受简单攻击

Kevin Kuo, Chhavi Yadav, Virginia Smith

AI总结 本文发现针对开源权重大语言模型的防御措施易受abliteration和prefilling等低成本攻击,并提出abliteration-resistant tuning (ART) 方法将攻击成功率降低10%-20%。

Comments main body: 9 pages, 3 figures

详情
AI中文摘要

近期针对开源权重大语言模型(LLMs)的防御措施旨在防止对抗性使用。这些防御措施基于一个假设:新的有害行为是通过微调学习到的,而不是通过越狱模型诱发的。然而,预训练的LLMs已经在许多领域编码了大量有害知识,这引发了一个重要问题:对手能否越狱受保护的模型,在不进行任何微调的情况下实现有害使用?在本文中,我们表明开源权重防御措施容易受到更简单的策略攻击,这些策略虽然众所周知,但尚未针对这些防御措施进行系统评估。具体来说,我们评估了两种低成本攻击——abliteration和prefilling——它们不依赖于基于梯度的优化。在三个有害性评估基准(BeaverTails、HarmBench和AdvBench)上,这些攻击将针对受保护开源权重模型的攻击成功率从低于10%提高到16%-96%的范围。为了缓解这一漏洞,我们引入了abliteration-resistant tuning (ART),它将基于abliteration的目标纳入训练。ART可以叠加到现有防御措施上,并将abliteration、prefilling及其组合的成功率降低10%-20%。这些发现表明,开源权重模型的攻击面比先前描述的要更广,并且对防御措施的评估应包含更多样化的攻击策略,而不仅仅是对抗性微调。

英文摘要

Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model. Yet, pretrained LLMs already encode substantial harmful knowledge across many domains, which raises an important question: can an adversary jailbreak safeguarded models, to achieve harmful usage without fine-tuning at all? In this paper, we show that open-weight safeguards are susceptible to simpler strategies that, despite being well known, have not been systematically evaluated against these safeguards. Specifically, we evaluate two low-cost attacks--abliteration and prefilling--that do not rely on gradient-based optimization. Across three harmfulness evaluation benchmarks (BeaverTails, HarmBench, and AdvBench), these attacks increase attack success rates against safeguarded open-weight models from below 10\% to a range of 16%-96%. To mitigate this vulnerability, we introduce abliteration-resistant tuning (ART), which incorporates an abliteration-based objective into training. ART can be layered onto existing defenses and reduces the success rates of abliteration, prefilling, and their combination by 10%-20%. These findings indicate that the attack surface for open-weight models is broader than previously characterized, and that evaluations of safeguarding defenses should incorporate a more diverse set of attack strategies beyond adversarial fine-tuning.

2605.26525 2026-05-27 cs.CV cs.AI

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

ReCA: 通过递归上下文分配实现多镜头长视频外推

Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang

AI总结 针对多镜头视频外推任务中上下文分配瓶颈,提出递归上下文分配框架,通过层次化分解和结构化状态传播提升长视频生成的一致性和质量。

Comments Project Page: https://reca.vmv.re , Code: https://github.com/ali-vilab/ReCA

详情
AI中文摘要

分钟级电影式视频生成是生成式视频模型的核心挑战。现有范式仅解决该挑战的片段:单镜头外推保留锚点但缺乏电影结构,而多镜头叙事施加结构却可自由创造视觉状态而非延续观察到的状态。我们定义多镜头视频外推(MSVE)任务,该任务将观察到的帧或片段扩展为一系列具有电影结构的镜头,同时保留锚点状态并推进叙事意图。该设置受限于短视频模型的每次调用生成预算。我们识别出三个耦合瓶颈:(1)全局规划器从完整剧本中过度指定不支持的细节;(2)镜头级提示在携带完整故事时稀释任务相关状态;(3)时间链将生成帧转变为有损记忆,其中身份、场景、对象和动作状态衰减。MSVE揭示长视频失败不仅是上下文长度的限制,更是上下文分配失败。我们提出递归上下文分配(ReCA),一种推理时框架,在规划和生成之间分层分配上下文。ReCA递归地将MSVE分解为上下文有界子问题,在叶节点调用冻结生成器,并跨时间传播结构化状态更新。为评估该设置,我们进一步提出MSVE-Bench和NB-Q,一种源接地协议,带有专为3至5分钟长视频生成设计的提示,该场景未被现有短视频基准覆盖。与先前方法相比,ReCA在最强竞争控制器上将平均归一化分数提高8%至16%,并将多镜头一致性指标提高28%至43%。查看项目页面:https://reca.vmv.re。

英文摘要

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.

2605.26524 2026-05-27 cs.CV cs.AI

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

CmIVTP:面向海事智能的基于跨模态交互的船舶轨迹预测

Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao

AI总结 针对单一数据源局限导致船舶轨迹预测不准的问题,提出跨模态交互框架CmIVTP,融合AIS和CCTV数据,利用目标感知场景编码器和跨模态交互Transformer实现高精度预测。

详情
AI中文摘要

海事智能交通系统(MITS)对于确保繁忙水域的航行安全和效率至关重要。然而,由于单源数据的局限性,准确的船舶轨迹预测仍然具有挑战性。自动识别系统(AIS)数据对于小型船舶通常稀疏或不可用,而仅靠闭路电视(CCTV)数据无法完全捕捉动态船舶行为。为缓解这些挑战,我们提出了一种基于跨模态交互的船舶轨迹预测(称为CmIVTP)框架,以建模船舶动力学与环境约束之间的复杂交互。具体地,我们引入了一个目标感知场景编码器来提取场景语义特征,有效捕捉船舶-环境交互并提高轨迹预测精度。此外,我们提出了一个跨模态交互变换器,它集成了AIS衍生的运动特征、基于CCTV的环境特征和场景表示。它利用跨模态注意力机制同时捕捉模态内语义和模态间交互,确保动态一致且环境可行的预测。此外,我们通过将历史AIS轨迹聚类为代表性运动模式构建了船舶群体轨迹库,为候选轨迹生成提供了一种高效且可扩展的方法。另外,我们引入了海事多模态数据集增强版(名为Maritime-MmD$^+$),这是一个同步AIS数据和CCTV视频数据的大规模数据集,为多模态轨迹预测研究提供了有力支持。大量实验表明,CmIVTP在多模态驱动的船舶轨迹预测基准上取得了更好的性能。本工作的代码资源可在https://github.com/LouisYxLu/CmIVTP获取。

英文摘要

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.

2605.26520 2026-05-27 cs.CV cs.AI

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch: 一种具有自校正视觉草图和逐步奖励的交错推理模型

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

AI总结 针对视觉-语言模型在长程视觉推理中文本中心范式局限性的问题,提出InterSketch模型,通过自校正和逐步奖励机制增强交错视觉-文本思维链能力,在视觉推理基准上超越Gemini-3-Pro等专有模型。

详情
AI中文摘要

尽管视觉-语言模型(VLM)已展现出多轮视觉推理能力,但其推理轨迹仍相对浅层且以文本为中心,限制了其在复杂视觉挑战中的适用性。相比之下,人类思维通常涉及长程推理,并伴有交错的视觉-文本思维链(VT-CoT)。为弥合这一差距,我们引入InterSketch,一种交错推理模型,通过自校正和逐步奖励机制增强VT-CoT能力。InterSketch使用外部工具动态生成中间视觉草图,并将其与文本推理交错进行,从而在长程视觉理解任务中实现有效的感知和逻辑推理。具体而言,在第一个冷启动阶段,我们提出了一个合成的高质量交错VT-CoT数据集,并引入反思机制,使模型具备多轮交错推理和自校正能力。在后续的强化学习(RL)阶段,我们设计了一种逐步奖励机制,以缓解长程推理中仅端到端监督固有的奖励信号稀疏性问题。在视觉推理基准上的大量实验证明了InterSketch的有效性,其性能甚至超越了Gemini-3-Pro等专有模型。

英文摘要

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

2605.26514 2026-05-27 cs.CV cs.AI cs.LG

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

CSV-ViT: 一种使用可变大小皮层超顶点的视觉Transformer用于阿尔茨海默病病理检测

Geonwoo Baek, Ikbeom Jang

AI总结 提出一种保留感兴趣区域的、基于顶点的可变大小皮层表面分块方法(皮层超顶点),并设计可变大小补丁兼容的视觉Transformer(CSV-ViT),在阿尔茨海默病诊断、淀粉样蛋白阳性和tau蛋白阳性三分类任务中优于现有表面模型。

详情
AI中文摘要

确认阿尔茨海默病(AD)通常依赖于正电子发射断层扫描(PET),该方法仍然昂贵且有创,这促使了基于结构MRI的预筛查的使用。在非欧几里得流形,特别是大脑皮层表面上的深度学习,由于数据的球形拓扑结构面临重大挑战。最近的表面模型已经能够从皮层表面数据中学习;然而,施加基于面的均匀补丁通常会导致补丁边界处的重复顶点。一般来说,许多基于表面的模型对感兴趣区域(ROI)的感知有限,这可能导致非皮层区域(如内侧壁)被包含在内。我们提出了一种皮层表面分块方法,该方法执行保留ROI的、基于顶点的、可变大小的补丁划分。我们将这些皮层表面补丁称为皮层超顶点(CSV)。基于这种表示,我们设计了CSV视觉Transformer(CSV-ViT),这是一种可变大小补丁容忍的视觉Transformer,使用填充和掩码感知的补丁嵌入。我们使用T1加权MRI,并通过将AD相关状态分类为三个类别来评估我们的框架:AD诊断、淀粉样蛋白阳性和tau蛋白阳性。在实验中,CSV-ViT取得了比最近基于表面的模型更高的分类性能。结果表明,所提出的CSV-ViT可能支持在PET或脑脊液确认之前基于MRI的AD相关状态预测。

英文摘要

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.

2605.26513 2026-05-27 cs.CV

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

Re-M3Dr:重新平衡的多模态均值偏差回归

Haojie Yin, Chengcheng Feng, Tianyi Liu, Tianqi Zhang, Kaizhu Huang

AI总结 针对多模态医学图像融合性能反不如单模态的问题,提出Re-M3Dr框架,通过自适应边界的监督对比学习和锐度感知梯度调制,实现多模态均值偏差回归,在临床数据集上均方误差降低29%。

详情
AI中文摘要

均值偏差(MD)是评估眼科视野损失的关键指标。虽然以往的工作仅关注从光学相干断层扫描(OCT)预测MD,但直观上假设将OCT与另一种眼底摄影(FP)成像结合可以提高性能,因为两种眼科医学成像提供了互补信息。当应用复杂的多目标优化时,这一点尤其值得期待,正如常见的多模态分类中所记载的那样。令人惊讶的是,我们的研究表明,在这种医学成像场景中,多模态融合的性能不如单模态模型。通过详细分析,我们确定根本原因是数据分布和模态学习冲突之间的耦合不平衡。这种不平衡扭曲了优化景观,导致训练不稳定。为了解决这一挑战,我们提出了重新平衡的多模态均值偏差回归(Re-M3Dr)方法,这是一种新颖的多模态回归框架。我们通过自适应边界的监督对比学习增强单模态表示。然后,我们的框架通过锐度感知梯度调制稳定联合优化。在公共和私人临床数据集上的实验结果表明,与最先进的多模态学习方法相比,均方误差平均降低29%,证明了Re-M3Dr的优越性。代码可在补充材料中获得。

英文摘要

Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.

2605.26509 2026-05-27 cs.LG math.PR stat.CO

SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning

SIKA-GP:利用稀疏诱导核近似加速贝叶斯深度学习中的高斯过程推断

Wenyuan Zhao, Rui Tuo, Chao Tian

AI总结 提出SIKA-GP方法,通过基于二元有序模板基的稀疏诱导核近似,将高斯过程推断的计算复杂度降低至O(log M),并实现高效张量化GPU计算,可自然嵌入贝叶斯神经网络,在视觉和Transformer语言基准上显著加速训练和推断而不牺牲预测性能。

Comments 20 pages, 8 figures; accepted to International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

高斯过程(GPs)为不确定性估计提供了原则性的贝叶斯框架,但其计算复杂度严重限制了在大规模数据集上的可扩展性。我们提出SIKA-GP,该方法使用基于二元有序模板基的稀疏诱导核近似来加速GP推断,对诱导点数量的复杂度依赖仅为${O}(\log M)$。我们的方法从稀疏激活基构建紧凑且表达力强的核表示,从而实现高效的张量化GPU计算,并与现代大规模模型无缝集成。SIKA-GP可以自然地嵌入具有稀疏激活的贝叶斯神经网络(BNNs)中,在训练和推断中均实现显著加速,且不牺牲预测性能。该方法自然地扩展到深度特征学习,解决了深度架构和高维特征表示带来的可扩展性挑战。在视觉和基于Transformer的语言基准上的实验结果表明,我们的方法始终提供快速且准确的GP模型,为可扩展核学习提供了一条原则性路径。

英文摘要

Gaussian processes (GPs) provide a principled Bayesian framework for uncertainty estimation, but their computational complexity severely limits scalability to large datasets. We propose SIKA-GP, which accelerates GP inference using sparse inducing kernel approximations based on a dyadic ordered template basis, incurring only ${O}(\log M)$ complexity dependence on the number of inducing points. Our approach constructs compact and expressive kernel representations from sparsely activated bases, enabling efficient tensorized GPU computation and seamless integration with modern large-scale models. SIKA-GP can be naturally embedded into Bayesian neural networks (BNNs) with sparse activations, yielding significant speedups in both training and inference without sacrificing predictive performance. The method naturally extends to deep feature learning, addressing the scalability challenges introduced by deep architectures and high-dimensional feature representations. Empirical results on vision and transformer-based language benchmarks demonstrate that our approach consistently delivers fast and accurate GP models, providing a principled path toward scalable kernel learning.

2605.26503 2026-05-27 cs.CV

Uncertainty-Aware Gaussian Map for Vision-Language Navigation

面向视觉-语言导航的不确定性感知高斯地图

Jianzhe Gao, Rui Liu, Yuxuan Xu, Tongtong Cao, Yingxue Zhang, Zhanguang Zhang, Sida Peng, Yi Yang, Wenguan Wang

AI总结 提出不确定性感知高斯地图,通过显式建模几何、语义和外观三种感知不确定性并融入观测空间,提升视觉-语言导航中智能体的决策可靠性。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体按照自然语言指令在3D环境中导航。在导航过程中,现有智能体通常遇到感知不确定性,例如缺乏可靠定位的证据或空间线索解释的模糊性,但在预测动作时通常忽略此类信息。在这项工作中,我们显式建模三种形式的感知不确定性(即几何、语义和外观不确定性),并将其整合到智能体的观测空间中,以实现知情决策。具体来说,我们的智能体首先构建一个语义高斯地图(SGM),由从全景观测初始化的可微3D高斯原语组成,编码环境的几何结构和语义内容。在SGM之上,通过高斯位置和尺度的变分扰动估计几何不确定性,以评估结构可靠性;通过扰动高斯语义属性捕获语义不确定性,以揭示模糊解释;通过Fisher信息刻画外观不确定性,该信息衡量渲染观测对高斯级变化的敏感性。这些不确定性被纳入SGM,将其扩展为统一的3D价值地图,将其作为支持可靠导航的可供性和约束。在多个VLN基准上的综合评估显示了我们的智能体的有效性。

英文摘要

Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent's observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.

2605.26501 2026-05-27 cs.CV cs.AI

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

揭示视觉-语言模型的脆弱性:通过纹理约束扰动和跨模态优化的多模态对抗协同

Xiang Fang, Wanlong Fang, Changshuo Wang

AI总结 提出多模态对抗协同框架,通过纹理约束的通用对抗扰动和可学习的文本提示扰动,在黑盒设置下联合优化,揭示视觉-语言模型在多模态攻击下的脆弱性。

Comments Publish in AAAI 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通过整合视觉和文本输入,在图像描述和视觉问答等任务中表现出色,改变了多模态理解。然而,它们对抗攻击的鲁棒性,特别是利用两种模态的攻击,仍未被充分探索,这给自动驾驶和内容审核等关键应用带来了风险。现有攻击集中于单一模态或需要不切实际的白盒访问,限制了其现实相关性。在本文中,我们引入了多模态对抗协同(MMAS),这是一个开创性的框架,用于针对LVLMs构建通用的黑盒多模态攻击。MMAS同时生成纹理尺度约束的通用对抗扰动用于图像,以及可学习的提示扰动用于文本,仅通过模型查询进行联合优化。图像扰动利用基于小波的纹理约束确保在各种视觉输入中的不可感知性和鲁棒性。文本扰动在嵌入空间中受L范数约束,在保持语义连贯性的同时将输出导向目标。一种新颖的跨模态正则化项对齐扰动的梯度方向,增强了它们在任务和模型间的协同影响和可迁移性。大量实验表明,我们提出的攻击在主流LVLMs上具有强大的通用对抗能力。

英文摘要

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.

2605.26500 2026-05-27 cs.CV

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

面向视觉-语言导航的开放集语义分组3D高斯地图

Jianzhe Gao, Rui Liu, Wenguan Wang

AI总结 提出一种3D高斯地图表示环境,通过在线构建自中心场景地图和开放集语义分组操作增强几何与语义信息,并设计多层级动作预测策略,在三个公开基准上验证了有效性。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体基于自然语言指令遍历复杂的3D环境,这需要对场景有透彻的理解。现有工作为智能体配备了各种场景表示以增强空间感知,但往往忽略了VLN场景中复杂的3D几何和丰富的语义,限制了在多样化和未见环境中的泛化能力。为应对这些挑战,本文提出一种3D高斯地图,将环境表示为一组可微分的3D高斯,并据此开发了用于VLN的导航策略。具体地,通过从稀疏伪激光雷达点云初始化3D高斯来在线构建自中心场景地图,为场景理解提供信息丰富的几何先验。每个高斯基元进一步通过开放集语义分组操作得到增强,该操作基于3D高斯在开放世界中属于对象实例或材质类别的成员关系对其进行分组,形成统一的3D高斯地图。基于该地图,设计了多层级动作预测策略,结合多粒度的空间-语义线索,辅助智能体进行决策。在三个公开基准(即R2R、R4R和REVERIE)上进行的大量实验验证了我们方法的有效性。

英文摘要

Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.