arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1942
2605.20372 2026-05-21 cs.CV cs.AI

Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

基于潜在空间引导的多模态分割中缺失模态的场景采样

Irem Ulku, Ö. Özgür Tanrıöver, Erdem Akagündüz

AI总结 本文提出了一种新的训练策略,通过直接从预训练的潜在空间学习场景采样分布,以指导多模态分割在缺失模态下的微调,从而提高性能。

Comments 14 pages, 4 figures, 9 tables

详情
AI中文摘要

多模态语义分割通过结合不同传感器模态的互补信息,为遥感分析带来了好处。在现实中的遥感应用中,由于传感器故障、恶劣大气条件或数据采集问题,一个或多个模态可能不可用。即使有预训练的多模态表示和现有的微调或适应策略,性能仍可能受限,因为训练时通常将所有模态可用性场景视为等信息。在本文中,我们提出了一种新的训练策略,直接从预训练的潜在空间学习场景采样分布。与依赖于均匀随机模态丢弃不同,所提出的方法将微调引导到更具信息量的模态可用性场景。更具体地说,我们独立量化每个场景的影响,基于其在共享潜在表示中引起的变化。然后,我们使用径向基函数内核捕捉场景关系,并通过正则化内核平滑推导出细化的场景评分。这些评分随后在场景采样过程中转换为概率分布,用于微调。我们在三个遥感图像集(DSTL、Potsdam和Hunan)上评估了该策略,使用CBC-SLP、CBC和CMX主干网络。不同图像集和主干网络的实验结果表明,我们的方法优于标准微调和LoRA基于的适应。这些发现表明,预训练的潜在表示可以作为缺失模态微调期间采样的有效基础。代码可在https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling获取。

英文摘要

Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling

2605.20369 2026-05-21 cs.CL cs.AI cs.LG

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

DEL:用于大语言模型数值学习的数字熵损失

Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang

AI总结 本文提出Digit Entropy Loss (DEL)用于大语言模型的自回归数值学习,通过重新设计传统无监督熵优化,引入数字条件概率和二元交叉熵,使熵优化转向监督方式,同时推广整数基于的数值学习到浮点数优化,从而提升数值预测的准确性。

详情
AI中文摘要

数字预测是大语言模型(LLMs)在数学问题解决和代码生成中的基本能力。广泛采用的最大似然估计(MLE)用于LLM训练并不适合数字预测。最近,惩罚驱动的方法,例如数字标记损失和离散化距离损失,引入了数字距离的归纳偏置,但分别导致了数字分布过度锐化和过度扁平化。在本文中,我们深入分析了LLM的数值学习,并表明现有的数值学习方法在概念上遵循一个准则-距离公式,其中准则项代表优化模式,距离项灌输几何先验。因此,我们提出了Digit Entropy Loss (DEL)用于自回归数值学习,其重新设计传统无监督熵优化的三个关键设计:利用数字条件概率和二元交叉熵将熵优化引导为监督方式;舍弃距离项以避免数值距离的问题;并将整数基于的数值学习推广到浮点数优化,使数值预测更加准确。我们的DEL公式可以结合整数、小数和小数点,将学习目标从单个数字扩展到浮点数领域。在七个数学推理基准测试中使用四个代表性的LLM,包括CodeLlama、Mistral、DeepSeek和Qwen-2.5,进行实验,结果表明DEL在整体预测准确性和数值距离方面均优于其替代方法。源代码在https://github.com/PolyU-VCLab/DEL。

英文摘要

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

2605.20364 2026-05-21 cs.CL

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

推理监督有害:基于TTCW的长文本文学评论生成

Jinlong Liu, Mohammed Bahja, Mark Lee

AI总结 本文提出了一种基于TTCW的长文本文学评论生成方法,通过构建包含263911个长文本故事的数据集,每个故事都标注了14个TTCW维度的评分和元合成评论。实验发现,非推理微调在性能和稳定性上更优,而推理监督模型更易出现解析失败,导致生成无关或重复的推理文本,表明在固定格式评分标准下,推理监督并不总是有益的。

Comments Submit to EMNLP 2026

详情
AI中文摘要

长文本文学写作的自动评估仍然具有挑战性,因为通用LLM-as-Judge方法可能无法充分捕捉与创造力相关的维度,如原创性和灵活性。尽管Torrance创造性写作测试(TTCW)提供了一个结构化的创造力框架,但先前的工作仅在成对层面展示了基于参考的TTCW评估。本文通过构建包含263911个长文本故事的数据集,每个故事均标注了14个TTCW维度的评分和元合成评论,以填补这一空白。使用该数据集,我们对Qwen3模型进行两种规模(4B和8B)的微调,分别在有和无推理内容的条件下进行。结果表明,非推理微调在性能和稳定性上表现更优,最佳设置达到0.6820的评估分数。进一步分析显示,推理监督模型更倾向于解析失败,通常继续生成无关或重复的推理式文本,而非完成所需的14项指标评论报告。这些结果表明,在固定格式评分标准下,推理监督并不总是有益的,即使经过任务特定微调,精确的指标对齐评分仍然具有挑战性。

英文摘要

Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

2605.20362 2026-05-21 cs.CV

HAPS: Rethinking Image Similarity for Virtual Staining

HAPS: 重新思考图像相似性以适应虚拟染色

Fedor Gubanov, Svetlana Illarionova, Vlad Kozlovskiy, Mikhail Romanov, Yersultan Akhmetov, Aida Akaeva, Vyacheslav Grinevich, Rifat Hamoudi, Maxim Sharaev

AI总结 本文提出HAPS指标,通过分析组织学图像的相似性,改进了传统通用度量标准在评估虚拟染色质量时的不足,从而提升训练数据质量。

Comments 17 pages, 3 figures

详情
AI中文摘要

虚拟染色是数字病理学中的一项新兴技术,通过合成目标染色来加快和降低成本。然而,虚拟染色模型的质量仍主要依赖于通用度量标准如SSIM、PSNR和LPIPS。这些指标最初为自然图像设计,与组织学数据的领域特性不匹配,无法捕捉组织形态的保持和生物标记物的表达模式。因此,建立一个稳健的、领域特定的度量标准来量化不同组织学模态间的相似性仍然是该领域的关键缺口。本文将组织学图像相似性作为独立问题进行系统评估,并对一系列全参考度量标准进行了评估。我们进一步分析了度量标准对受控几何畸变(位移、旋转和非刚性变形)的敏感性,这些畸变模拟了连续切片之间的现实对齐误差。基于这些观察,我们提出了组织学感知感知相似性(HAPS)度量标准。HAPS在冻结的预训练于组织病理学数据的编码器的特征空间中计算距离,并添加一个线性头部来将特征层面的差异聚合为最终得分,该得分与专家评估一致。最后,我们展示了HAPS在训练数据质量控制中的实际价值。通过在MIST数据集中量化训练对的相似性并过滤低分样本,我们创建了一个更干净的训练集。在该精炼数据上训练的虚拟染色模型优于在原始未经过滤数据集上训练的模型。

英文摘要

Virtual staining of histopathology images (e.g., H&E-IHC) is an emerging tool in digital pathology, enabling faster and cheaper workflows by synthesizing target stains from routinely acquired slides. Yet, the quality of virtual staining models is still predominantly assessed with generic metrics such as SSIM, PSNR, and LPIPS. Originally developed for natural images, these metrics are inherently misaligned with the domain-specific characteristics of histological data, failing to capture tissue morphology preservation and biomarker expression patterns. Consequently, a robust, domain-specific standard for quantifying similarity across diverse histological modalities remains a critical gap in the field. In this work, we formalize histology image similarity as a standalone problem and systematically evaluate a broad set of full-reference metrics against a dataset of H&E-IHC patch pairs annotated with expert similarity scores. We further analyze metrics sensitivity to controlled geometric distortions (shifts, rotations and non-rigid deformations) that mimic realistic registration errors between serial sections. Guided by these observations, we propose the Histology-Aware Perceptual Similarity (HAPS) metric. HAPS computes distances in the feature space of a frozen encoder pretrained on histopathology data, adding a linear head to aggregate feature-level differences into a final score that aligns with expert assessments. Finally, we demonstrate the practical value of HAPS for quality control of training data. By quantifying the similarity of training pairs in the MIST dataset and filtering low-scoring samples, we create a cleaner training set. Virtual staining models trained on this refined data outperform those trained on the original, unfiltered dataset.

2605.20357 2026-05-21 cs.LG cs.AI

Consistently Informative Soft-Label Temperature for Knowledge Distillation

一致信息软标签温度用于知识蒸馏

Hoang-Chau Luong, Nghia Van Vo, Kaiqi Zhao, Lingwei Chen

AI总结 本文提出CIST方法,通过为教师和学生分配样本级自适应温度,解决传统固定温度设计中教师软标签熵不一致和教师-学生logit尺度对齐过严的问题,从而提升知识蒸馏效果。

详情
AI中文摘要

知识蒸馏(KD)通过匹配教师和学生预测分布将知识从高容量教师传递给紧凑学生,温度缩放是平滑教师预测并暴露信息量大的

英文摘要

Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions and exposing informative "dark knowledge" beyond the hard label. However, the standard fixed-temperature design is inherently sample-agnostic. Since samples differ in logit scale and learning difficulty, a single global temperature produces teacher soft labels with highly inconsistent entropy: some predictions remain overly sharp and provide limited inter-class information, whereas others become over-smoothed and lose class-discriminative information. Moreover, sharing the same temperature between teacher and student further imposes rigid logit-scale alignment despite their capacity mismatch. To address these limitations, we propose CIST (Consistently Informative Soft-label Temperature), which assigns separate sample-wise adaptive temperatures to the teacher and student. This design produces consistently informative teacher soft labels while relaxing rigid teacher--student logit-scale matching. It also reweights the distillation objective according to teacher confidence and student learning difficulty. Theoretically, we show that teacher-label entropy is largely governed by the ratio between the maximum teacher logit and the temperature, providing a principled basis for adaptive smoothing. Empirically, CIST mitigates the inconsistency induced by fixed temperature, and experiments on both vision and language distillation tasks show consistent improvements over standard KD and strong baselines with negligible computational overhead.

2605.20356 2026-05-21 cs.CL cs.AI cs.SD

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

全双工语音对话模型中的同步与轮流机制

Pablo Riera, Pablo Brusco, Cristina Kuo, Marcelo Sancinetti, S. R. K. Branavan

AI总结 本文研究了全双工语音对话模型如何通过内部表示的同步协调交互,并发现噪声条件下同步性下降,内部状态编码了提前的轮流预测信息。

详情
AI中文摘要

全双工语音对话模型(SDMs)能够同时听和说,使交互动态更接近人类对话。受人类沟通中神经耦合的启发,我们研究了此类模型在交互过程中如何协调其内部表示。我们模拟了两个预训练Moshi模型在受控条件下的全双工对话,操纵信道噪声和解码偏置。通过跨时间滞后计算中心核对齐(CKA)来测量同步性,同时利用因果LSTM模型从说话者和倾听者角度探测提前的轮流提示信号。我们发现无噪声条件下同步性较强,接近零滞后,随着噪声增加而下降,并展示了内部状态编码了支持提前轮流预测的信息。

英文摘要

Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

2605.20355 2026-05-21 cs.RO cs.HC cs.LG

Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

近端状态引导:减少人工智能辅助下的技能退化

Megha Srivastava, Jonathan Ouyang, Eric Zhou, Andrew Silva, Emily Sumner, Dorsa Sadigh, Yuchen Cui, Deepak Gopinath, Guy Rosman

AI总结 本文提出了一种名为近端状态引导(PSN)的共享自主算法,通过引导用户向最易学习的状态发展,同时优化技能发展和任务表现,以减少人工智能辅助下的技能退化问题。

Comments 9 pages

详情
AI中文摘要

技能退化,即在人工智能辅助下人类能力的逐渐下降,对半自主系统的共享控制构成了安全风险,因为在这种情况下,操作员可能无法区分自己的输入与自主修正。我们提出了近端状态引导(PSN),一种共享自主算法,通过引导用户向估计最易学习的状态发展,共同优化技能发展和任务表现。我们首先展示了PSN在平衡无辅助奖励下的学生进步与总体共享表现方面优于现有共享自主基线,使用经典LunarLander环境中的模拟学生。然后,我们呈现了迄今为止关于整合学习兼容共享自主的规划器的人类受试者研究:在CARLA模拟器中的两个驾驶任务(高性能赛车和并线,n=60)中,PSN在无辅助技能方面产生的收益比标准混合共享自主大7倍,同时碰撞次数比无辅助自我练习少50%。

英文摘要

Skill atrophy, the gradual decline of human capability under AI assistance, poses a safety risk in shared-control of semi-autonomous systems, where operators may be unable to distinguish their own inputs from autonomous corrections. We propose Proximal State Nudging (PSN), a shared autonomy algorithm that jointly optimizes for skill development and task performance by nudging users toward states estimated to be most learnable. We first show that PSN outperforms existing shared autonomy baselines in balancing student improvement in unassisted reward with overall shared performance, using simulated students in the classic LunarLander environment. We then present, to the best of our knowledge, the first human subject studies of a planner incorporating learning-compatible shared autonomy: across two driving tasks in the CARLA simulator (High Performance Racing and Parallel Parking, n = 60), PSN produces up to 7x larger gains in unassisted skill than standard blended shared autonomy, while incurring 50% fewer collisions than unassisted self-practice.

2605.20337 2026-05-21 cs.CV

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

能力 ≠ 可解释性:视觉基础模型的人类可解释性

Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre

AI总结 本文研究了视觉基础模型的人类可解释性,提出了一种评估框架,发现基础模型比监督模型更难解释,且可解释性与下游任务性能无关,而是与特征的局部性和粗粒度语义对齐有关。

详情
AI中文摘要

领先的视觉模型的可解释性如何?随着这些模型从研究基准转向高风险部署,这个问题变得日益紧迫,但现有方法无法可靠地回答这个问题。我们通过两种互补的心理物理学协议构建了一个框架来衡量和比较视觉模型的人类可解释性:(1)局部化性——观察者能否预测特征在新图像上的位置?(2)可命名性——观察者能否准确描述特征所代表的内容?特征通过稀疏自编码器恢复,一个基于偶然锚定的评分函数将每个模型置于同一尺度上。将该框架应用于六个视觉Transformer——两个监督ViT和四个基础模型(DINOv2、DINOv3、CLIP、SigLIP)——我们收集了超过15,000个行为响应,分析了377名通过我们预设质量检查的参与者中的13,400个响应。基础模型比其监督模型更难解释,且差距不是能力取舍:可解释性不与我们在任何基准上的下游任务性能相关。相关的是特征激活的局部性和与人类的粗粒度语义对齐——具有聚焦激活和反映世界广泛类别结构的表示模型产生更可解释的特征,而细粒度感知对齐则不。两种协议产生高度相关的排名并共享相同的预测因素,确立可解释性作为表示质量的独立、可测量的维度——令人惊讶的是,我们测试的每个基础模型都低于先前的监督基线。仅靠能力无法弥合这一差距;局部性和粗粒度对齐可以。

英文摘要

How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

2605.20316 2026-05-21 cs.CV cs.AI

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

FullFlow: 通过双向视觉-语言生成升级文本到图像流匹配模型

Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann

AI总结 本文提出FullFlow方法,通过仅训练LoRA适配器和轻量级文本头部,将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器,从而在保持图像连续流的同时添加文本离散插入过程,提升文本到图像和图像到文本的生成质量。

Comments project page: https://ericbill21.github.io/fullflow/

详情
AI中文摘要

现代文本到图像扩散模型编码了丰富的视觉先验,但只能通过单向文本条件生成暴露。现有统一的视觉-语言模型通过大规模联合预训练或对文本路径进行大量重训练来恢复双向能力,但丢弃了文本到图像模型本身已编码的强图像先验。我们介绍了FullFlow,一种参数高效的配方,通过仅训练LoRA适配器和轻量级文本头部,将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器。FullFlow保持图像在原生连续流中,并添加文本的离散插入过程。分离的图像和文本时间步将推断转化为二维生成空间中的轨迹选择,使文本→图像、图像→文本、联合采样和部分文本预测能够通过单一主干模型完成。在Stable Diffusion 3 (SD3)上,FullFlow在相同可训练参数数量和匹配LoRA秩的情况下,将文本→图像的FID从62.7提升到31.6,将图像→文本的CIDEr从2.0提升到99.4,同时在两个RTX A5000 GPU上训练时间不超过24小时的情况下,将峰值VRAM从约84GB降低到约38GB,并将吞吐量提高约8倍,仅训练主干参数的约5%。同样的配方适用于FLUX.1-dev,并通过部分文本生成支持下游VQA。这些结果表明,强大的双向视觉-语言能力可以从预训练的文本到图像流模型中解锁,而无需完整的多模态预训练。

英文摘要

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

2605.20315 2026-05-21 cs.CL

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Mix-Quant: 量化预填,精准解码用于代理大语言模型

Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

AI总结 本文提出Mix-Quant,一种简单有效的阶段感知量化框架,用于加速代理大语言模型的推理。通过在预填阶段应用高吞吐量NVFP4量化,同时保留BF16精度用于解码,从而在保持任务性能的同时显著提高效率,实现预填阶段高达3倍的速度提升。

详情
AI中文摘要

LLM代理最近作为一种通过规划、工具使用、记忆检索和多步交互解决复杂任务的强大范式出现。然而,这些代理工作流往往引入显著的输入侧开销,使计算密集型的预填阶段成为长上下文、多轮推理中的关键瓶颈。在本工作中,我们提出Mix-Quant,一种简单有效的阶段感知量化框架,用于加速代理推理。我们首先研究了FP4量化在代理LLM工作流中的应用,并发现对整个推理过程进行量化会导致显著的性能下降。相比之下,预填阶段表现出显著的量化冗余,因此可以以最小的精度损失进行量化,尽管它是计算的主要来源。基于这一见解,我们将在预填阶段应用高吞吐量NVFP4量化,同时保留BF16精度用于解码。通过将预填加速与解码质量解耦,Mix-Quant结合了阶段感知的算法量化与硬件高效的NVFP4执行,以缓解LLM代理的推理瓶颈。在长上下文和代理基准测试中的广泛实验表明,Mix-Quant在保持任务性能的同时,实现了显著的效率提升,预填阶段可达3倍的速度提升。

英文摘要

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

2605.20314 2026-05-21 cs.LG cs.AI

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

数据更少,训练更快:重复较小的数据集通过采样偏差加速学习

Jingwen Liu, Ezra Edelman, Surbhi Goel, Bingbin Liu

AI总结 研究探讨了'小数据与大数据差距'现象,即使用更少样本重复训练比使用更大数据集更节省计算资源,通过层间增长和采样偏差机制实现加速,为优化提供了新的归纳偏差。

Comments ICML 2026

详情
AI中文摘要

本文研究了'小数据与大数据差距'现象,即在较少样本上重复训练相比使用较大数据集可以节省训练计算资源。这一现象在算法任务、架构和优化器中均被观察到,无法用现有理论解释。我们提出,这种加速是由于适当的层间增长机制,由采样偏差驱动,且在数据集较小时更为显著。我们通过多种干预措施提供了理论分析和实证证据。研究结果表明,使用较小数据集并进行重复训练不仅是在数据稀缺时的退化策略,而且可以主动作为优化的有利归纳偏差,特别是在推理任务中。

英文摘要

This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results suggest that using a smaller dataset with more repetitions is not just a fallback strategy under data scarcity, but can be proactively leveraged as a favorable inductive biases for optimization, particularly in reasoning tasks.

2605.20311 2026-05-21 cs.LG

WaveGraphNet: Physics-Consistent Guided-Wave Damage Localization through Coupled Inverse-Forward Graph Learning

WaveGraphNet: 通过耦合逆向-前向图学习实现物理一致的引导波损伤定位

Vinay Sharma, Aditya Bharade, Olga Fink

AI总结 本文提出WaveGraphNet,一种用于碳纤维增强聚合物板引导波损伤定位的耦合逆向-前向图学习框架,通过图结构建模传感布局,利用图连接性表示测量传播路径,结合逆向分支和前向分支实现损伤定位的鲁棒性提升。

详情
AI中文摘要

引导波结构健康监测通过稀疏的粘结压电换能器网络在复合板中实现损伤定位。然而,从pitch-catch测量推断缺陷的空间位置仍然在仅有有限损伤位置用于训练时受到弱约束。因此,训练以预测缺陷位置的模型可能在已见案例中表现良好,但在未见结构区域泛化能力差。本文提出WaveGraphNet,一种用于引导波损伤定位的耦合逆向-前向图学习框架。传感布局被显式建模为图,其中换能器表示为节点,测量传播路径定义图连接性。逆向分支将图结构化的频谱描述符映射到损伤位置,而前向分支预测与候选位置相关的测量波响应路径的能量偏差模式。在训练过程中,前向分支作为物理一致的正则化器,抑制那些在数值上合理但与测量波响应能量重新分布不一致的位置估计。这种耦合促使推断的损伤坐标与底层波传播行为达成一致。在本基准中,所提出的图基公式为稀疏引导波传感提供了强大的定位模型,并在 extrapolation 到 held-out 区域时相比非图和图基基线表现出改进的鲁棒性。这些结果突显了耦合逆向-前向图学习作为在有限空间覆盖下引导波定位的有效策略的潜力。

英文摘要

Guided-wave structural health monitoring enables damage localization in composite plates using sparse networks of bonded piezoelectric transducers. However, inferring the spatial location of defects from pitch-catch measurements remains weakly constrained when only a limited set of damage locations is available for training. As a result, models trained to predict defect locations may perform well on seen cases but generalize poorly to unseen regions of the structure. This paper proposes WaveGraphNet, a coupled inverse--forward graph learning framework for guided-wave damage localization in Carbon Fiber Reinforced Polymer (CFRP) plates. The sensing layout is explicitly modeled as a graph, where transducers are represented as nodes and measured propagation paths define the graph connectivity. An inverse branch maps graph-structured spectral descriptors of differential guided-wave responses to a damage location, while a forward branch predicts the path-wise energy-deviation patterns of measured wave responses associated with a candidate location. During training, the forward branch serves as a physics-consistent regularizer, discouraging location estimates that are numerically plausible but inconsistent with the measured redistribution of wave-response energy. This coupling encourages agreement between inferred damage coordinates and the underlying wave propagation behavior. Within this benchmark, the proposed graph-based formulation provides a strong localization model for sparse guided-wave sensing and demonstrates improved robustness in extrapolation to held-out regions compared to both non-graph and graph baselines. These results highlight the potential of coupled inverse-forward graph learning as an effective strategy for guided-wave localization under limited spatial coverage.

2605.20309 2026-05-21 cs.CV cs.AI

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Tiny-Engram: 生成视觉中的触发索引概念表

Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng

AI总结 本文提出Tiny-Engram,一种紧凑的触发索引概念表,通过显式地为视觉记忆分配词汇地址和激活边界,实现对冻结图像和视频生成器中的概念的控制。该方法通过注册的n-gram匹配索引参数化每个概念,仅在匹配触发区域调节文本编码器的隐藏状态,从而在保持周围提示的组合控制的同时,将罕见触发短语绑定到目标身份。

详情
AI中文摘要

当前生成视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念,但对是否以及何时检索概念的控制有限。在本工作中,我们引入Tiny-Engram,一种紧凑的触发索引概念表,为冻结的图像和视频生成器中的视觉记忆提供显式的词汇地址和激活边界。Tiny-Engram将每个概念参数化为一组小的记忆条目,这些条目通过注册的n-gram匹配进行索引,仅在匹配的触发区域调节文本编码器的隐藏状态。在该词汇支持之外,条件路径与冻结的基础模型相同。在单编码器潜在扩散和多编码器扩散-变压器骨干结构上,这种公式将罕见触发短语绑定到目标身份,同时保持周围提示的组合控制。我们进一步在文本条件的视频生成设置中评估相同的表式记忆,其中触发路径可靠地改变生成的主题,但保持在排除的视频提示中精细的身份持续性仍然有限。综合来看,这些结果表明,小型、显式地址的概念表是实现模块化视觉个性化的一种实用途径,尤其在图像生成中证据最强。对于视频扩散,剩余的差距指向更广泛的需求:时间稳定的身份可能依赖于文本侧记忆与不断演变的视觉状态之间的更紧密耦合,这促使未来在记忆注入方面的工作超越文本条件接口。

英文摘要

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

2605.20308 2026-05-21 cs.CV cs.AI cs.LG

SDM: A Powerful Tool for Evaluating Model Robustness

SDM:评估模型鲁棒性的强大工具

Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li

AI总结 本文提出了一种名为SDM的新型梯度攻击方法,通过重新定义对抗样本生成的目标,解决了传统方法中'高损失非对抗样本'导致的性能下降问题,并在实验中证明了其在攻击性能和成本效率上的优势。

Comments 16 pages

详情
Journal ref
Forty-third International Conference on Machine Learning (ICML 2026)
AI中文摘要

基于梯度的攻击方法是评估模型鲁棒性的重要方法。然而,自从提出APGD以来,此类方法难以取得显著突破。为了实现这一效果,我们首先分析了先前方法中导致攻击性能下降的'高损失非对抗样本'问题,并证明该问题源于对抗样本生成目标的不恰当。随后,我们将目标重新定义为

英文摘要

Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.

2605.20304 2026-05-21 cs.RO

Terrestrial Soft Mobile Robots: A Review

陆地软体移动机器人:综述

Dimuthu D. K. Arachchige

AI总结 本文综述了软体移动机器人的当前研究状态,重点探讨了无轮陆地移动系统中的运动策略、驱动方法、建模方法和控制系统,同时指出了实现软体移动机器人在各领域广泛应用的关键挑战。

详情
AI中文摘要

软体移动机器人已经 emerged as 一个有前景的研究领域,具有在多个学科中应用的潜力,包括但不限于搜索与救援、服务、监控、探索和制造。在本文中,我们提供了一篇关于当前软体移动机器人研究现状的全面综述,重点是无轮陆地移动系统。我们包括了过去和现在在运动策略、驱动方法、建模方法和控制系统方面的进展。进一步,我们确定了必须克服的关键研究挑战,以实现软体移动机器人在各种应用中的广泛应用。总体而言,本文为对软体移动机器人和软机器人领域感兴趣的研究人员和实践者提供了有价值的资源。

英文摘要

Soft mobile robots have emerged as a promising area of research with potential applications in various disciplines including but not limited to search-and-rescue, service, surveillance, explorations, and manufacturing. In this article, we provide a comprehensive review of the current state of soft mobile robot research, focusing on wheelless terrestrial locomotive systems. We include past and present developments in locomotion strategies, actuation methods, modeling approaches, and control systems. Further, we identify key research challenges that must be overcome to enable the widespread adoption of soft mobile robots in various applications. Overall, this article provides a valuable resource for researchers and practitioners interested in the field of soft mobile robots and soft robotics.

2605.20300 2026-05-21 cs.LG cs.AI

Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning

鲁棒的子空间约束二次模型用于低维结构学习

Zheng Zhai, Xiaohui Li

AI总结 本文提出了一种鲁棒的子空间约束二次模型(SCQM),用于从高维数据中学习低维结构。基于子空间约束二次矩阵分解(SQMF)框架,该模型能够适应广泛噪声分布,包括广义高斯和径向拉普拉斯模型。这种泛化能力使其在重尾和轻尾噪声下均能保持稳定性能,显著提高了在不同数据场景下的鲁棒性。为高效解决由此产生的非凸优化问题,我们开发了一种基于梯度的算法,配备回溯线搜索策略以确保稳定和高效的收敛。此外,我们还对$\ell_p^p$和$\ell_2$损失函数进行了敏感性分析,阐明了它们在不同噪声特性下的不同行为。大量数值实验验证了理论分析,并展示了所提方法在鲁棒性和重建准确性方面优于现有方法。

详情
AI中文摘要

在本文中,我们提出了一种鲁棒的子空间约束二次模型(SCQM),用于从高维数据中学习低维结构。基于子空间约束二次矩阵分解(SQMF)框架,所提出的模型能够适应广泛噪声分布,包括广义高斯和径向拉普拉斯模型。这种泛化能力使该方法在重尾和轻尾噪声下均能保持稳定性能,从而在不同数据场景下显著提高了鲁棒性。为高效解决由此产生的非凸优化问题,我们开发了一种基于梯度的算法,配备回溯线搜索策略以确保稳定和高效的收敛。此外,我们还对$\ell_p^p$和$\ell_2$损失函数进行了敏感性分析,阐明了它们在不同噪声特性下的不同行为。大量数值实验验证了理论分析,并展示了所提方法在鲁棒性和重建准确性方面优于现有方法。

英文摘要

In this paper, we propose a robust subspace-constrained quadratic model (SCQM) for learning low-dimensional structure from high-dimensional data. Building upon the subspace-constrained quadratic matrix factorization (SQMF) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models. This generalization enables reliable performance under both heavy-tailed and light-tailed noise, thereby substantially enhancing robustness across diverse data regimes. To efficiently address the resulting nonconvex optimization problem, we develop a gradient-based algorithm equipped with a backtracking line-search strategy that ensures stable and efficient convergence. In addition, we present a sensitivity analysis of the $\ell_p^p$ and $\ell_2$ loss functions, elucidating their distinct behaviors under varying noise characteristics. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy.

2605.20299 2026-05-21 cs.LG cs.AI cs.RO

Mechanisms of Misgeneralization in Physical Sequence Modeling

物理序列建模中泛化错误的机制

Kento Nishi, Raphael Tang, Karun Kumar, Core Francisco Park, Hidenori Tanaka

AI总结 本文研究了物理序列建模中由于局部误差传播导致的物理泛化错误,提出了一种数据偏差核来预测物理量的质量变化,并提出了基于核的干预策略。

Comments Preprint. kentonishi.com/physical-misgeneralization

详情
AI中文摘要

生成序列模型通常用于在物理领域规划运动,从机器人到机械模拟。在构建训练此类模型的数据集时,工程师可能会选择演示来指定轨迹在物理量如旅行距离或机械能上的分布。例如,一个构建迷宫导航代理的机器人工程师可能会选择旅行距离覆盖固定范围的演示,希望限制代理的预期功率使用。我们发现标准深度学习可以违反这一意图:每个生成的轨迹在单独看来都合理,但物理量的总体分布是错误的。我们将这种失败称为物理泛化错误,并发展了其机制。通过受控的合成任务,我们发现物理泛化错误出现在局部误差典型于模型类通过物理测量传播到恢复分布时。我们用数据偏差核估计这些误差,并利用它来预测在我们的合成任务和更应用的迷宫导航和双摆运动任务中哪些物理量获得或失去质量。最后,我们的机制性解释有助于识别哪些缓解策略在结构上具有前景,并利用它提出了一种基于核的干预。

英文摘要

Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent's expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

2605.20297 2026-05-21 cs.CV cs.LG

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

MedCRP-CL: 通过贝叶斯非参数语义模态发现实现连续医学图像分割

Ziyuan Gao

AI总结 该研究提出MedCRP-CL框架,通过在线任务结构发现和结构感知的连续学习方法,解决医学图像分割在持续学习中的挑战,实现了73.3%的Dice得分和仅4.1%的遗忘率。

Comments Accepted by ICML 2026

详情
AI中文摘要

医学图像分割在持续学习中面临根本性挑战:数据按顺序从异质源到来,但有效的持续学习需要发现哪些任务共享足够的结构以受益于联合学习。现有方法要么在所有任务上应用统一约束,导致任务冲突时发生灾难性遗忘,要么需要预定义的任务分组,无法预测未来任务多样性。我们引入MedCRP-CL框架,实现在线任务结构发现和结构感知的持续学习。利用中文餐厅过程(CRP),我们的方法从临床文本提示中动态推断任务分组,无需预定义聚类数量或访问未来任务。我们将发现的分组称为语义模态,因为它们通过整合解剖区域和病理背景捕捉更细粒度的结构。在发现的结构指导下,我们维护语义模态特定的LoRA适配器,通过内模态EWC正则化,确保在不同任务组之间参数隔离,同时促进相似组的知识转移。该框架也是无回放的,仅存储聚合统计信息而非原始患者数据。在16个医学分割任务和四种成像模态上的实验表明,MedCRP-CL实现了73.3%的Dice得分,仅4.1%的遗忘率,优于最佳基线8.0%,同时仅需6倍更少的参数。代码可在https://github.com/zygao930/MedCRP-CL获取。

英文摘要

Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6$\times$ fewer parameters. Code is available at https://github.com/zygao930/MedCRP-CL.

2605.20296 2026-05-21 cs.LG cs.AI

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

谱遗忘:无需重新训练的后验能力恢复

Aarash Abro, Muhammad Tahir

AI总结 研究探讨了语言模型在目标任务微调过程中因训练数据未显式威胁而退化的能力现象,提出了一种仅使用预训练检查点和微调后检查点的后验修复方法,通过谱修复技术恢复受损能力并保留目标任务收益。

详情
AI中文摘要

对语言模型进行目标任务微调通常会退化那些训练数据从未显式威胁的能力。我们研究这种现象,称为灾难性遗忘,并提出一种后验修复解决方案,仅使用预训练检查点W_base和其微调后代W_ft。目标不仅是将模型回退到基础检查点,而是恢复微调损坏的能力,同时保留目标任务的收益和任何有益的未显式改进。我们引入了DG-Hard,一种仅使用检查点的谱修复方法,用于微调更新Δ= W_ft - W_base。DG-Hard将Δ视为嵌入在IID-like噪声残差中的低秩任务对齐信号,该信号梯度下降没有动力去除,并对每个权重-增量矩阵应用Donoho-Gavish硬奇异值阈值,保留更新的结构高能部分并去除谱体。这将修复简化为一个闭合形式的SVD过滤步骤,无需数据依赖的调优。一个核心困难是评估:平均准确率隐藏了每个基准的失败,而朴素恢复分数奖励那些简单回退到基础的模型。因此,我们引入了一个分区条件度量,分别跟踪愈合、保留、非损坏和目标任务保留。在14(模型,任务)设置和九个跨领域未显式基准上,DG-Hard在后验基线中实现了最强的平衡修复。DG-Hard还恢复了由良性微调退化的三个独立安全轴的安全对齐,尽管不使用任何对齐数据。这些结果表明,部分微调引起的能力建设损失并非专业化不可避免的后果,而是在权重更新本身中可去除的谱残余。

英文摘要

Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint $W_{\mathrm{base}}$ and its fine-tuned descendant $W_{\mathrm{ft}}$. The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update $Δ= W_{\mathrm{ft}} - W_{\mathrm{base}}$. DG-Hard treats $Δ$ as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across $14$ (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

2605.20295 2026-05-21 cs.LG cs.AI

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Quant.npu:通过完全静态量化实现高效的移动NPU推理以支持设备端LLM

Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang

AI总结 本文提出Quant.npu框架,通过完全静态量化方法实现高效的移动NPU推理,解决了传统后训练量化方法在NPU硬件约束下的兼容性问题,并在实际移动NPU上实现了较高的准确性和较低的推理延迟。

详情
AI中文摘要

大型语言模型(LLMs)正越来越多地部署在移动设备上,其中神经处理单元(NPUs)需要完全静态量化以实现最优的推理效率。然而,现有的后训练量化(PTQ)方法主要依赖于动态激活量化,使其与NPU硬件约束不兼容。为了弥合高保真PTQ与NPU受限推理之间的差距,我们提出了Quant.npu,一个仅整数的完全静态量化框架。它结合了可学习的量化参数和旋转矩阵,使低比特激活-权重量化无需运行时重新计算量化参数。关键的是,我们发现初始化和选择性优化量化参数对于优化稳定性至关重要,因为不恰当的初始化和简单的联合优化会引发梯度不稳定,破坏旋转矩阵的优化。为此,我们提出了针对不同激活特征的旋转和比特宽感知初始化,以及针对旋转和未旋转张量的分布感知选择性优化(双阶段量化流水线)。此外,我们引入了一种敏感性引导的自适应混合精度方案,以在准确性和推理效率之间取得平衡。在实际移动NPU上的大量实验表明,Quant.npu在准确度上与最先进的方法相当,同时将推理延迟降低了最高15.1%。

英文摘要

Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

2605.20293 2026-05-21 cs.LG cs.AI cs.NE

Closed-form predictive coding via hierarchical Gaussian filters

通过分层高斯滤波器实现闭式预测编码

Aleksandrs Baskakovs, Sylvain Estebe, Kenneth Enevoldsen, Kristoffer Nielbo, Chris Mathys, Nicolas Legrand

AI总结 本文提出通过分层高斯滤波器实现预测编码,恢复了精度加权的信息传递,实现了动态不确定性估计和Hebbian兼容的更新规则,从而在单个自由能目标下同时学习激活、权重和精度,无需全局误差信号,且无需迭代或自动微分。

详情
AI中文摘要

预测编码(PC)提供了一种局部且生物基础的替代反向传播方法,用于训练人工神经网络,但至今仍较慢,且随着网络深度增加性能急剧下降。我们追溯这两个问题到一个简化:当前PC网络将精度矩阵固定为单位矩阵,丢弃了变分推导所需的精度加权预测误差,以实现快速、局部和贝叶斯的特性。我们通过将预测编码网络表示为深度分层高斯滤波器(HGF)并恢复精度加权的信息传递,从而在每一层实现动态不确定性估计和Hebbian兼容的更新规则。所得到的网络可以在单个自由能目标下同时学习激活、权重和精度,无需全局误差信号,并且在推断过程中无需迭代或自动微分。在FashionMNIST上,我们的解决方案在epoch级的运行时间成本上接近反向传播,同时在更少的epoch中收敛,并在在线、数据效率和概念漂移任务上优于反向传播。因此,我们证明了闭式变分推断与在线精度学习相结合,为深度预测编码网络提供了一个可处理的基础,保留了生物和解释性优势,而无需迭代松弛或全局误差信号。

英文摘要

Predictive coding (PC) offers a local and biologically grounded alternative to backpropagation in the training of artificial neural networks, yet to date, it remains slower, and performance degrades sharply as network depth increases. We trace both problems to a single simplification: current PC networks fix the precision matrix to the identity, discarding precision-weighted prediction errors that the variational derivation requires to be fast, local, and Bayesian. We close this gap by expressing predictive coding networks as deep hierarchical Gaussian filters (HGFs) and restore precision-weighted message passing, yielding dynamic uncertainty estimates and Hebbian-compatible update rules at every layer. The resulting networks can simultaneously learn activations, weights, and precisions under a single free-energy objective, with no global error signal, and resolve inference without requiring iterations or automatic differentiation. On FashionMNIST, our solution approaches backpropagation in epoch-level wall-clock cost while converging in fewer epochs, and outperforms it on online, data efficiency, and concept-drift tasks. We thus establish that closed-form variational inference with online precision learning provides a tractable foundation for deep predictive coding networks, retaining biological and interpretative advantages, without requiring iterative relaxation or global error signals.

2605.20292 2026-05-21 cs.LG

TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

TreeText-CTS: 用于不规则临床时间序列预测的紧凑、可追溯的树路径证据

Kwanhyung Lee, Juhwan Choi, Jongheon Kim, Joohyung Lee, Hyeongwon Jang, Eunho Yang

AI总结 本文提出TreeText-CTS,一种用于不规则临床时间序列预测的紧凑、可追溯的树路径证据方法,通过冻结XGBoost模型生成多尺度窗口摘要,并将激活的树路径转换为确定性、可追溯的证据单元,从而在多个数据集上实现了最佳的AUROC和AUPRC性能。

Comments 27 pages, 4 figures

详情
AI中文摘要

数值时间序列模型可以有效地处理不规则的电子健康记录(EHR)轨迹,但它们不自然地暴露支持每个风险估计的测量和时间模式作为可读的证据。现有的基于文本的界面提高了可读性,但通常依赖于原始序列化,这冗长且重复,或患者级别的自由形式摘要,难以追溯到源测量和时间窗口。为弥合这一差距,我们引入TreeText-CTS(临床时间序列),将不规则EHR轨迹转换为可读、紧凑、可追溯的树路径证据单元,而无需患者级别的摘要或推理时间的自回归解码。TreeText-CTS通过冻结的XGBoost模型路由多尺度窗口摘要,并将激活的树路径转换为由阈值条件组成的确定性、可追溯的证据单元。一个证据选择器将这些单元的有信息子集组合在一起,然后语言模型编码器将其整合用于预测。在PhysioNet 2012死亡率、MIMIC-III死亡率和PhysioNet 2019脓毒性休克发病预测等多个数据集上,TreeText-CTS在评估的基于文本的EHR时间序列接口中实现了最佳的AUROC和AUPRC性能,其AUPRC比最强的先前基于文本的接口提高了6.0到9.7个百分点,同时在数值时间序列模型中保持竞争力。消融实验显示,树路径证据构建、证据选择和语言模型组合各自都对性能有所贡献。因为传递给语言模型编码器的每个跨度都是由激活的树路径阈值条件构建的,TreeText-CTS使提供给最终预测器的证据是可检查和可追溯的。

英文摘要

Numerical time-series models can effectively process irregular electronic health record (EHR) trajectories, but they do not naturally expose the measurements and temporal patterns supporting each risk estimate as readable evidence. Existing text-based interfaces improve readability, but typically rely on either raw serialization, which is lengthy and redundant, or patient-level free-form summaries, which are difficult to trace to source measurements and time windows. To bridge this gap, we introduce TreeText-CTS (Clinical Time-Series), which converts irregular EHR trajectories into human-readable, compact, source-traceable tree-path evidence units without patient-level summarization or inference-time autoregressive decoding. TreeText-CTS routes multi-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source-traceable evidence units composed of threshold conditions. An evidence selector assembles an informative subset of these units, which a language-model encoder then integrates for prediction. Across PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset forecasting, TreeText-CTS achieves the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces, improving AUPRC by 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface while remaining competitive with numerical time-series models. Ablations show that tree-path evidence construction, evidence selection, and language-model composition each contribute to performance. Because every span passed to the language-model encoder is constructed from activated tree-path threshold conditions, TreeText-CTS makes the evidence supplied to the final predictor inspectable and source-traceable.

2605.20289 2026-05-21 cs.LG cs.AI

Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers

插件式脉冲运算符:突破脉冲变换器中的非线性瓶颈

Xinzhe Yuan, Xiang Peng, Bin Gu, Huan Xiong

AI总结 本文提出了一种插件式框架,通过将Transformer中的非线性运算分解为三个基本算子(除法、指数和ℓ2范数),并利用LIF神经元群体和轻量级位移缩放实现脉冲友好的近似,从而在不需微调的情况下支持常见的Transformer非线性运算。

Comments Accepted to ICML 2026. 9 pages main paper, 8 pages appendix, 6 figures, 5 tables. Correspondence to Bin Gu and Huan Xiong

详情
AI中文摘要

ANN到SNN的转换提供了一条实用且无需训练的途径来构建脉冲大规模语言模型。然而,当前的流程主要关注于脉冲驱动实现Transformer线性代数运算,而对关键非线性运算的支持有限。这种差距限制了与神经形态风格执行约束的兼容性,其中此类非线性通常需要除法、指数或范数计算,这些计算并不自然支持于标准的泄漏积分-放电动力学。为了解决这个问题,我们提出了一种插件式框架,实现了Transformer非线性的脉冲友好的近似,并整合到现有的ANN到SNN流程中。我们的方法将这些非线性计算分解为三个反复出现的基本算子——除法、指数和ℓ2范数——并通过利用LIF神经元群体进行群体计算,并结合轻量级位移缩放以避免浮点运算来实现它们。通过将这些基本算子作为模块化运算块进行组合,我们的框架支持常见的Transformer非线性运算(例如Softmax、SiLU和归一化),而无需任何微调。在一系列LLM Transformer上的实验表明,选择性地替换目标非线性运算符在所有评估任务中导致的精度下降少于1%。

英文摘要

ANN-to-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives -- division, exponentiation, and $\ell_2$ norms -- and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a $1\%$ accuracy drop across all evaluated tasks.

2605.20287 2026-05-21 cs.LG cs.AI cs.CV

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

FusionCell: 跨注意力融合布局几何与网络列表拓扑以实现标准单元性能预测

Haoyi Zhang, Kairong Guo, Bojie Zhang, Yibo Lin, Runsheng Wang

AI总结 本文提出FusionCell,通过跨注意力机制融合布局几何和网络列表拓扑,以提高标准单元性能预测的准确性,解决了传统方法忽略布局几何导致的耦合和布局依赖效应的问题。

详情
AI中文摘要

标准单元是数字电路的基本构建块,其延迟和功率对芯片级性能有关键影响;然而,其表征仍依赖于缓慢的仿真扫描,许多快速预测器忽略了布局几何,未能捕捉到耦合和布局依赖效应。挑战在于如何联合表示布局几何和网络列表拓扑,使模型能够同时捕捉细粒度的空间细节和结构连接,以实现准确的性能预测。我们引入FusionCell,一种双模态预测器,将路由布局几何和网络列表拓扑作为输入,并在统一模型中显式融合它们。一个DeiT编码器处理三层路由布局,而图Transformer模型异构设备/网络图。模态通过拓扑引导机制集成,其中网络列表作为结构“地图”主动查询布局中的相关物理区域,以实现联合几何和拓扑推理。我们构建了一个基于ASAP7 PDK的7nm数据集,使用自动工具生成超过19500个单元,涵盖149种类型,针对六个指标:信号上升/下降延迟、过渡和功率。实验结果表明,FusionCell减少了回归误差,平均MAPE为0.92个百分点,并在基线模型上提高了Spearman/Kendall排名,同时将表征过程的速度提高了数十倍,相比电路仿真。

英文摘要

Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

2605.20285 2026-05-21 cs.LG cs.AI

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

反思式X训练:反馈条件化提升跨所有LLM训练阶段的扩展性

Brandon Cui, Ximing Lu, Jaehun Jung, Syeda Nahida Akter, Hyunwoo Kim, Yuxiao Qu, David Acuna, Shrimai Prabhumoye, Yejin Choi, Prithviraj Ammanabrolu

AI总结 本文提出反思式训练(IXT),通过利用后续阶段的动态来改进早期阶段,从而提高LLM训练的扩展效率,实验表明该方法在计算效率和性能上均有显著提升。

详情
AI中文摘要

我们探讨了如何更高效地扩展当前LLM训练流水线中的多个不断增长的阶段。我们的核心直觉源于后续阶段的动态(例如训练后)可以用来指导早期阶段(例如预训练)。为此,我们提出了反思式训练(或IXT),受离线奖励条件强化学习启发,适用于训练的任何阶段。IXT使用一个思考奖励模型来用自然语言批评性反馈标注数据,使从流水线的最早阶段开始就能进行质量感知训练。然后通过将生成的反馈作为前缀条件化数据来训练模型——确保在训练早期阶段并非所有token都被同等对待。在7.5-12B基于transformer的密集LLM上进行的全面实验表明,我们的方法:使扩展曲线弯曲,从而在一般情况下实现高达2.8倍的计算效率提升;并在数学和代码等领域达到其他训练方法无法达到的性能水平。

英文摘要

We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.

2605.20284 2026-05-21 cs.CV cs.AI cs.LG

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO: 一种面向工业异常问答的多模态推理框架

Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park

AI总结 本文提出JUDO框架,通过结合领域知识和上下文提升多模态推理能力,以解决工业异常检测中模型缺乏领域知识的问题,实验表明其在MMAD基准上优于Qwen2.5-VL-7B和GPT-4o。

Comments Published at ICLR 2026

详情
AI中文摘要

工业异常检测已显著受益于大多模态模型(LMMs),使检测能力超越了单纯的检测,尤其通过视觉引导推理提升图像理解能力。然而,LMMs缺乏领域特定知识,限制了其在复杂工业场景中生成准确响应的能力。在本工作中,我们提出了JUDO,即Juxtaposed Domain-Oriented Multimodal Reasoner,一种能够高效整合领域知识和上下文的视觉和文本推理框架。通过视觉推理,我们的模型通过将查询图像与正常图像进行对比,分割缺陷区域,实现细粒度的视觉比较检查。此外,我们通过监督微调(SFT)注入领域知识,以增强上下文理解,并通过强化学习(GRPO)引导领域推理,采用领域导向的推理过程。实验结果表明,JUDO在MMAD基准上表现优异,超越了Qwen2.5-VL-7B和GPT-4o等模型。这些结果突显了增强领域知识和上下文对有效推理在异常理解中的重要性。

英文摘要

Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

2605.20277 2026-05-21 cs.CV cs.AI

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹积分反馈调节解剖感知奖励用于体积计算断层扫描分析

Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang

AI总结 本文提出了一种新的框架,通过轨迹积分反馈GRPO(TIF-GRPO)来改进医疗视觉语言模型在三维CT分析中的性能,通过引入临床异常基准评估子系统(CABS)来解决优化目标与临床严谨性之间的不匹配问题,提升异常检测和临床准确性。

详情
AI中文摘要

医学视觉-语言模型(VLMs)已迅速发展为通用多模态助手,但其在三维计算机断层扫描(CT)分析中的应用仍受到优化目标与临床严谨性之间持续不匹配的限制。当前的强化学习(RL)范式仍然依赖于词汇代理信号,导致``评估幻觉'',即模型优化语言流畅性而非事实性临床正确性,从而导致诊断性关键错误。为弥合这一差距,我们引入了临床异常基准评估子系统(CABS),一个将放射学报告分解为可验证的临床语义单元的结构化系统。利用CABS,我们识别出标准RL中的``机理分歧'',即表面相似性奖励驱动策略梯度绕过医学事实。因此,我们提出了轨迹积分反馈GRPO(TIF-GRPO),一种将控制理论原理整合到策略优化中的新框架。通过将临床推理建模为伪时间轨迹以发现异常,TIF-GRPO通过积分反馈回路调节解剖感知奖励,该回路将持续遗漏视为累积状态误差,并将幻觉视为过度的控制努力。在3D CT基准测试中,我们的方法显著提高了异常检测和临床忠实度,建立了医疗VLMs中细粒度调节的新范式。我们的项目可在GitHub上获取。

英文摘要

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

2605.20276 2026-05-21 cs.LG

OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

OmniISR: 一个通过中间监督和正则化实现集中学习和联邦学习统一框架

Wei-Bin Kou, Guangxu Zhu, Ming Tang, Chen Zhang, Lisheng Wu, Lei Zhou, Yujiu Yang

AI总结 本文提出OmniISR框架,通过中间监督和正则化信号融合纯集中学习、纯联邦学习和混合集中-联邦学习训练模式,解决了集中学习和联邦学习之间的不兼容优化问题,并在理论上推导了收敛界、联邦漂移界、梯度对齐保证和逃逸时间界。

Comments 18 pages

详情
AI中文摘要

边缘智能的全球部署跨越异构法律框架。虽然一些地区允许通过云数据聚合进行集中学习(CL),而其他地区则要求严格的数据本地化, necessitating 联邦学习(FL)。这种操作二元性引入了两个不兼容的优化制度(即无偏全局梯度但伴随内部协变量漂移的CL与有偏、易漂的本地更新的FL),导致任何简单的整合都缺乏严谨的理论保证。为填补这一空白,我们提出OmniISR,一个统一的框架,通过在多个隐藏层中配备中间监督和正则化(ISR)信号来融合纯CL、纯FL和混合CL-FL训练模式。具体来说,我们提出(i)使用互信息(MI)作为中间监督以对齐CL中的漂移内部协变量和FL中的客户端漂移表示,以及(ii)采用负熵(NE)作为中间正则化器以惩罚过度自信的预测,保持表示不确定性,并避免设备特定的崩溃。在理论方面,我们推导了(i)一个统一的、ISR无关的、非渐近的O(1/sqrt(T))收敛界,显示引入的ISR不违反标准SGD收敛,(ii)一个联邦漂移界,量化了ISR减少的客户端漂移,(iii)一个梯度对齐保证,确保在轻微偏置下非冲突的CL和FL更新,以及(iv)一个显式逃逸时间界,表明CL-FL混合混合扩大了有效随机性并加速了从严格鞍点的逃逸。广泛的实验表明,OmniISR在集中和联邦范式中一致提高了模型性能,减少了CL-FL差距22.60%,并在多个FL算法中产生了37/48配对指标胜利。

英文摘要

The global deployment of edge intelligence operates across heterogeneous legal frameworks. While some regions permit centralized learning (CL) via cloud data aggregation, others enforce strict data localization, necessitating federated learning (FL). This operational dichotomy introduces two incompatible optimization regimes (i.e., unbiased global gradients yet coupled with internal covariate shift in CL versus biased, drift-prone local updates in FL), resulting in that any naive integration of the two lacks rigorous theoretical guarantees. To fill this gap, we propose OmniISR, a unified framework that fuses pure CL, pure FL, and hybrid CL-FL training modes via equipping intermediate supervision and regularization (ISR) signals at multiple hidden layers. Specifically, we propose (i) to use mutual-information (MI) as intermediate supervision to align shifting internal covariate in CL and client-drifting representations in FL, and (ii) to adopt negative-entropy (NE) as intermediate regularizer to penalize overconfident prediction, preserve representational uncertainty, and avoid device-specific collapse. On the theory side, we derive (i) a unified, ISR-agnostic, and non-asymptotic O(1/sqrt(T)) convergence bound that shows the introduced ISR does not violate standard SGD convergence, (ii) a federated drift-bound that quantifies the ISR-reduced client drift, (iii) a gradient-alignment guarantee that ensures non-conflicting CL and FL updates under mild bias, and (iv) an explicit escape-time bound that indicates that CL-FL hybrid mixing enlarges effective stochasticity and accelerates escape from strict saddles. Extensive experiments demonstrate that OmniISR consistently improves model performance in both centralized and federated paradigms, reduces the CL-FL gap by 22.60%, and yields 37/48 paired metric wins across multiple FL algorithms.

2605.20275 2026-05-21 cs.CV cs.AI

You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

你不需要注意力:基于门控卷积的基于手表的跌倒检测

Sana Alamgeer, Ronish Kumar, Awatif Yasmin, Muhammad Irshad, Anne H. H. Ngu

AI总结 本文提出了一种轻量级的双流架构Gated-CNN,用于基于手表的跌倒检测,通过门控机制提升特征提取效率,实现在不使用注意力机制的情况下达到更高的检测精度。

详情
AI中文摘要

现有的基于可穿戴设备的跌倒检测系统依赖于自注意力机制,这种机制带来了二次计算开销,将权重分布到所有时间步。这种全局权重分布会损害在短固定长度窗口中跌倒特征的精确定位。为克服这一挑战,我们提出Gated-CNN,一种轻量级双流架构,通过独立的一维卷积特征提取器处理加速度计和陀螺仪流,随后(i)一个sigmoid门控模块,选择性地抑制无信息的背景激活,同时增强跌倒区分特征;(ii)一个全局平均池化层,将每个流压缩成紧凑的固定长度描述符;(iii)一个共享的分类头,融合两个描述符进行二分类跌倒预测。对于离线评估,我们在五个腕部惯性测量单元(IMU)数据集上评估模型,分别在SmartFallMM、WEDA-Fall、FallAllD、UMAFall和UP-Fall数据集上获得平均F1分数为93%、93%、90%、91%和90%的结果,优于Transformer基线。对于实时评估,我们将模型部署在Google Pixel Watch 3上,并在12名参与者上进行测试。模型在零次遗漏的情况下实现了97%的平均F1分数和98%的准确率,表明sigmoid门控提供了一种在结构上更一致且计算更高效的替代方案,用于商用智能手表的跌倒检测。

英文摘要

Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

2605.20273 2026-05-21 cs.LG cs.AI

Modality-Decoupled Online Recursive Editing

模态解耦的在线递归编辑

Siyuan Li, Youyuan Zhang, Fangming Liu, Jing Li

AI总结 本文提出M-ORE,一种用于持续多模态大语言模型适应的模态解耦在线递归编辑器,通过统一的近端投影公式和Sherman-Morrison递归实现常数级的每编辑开销,从而在保持模块局部统计信息和固定正交低秩编辑子空间的同时,减少长周期干扰,提升可靠性、通用性和局部性。

详情
AI中文摘要

针对多模态大语言模型(MLLMs)的在线模型编辑需要在计算和内存预算限制下处理连续的纠正流,但为文本-only LLMs开发的编辑器在MLLMs上往往表现不佳:视觉主导的激活偏移了塑造更新的统计信息,导致跨模态冲突,而顺序写入在共享的编辑空间中交织,放大了长周期干扰,导致跨编辑干扰。为了解决这些问题,我们提出了M-ORE,一种用于持续MLLM适应的模态解耦在线递归编辑器。M-ORE源自统一的近端投影公式,并允许通过Sherman-Morrison递归实现闭式更新,从而实现每编辑常数开销。它维护文本堆栈和视觉投影器的模块级局部统计信息,以避免视觉主导的更新塑造,并通过Sherman-Morrison递归在固定正交低秩编辑子空间中进行持续更新,以缓解长周期干扰。在多个MLLM基础架构和在线编辑基准上的实验表明,我们的M-ORE方法在可靠性、通用性和局部性方面优于强大的基线方法,同时实现了有利的质量-效率扩展。我们的代码在https://github.com/lab-klc/M-ORE上公开可用。

英文摘要

Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text-only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causing inter-edit interference. To address these, we propose M-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling. Our code is publicly available at https://github.com/lab-klc/M-ORE.