arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.04433 2026-06-04 cs.CV cs.CL cs.LG

Stateful Visual Encoders for Vision-Language Models

用于视觉-语言模型的有状态视觉编码器

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

AI总结 提出有状态视觉编码器,通过将每个视觉表示条件于先前的视觉特征,增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力,在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。

详情
Comments
Project page: https://statefulvisualencoders.github.io/
AI中文摘要

视觉-语言模型(VLM)越来越多地用于多图像、多轮代理场景,其中决策依赖于视觉变化。然而,在现有的开源权重VLM中,视觉比较仅在语言模型内部进行,而视觉编码器本身是无状态的:每个图像独立编码,无法访问先前的视觉上下文。因此,微小但任务关键的变化可能在语言模型有机会比较之前被减弱,尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器,它将每个视觉表示条件于先前的视觉特征。在监督微调下,配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后,我们在实际任务上验证了我们的模型,包括纵向放射学、细粒度图像比较和遥感,其中有状态编码器一致地改进了通用VLM基线,并在选定领域可以匹配或超越专用模型。项目页面:https://statefulvisualencoders.github.io/

英文摘要

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

2606.04432 2026-06-04 cs.CV

DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation

DSA: 用于快速自回归视频生成的动态步数分配

Thanh-Tung Le, Yunhan Zhao, Menglei Chai, Zhengyang Shen, Zhe Cao, Danhang Tang, Xiaohui Xie, Deying Kong

AI总结 提出一种置信度引导的自适应计算框架DSA,通过轻量级置信度头动态调整每帧去噪步数,在保持视频质量的同时实现实时自回归视频生成。

详情
Comments
CVPR2026, Findings Track
AI中文摘要

视频扩散变压器已实现最先进的视觉质量,但其高推理成本仍是实时应用的主要瓶颈。最近的蒸馏框架产生了具有降低延迟的自回归视频扩散模型,但这些模型仍然每帧使用固定数量的去噪步数,在可预测帧上浪费计算,而在具有挑战性的帧上精炼不足。我们提出了DSA,一种用于自回归视频扩散的置信度引导自适应计算框架。DSA引入了一个轻量级置信度头,在分布匹配蒸馏目标下与生成器联合训练,以估计每帧去噪可靠性。在推理时,该置信度信号动态调整扩散步数:简单帧提前终止以提高速度,而复杂帧获得额外精炼。我们的方法不需要额外的视频数据、启发式规则,且几乎不需要架构修改。实验表明,DSA实现了实时自回归视频生成,在H100 GPU上达到22.63 FPS,延迟低于1秒,同时与最近的自回归和双向视频扩散模型相比,保持了有竞争力或更优的VBench质量。我们的结果表明,置信度引导的自适应采样为交互式视频生成提供了一条有效且实用的路径。

英文摘要

Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.

2606.04429 2026-06-04 stat.ML cs.LG

Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks

平坦性与泛化:使用齐次神经网络学习多指标模型

Harsh Vardhan, Hossein Taheri, Arya Mazumdar

AI总结 本文研究两层齐次神经网络学习多指标模型时,平坦性与泛化之间的关系,证明最平坦插值器总能泛化,而某些非泛化插值器的平坦性无法接近最平坦值。

详情
AI中文摘要

用于解释一阶梯度方法在非凸神经网络上泛化能力的常见启发式方法是“平坦插值器泛化良好”(Hochreiter and Schmidhuber, 1994; Keskar et al., 2017),其中平坦性可通过经验损失Hessian矩阵的迹来衡量。然而,Dinh等人(2017)表明,利用网络的对称性(可在保持总体和经验损失不变的情况下改变平坦性),任何插值器都可以变得更尖锐或更平坦。这一结果使得之前的启发式陈述变得空洞。在本文中,我们表明,对于使用两层非凸齐次神经网络学习未知多指标模型,尽管存在对称性,平坦性与泛化之间仍存在联系。这种联系涉及“最平坦”插值器,即所有插值器中具有阶数最小平坦性的插值器。首先,我们证明存在一类自然的非泛化插值器,其平坦性即使利用对称性也无法接近最平坦可能值。其次,我们证明,对于由单指标模型之和生成的数据,如果近似误差和标签噪声较低,任何最平坦插值器都能实现较小的总体损失,即最平坦插值器总是泛化的。这建立了平坦性与泛化之间的直接联系,适用于一大类激活函数和现实数据分布。

英文摘要

A common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that "flat interpolators generalize well" (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with $2$-layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the "flattest" interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.

2606.04427 2026-06-04 cs.CV

Implicit Fuzzification via Bounded Noise Injection for Robust Medical Image Segmentation

通过有界噪声注入的隐式模糊化用于鲁棒医学图像分割

Bisheng Tang, Zhangfeng Ma, Chuchu Zhai, Feng Dong, Yaoqun Wu, Ammar Oad, Yifei Peng

AI总结 提出NoiseUNet,通过在跳跃连接中注入有界扰动来正则化跨尺度特征融合,隐式实现模糊化,提升医学图像分割的准确性和边界保真度。

详情
Comments
Under reviewing
AI中文摘要

图像分割仍然受到由采样引起的信息损失和像素级标注固有不确定性导致的边界模糊性的根本限制。尽管U-Net等编码器-解码器架构取得了强劲性能,但它们常常产生过度自信的预测,无法捕捉过渡区域的模糊性。为解决此问题,我们提出 extbf{NoiseUNet},一个简单而有效的框架,它在跳跃连接中注入有界扰动以正则化跨尺度特征融合。该机制增强了对局部特征变化的鲁棒性,并促进了边界感知表示。理论上,该扰动诱导出隐式模糊化效果,产生软性的、数据驱动的隶属度,无需显式模糊建模。我们进一步引入 extbf{ThyR},一个具有固有模糊边界的真实世界甲状腺超声数据集。实验表明,NoiseUNet在分割精度和边界保真度上均有一致提升。

英文摘要

Image segmentation remains fundamentally limited by boundary ambiguity arising from sampling-induced information loss and inherent uncertainty in pixel-wise labeling. Although encoder-decoder architectures such as U-Net achieve strong performance, they often produce overconfident predictions that fail to capture transition-region ambiguity. To address this issue, we propose \textbf{NoiseUNet}, a simple yet effective framework that injects bounded perturbations into skip connections to regularize cross-scale feature fusion. This mechanism enforces robustness to local feature variations and promotes boundary-aware representations. Theoretically, the perturbation induces an implicit fuzzification effect, yielding soft, data-driven memberships without requiring explicit fuzzy modeling. We further introduce \textbf{ThyR}, a real-world thyroid ultrasound dataset with inherently ambiguous boundaries. Experiments demonstrate that NoiseUNet consistently improves both segmentation accuracy and boundary fidelity.

2606.04425 2026-06-04 cs.CR cs.AI

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

如果提示注入从未消失?探索智能体系统中的跨会话存储提示注入

Yuanbo Xie, Tianyun Liu, Yingjie Zhang, Suchen Liu, Yulin Li, Liya Su, Tingwen Liu

AI总结 本研究引入跨会话存储提示注入,通过持久化状态使提示注入从单会话模型级威胁转变为长期系统级漏洞,并构建了分类法、基准测试和沙箱工具以评估风险。

详情
Comments
position paper
AI中文摘要

现代智能体系统将大语言模型从会话受限的助手转变为跨会话持久化并演化共享世界状态的有状态系统,通过记忆、文件系统、工具和其他长期存在的上下文工件实现。这种转变从根本上扩展了提示注入的攻击面。然而,先前关于提示注入的工作主要关注单会话内的模型级威胁,忽视了跨会话持久系统状态如何从根本上改变智能体系统的系统级风险。受Web系统中存储型跨站脚本的启发,我们引入了跨会话存储提示注入,其中成功的注入可以持久存在于智能体系统状态中,并在原始攻击者交互结束后长时间静默影响未来执行。为了系统研究这一威胁,我们形式化了存储提示注入,并开发了关于对抗性内容如何跨会话持久化并影响智能体系统的分类法。我们进一步开发了基准测试和沙箱工具包来评估存储提示注入的风险,支持对不同模型、攻击目标和持久化渠道的攻击成功率进行定量分析。我们的发现强调,持久化将提示注入从短暂的模型级威胁转变为嵌入智能体执行状态中的长期系统级漏洞。我们希望这项工作能引起对这一新兴威胁的更广泛关注,并激励社区系统研究和缓解智能体系统中持久化带来的系统风险。

英文摘要

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

2606.04423 2026-06-04 cs.LG stat.ML

The price of multi-group transductive learning

多组转导学习的代价

Noah Bergam, Samuel Deng, Daniel Hsu

AI总结 本文证明在转导学习设置中,多组学习器在某些组上的错误率可能相对于单组设置产生乘法惩罚,且惩罚随组数线性增长至样本量的平方根,这与统计设置中惩罚至多对数增长且与组数无关形成鲜明对比。

详情
AI中文摘要

我们证明,在转导设置中,每个多组学习器在某些组上的错误率相对于单组设置可能产生乘法惩罚,并且惩罚可以随组数线性增加,最多达到样本量的平方根。这与类似(组可实现)统计设置中的最优多组学习器形成鲜明对比,后者的惩罚始终至多是样本量的对数,且与组数无关。

英文摘要

We show every multi-group learner in the transductive setting may incur a multiplicative penalty in its error rate on some group relative to the error rate achievable in the single-group setting, and the penalty can increasing linearly with the number of groups, up to roughly the square-root of the sample size. This stands in stark contrast to optimal multi-group learners in an analogous (group-realizable) statistical setting, where the penalty is always at most logarithmic in the sample size and independent of the number of groups.

2606.04420 2026-06-04 cs.LG

Loss-Conditional PINNs for Parametric PDE Families

损失条件PINNs用于参数化PDE族

Anna Lazareva, Alexander Tarakanov

AI总结 提出LC-PINN,通过将损失权重或物理系数作为网络输入并随机采样,实现单一模型参数化整个PDE族,无需配对数据,在多个参数化方程上匹配或优于逐权重重训练的PINN基线。

详情
AI中文摘要

物理信息神经网络(PINNs)通过最小化残差、边界、初始和数据损失的加权组合来逼近常微分方程和偏微分方程的解。其性能通常受损失权重选择的主导:不良的权重可能导致训练退化到满足一个物理约束而忽略另一个的解。现有方法选择或调整单一组好的权重。我们采取不同的观点:不是调整一个权重向量,而是在训练期间探索整个权重空间。我们引入LC-PINN,它将Dosovitskiy和Djolonga(2020)的损失条件训练适应于PDE残差设置:条件向量(损失权重或标量物理系数)被视为网络输入,并在每个优化步骤从简单先验中采样。这将PINN训练转变为学习由该向量索引的连续解族,无需求解器生成的配对数据。因此,LC-PINN介于经典PINNs和算子学习之间:它保持完全物理信息,但在参数族上摊销训练。我们的贡献不在于损失条件构造本身,而在于将其扩展到PINNs,将损失权重和参数系数机制统一在一个架构下(损失权重使用拼接,系数使用FiLM),以及一个固定求积的L-BFGS完成协议,使得参数系数机制可训练。我们给出了条件最优的lambda不变性结果,并在参数化Helmholtz、Schrödinger、粘性Burgers和Buckley-Leverett方程上研究了LC-PINN。单个LC-PINN在一个模型中参数化整个族,同时匹配或改进逐权重重训练的PINN基线,总成本相对于逐实例重训练具有有利的摊销。

英文摘要

Physics-informed neural networks (PINNs) approximate solutions of ODEs and PDEs by minimising a weighted combination of residual, boundary, initial, and data losses. Their performance is often dominated by the choice of loss weights: a poor weighting can drive training to a degenerate solution in which one physical constraint is satisfied while another is ignored. Existing methods select or adapt a single good set of weights. We take a different view: instead of tuning one weight vector, we explore the entire weight space during training. We introduce LC-PINN, which adapts the loss-conditional training of Dosovitskiy and Djolonga (2020) to the PDE-residual setting: the conditioning vector (either the loss weights or a scalar physical coefficient) is treated as a network input and sampled from a simple prior at every optimisation step. This turns PINN training into learning a continuous family of solutions indexed by that vector, with no solver-generated paired data. LC-PINN thus lies between classical PINNs and operator learning: it stays fully physics-informed but amortises training over a parametric family. Our contribution is not the loss-conditional construction itself, but its extension to PINNs, the unification of the loss-weight and parametric-coefficient regimes under one architecture (concatenation for loss weights, FiLM for coefficients), and a fixed-quadrature L-BFGS finishing protocol that makes the parametric-coefficient regime trainable. We give a lambda-invariance result for the conditional optimum and study LC-PINN on parametric Helmholtz, Schrodinger, viscous Burgers, and Buckley-Leverett equations. A single LC-PINN matches or improves retrained per-weight PINN baselines while parameterising the full family in one model, at a total cost that amortises favourably against per-instance retraining.

2606.04419 2026-06-04 eess.IV cs.AI cs.CV physics.med-ph

L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

L-TGVN:利用纵向先验进行个性化快速MRI

Arda Atalık, Sumit Chopra, Daniel K. Sodickson

AI总结 提出L-TGVN,一种利用纵向先验作为侧信息从高度欠采样测量中重建当前扫描的变分网络,无需显式配准并适应协议差异,在定量指标和结构保持上优于基线方法。

详情
Comments
Accepted to MICCAI 2026
AI中文摘要

MRI提供优异的软组织对比度且无电离辐射,但长采集时间增加患者不适,同时提高检查成本并限制扫描仪吞吐量。减少扫描时间的常见方法是采集更少的测量值,这会产生一个病态线性逆问题;因此,恢复诊断质量的图像需要结合测量数据之外的先验知识。在随访检查中,患者最近的先前扫描可以提供高度信息化的受试者特定背景,但实际应用因时间变化(包括病理进展)、扫描间错位以及跨采集协议漂移而复杂化。在这项工作中,我们引入了L-TGVN,一种纵向信任引导变分网络,利用先前扫描作为侧信息,从高度欠采样测量中重建当前扫描。关键是,L-TGVN约束先前扫描的影响与获取的测量一致。与许多现有的纵向重建方法不同,它不需要先前扫描和当前扫描之间的显式预配准。它进一步适应不同就诊间的采集协议差异(例如,序列参数的变化)。我们在匹配容量的基线上评估L-TGVN,包括先验引导方法和不使用纵向先验的方法,并观察到标准定量指标的一致改进,以及在挑战性加速下更好地保留精细结构。源代码可在github.com/sodicksonlab/L-TGVN获取。

英文摘要

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.

2606.04418 2026-06-04 cs.SD cs.CL eess.AS

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

CleanCodec:通过感知引导编码实现高效且鲁棒的语音分词化

Eugene Kwek, Feng Liu, Rui Zhang, Wenpeng Yin

AI总结 提出CleanCodec,一种去噪音频编解码器,通过选择性信息瓶颈编码仅保留感知重要特征,以12.5 tokens/s实现最先进的分词效率,在说话人相似度和语音可懂度上显著优于现有编解码器,并在下游任务中实现高达17倍推理加速。

详情
AI中文摘要

神经音频编解码器是语音处理流程的关键组件,将音频压缩为离散令牌以供下游建模。然而,现有编解码器难以平衡重建质量与令牌效率,常常以牺牲语言和声学有意义内容为代价,编码背景噪声和录音伪影等感知无关信息。我们将音频分词化重新定义为选择性信息瓶颈问题,并提出CleanCodec,一种去噪音频编解码器,学习仅编码感知重要特征并丢弃不可感知信息。在每秒仅12.5个令牌的情况下,CleanCodec实现了最先进的分词效率,在说话人相似度和语音可懂度上大幅优于现有编解码器。在下游文本到语音和语音转换任务上的评估进一步展示了改进的性能和高达17倍的推理加速,凸显了显著的效率提升。

英文摘要

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

2606.04414 2026-06-04 cs.CV cs.MM

Motion-Guided Causal Disentanglement for Robust Multi-View Cine Cardiac MRI Diagnosis

运动引导的因果解耦用于鲁棒多视角电影心脏MRI诊断

Chuankai Xu, Cristiane De Carvalho Singulane, Mohammad Abuannadi, Stephen Chandler, Jeremy Slivnick, Karolina Zareba, Jane Cao, Vidya Nadig, Fabio Fernandes, Seth Uretsky, Diego Perez de Arenaza, Amit Patel, Jianxin Xie

AI总结 提出运动引导的视角-疾病解耦框架MoViD,通过双分支监督对比学习和梯度反转对抗约束分离视角特定与疾病判别特征,结合无标注时间运动特征定位心脏区域并缓解类别不平衡,在静脉血栓数据集和两个公开基准上超越标准Transformer基线。

详情
AI中文摘要

多视角心脏磁共振成像提供互补的解剖信息,广泛用于无创疾病评估。最近的基于Transformer的模型在CMR分析中展示了强大的表示学习能力;然而,它们通常学习统一的潜在嵌入,将视角特定的解剖变异与疾病相关特征纠缠在一起。这种纠缠使分类器偏向结构属性而非视角不变的病理模式。在低数据场景下,特别是对于代表性不足的心脏疾病,这个问题更加严重,因为有限的样本增加了对捷径学习和视角相关决策边界的敏感性。为了解决这个问题,我们提出了一个基于ViT-MAE骨干的运动引导视角-疾病解耦框架MoViD。该模型通过双分支监督对比目标和梯度反转对抗约束,明确地将潜在表示分解为视角特定和疾病判别组件,最小化疾病信息泄漏到视角嵌入中。此外,引入了一种从帧间差异图导出的无标注时间运动特征,用于定位跳动的心脏区域并抑制背景伪影。对比损失中融入了焦点重加权机制以缓解类别不平衡。我们在一个私有临床静脉血栓数据集和两个公开基准(M&Ms, M&Ms2)上评估了该框架。在疾病分类和心脏分割任务中,我们的方法始终优于标准Transformer基线,并与大规模预训练基础模型相比表现出竞争性能,验证了结构解耦在医学图像分析中的有效性。

英文摘要

Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View--Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (M&Ms, M&Ms2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.

2606.04413 2026-06-04 cs.LG

(Mis)generalization of Helpful-only Fine-tuning

仅帮助性微调的(错误)泛化

Mohammad Omar Khursheed, Baram Sosis, Fabien Roger

AI总结 研究仅帮助性训练(不拒绝用户意图)的模型在泛化中的缺陷,发现其存在涌现错位、残余拒绝行为、低可操控性、谄媚和不连贯角色等问题,并提出合成文档微调和添加角色相关问题来缓解。

详情
Comments
77 pages, 50 figures
AI中文摘要

仅帮助性模型,即训练为始终遵循用户意图的模型,对于危险能力评估和AI研发中拒绝行为会成为障碍的其他领域具有价值。关于仅帮助性训练的泛化特性知之甚少:仅帮助性模型比其无害对应模型拒绝更少,但先前工作未研究其对齐的其他维度。我们研究了现有仅帮助性模型的缺陷。我们发现一些模型表现出涌现错位,其他模型存在残余拒绝行为,大多数模型显示出低可操控性、谄媚和不连贯角色。我们表明简单的反拒绝训练可能导致其中许多问题。然而,这些问题并非仅帮助性训练的必要后果:我们证明合成文档微调和向SFT及RL添加角色相关问题可以缓解它们。

英文摘要

Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R&D where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their harmless counterparts, but previous work has not studied other dimensions of their alignment. We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. We show that simple anti-refusal training can cause many of these issues. None of these problems are necessary consequences of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.

2606.04410 2026-06-04 cs.CV

Ultra-Fast Neural Video Compression

超快神经视频压缩

Jiahao Li, Wenxuan Xie, Zhaoyang Jia, Bin Li, Zongyu Guo, Xiaoyi Zhang, Yan Lu

AI总结 提出基于块的编码框架DCVC-UF,通过联合时空建模和并行重建实现超快编解码,显著提升率失真-复杂度权衡。

详情
Comments
CVPR 2026
AI中文摘要

尽管神经视频编解码器(NVC)已展现出优越的压缩比,但其过高的计算复杂度仍是实际部署的关键障碍。本文引入一种基于块的编码框架,旨在显著改善率失真-复杂度权衡。我们的方法不是逐帧处理,而是将多个帧组成的块编码为单个紧凑的潜在表示,并同时解码它们。这是通过用于联合时空建模的跨帧交互模块和用于并行重建的帧特定解码器实现的。这种范式不仅显著提高了编码吞吐量,还有助于更有效地建模长期时间相关性。为了进一步提高速度,我们提出了一种简化的熵编码机制,将比特流交互整合为单一步骤,大幅减少解码开销。基于这些创新,我们提出了DCVC-UF(超快),一种新的NVC,在性能上树立了新的SOTA。我们的实验表明,DCVC-UF可以实现超快的编码和解码速度,显著优于之前的领先编解码器。DCVC-UF是NVC发展历程中的一个显著里程碑。代码位于https://github.com/microsoft/DCVC。

英文摘要

While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at https://github.com/microsoft/DCVC.

2606.04408 2026-06-04 cs.LG cs.AI

An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization

基于差分进化和梯度下降优化的集成潜在因子模型

Rui Zhang, Jinhang Liu, Wenbo Zhang

AI总结 针对高维不完全数据,提出一种集成潜在因子模型,通过差分进化和梯度下降两种优化方法分别建模并自适应加权融合,以获取更全面、偏差更小的表示。

详情
AI中文摘要

高维不完全(HDI)数据在许多现实世界的大数据场景中普遍存在。潜在因子模型是一种常见的表示学习方法,能够从这些数据中揭示信息丰富的潜在因子。然而,大多数现有的潜在因子模型仅依赖梯度下降进行优化,这可能导致表示不充分且有偏差,特别是在处理异构HDI数据时。因此,本研究提出了一种基于差分进化和梯度下降优化的集成潜在因子模型(ELFM-DEGDO),其设计包括两个方面:1)分别通过差分进化和梯度下降优化独立建模两个不同的潜在因子模型;2)通过定制的自适应加权机制将这两个不同的潜在因子模型组合起来,以有效融合它们的优势。通过利用两种优化范式的互补优势,ELFM-DEGDO能够为HDI数据生成更全面、偏差更小的表示。在三个HDI数据集上的测试表明,ELFM-DEGDO的性能始终优于相关的几种潜在因子模型。

英文摘要

High-dimensional and incomplete (HDI) data are prevalent in many real-world big data scenarios. Latent factor models serve as a common representation learning approach, capable of uncovering informative latent factors from such data. Nevertheless, most existing latent factor models rely solely on gradient descent for optimization, which may lead to insufficient and biased representations, particularly when dealing with heterogeneous HDI data. Thus, this study proposes an Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization (ELFM-DEGDO) with two-fold designed: 1) two diverse latent factor models are independently modeled via differential evolution and gradient descent optimization, respectively, and 2) the two diverse latent factor models are combined via a customized self-adaptive weighting mechanism to effectively fuse their strengths. By leveraging the complementary advantages of both optimization paradigms, ELFM-DEGDO is able to produce more comprehensive and less biased representations for HDI data. Three HDI datasets are tested to show that ELFM-DEGDO consistently performs better than related several latent factor models.

2606.04405 2026-06-04 cs.LG cs.AI

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

尺度不变Transformer中Grokking的低秩衰减:谱几何视角

Mingyu Li

AI总结 针对尺度不变Transformer中权重衰减无法简化归一化层函数的问题,提出低秩衰减(LRD)正则化器,通过核范数子梯度的切向分量压缩奇异值,在模算术任务中加速有效秩下降并扩展延迟泛化(grokking)的数据边界。

详情
AI中文摘要

现代Transformer架构经常采用归一化机制,如RMSNorm和Query-Key归一化,使得模型的部分相对于权重幅度近似尺度不变。在这种机制下,标准的Frobenius范数权重衰减仅沿权重空间的径向方向作用,无法直接简化归一化层所表示的函数。我们通过这一视角研究小规模算法任务中的grokking现象,并提出\emph{低秩衰减}(LRD),一种类似核范数的谱正则化器,其子梯度——极因子$UV^\top$——即使在尺度不变设置中也保留切向分量。这一区别具有具体的动力学后果:在模型记忆训练集且任务梯度消失后,L2衰减无法再重塑权重谱,而LRD则以类似$\ell_1$的方式继续压缩奇异值。在模算术任务中,我们发现LRD诱导Query/Key矩阵的快速有效秩下降,并扩展了延迟泛化(grokking)发生的数据分数边界。我们进一步通过核范数子微分在低秩流形附近的“针到扇”展开,提供了谱几何解释。

英文摘要

Modern Transformer architectures frequently employ normalization mechanisms such as RMSNorm and Query-Key Normalization, making parts of the model approximately scale-invariant with respect to weight magnitudes. In this regime, standard Frobenius-norm weight decay acts purely along the radial direction of the weight space and cannot directly simplify the function represented by the normalized layer. We study grokking in small algorithmic tasks through this lens and propose \emph{Low-Rank Decay} (LRD), a nuclear-norm-like spectral regularizer whose subgradient -- the polar factor $UV^\top$ -- retains a tangential component even in the scale-invariant setting. This distinction has a concrete dynamical consequence: after the model memorizes the training set and task gradients vanish, L2 decay can no longer reshape the weight spectrum, whereas LRD continues to compress singular values in an $\ell_1$-like fashion. On modular arithmetic tasks, we find that LRD induces rapid effective-rank collapse in Query/Key matrices and expands the data-fraction boundary at which delayed generalization (grokking) occurs. We further provide a spectral-geometric interpretation through the ``needle-to-fan'' expansion of the nuclear-norm subdifferential near low-rank strata.

2606.04404 2026-06-04 stat.ML cs.LG

Knockoffs-based False Discovery Rate Control and Simplification for Deep Neural Networks

基于Knockoffs的深度神经网络错误发现率控制与简化

Huiqi Zhang, Wenyu Liao, Yiqing Shi, Xiaobo Huang, Fang Xie

AI总结 本文基于knockoff方法和正则化神经网络,提出了三种在控制错误发现率条件下的变量筛选方法(单层过滤、多层过滤、变量权重聚合过滤),以简化深度神经网络并降低计算复杂度。

详情
AI中文摘要

深度神经网络是机器学习中广泛使用的框架,已广泛应用于各个领域。然而,深度神经网络通常涉及大量参数和输入,其中许多可能与目标或真实输出无关。这些参数和输入变量不仅增加了计算复杂度,还导致了额外的计算成本。解决这一问题的一种方法是knockoff方法,该方法在高维回归中已被证明能有效控制错误发现率。基于knockoff方法和正则化神经网络,本文提出了三种在控制错误发现率条件下的变量筛选方法:单层过滤、多层过滤、变量权重聚合过滤。与现有算法相比,我们发现我们的算法表现出令人满意的性能。

英文摘要

The deep neural network is a widely used framework in machine learning that has been widely applied in various fields. However, deep neural networks often involve a large number of parameters and inputs, many of which may be irrelevant to the goal or true output. These parameters and \textcolor{black}{input variables} not only increase computational complexity, but also contribute to additional computational cost. One solution to this problem is knockoff methods, which have proven successful in controlling false discovery rates in high-dimensional regression. Building on the knockoff methods and using the regularised neural network, this paper proposes three variable screening methods under the condition of controlling false discovery rates: \textit{one layer filter}, \textit{multiple layers filter}, \textit{variable weight aggregation filter}. In comparison with existing algorithms, we find that our algorithms show satisfactory performance.

2606.04402 2026-06-04 cs.AI

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

并非所有错误都同等重要:后果感知的推理计算分配

Jingbo Wen, Liang He, Ziqi He

AI总结 提出后果感知的测试时计算分配方法,通过轻量级预测器估计任务错误成本,在相同预算下将高后果任务路由到更多计算资源,在SWE-bench上降低22%-33%的成本加权损失。

详情
AI中文摘要

现代推理模型可以为不同任务分配不同量的测试时计算,例如思考令牌、模型调用或计算预算。现有方法通常通过预测难度来驱动这种分配,并在预期能提高准确率的地方投入更多计算。这隐含地假设所有失败的成本相同,因为准确率目标对每个任务一视同仁。然而,这种假设在部署中并不成立:日志消息中的拼写错误和导致生产数据库损坏的迁移都算作一次基准失败,但它们的实际成本根本不同。为填补这一空白,我们提出后果感知的测试时计算分配。我们不是仅根据预测难度来路由计算,而是使用轻量级预测器从问题文本中估计如果任务被错误解决会有多高的成本。然后,调度器在相同总预算下将更高后果的任务路由到更大的计算层级或更高的思考预算。我们在SWE-bench Lite上进行主要实验,并在Multi-SWE-bench mini上评估跨数据集行为,总共涵盖700个软件工程任务。我们的结果表明,在各种标注下,后果和难度大致正交,并且当前的思考模型并未根据后果充分分配计算。此外,我们的仅问题文本预测器在300个SWE-bench任务中从未将高后果任务误分类为低后果任务。在匹配的计算预算下,我们的后果感知调度器相对于难度感知路由将成本加权损失降低了22%至33%;特别是,优先级感知变体(根据边际效用信号缩放每个任务的成本进行路由)降低了超过30%,而其可部署的预测器驱动版本保留了超过90%的预言机增益。

英文摘要

Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.

2606.04401 2026-06-04 cs.LG

TANDEM: Bi-Level Data Mixture Optimization with Twin Networks

TANDEM: 基于孪生网络的双层数据混合优化

Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, Ju Fan, Qixiang Jiang

AI总结 提出TANDEM方法,通过孪生网络(代理模型和参考模型)的差异衡量数据效用,优化领域混合比例,在数据受限和监督微调等场景中显著提升大语言模型性能。

详情
AI中文摘要

大语言模型(LLM)的能力在很大程度上取决于来自不同领域的训练数据。优化特定领域的混合比例可以建模为一个双层优化问题,我们将其简化为单层惩罚形式,并通过孪生网络求解:一个在主数据上训练的代理模型和一个在额外数据上动态更新的参考模型。我们提出的方法——用于双层数据混合优化的孪生网络(TANDEM),通过孪生模型之间的差异衡量数据效用,并增加从额外数据中受益更多的领域的权重。与先前方法相比,TANDEM提供了理论保证和更广泛的适用性。此外,我们的双层视角提出了研究领域重新加权的新设置,例如数据受限场景和监督微调,其中优化的混合比例显著提升了性能。大量实验验证了TANDEM在所有场景中的有效性。

英文摘要

The capabilities of large language models (LLMs) significantly depend on training data drawn from various domains. Optimizing domain-specific mixture ratios can be modeled as a bi-level optimization problem, which we simplify into a single-level penalized form and solve with twin networks: a proxy model trained on primary data and a dynamically updated reference model trained with additional data. Our proposed method, Twin Networks for bi-level DatA mixturE optiMization (TANDEM), measures the data efficacy through the difference between the twin models and up-weights domains that benefit more from the additional data. TANDEM provides theoretical guarantees and wider applicability, compared to prior approaches. Furthermore, our bi-level perspective suggests new settings to study domain reweighting such as data-restricted scenarios and supervised fine-tuning, where optimized mixture ratios significantly improve the performance. Extensive experiments validate TANDEM's effectiveness in all scenarios.

2606.04399 2026-06-04 cs.LG cs.CR

DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data

DPDL: 非独立同分布数据下分散式随机学习中的差分隐私保护

Yunsheng Yuan, Xue Xiao, Lina Wang, Feng Li

AI总结 针对非独立同分布数据下的分散式学习隐私泄露问题,提出基于差分隐私和相似性校准的DPDL算法,通过加噪和余弦相似度校准实现隐私保护并保持线性加速。

详情
AI中文摘要

在分散式学习范式中,一组智能体在没有中央服务器的情况下,利用分布式数据集协作训练全局模型。尽管协作的力量已被许多前沿研究验证,但它需要智能体之间广泛交换梯度信息,从而对单个智能体带来高隐私泄露风险。此外,在实际应用中,训练数据通常在智能体之间非独立同分布,这给实现隐私保护的分散式学习带来了更多挑战。为了解决这些问题,我们提出了一种针对非独立同分布数据的隐私保护分散式学习算法DPDL,该算法通过基于相似性校准的技术,在交叉梯度聚合中利用差分隐私(DP)的概念。具体来说,在每一轮中,每个智能体在与其邻居共享交叉梯度(即其私有本地数据上邻居局部模型的导数)之前,通过高斯噪声机制对其进行扰动;然后采用余弦相似度校准接收到的扰动交叉梯度,使得校准后的交叉梯度聚合能够以类似动量的方式有效更新局部模型。我们严格的理论分析不仅揭示了实现特定隐私保护水平所需的最小噪声水平,而且表明我们的算法在非独立同分布数据训练中仍然实现了线性加速。最后,我们在真实世界数据集上进行了大量实验,以验证我们的算法在防御隐私攻击和训练准确模型方面的有效性。

英文摘要

In the paradigm of decentralized learning, a group of agents collaborate to train a global model using distributed datasets without a central server. Although the power of collaboration has been verified by many state-of-the-art studies, it entails extensive gradient information exchanging among the agents and thus induces high risk of privacy leakage for the individual agents. Moreover, in real-world applications, the training data are usually non-identically and independently distributed across the agents, inducing more challenges to enable privacy-preserved decentralized learning. To address these issues, we propose a privacy-preserved decentralized learning algorithm with non-IID data, DPDL, which leverages the notion of Differential Privacy (DP) in cross-gradient aggregation through a similarity-based calibration technique. Specifically, in each round, each agent perturbs the cross-gradients (i.e., the derivatives of its neighbors' local model in its private local data) by Gaussian noise mechanism before sharing them with its neighbors; it then adopt cosine similarity to calibrate the received perturbed cross-gradients such that the aggregation of the calibrated cross-gradients can be utilized to effectively update local model in a momentum-like manner. Our rigorous theoretical analysis not only reveals the minimum noise level required to achieve a specific level of privacy preservation, but also illustrates that our algorithm still achieves a linear speedup in training with non-IID data. We finally conduct extensive experiments on real-world dataset to validate the effectiveness of our algorithm in defending privacy attacks and in training accurate models.

2606.04396 2026-06-04 cs.CL

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

读取轨迹,引导路径:面向扩散语言模型的轨迹感知强化学习

Anant Khandelwal, Manish Gupta

AI总结 提出CAPR算法,通过缓存轨迹状态和块级价值头,利用去噪轨迹提供类似树搜索的细粒度监督,在降低计算成本的同时提升扩散语言模型的强化学习效果。

详情
Comments
19 pages, 10 figures, 7 Tables
AI中文摘要

扩散大语言模型(dLLMs)通过并行迭代去掩码和修正多个位置来生成响应。这一过程留下了丰富的去噪轨迹,描绘了哪些标记变得可信、哪些仍不稳定以及何时形成承诺。现有的dLLM强化学习方法仅弱化地使用这一信号。扁平化展开成本低,但将单一结果奖励分配给整个轨迹。树展开通过分支部分轨迹并将叶节点奖励向上传播,提供更精细、可验证的训练信号,但计算密集。我们提出疑问:去噪轨迹本身能否在不使用树级计算的情况下提供类似树的监督?我们引入CAPR(缓存-摊销路径细化),一种dLLM-RL算法,它将去噪轨迹总结为紧凑的路径状态,利用缓存轨迹状态生成廉价的兄弟延续,并训练块级价值头用于局部块级监督。在块级去掩码调度下,CAPR记录路径状态和块进度特征,然后根据每个块中揭示的标记将最终结果奖励重新分配到各个块。这训练价值头将一个稀疏奖励转换为块级PPO权重。因此,CAPR恢复了树搜索的大部分粒度,同时避免了完整的树扩展,将展开生成成本降低到扁平展开的大约0.75倍和树展开的0.6倍(在标准设置下)。在4x4数独、Countdown、GSM8K和Math500上,使用密集和混合专家LLaDA骨干网络,CAPR在256和512标记预算下为RL调优的dLLMs设立了新的最先进水平。在数独上,它以不到三分之一的每步计算量匹配了最强的树结构基线。

英文摘要

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

2606.04392 2026-06-04 cs.LG cs.CL

Physics-Informed Neural Network Modeling of Biodegradable Contaminant Transport through GCL/SL Composite Liners

物理信息神经网络建模可生物降解污染物通过GCL/SL复合衬垫的迁移

Dong Li, Yapeng Cao, Haiping Zhao, Shutong Han

AI总结 提出双域物理信息神经网络框架,通过硬约束PINN精确模拟GCL/SL复合衬垫中污染物迁移,并扩展至逆问题识别降解半衰期。

详情
AI中文摘要

本研究开发了一个双域物理信息神经网络框架,用于污染物通过GCL/SL复合衬垫系统的迁移,其中薄GCL层采用稳态平流-弥散-生物降解公式处理,而下层土壤衬垫建模为瞬态传输域。在不同渗滤液水头条件下,评估了两种公式与解析解和有限元参考解的对比:标准软约束PINN(Std-PINN)和硬约束PINN(H-PINN),其中选定的边界和初始条件直接嵌入试验解中。Std-PINN捕捉了整体突破行为,但在早期传输阶段显示出较大误差,特别是在平流传输更显著的高水头条件下。H-PINN减少了与基于惩罚的约束执行相关的优化负担,提供了更准确和稳定的浓度预测,将MAE从Std-PINN的约0.058-0.067降低到H-PINN的约0.011-0.023,同时将MRE从约9.10%-19.16%降低到约2.08%-3.14%。参数分析证实,采用tanh激活函数和优化网络结构的H-PINN提供了最佳的预测精度。H-PINN进一步扩展到逆建模,用于从有限的浓度观测中识别SL降解半衰期,显示出对预设值的可靠收敛性以及在低到中等观测噪声下的可接受鲁棒性。

英文摘要

This study develops a two-domain physics-informed neural network framework for contaminant transport through a GCL/SL composite liner system, in which the thin GCL layer is treated using a steady-state advection-dispersion-biodegradation formulation and the underlying soil liner is modeled as a transient transport domain. Two formulations are evaluated against analytical and finite-element reference solutions under different leachate-head conditions: a standard PINN with soft constraint enforcement (Std-PINN) and a hard-constrained PINN (H-PINN), in which selected boundary and initial conditions are embedded directly into the trial solutions. The Std-PINN captures the overall breakthrough behavior but shows larger errors during the early transport stage, particularly under higher leachate heads where advective transport becomes more pronounced. The H-PINN reduces the optimization burden associated with penalty-based constraint enforcement and provides more accurate and stable concentration predictions, lowering the MAE from approximately 0.058-0.067 for the Std-PINN to about 0.011-0.023 for the H-PINN, while reducing the MRE from approximately 9.10%-19.16% to about 2.08%-3.14%. Parametric analyses confirm that the H-PINN with the tanh activation function and an optimized network structure provides the best predictive accuracy. The H-PINN is further extended to inverse modeling for identifying the SL degradation half-life from limited concentration observations, showing reliable convergence toward prescribed values and acceptable robustness under low-to-moderate observation noise.

2606.04391 2026-06-04 cs.AI

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

基于状态接地动态检索的Web代理在线技能学习

Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu

AI总结 提出状态接地动态检索(SGDR)方法,通过逐步技能重用提升Web代理在多步自动化任务中的表现,在WebArena上平均成功率分别达到37.5%(GPT-4.1)和24.3%(Qwen3-4B)。

详情
Comments
17 pages
AI中文摘要

语言代理越来越依赖可重用技能来改进跨相关任务的多步Web自动化。越来越多的研究关注在线技能学习,其中代理不断从先前的任务轨迹中归纳技能,并在未来的任务中动态重用它们。然而,现有方法主要在任务级别重用技能:根据初始任务指令检索一组固定的技能,并在执行过程中保持不变。这种静态策略与Web执行不一致,因为适当的下一步动作不仅取决于任务目标,还取决于当前网页状态,而网页状态通常会转变为初始技能无法覆盖的情况。为了解决这一差距,我们提出了状态接地动态检索(SGDR),一种在线技能学习方法,使Web代理能够逐步重用技能。SGDR由三个组件组成:一个滑动窗口提取过程,将完成的轨迹转化为可在中间执行状态调用的可重用子程序;一种双文本代码表示,将技能检索与可执行动作连接起来;以及一种状态接地动态检索机制,将技能与任务目标和当前网页状态相匹配。在WebArena上跨五个领域的实验表明,SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. 代码可在 https://github.com/plusnli/skill-dynamic-retrieval 获取。

英文摘要

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

2606.04390 2026-06-04 cs.LG cond-mat.dis-nn math.PR

Shortcomings and capacities of real-constrained neural networks in complex spaces

复空间中实约束神经网络的缺陷与能力

Andrew Gracyk

AI总结 通过 Gardner 体积比较和 Harish-Chandra-Itzykson-Zuber (HCIZ) 公式,研究了复假设类中强制实预激活相对于复预激活的存储容量渐近比。

详情
Comments
First version
AI中文摘要

我们找到了在复假设类中强制实预激活相对于复预激活时存储容量的渐近比。我们的方法依赖于临界容量下的 Gardner 体积比较。我们的证明依赖于文献中非标准的 Harish-Chandra-Itzykson-Zuber (HCIZ) 公式的应用。利用 HCIZ 公式,我们可以获得最终渐近比的更稳健近似。该策略特别适用于我们的工作,因为我们通过 Weyl 积分公式和 Haar 测度在酉紧流形和正交紧流形上进行积分。

英文摘要

We find the asymptotic ratio between the storage capacities when enforcing real pre-activations in a complex hypothesis class as opposed to complex ones in the same class. Our methods depend on Gardner volume comparisons at critical capacity. Our proof relies on an application of the Harish-Chandra-Itzykson-Zuber (HCIZ) formula, nonstandard in literature. With the HCIZ formula, we may obtain a more robust approximation for the final asymptotic ratio. This strategy is applicable to our work specifically since we integrate over the unitary and orthogonal compact manifolds, facilitated via the Weyl integration formula and the Haar measure.

2606.04389 2026-06-04 cs.CL

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

当来访者不再跟随:基于认知概念化图的策略性咨询框架

Yihao Qin, Junyi Zhao, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Chang Liu, Bin Hu

AI总结 针对现有评估协议中来访者过度顺从导致评分虚高的问题,提出基于认知行为疗法的抵抗感知框架,通过认知概念化图模拟动态抵抗,并利用强化学习优化策略推理与响应生成,以提升在困难咨询交互中的策略鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)在心理咨询中展现出潜力,但现有基准高度依赖高度合作的模拟来访者。我们观察到一个关键的咨询师-来访者跟随现象:这些来访者往往在仅几轮对话后便迅速从抵抗转向顺从,造成治疗进展的假象,并通过表面共情在当前评估协议下虚高分数。为解决这一评估失配问题,我们提出一个基于认知行为疗法(CBT)的抵抗感知框架。我们引入CARS,一个通过认知概念化图(CCDs)显式建模动态抵抗的来访者模拟器。我们提出STREAMS,一个将策略推理(思考者)与响应生成(呈现者)解耦并通过强化学习优化的双模块框架。我们进一步提出EWTS-MI,一个用于评估高摩擦交互下响应性的熵加权指标。在抵抗性和非抵抗性咨询设置上的实验验证了我们对评估失配的发现,并展示了抵抗感知训练在挑战性咨询交互下提升策略鲁棒性的有效性。

英文摘要

Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly shift from resistance to compliance after only a few turns, creating an illusion of therapeutic progress and inflating scores under current evaluation protocols through superficial empathy. To address this evaluation mismatch, we propose a Cognitive Behavioral Therapy (CBT)-grounded resistance-aware framework. We introduce CARS, a client simulator that explicitly models dynamic resistance via Cognitive Conceptualization Diagrams (CCDs). We present STREAMS, a dual-module framework that decouples strategic reasoning (Thinker) from response generation (Presenter) and optimizes it via reinforcement learning. We further propose EWTS-MI, an entropy-weighted metric for evaluating responsiveness under high-friction interactions. Experiments across resistant and non-resistant counseling settings validate our findings on evaluation mismatch and demonstrate the effectiveness of resistance-aware training for improving strategic robustness under challenging counseling interactions.

2606.04388 2026-06-04 cs.CR cs.AI cs.LG

TITAN-FedAnil+: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises

TITAN-FedAnil+:面向资源受限智能企业的基于信任的自适应区块链联邦学习

Muhammad Hadi, Muhammad Jahangir, Talha Shafique, Muhammad Khuram Shahzad

AI总结 提出TITAN-FedAnil+框架,通过基于亲和传播的自适应聚类聚合过滤恶意更新、GPU加速向量化提升效率及有符号状态跳变机制实现轻量级区块链重同步,在资源受限边缘设备上内存开销降低81%。

详情
Comments
8 pages, 5 figures; code available at https://github.com/error8149/FedAnilPlus-Optimized
AI中文摘要

联邦学习(FL)已成为一种在保护数据隐私的同时实现协作智能的有效范式。然而,由非独立同分布(non-IID)数据分布引起的数据异构性和去中心化安全威胁仍然是重大挑战,尤其是在资源受限的企业环境中。本文提出了TITAN-FedAnil+,一种面向智能企业中区块链联邦学习的基于信任的自适应网络。所提出的框架引入了基于亲和传播的自适应聚类聚合,无需预先知道攻击者数量即可识别并过滤恶意更新。此外,采用GPU加速向量化以提高计算效率,同时通过有符号状态跳变机制实现轻量级区块链重同步。实验结果表明,与基线框架相比,在受限的8 GB边缘设备上经过50轮通信,内存开销显著降低,节省高达81%。结果表明,TITAN-FedAnil+有效提升了智能企业环境中安全联邦学习部署的鲁棒性、可扩展性和资源效率。

英文摘要

Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non-IID distributions and decentralized security threats remain significant challenges, particularly in resource-constrained enterprise environments. This paper presents TITAN-FedAnil+, a Trust-Based Adaptive Network for blockchain-enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation-based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU-accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN-FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.

2606.04387 2026-06-04 cs.IR cs.AI

Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking

重新思考基于LLM的分层偏好排名的销售线索评分

Chenyu Zhang, Yiwen Liu, Yin Sun, Xinyuan Zhang, Yuji Cao, Junming Jiao, Juyi Qiao

AI总结 针对高价值领域销售线索转化问题,提出基于LLM的判别式框架HPRO,通过分层偏好排名优化联合建模结构化与非结构化数据,实现评分与排名性能提升。

详情
AI中文摘要

在高价值领域(如汽车、房地产)中,销售线索转化与电子商务推荐有根本不同,因为其决策周期长且涉及多阶段漏斗。传统的线索评分方法(基于规则的评分卡、机器学习或逐点CTR模型)面临严重挑战:监督信号稀疏、非结构化CRM日志中的语义鸿沟,以及无法捕捉线索的相对优先级。虽然大型语言模型(LLM)能够对客户交互提供卓越的语义理解,但通用LLM不适合线索排名:它们生成文本而非可比较的分数,并且缺乏与销售漏斗分层优先级的一致性。我们提出了一种基于LLM的判别式框架用于销售线索评分,该框架支持结构化CRM特征和非结构化客户交互的联合建模。在此框架之上,我们提出了HPRO(分层偏好排名优化),通过分层偏好排名目标增强销售线索评分。HPRO采用边际感知的Bradley-Terry公式,将稀疏的二元标签转换为密集的、漏斗感知的偏好对,使线索评分能够同时利用逐点和成对监督。在来自领先新能源汽车品牌的大规模数据上的实验表明,分类性能达到最先进水平(AUC 0.8161),排名性能提升(排名靠前线索的精确度提高39.7%)。为期132天的在线A/B测试验证了9.5%的销量提升,确认了实际的商业影响。

英文摘要

Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority. While Large Language Models(LLMs) offer superior semantic understanding of customer interactions, general-purpose LLMs are ill-suited for lead ranking: they generate text rather than comparable scores, and lack alignment with the hierarchical priorities of sales funnels. We introduce an LLM-based discriminative framework for sales lead scoring, which supports joint modeling of structured CRM features and unstructured customer interactions. On top of this framework, we propose HPRO (Hierarchical Preference Ranking Optimization), which augments sales lead scoring with a hierarchical preference ranking objective. HPRO employs a margin-aware Bradley-Terry formulation to transform sparse binary labels into dense, funnel-aware preference pairs, enabling lead scoring to leverage both pointwise and pairwise supervision. Experiments on large-scale data from a leading NEV brand demonstrate state-of-the-art classification (AUC 0.8161) and ranking performance (+39.7% precision among top-ranked leads). A 132-day online A/B test validates 9.5% sales volume uplift, confirming real-world commercial impact.

2606.04385 2026-06-04 cs.CV

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

保持几何结构的异质基础模型无监督对齐

Shuwen Yu, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

AI总结 提出GPUA框架,通过正交映射将视觉基础模型特征对齐到视觉语言模型语义空间,无需标签或参数更新,提升跨模型兼容性并在零样本识别与分割任务中取得显著增益。

详情
Comments
Accepted at ICML 2026
AI中文摘要

基础模型推动了计算机视觉的快速发展,然而两种主导范式——视觉语言基础模型(VLM)和纯视觉基础模型(VFM)——仍然仅部分兼容。VLM提供语言基础的语义对齐,但通常视觉上粗糙;而VFM学习判别性的感知几何结构,但缺乏语义基础。我们提出GPUA(保持几何结构的无监督对齐),一个整合VFM和VLM互补优势的框架。受跨语言对齐启发,GPUA将VFM特征视为一种视觉语言,并学习一个正交映射,将VFM空间转换到VLM语义空间,保持几何结构并缩小模态差距,无需标签或模型参数更新。GPUA是任务无关的,仅需对预训练模型进行特征级访问。在多种基准上的实验表明,跨模型兼容性得到改善,下游零样本识别和分割任务中取得了显著增益,且开销可忽略。代码可在https://github.com/Yuteam14/GPUA获取。

英文摘要

Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA

2606.04384 2026-06-04 cs.LG cs.CR stat.ML

Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD

重新审视选择性释放DPSGD中的子采样隐私放大

Xiaobo Huang, Fang Xie

AI总结 针对DPSGD中梯度裁剪和噪声注入导致的效用下降和收敛缓慢问题,重新评估选择性释放机制的隐私分析,提出基于裁剪梯度的差分隐私选择性释放算法(DPSR-CG),通过严格的隐私分析和实验证明其在保持严格隐私保证的同时实现优异模型性能。

详情
AI中文摘要

机器学习对敏感数据的依赖需要差分隐私随机梯度下降(DPSGD)等隐私保护技术。然而,由于梯度裁剪和噪声注入,DPSGD存在显著的效用下降和收敛缓慢的问题。先前的工作试图从不同角度改进DPSGD;值得注意的是,差分隐私选择性更新与释放(DPSUR)算法取得了显著的模型效用。然而,DPSUR中的隐私核算忽略了选择性释放机制引入的采样概率变化,这损害了其隐私保证的严谨性。为了解决这些限制,我们重新评估了选择性释放机制的隐私分析,并提出了一种新颖的算法:基于裁剪梯度的差分隐私选择性释放(DPSR-CG)。通过严格的新推导隐私分析以及在多个数据集(MNIST、CIFAR-10、IMDB和FMNIST)上的广泛实验,我们证明了我们的DPSR-CG机制在保持严格隐私保证的同时实现了卓越的模型性能。

英文摘要

Machine learning's reliance on sensitive data necessitates privacy-preserving techniques like Differentially Private Stochastic Gradient Descent (DPSGD). However, DPSGD suffers from substantial utility degradation and slow convergence due to gradient clipping and noise injection. Prior works have attempted to improve DPSGD from various perspectives; notably, the Differentially Private Selective Update and Release (DPSUR) algorithm has achieved remarkable model utility. However, the privacy accounting in DPSUR overlooks the variation in sampling probability introduced by the selective release mechanism, which compromises the rigor of its privacy guarantees. To address these limitations, we re-evaluate the privacy analysis of the selective release mechanism and propose a novel algorithm: Differentially Private Selective Release based on Clipped Gradients (DPSR-CG). Through a rigorous, newly derived privacy analysis and extensive experiments on multiple datasets (MNIST, CIFAR-10, IMDB, and FMNIST), we demonstrate that our DPSR-CG mechanism maintains strict privacy guarantees while achieving exceptional model performance.

2606.04382 2026-06-04 cs.DL cs.AI cs.IR

LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

LCSHBench:一个多语言、共识基础的国会图书馆主题标目分配基准

Kwok Leong Tang

AI总结 提出LCSHBench基准,基于多图书馆共识构建多语言书目记录集,通过精确匹配和概念匹配评估自动主题编目,并展示低秩微调嵌入器在跨语言检索中的改进。

详情
AI中文摘要

自动主题编目为书目记录分配受控词汇标目,但LCSH缺乏标准的公开基准。我们引入LCSHBench:来自哈佛、哥伦比亚和普林斯顿开放许可目录的15种语言的22,346本书。只有当至少两个独立编目机构分配了LCSH时,记录才被纳入;我们发布每个目录的来源以及联合和一致答案视图。对465,187部由三个图书馆编目的作品进行的一致性研究显示了这种设计的重要性:图书馆通常在底层主题上达成一致(93.3%共享概念级标目),但在确切表达上经常不同(39.4%具有相同的标目集)。因此,LCSHBench通过开放词汇生成和全词汇检索,使用按语言和标目类型分解的集合和排名指标,对精确匹配和概念匹配进行评分。作为首次演示,对300M设备端嵌入器的低秩微调改进了跨语言检索,并在开发集上的精确召回率@200(0.659 vs 0.623)超过了3,072维托管嵌入器。语言面板显示增益并不均匀,保留测试和端到端确认仍是未来工作。

英文摘要

Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.

2606.04381 2026-06-04 cs.LG cs.AI

From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models

从符号到几何:在大语言模型中实现空间推理

Chen Chu, Bita Azarijoo, Li Xiong, Khurram Shafique, Cyrus Shahabi

AI总结 提出空间语言模型(SLM),通过将位置信息作为一等模态并学习空间表示,在推理过程中实现几何空间推理,显著优于基于符号推理的现有方法。

详情
AI中文摘要

近期的大语言模型(LLM)通常表现出空间推理能力;然而,这种能力很大程度上是\emph{符号}性的,源于对空间语言的模式匹配,而非真正的\emph{几何}空间推理。由于LLM操作离散令牌,它们缺乏对连续空间表示、显式几何计算和结构化空间算子的原生支持。为解决这一局限,我们引入了\emph{空间语言模型(SLM)},这是首个将位置信息作为一等模态并在模型推理过程中实现几何空间推理的多模态LLM。SLM直接操作学习到的空间表示,而非空间关系的文本描述。为支持有效训练,我们构建了\emph{空间指令数据集},该数据集对齐了空间表示、原子几何操作和自然语言指令。我们进一步提出了名为\emph{SpatialEval}的新基准,旨在评估属性、距离、拓扑和相对位置任务上的空间推理。大量实验表明,SLM显著优于依赖通过提示工程或文本抽象进行符号推理的现有基于LLM的方法,展示了集成几何空间表示对稳健空间推理的优势。我们的指令数据集、评估基准、模型训练代码和模型检查点可在\hyperlink{https://github.com/chuchen2017/SLM}{https://github.com/chuchen2017/SLM}获取。

英文摘要

Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emph{symbolic}, arising from pattern matching over spatial language rather than true \emph{geometric} reasoning over space. Because LLMs operate on discrete tokens, they lack native support for continuous spatial representations, explicit geometric computation, and structured spatial operators. To address this limitation, we introduce the \emph{Spatial Language Model (SLM)}, the first multimodal LLM that treats location information as a first-class modality and enables geometric spatial reasoning within the model's inference process. SLM directly operates on learned spatial representations rather than textual descriptions of spatial relations. To support effective training, we construct a \emph{Spatial Instruction Dataset} that aligns spatial representations, atomic geometric operations, and natural language instructions. We further propose a new benchmark named \emph{SpatialEval}, which is designed to evaluate spatial reasoning across attributes, distance, topology, and relative-position tasks. Extensive experiments show that SLM significantly outperforms existing LLM-based approaches that rely on symbolic reasoning via prompt engineering or textual abstraction, demonstrating the benefits of integrating geometric spatial representations for robust spatial reasoning. Our instruction dataset, evaluation benchmark, model training codes, and models' checkpoints can be found at: \hyperlink{https://github.com/chuchen2017/SLM}{https://github.com/chuchen2017/SLM}.

2606.04380 2026-06-04 stat.ML cs.LG

REGAIN: REconciliation GAIN-driven Auxiliary Direction Learning

REGAIN:基于调和增益的辅助方向学习

Weijia Li, Shun Hu, Yanfei Kang

AI总结 提出REGAIN框架,通过学习归一化辅助方向并利用冻结预测预言机,基于目标加权损失减少选择方向,以改进预测调和。

详情
AI中文摘要

预测调和通常从固定测量系统开始,询问如何将预测投影到一致空间。我们提出不同问题:哪些额外的线性测量应被预测并纳入调和系统?我们提出REGAIN,一种调和增益框架,学习归一化辅助方向,用冻结预测预言机预测诱导序列,并通过增强广义最小二乘调和后的目标加权损失减少选择方向。与基于方差的分量或基于可预测性的辅助选择不同,REGAIN优化辅助测量对最终调和预测的下游影响。我们提供统计特征,表明有用的辅助方向必须提供关于未解决目标不确定性的互补信息,而不仅仅是易于预测。分析还阐明了协方差风险减少机制、偏差变化在实现二次风险中的作用以及估计增益信号的稳定性。开发了带有保留增益筛选的分阶段学习算法,以及可选的联合优化步骤。在北京PM2.5和澳大利亚旅游数据上的实验表明,增益选择的测量可以改进普通多变量和层次预测,特别是当它们揭示原始测量系统未捕捉的残差不确定性时。

英文摘要

Forecast reconciliation usually starts from a fixed measurement system and asks how forecasts should be projected onto a coherent space. We ask a different question: which additional linear measurements should be forecast and included in the reconciliation system? We propose REGAIN, a reconciliation-gain framework that learns normalized auxiliary directions, forecasts the induced series with a frozen forecasting oracle, and selects directions by their target-weighted loss reduction after augmented generalized least-squares reconciliation. Unlike variance-based components or predictability-based auxiliary selection, REGAIN optimizes the downstream effect of an auxiliary measurement on the final reconciled forecasts. We provide a statistical characterization showing that useful auxiliary directions must provide complementary information about unresolved target uncertainty, rather than merely being easy to forecast. The analysis also clarifies the covariance-risk reduction mechanism, the role of bias changes in realized quadratic risk, and the stability of estimated gain signals. A stagewise learning algorithm with held-out gain screening is developed, together with an optional joint refinement step. Experiments on Beijing PM2.5 and Australian Tourism data show that gain-selected measurements can improve both ordinary multivariate and hierarchical forecasts, especially when they reveal residual uncertainty not captured by the original measurement system.