arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
2605.28190 2026-05-28 cs.CL

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

更难文本嵌入基准(HTEB):超越一维静态鲁棒性

Manuel Frank, Haithem Afli

发表机构 * Department of Computer Science Munster Technological University(计算机科学系穆斯特技术大学)

AI总结 提出HTEB动态评估框架,通过LLM随机变换输入,从词汇/风格、长度和语言三个维度挑战文本嵌入模型的鲁棒性,发现模型具有特定且部分解耦的鲁棒性轮廓,规模提升绝对分数但未缩小原始与变换评估差距,且英语数据集对变换更敏感。

Comments 29 pages, 11 figures

详情
AI中文摘要

像MTEB这样的嵌入基准为每个模型报告单一分数,隐含地将鲁棒性视为静态的标量属性。我们认为嵌入鲁棒性是多维的,因为模型对不同类型的变化有不同的响应,并且需要动态评估来暴露静态基准隐藏的失败。我们引入了更难文本嵌入基准(HTEB),这是一个动态评估框架,通过LLM在评估时随机变换输入,沿着三个实际可解释的轴(词汇/风格、长度和语言)挑战模型鲁棒性。在32个数据集(覆盖42种语言)上评估16个开源嵌入模型,变换通过英语子样本上的4800个人类评分验证,我们发现三种模式:(1)模型在各个轴上表现出特定的、部分解耦的鲁棒性轮廓。(2)在三个模型家族中,规模提升绝对分数,但未缩小原始评估与变换评估之间的差距。在这里,缩放倾向于特别改善语言轴。(3)英语数据集对HTEB变换比多语言数据集更敏感。这表明HTEB识别了模型在部署相关轴上的优缺点,挑战了当前的嵌入基准,并主张进行多维、动态的鲁棒性评估。

英文摘要

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

2605.28188 2026-05-28 cs.CL

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

框架至关重要:通过基于行为的价值对齐解决决策中的框架敏感性

Seojin Hwang, Minju Kim, Junhyuk Choi, JeongHyun Park, Hwanhee Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 本文提出Fragile基准测试框架,系统评估大语言模型在事实等价但不同框架输入下的决策稳定性,并设计Valign方法通过表示级干预有效降低框架引起的决策翻转。

Comments 29 pages, 7 figures, 31 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地部署在高风险决策场景中,例如法律推理,其中在事实上等价的输入下保持一致性至关重要。然而,我们发现,事实保持不变但框架不同的输入会显著破坏LLM决策的稳定性。为了系统研究这一问题,我们引入了Fragile,一个大规模基准测试,它在三个受控维度上隔离了事实保持的语义框架:价值倾向叙述、时间切片和叙述生动性。我们的实验揭示了LLM对框架的高度敏感性,平均决策翻转率为28.6%。我们发现,简单的先验提示级和激活级干预不仅无法抑制框架敏感性,反而会主动放大它。因此,我们提出了Valign,一种表示级方法,通过将决策锚定到稳定的价值先验、将隐藏状态引导至模型的价值一致方向,并从模型隐藏状态中投影出时间-生动性敏感方向,显式地针对这些框架维度。Valign持续减少了框架引起的决策翻转,表明稳健的缓解需要直接针对框架操作的内部路径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.

2605.28186 2026-05-28 cs.RO cs.AI

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

可视化运动策略中的潜在相位结构:基于时间特征扩展的多环境研究

Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan(日本防卫大学校数学与计算机科学系)

AI总结 提出一种框架,通过扩展聚类特征(包括动作、下一状态和下一动作)并引入抑制自转移的聚类数确定方法,从深度强化学习运动策略中揭示更清晰、更规则的潜在运动相位结构。

详情
AI中文摘要

深度强化学习(DRL)已被证明在MuJoCo基准测试(如HalfCheetah、Ant和Walker2D)的运动控制任务中表现出高性能。然而,可视化由深度神经网络实现的训练策略函数内部获得的运动结构仍然具有挑战性。从生物力学及相关领域可知,运动控制是通过重复运动相位(如站立相和摆动相)实现的。在本研究中,我们提出一个框架,用于从运动控制策略通过与环境交互生成的轨迹中揭示潜在的相位结构。所提出的方法将聚类特征从仅状态观测扩展到包括动作、下一状态和下一动作的增强特征,并引入一种抑制自转移的聚类数确定方法。将所提出的方法应用于三个环境——Ant-v5、HalfCheetah-v5和Walker2D-v5,我们成功识别出比现有方法具有更清晰和更规则转换规则的相位结构。

英文摘要

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

2605.28184 2026-05-28 cs.LG

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

通过最优系数校准在强化学习中联合训练多令牌预测

Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS) Meituan(美团)

AI总结 本文从优化角度分析多令牌预测与强化学习联合训练失败的原因,提出最优系数校准方法,通过在线跟踪最优系数实现性能提升。

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的标准范式,而多令牌预测是预训练中广泛采用的模块。将两者结合是自然的方法,但当前的强化学习实践会分离多令牌梯度,因为联合训练会降低性能。我们从优化角度重新审视这一失败。我们表明,多令牌对强化学习目标的每步影响可分解为两项:一阶相关性和二阶扰动惩罚。这种分解统一了三种多令牌训练模式:分离、交叉熵损失和策略损失,并解释了每种模式成功或失败的原因。对策略损失的进一步分析揭示,尽管它符合直觉,但性能仍然下降:相关性项衰减而二次惩罚持续存在。在分析指导下,我们提出最优系数校准,一种自适应方案,通过对数概率代理在线跟踪最优系数,且成本可忽略。在六个竞赛级数学推理基准上,最优系数校准一致达到或超过分离基线,实现了改进的联合多令牌-强化学习训练性能。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

2605.28181 2026-05-28 cs.CL

When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

当置信度误导时:扩散语言模型的后缀锚定与锚邻近置信度调制

Jungwon Park, Jimyeong Kim, Jungmin Ko, Nojun Kwak, Wonjong Rhee

发表机构 * RICS(智能研究学院) AIIS(人工智能研究所) IPAI(人工智能研究所) Department of Intelligence and Information(智能与信息系) Daegu Gyeongbuk Institute of Science and Technology(大邱庆州市科学技术院)

AI总结 针对扩散语言模型中置信度误导导致生成不完整或过早解码的问题,提出后缀锚定与锚邻近置信度调制方法,无需训练即可提升完全非自回归解码性能。

Comments Preprint

详情
AI中文摘要

扩散语言模型通过对掩码标记序列进行迭代去噪来解码文本,使得选择解码位置成为推理时的核心决策。大多数无训练解码策略使用模型置信度进行位置选择,假设高置信度位置已准备好解码。本文通过研究置信度何时误导完全非自回归解码来重新审视这一假设。EOT标记可能获得高置信度并导致生成不完整;插入后缀锚定可以缓解此问题,但会在锚附近引入局部过度置信,导致锚邻近标记过早解码。为解决这些问题,我们提出后缀锚定置信度调制,一种简单的无训练方法,它插入短后缀锚定以鼓励回复完成,并根据解码进度调制锚附近的置信度。这保留了后缀锚定的回复完成优势,同时减少了锚邻近标记的过早解码。在纯文本推理、视觉-语言推理和代码生成基准测试中,我们的方法持续改进基于置信度的完全非自回归解码,优于显式EOT抑制,并保持了完全非自回归生成的并行解码优势。

英文摘要

Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.

2605.28179 2026-05-28 cs.CL

SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

SuperValid: 面向泛化下游扩展的能力对齐OOD验证

Quanen Sun, Changxin Tian, Ke Shi, Cai Chen, Cunyin Peng, Jia Liu, Kunlong Chen, Zhiqiang Zhang

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出SuperValid框架,通过从基准测试中提炼核心概念并扩展为多样化的知识丰富文本,合成能力对齐的分布外验证数据,以在能力层面预测下游性能,实现有效的模型选择、早停和扩展决策。

详情
AI中文摘要

扩展定律通过将计算量与交叉熵损失相关联来指导大型语言模型的训练,最近的工作进一步将其扩展到预测下游基准性能。然而,先前的方法在两个方面面临泛化限制:关注基准级性能会引入特定场景的伪影,而依赖IID验证损失则无法在训练分布变化时跟踪能力提升。在这项工作中,我们认为下游扩展应在能力层面进行研究,这能够捕捉跨相关任务的共享技能因素,同时抽象掉基准特定的噪声。我们提出了SuperValid,一个通过从能力领域内的基准测试中提炼核心概念并将其扩展为多样化的知识丰富文本来合成OOD(分布外)、能力对齐验证数据的框架。涵盖6个能力领域内17个基准测试的大量实验表明,SuperValid损失与不同架构、规模和训练数据分布的模型的下游性能表现出强且稳定的相关性。作为一种无需训练、可在训练期间计算且无需基准评估的度量,SuperValid实现了有效的模型选择、早停和扩展决策。

英文摘要

Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

2605.28176 2026-05-28 cs.CV

From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen

从Kellgren-Lawrence到焦磷酸钙晶体沉积:一种用于膝骨关节炎评估的软标签框架

Francisco Bérchez-Moreno, Riccardo Rosati, Maria Chiara Fiorentino, Víctor M. Vargas, Edoardo Cipolletta, Emilio Filippucci, Luca Romeo, Pedro A. Gutiérrez, César Hervás-Martínez

发表机构 * organization= Department of Political Science, Communication International Relations, University of Macerata , city= Macerata , country= Italy organization= Department of Economics Law, University of Macerata , city= Macerata , country= Italy organization= Department of Innovative Technologies in Medicine \& Dentistry, Università degli Studi "G. D'Annunzio" Chieti - Pescara , city= Chieti , country= Italy organization= Department of Internal Medicine, Azienda Ospedaliero Universitaria delle Marche , city = Ancona , country= Italy organization= Academic Rheumatology, University of Nottingham , city = Nottingham , country= UK organization= Department of Rheumatology, Polytechnic University of Marche , city= Ancona , country= Italy

AI总结 提出基于软标签的序贯深度学习框架,通过单峰概率分布替代独热编码,同时处理KL和CPPD分级中的序数不确定性和不对称关系,在膝X光图像上显著提升分级性能。

详情
AI中文摘要

背景与目标。传统的膝骨关节炎(KOA)分级深度学习方法依赖于独热标签,未能捕捉Kellgren-Lawrence(KL)和焦磷酸钙沉积病(CPPD)严重程度评分的序数不确定性,以及临床实践中观察到的两个量表之间的不对称关系。方法。我们回顾性收集了2172张膝关节X光图像,包括968张同时标注了KL和CPPD严重程度的X光片。开发了一个基于软标签的序贯深度学习框架用于两项任务,用以标注等级为中心的单峰概率分布替代独热目标。研究了四种分布形式:二项分布、贝塔分布、三角分布和指数分布。结果。所有软标签策略均持续优于名义基线。对于CPPD分级,三角分布实现了最高的二次加权卡帕(QWK)和最低的平均绝对误差(MAE)(QWK = 0.796;MAE = 0.438),而贝塔分布在考虑各类别的平均MAE(AMAE)和最大MAE(MMAE)时产生了最平衡的类别性能(AMAE = 0.458;MMAE = 0.573)。对于KL分级,基于贝塔的方法提供了最佳整体性能,实现了最高的QWK以及最低的MAE和类别误差(QWK = 0.777;MAE = 0.529;AMAE = 0.523;MMAE = 0.775)。统计分析表明,与传统的独热监督相比有显著改进(p < 0.001)。

英文摘要

Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).

2605.28174 2026-05-28 cs.CV cs.AI

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO:面向跨传感器与尺度的生态遥感多模态地理空间基础模型

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe

发表机构 * Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹科技大学生物与环境科学与工程 division) Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹科技大学计算机、电气与数学科学与工程 division)

AI总结 提出FLORO多模态地理空间基础模型,通过掩码自编码在异构遥感数据上预训练,利用可用性感知输入统一异构传感器配置,在PANGAEA基准上实现强迁移性能。

Comments 29 pages, 9 figures

详情
AI中文摘要

基础模型为可迁移的遥感表示提供了有前景的途径,但许多当前方法依赖于非常大的预训练数据集和固定的传感器配置,限制了它们在生态和环境应用中的适用性,这些应用中的观测通常跨平台、空间和光谱分辨率以及可用模态而变化。我们提出了FLORO,一个多模态地理空间基础模型,旨在从一个小型但高度多样化的遥感语料库中学习可迁移表示。FLORO使用掩码自编码在Sentinel-1、Sentinel-2、SkySAT影像、高程和无人机数据的异构组合上进行预训练。为了适应传感器变异性,FLORO结合了可用性感知输入,指示每个样本中存在哪些光谱波段和辅助模态,从而在异构传感器配置上实现统一的输入空间。我们在PANGAEA基准上,在冻结编码器协议下,评估了FLORO的场景分类、分割和回归任务。尽管在比竞争基础模型更小的语料库上预训练,FLORO在跨光学、光学-SAR和光学-高程基准(涵盖中分辨率卫星、航空和超高分辨率无人机影像)上实现了强大且稳定的迁移。FLORO在六个PANGAEA基准上取得了第二好的平均分割性能,仅次于最近引入的预训练图像数量超过两个数量级的基础模型,在场景分类上保持竞争力,在回归任务中表现稳健,而定性结果显示在洪水、城市、生物量和冠层高度预测设置中空间结构的保存有所改善。在EuroSAT-MS上的单独对照实验中,相对于绝对位置编码,地理位置编码进一步提高了分类性能。

英文摘要

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

2605.28173 2026-05-28 cs.CV

MangaFlow: An End-to-End Agentic Framework for Controllable Story to Manga Generation

MangaFlow: 一种用于可控故事到漫画生成的端到端代理框架

Muyao Wang, Zeke Xie, Yanhao Chen, Lixin Xiu, Hideki Nakayama

发表机构 * The University of Tokyo(东京大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出MangaFlow代理框架,通过将漫画创作分解为规划、定位、布局构建、参考条件渲染、合成和文字放置等步骤,实现可控的长篇漫画生成,支持布局和视觉参考作为显式中间变量,并引入故事段落记忆以保持跨面板一致性。

详情
AI中文摘要

端到端漫画生成是一项结构化的视觉叙事任务,需要故事分解、重复角色和场景定位、页面布局设计、面板渲染、页面合成和文字放置。然而,现有的生成模型通常直接进行页面合成,将这些因素纠缠在单个视觉输出中,限制了对布局几何、视觉参考和跨面板一致性的精确控制。为了解决这些限制,我们提出了MangaFlow,一个用于可控长篇漫画生成的代理框架,它将漫画创作分解为规划、定位、布局构建、参考条件渲染、合成和文字放置。通过将布局和视觉参考视为显式中间变量,MangaFlow既支持简单的文本到漫画生成,也支持更精确的用户控制漫画创作。这种设计将布局、视觉资产和文字放置暴露为可编辑的中间控制,用于细化面板几何、参考和文字位置。为了支持长篇一致性,MangaFlow引入了故事段落记忆,将段落描述与相应的角色、场景和对象参考链接起来,以便在面板间重用。我们进一步提出了一个元基准,用于评估布局可控性、视觉一致性和生成质量。实验表明,MangaFlow在布局遵循和跨面板一致性方面优于直接生成基线,同时支持灵活的人工控制。

英文摘要

End-to-end manga generation is a structured visual storytelling task that requires story decomposition, recurring character and scene grounding, page layout design, panel rendering, page composition, and lettering. However, existing generative models often perform direct page synthesis, entangling these factors in a single visual output and limiting precise control over layout geometry, visual references, and cross-panel consistency. To address these limitations, we propose MangaFlow, an agentic framework for controllable long-form manga generation that decomposes manga creation into planning, grounding, layout construction, reference-conditioned rendering, composition, and text placement. By treating layout and visual references as explicit intermediate variables, MangaFlow enables both simple text-to-manga generation and more precise user-controlled manga creation. This design exposes layout, visual assets, and lettering as editable intermediate controls for refining panel geometry, references, and text placement. To support long-form consistency, MangaFlow introduces a story section memory that links section descriptions with corresponding character, scene, and object references for reuse across panels. We further present a meta-benchmark for evaluating layout controllability, visual consistency, and generation quality. Experiments show that MangaFlow improves layout adherence and cross-panel consistency over direct generation baselines while supporting flexible human control.

2605.28172 2026-05-28 cs.RO

Provably Guaranteed Polytopic Uncertainty Quantification for SLAM

具有可证明保证的多面体不确定性量化用于SLAM

Guangyang Zeng, Yulong Gao, Yuan Shen, Lingpeng Chen, Haoying Li, Guodong Shi, Junfeng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen(数据科学学院,香港中文大学(深圳)) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(人工智能学院,香港中文大学(深圳)) Department of Electrical and Electronic Engineering, Imperial College London(电子与电气工程系,帝国理工学院伦敦分校) School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney(航空航天、机械与机电工程学院,悉尼大学)

AI总结 本文提出基于多面体表示的不确定性量化算法,通过前向映射、后向位姿跟踪和位姿复合三个模块,为3D-3D路标SLAM提供可证明的确定性保证,并结合共形预测提高实用性。

Comments 16 pages, 10 figures; accepted by Robotics: Science and Systems 2026

详情
AI中文摘要

在安全关键的机器人应用中,感知中保证且实用的不确定性量化至关重要。许多现有工作要么没有提供正式包含保证,要么依赖限制性建模假设,要么只关注位姿估计而非完整的SLAM流水线。本文提出了用于基于3D-3D路标的SLAM的可证明保证的不确定性量化算法。该算法由三个基本的不确定性量化模块组成:用于建图的前向不确定性量化、用于位姿跟踪的后向不确定性量化以及位姿复合。每个模块生成一个认证的不确定性集;当输入不确定性边界是确定性的时,输出集继承确定性保证,即它们可证明地包含真实位姿和路标。具体来说,我们使用多面体表示不确定性集,从而实现易处理的计算和对位姿不确定性的统一处理。为了提高算法的实际可用性,我们结合了共形预测,从数据中以规定概率校准测量不确定性。仿真和实验表明,所提出的算法既提供了强大的理论保证,又具有实际可用性。代码开源在 https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification。

英文摘要

In safety-critical robotics applications, guaranteed and practical uncertainty quantification (UQ) in perception is vital. Many existing works either offer no formal containment guarantee, rely on restrictive modeling assumptions, or focus only on pose estimation rather than a complete SLAM pipeline. This paper presents provably guaranteed UQ algorithms for 3D-3D landmark-based SLAM. The algorithms consist of three basic UQ modules: forward UQ for mapping, backward UQ for pose tracking, and pose compound. Each module produces a certified uncertainty set; when the input uncertainty bounds are deterministic, the output sets inherit deterministic guarantees, i.e., they provably contain the true poses and landmarks. Specifically, we use polytopes to represent uncertainty sets, enabling tractable computations and a unified treatment of pose uncertainty. To enhance algorithms' practical usability, we incorporate conformal prediction to calibrate measurement uncertainty from data with prescribed probability. Simulations and experiments demonstrate that the proposed algorithms provide both strong theoretical guarantees and practical usability. The code is open-sourced at https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification.

2605.28170 2026-05-28 cs.AI

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

通过Shapley值为大型语言模型定位输入不确定性量化

Seongjun Lee, Suwan Yoon, Changhee Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 提出ShaQ框架,利用Shapley值将输入中的模糊跨度建模为合作博弈参与者,通过条件熵的边际减少加权平均量化每个跨度对输入不确定性的贡献,实现跨度级归因,在AmbigQA、AmbiEnt和MediTOD基准上取得最先进性能。

Comments Codes are available https://anonymous.4open.science/r/ShaQ-0E39/README.md

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地集成到高风险决策中,可靠量化不确定性的能力已成为安全性和可信度的关键要求。然而,当前的不确定性量化方法主要在输出层面操作,通常无法区分不确定性是源于模型缺乏知识还是用户输入的模糊性。尽管以输入为中心的不确定性量化最近成为一个有前景的方向,但它仍相对未被充分探索,并且通常依赖于粗糙的输入级信息。因此,用户只能获得标量不确定性分数,这些分数几乎没有提供可操作的指导,以说明应该澄清输入的哪些部分来提高可靠性。为了解决这一局限性,我们提出了基于Shapley的输入不确定性量化(ShaQ),这是一个用于输入诱导不确定性的跨度级归因框架。我们的方法将输入中的模糊跨度建模为合作博弈中的参与者,并使用Shapley值量化它们的贡献,Shapley值通过澄清每个跨度联盟所获得的条件熵边际减少的加权平均来定义。与现有的输入级方法不同,我们的公式捕捉了跨度之间的复杂交互,并提供了一种原则性的分解,其中个体归因之和恰好等于总输入诱导不确定性。我们在AmbigQA和AmbiEnt基准上评估了ShaQ,它在模糊性检测中实现了最先进的性能。我们进一步在MediTOD上展示了其实用性,表明ShaQ可以定位未明确说明的临床话语,并促进高风险环境中的人机协作。总体而言,ShaQ改进了不确定性估计,并为有针对性的输入澄清提供了可操作的见解。

英文摘要

As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model's lack of knowledge or from ambiguity in the user's input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

2605.28168 2026-05-28 cs.AI

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

OccuReward: 面向电网交互建筑中人口公平性的LLM引导的以 occupant 为中心的奖励塑造

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(都灵理工大学)

AI总结 提出OccuReward框架,利用大语言模型迭代塑造奖励函数,通过舒适公平指数(CEI)反馈,在CityLearn v2中提升不同人口群体的舒适公平性,同时降低能耗成本。

Comments 4 pages, 2 figures. Accepted at OccuSys 2026, co-located with ACM Sustainability Week 2026. Preprint version

详情
AI中文摘要

大语言模型(LLM)在为基于深度强化学习(DRL)的建筑能源管理生成奖励函数方面展现出有前景的能力。然而,它们在异质人口群体中引发或加剧 occupant 舒适度差异的潜力尚未被探索。我们提出 OccuReward,一个研究 LLM 介导的奖励设计如何影响人口公平性的框架。我们的贡献有三方面:引入舒适公平指数(CEI)作为新颖的反馈信号;一种迭代的、公平感知的 LLM 奖励塑造方法;以及在这些优化目标下 DRL 代理的性能分析。利用来自 ASHRAE 全球热舒适数据库 II(13,440 票)的四个基于经验 occupant 档案,我们在 CityLearn v2 中部署了一个 Soft Actor-Critic 代理。我们的方法使用 Gemini API 生成奖励函数逻辑和权重——而不是执行每步推理——跨越三个细化轮次。15 个实验运行的结果显示,老年女性 occupant 在初始轮次中始终经历最低满意度。到第 3 轮,公平感知的 LLM 细化激活了特定的奖励组件,提高了年轻男性(+17.6%)、中年女性(+28.2%)、健康敏感者(+53.8%)和老年女性(+567%)的满意度,同时降低了 3.2% 的能源成本。我们的发现强调,虽然奖励层面的干预显著改善了公平性,但 AI 驱动控制器中的人口差异仍然存在,需要进一步研究建筑系统中的算法公平性。

英文摘要

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

2605.28167 2026-05-28 cs.CV

DebFilter: Eradicating Biases Stashed in Value

DebFilter: 消除隐藏在值中的偏见

Seung Hyuk Lee, Songkuk Kim

发表机构 * School of Integrated Technology, BK21 Graduate Program in Intelligent Semiconductor Technology(整合技术学院,智能半导体技术BK21研究生项目)

AI总结 提出DebFilter,一种轻量级、无需训练的方法,通过调整交叉注意力中的值分量来纠正文本到图像扩散模型中的社会偏见,实现推理时偏差缓解。

Comments 8 pages, 7 figures, supplementary material included, CVPR 2026

详情
AI中文摘要

文本到图像扩散模型,理论上等价于基于分数的生成模型,通过由预训练视觉语言模型(如CLIP)提取的文本嵌入引导的多步去噪过程生成图像。然而,这些文本嵌入固有地编码了社会和语义偏见——例如与性别和年龄相关的偏见——这些偏见随后通过引导机制以及模型在相对于这些偏见概念不平衡的大规模数据集上的训练被传播和放大,常常导致文本到图像生成中的输出偏差。我们提出了DebFilter,一种轻量级且无需训练的框架,用于缓解文本到图像扩散模型中的此类偏见。观察到模型在每个去噪步骤中的误差预测主要受交叉注意力动态影响,我们引入了一种偏差校正策略,调整交叉注意力中的值分量。具体地,我们对引导嵌入的切片施加固定偏移,有效地将交叉注意力值的语义方向转向无偏表示。这种调整重新配置了分数景观以产生平衡的输出,同时保持与预期文本语义的对齐。与依赖微调或重新训练的先前方法不同,DebFilter完全在推理时运行,无需额外数据或模型更新。我们的结果表明,该方法有效缓解了生成图像中的社会偏见,为更公平和更包容的文本到图像生成提供了一条高效且可扩展的途径。

英文摘要

Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases -- such as those related to gender and age -- that are subsequently propagated and amplified through the guidance mechanism, along with the model's training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model's error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.

2605.28165 2026-05-28 cs.LG

Unification and Optimization of Robust Supervised Learning

鲁棒监督学习的统一与优化

Jonas Hanselle, Valentin Margraf, Clemens Damke, Eyke Hüllermeier

发表机构 * LMU Munich, MCML(慕尼黑大学,MCML)

AI总结 提出一个统一框架,将分布鲁棒优化、标签平滑、邻域风险最小化和Mixup等鲁棒学习方法组织为三个设计轴,并通过联合超参数优化自动组合适合任务的鲁棒策略。

详情
AI中文摘要

文献中提出了各种经验风险最小化的鲁棒替代方案,以应对分布偏移、标签噪声和有限样本退化等故障模式。例如分布鲁棒优化、标签平滑、邻域风险最小化和Mixup。然而,这些方法通常是孤立开发的,迫使从业者事先承诺单一故障模式,即使任务的主要模式尚不清楚。为了解决这个问题,我们将现有的一大类方法沿着三个共同的设计轴组织起来,并推导出一个可行的训练程序,将鲁棒学习分解为顺序阶段(参考分布丰富化、输入空间扰动、标签空间扰动和样本级聚合),每个阶段都有立场选择(悲观、中性或乐观)。这产生了一个统一的设计空间,其中联合超参数优化可以组合和配置适合手头任务的鲁棒策略。在表格、图像和奖励建模基准测试中,联合超参数优化与每种设置中最佳单方法基线具有竞争力,为那些事先不知道其任务中哪种故障模式占主导地位的从业者提供了可靠的默认选择。

英文摘要

The literature has proposed various robust alternatives to empirical risk minimisation to address failure modes such as distribution shift, label noise and finite-sample degeneracies. Examples include distributionally robust optimization, label smoothing, vicinal risk minimization, and Mixup. However, such approaches are typically developed in isolation, forcing practitioners to commit a priori to a single failure mode even when the dominant mode for the task is unclear. To address this, we organize a broad class of existing methods along three common design axes and derive a tractable training procedure that decomposes robust learning into sequential stages (reference distribution enrichment, input-space perturbation, label-space perturbation, and sample-level aggregation), each with a choice of stance (pessimistic, neutral, or optimistic). This results in a unified design space in which joint hyperparameter optimization can compose and configure robustness strategies suited to the task at hand. Across tabular, image, and reward modeling benchmarks, joint hyperparameter optimization is competitive with the best single-method baseline in each setting, offering a reliable default for practitioners who do not know a priori which failure mode dominates their task.

2605.28163 2026-05-28 cs.CL cs.AI

DEPART: DEcomposing PARiTy across Multilingual LLMs

DEPART: 跨多语言大模型的性能差异分解

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal, Sunayana Sitaram

发表机构 * Microsoft Research India(微软印度研究院)

AI总结 提出DEPART框架,通过贝叶斯分层模型分解多语言大模型性能差异,发现语言特征解释79%-92%的方差,且模型内部表示与英语的相似性是主要预测因子。

详情
AI中文摘要

多语言大模型(mLLMs)排行榜报告每种语言的准确率,但很少解释为何出现差异,导致系统性偏差未被归因,且从业者无法采取可操作的杠杆。我们首先通过无分布Friedman和Kruskal-Wallis检验确定这些差距是系统性的而非抽样噪声的产物,然后引入一个两步贝叶斯分层框架,将多语言性能方差分解为可解释的组成部分。首先,隔离语言身份归因的方差,我们表明可观察的语言特征(文字、语系、类型学距离)在理解任务上解释了$R^2_{\text{ling}} = 79\%$的方差,在推理任务上解释了$92\%$,而模型内部表示与英语的相似性成为两个任务桶中的主导预测因子。其次,分解完整的(模型×基准×语言)立方体,我们发现NLU和推理具有根本不同的方差分布:模型身份主导理解(方差的66.7%),而基准×模型交互主导推理(46.3%)。这些结果共同将多语言评估从被动的性能映射重塑为一个可解释的诊断框架,并提供针对语言差异根本驱动因素的具体杠杆。

英文摘要

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

2605.28161 2026-05-28 cs.CV

MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

MeniOmni:用于整体半月板损伤评估的结构化多模态基准

Shurui Xu, Siqi Yang, Weiping Ding, Hui Wang, Mengzhen Fan, Yuyu Sun, Shuyan Li

发表机构 * 1School of Electronics, Electrical Engineering Computer Science, Queen's University Belfast, Belfast, UK 2Radiology Department, Affiliated Nantong Clinical College of Nantong University, Nantong First People's Hospital, School of Clinical Medicine, Nantong University, Nantong, Jiangsu, China 3School of Artificial Intelligence Computer Science, Nantong University, Nantong, China 4Faculty of Data Science, City University of Macau, Macau, China 5Department of Chemistry, University of Oxford, Oxford, UK 6Orthopedics Department, Nantong First People's Hospital, Southeast University, Nantong, Jiangsu, China

AI总结 提出MeniOmni基准,包含多中心MRI、临床先验和专家标注文本,支持细粒度Stoller分级和诊断报告生成,并引入风险感知序数评估和语义一致性指标Meni-Score。

Comments Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2026 (Oral Presentation)

详情
AI中文摘要

半月板损伤的临床诊断需要放射科医生将体积MRI证据与患者背景(如性别、年龄、BMI)相结合,并生成结构化诊断报告。现有的膝关节MRI基准通常是单模态的,依赖粗粒度标签,限制了评估整体临床推理的能力。我们提出了MeniOmni,一个用于半月板损伤评估的结构化多模态基准,包含746个多中心MRI研究,具有三平面体积输入、临床先验和专家标注的临床文本。MeniOmni支持两个任务:(1)细粒度Stoller严重程度分级和(2)诊断报告生成。我们进一步提出了风险感知序数评估和语义一致性指标(Meni-Score),以更好地反映临床相关性。基线实验表明,纳入临床先验可提高分级性能并减少严重错误,凸显了多模态上下文对更安全评估的价值。代码和数据可在https://github.com/ShuruiXu/MeniOmni获取。

英文摘要

Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.

2605.28160 2026-05-28 cs.AI

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

按需查看:多模态推理中视觉证据获取的认知调度框架

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知实验室) Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China(教育部高效计算实验室,厦门大学,361005,中国) Sino-Russian ResearchCenter for Digital Economy(中俄数字经济研究中心) Institute of Artificial Intelligence, Xiamen University, China(人工智能研究院,厦门大学,中国) School of Informatics, Xiamen University, China(信息学院,厦门大学,中国) School of Information Engineering, Xiamen Ocean Vocational College, Xiamen 361102, China(信息工程学院,厦门海洋职业技术学院,厦门361102,中国)

AI总结 提出CSMR框架,通过语言模型控制何时调用独立视觉感知模块获取任务相关视觉证据,在零样本设置下多个基准上优于基线方法。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的多模态推理方法主要遵循两种范式:在推理前将视觉输入转换为文本,或在统一的视觉-语言表示空间中进行端到端推理。尽管取得了经验上的进展,但两种范式都存在根本性的结构限制。前者依赖于静态的视觉到文本转换,往往会压缩并丢失细粒度的视觉细节。后者容易受到联合优化和注意力机制引起的语言主导,导致推理过程中对视觉证据的忠实性系统性减弱。在这项工作中,我们认为核心挑战在于视觉证据如何以及何时被引入推理过程。受此启发,我们提出了CSMR,一种多模态推理框架,其中语言模型通过决定何时调用独立的视觉感知模块来获取任务相关的视觉证据,从而控制推理过程。在多个多模态推理基准上的实验表明,在零样本设置下,CSMR在准确性上始终优于代表性基线方法。进一步的实验分析证实,这些优势主要源于所提出的认知调度机制。

英文摘要

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

2605.28158 2026-05-28 cs.AI

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

OR-Space:面向工业优化智能体的全生命周期工作空间基准

Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin, Dongdong Ge, Yinyu Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai University of Finance and Economics(上海财经大学) Stanford University(斯坦福大学)

AI总结 提出OR-Space基准,通过构建、修订和解释三种任务模式,评估大语言模型智能体在工业优化工作流中的可靠性。

Comments 34 pages, 8 figures

详情
AI中文摘要

大语言模型(LLM)智能体越来越多地被用于辅助运筹学(OR)建模,然而现有的面向OR的基准通常将评估简化为从自包含的问题陈述到数学公式或求解器程序的一次性翻译。这种设置忽略了实际工业OR工作流的两个特征:持久的多工件工作空间和多阶段任务生命周期。我们引入了OR-Space,一个全生命周期的工作空间基准,用于评估工业优化智能体在模型构建、模型修订和基于解释的任务中的表现。每个实例都是一个可执行的工作空间,包含业务文档、结构化数据、可选的代码工件、求解器输出以及分布在相互依赖文件中的任务特定评估器。OR-Space定义了三种任务模式:构建模式,智能体从异构工件构建可求解的优化模型;修订模式,智能体在需求变化或求解器反馈下修改现有模型,同时保留有效的先前逻辑;解释模式,智能体利用工作空间工件中的证据回答关于解决方案、约束和业务影响的基于解释的问题。通过将持久工作空间与生命周期导向的任务相结合,OR-Space评估智能体是否能够执行超越端到端文本生成的可靠优化工作。我们描述了基准设计、评估协议和质量控制流程,并将OR-Space定位为研究LLM智能体在工业OR工作流中的可靠性、失败模式和实际准备程度的基准。

英文摘要

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

2605.28157 2026-05-28 cs.CV

Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement Learning

Intra-YOLO:基于迁移学习与强化学习的口内摄影龋齿与磨牙-切牙矿化不良小目标检测模型

Po-Lun Chwang, Po-Yu Chang, Wen-Liang Lin, Tung-Sheng Wu, Min-Ching Wang, Yun-Chien Cheng

发表机构 * Department of Mechanical Engineering, College of Engineering, National Yang Ming Chiao Tung University(国家阳明交通大学机械工程系) Taipei Medical University Hospital(台北医学大学医院) Wan Fang Hospital, Taipei Medical University(台北医学大学万芳医院)

AI总结 提出Intra-YOLO模型,结合迁移学习与强化学习,解决口内照片中龋齿和MIH小目标检测难题。

详情
AI中文摘要

本研究开发了一种计算机辅助诊断(CAD)系统,用于检测口内照片中的龋齿和磨牙-切牙矿化不良(MIH)。这些病变外观相似,使得临床鉴别具有挑战性,尤其是考虑到它们尺寸小且成像条件多变。

英文摘要

This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.

2605.28155 2026-05-28 cs.LG cs.NI

Temporal Hyperbolic Graph Representation Learning for Scale-Free Internet Routing and Delay Prediction

面向无标度互联网路由与延迟预测的时间双曲图表示学习

Yi-Ling Kuo, Hao-Yu Tien, Shih-Yu Tsai

发表机构 * Department of Information Management and Finance, National Yang Ming Chiao Tung University(信息管理与金融系,国家阳明交通大学)

AI总结 提出HERMIT框架,结合双曲流形保持的时间图神经网络与随机森林回归器,利用双曲几何建模互联网路由图的层次和无标度结构,实现链路预测与RTT预测,在大规模真实数据集上优于基线模型。

详情
AI中文摘要

预测互联网往返时间(RTT)对于路由优化、服务质量(QoS)保障和流量工程至关重要,但由于长期时间依赖、动态路由演变和重尾延迟分布,仍然具有挑战性。虽然时间图神经网络(TGNN)可以建模不断演变的网络拓扑,但大多数现有方法在欧几里得空间中运行,难以捕捉互联网路由图的层次和无标度结构。双曲几何提供了更合适的表示空间。我们提出HERMIT(通过集成拓扑的双曲边缘感知RTT建模),这是一个混合框架,结合了双曲流形保持的时间GNN与随机森林回归器,用于联合链路预测和RTT预测。HERMIT基于HMPTGN,引入了RTT感知的边缘特征和可学习的边缘编码器,以改进对不断演变的链路状态和路由行为的建模。得到的双曲节点表示与历史RTT统计相结合,用于鲁棒的延迟预测。我们在2015-2024年的大规模真实互联网数据集上评估HERMIT。HERMIT始终优于仅使用历史RTT统计的强随机森林基线,RMSE改进6%,同时减少了重尾样本上的大误差。在链路预测性能上,它也超越了先前的双曲TGNN模型,包括HMPTGN和HTGN。这些结果表明,将双曲时间图学习与基于树的回归相结合,为真实世界互联网拓扑中的RTT预测提供了可扩展的解决方案。

英文摘要

Predicting Internet round-trip time (RTT) is critical for routing optimization, quality-of-service (QoS) provisioning, and traffic engineering, yet remains challenging due to long-term temporal dependencies, evolving routing dynamics, and heavy-tailed latency distributions. While Temporal Graph Neural Networks (TGNNs) can model evolving network topologies, most existing approaches operate in Euclidean space, which poorly captures the hierarchical and scale-free structure of Internet routing graphs. Hyperbolic geometry provides a more suitable representation space. We propose HERMIT (Hyperbolic Edge-aware RTT Modeling via Integrated Topology), a hybrid framework combining a hyperbolic manifold-preserving temporal GNN with a Random Forest regressor for joint link prediction and RTT prediction. Built on HMPTGN, HERMIT introduces RTT-aware edge features and a learnable edge encoder to improve modeling of evolving link states and routing behavior. The resulting hyperbolic node representations are combined with historical RTT statistics for robust latency prediction. We evaluate HERMIT on a large-scale real Internet dataset spanning 2015-2024. HERMIT consistently outperforms a strong Random Forest baseline using only historical RTT statistics, achieving a 6% RMSE improvement while reducing large errors on heavy-tailed samples. It also surpasses prior hyperbolic TGNN models, including HMPTGN and HTGN, in link prediction performance. These results demonstrate that combining hyperbolic temporal graph learning with tree-based regression provides a scalable solution for RTT prediction in real-world Internet topologies.

2605.28151 2026-05-28 cs.CV

A novel ordinal multi-view aggregation scheme for oak defoliation

一种用于橡树落叶的新型有序多视图聚合方案

Francisco Bérchez-Moreno, Ricardo Enrique Hernández-Lambraño, David Guijo-Rubio, Víctor Manuel Vargas, Francisco José Ruiz-Gómez, Juan Carlos Fernández, Pablo González-Moreno

发表机构 * Department of Forest Engineering, Laboratory of Dendrochronology, Silviculture and Global Change – DendrodatLab, Universidad de Córdoba(森林工程系、树轮学实验室、林学与全球变化——DendrodatLab,科尔多瓦大学) ERSAF. Andalusian Institute for Earth System Research (IISTA), Universidad de Córdoba(安达卢西亚地球系统研究所(IISTA)、科尔多瓦大学)

AI总结 提出一种基于有序分类的多视图集成框架,通过聚合从不同视角(北、南、树冠)训练的CNN预测,实现更稳健准确的橡树落叶估计。

详情
AI中文摘要

由气候和生物胁迫驱动的森林衰退威胁着生态系统功能,使得准确监测树木健康至关重要。在这项工作中,我们将树木落叶估计视为一个有序分类问题,使用地面图像。我们提出了一种新颖的多视图集成框架,该框架聚合了从不同视角(北、南和树冠)训练的卷积神经网络(CNN)的预测。该方法通过同质集成设计利用互补的视觉信息,同时保持建模一致性。通过比较多种有序分类方法并分析每个视图及其组合的贡献,进行了全面评估。结果表明,对落叶水平的有序结构进行建模比名义方法提高了性能,而所提出的多视图集成始终优于单视图和成对配置。特别是,三视图集成在所有评估指标上实现了最稳健和准确的预测。这些发现凸显了结合深度学习(DL)、有序分类(OC)和多视图聚合在地中海牧场等复杂生态系统中进行可扩展、一致和客观的森林健康评估的潜力。

英文摘要

Forest decline driven by climate and biotic stressors threatens ecosystem functioning, making accurate monitoring of tree health essential. In this work, we address tree defoliation estimation as an ordinal classification problem using ground-level imagery. We propose a novel multi-view ensemble framework that aggregates predictions from Convolutional Neural Networks (CNNs) trained on different perspectives of individual trees (north, south, and crown). This approach leverages complementary visual information while preserving modelling consistency through a homogeneous ensemble design. A comprehensive evaluation is conducted by comparing multiple ordinal classification methods and analysing the contribution of each view and their combinations. Results show that modelling the ordinal structure of defoliation levels improves performance over nominal approaches, while the proposed multi-view ensemble consistently outperforms single-view and pairwise configurations. In particular, the three-view ensemble achieves the most robust and accurate predictions across all evaluation metrics. These findings highlight the potential of combining Deep Learning (DL), Ordinal Classification (OC), and multi-view aggregation for scalable, consistent, and objective forest health assessment in complex ecosystems such as Mediterranean dehesas.

2605.28150 2026-05-28 cs.LG

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

离策略学习推理之所以有效是因为它比你想象的更悲观

Otmane Sakhi, Aleksei Arzhantsev, Imad Aouali, Flavian Vasile

发表机构 * Criteo AI Lab(Criteo人工智能实验室)

AI总结 本文通过隐式悲观主义解释离策略强化学习目标的有效性,并提出稳定诱导分布的改进方法。

详情
AI中文摘要

大规模强化学习已成为改进大型语言模型推理能力的核心工具。在此规模下,生成往往滞后或异步,因此更新是在旧策略收集的数据上进行的。这使得学习本质上是离策略的。然而,大多数现有方法仍根植于PPO风格的信任区域目标,将训练视为近似在策略,并使用重要性权重来纠正分布不匹配。这些修正可能引入高方差,破坏优化稳定性,并加速熵崩溃。最近的研究提出了一种替代方案:与其纠正不匹配,不如接受离策略数据并移除重要性权重,这通常能产生更强的算法。在本文中,我们提供了一种直观的离策略目标构建方法,包括成功的离策略目标,并表明其有效性可以通过隐式悲观主义来理解:它们优化的目标策略比名义目标所暗示的更保守。这一视角解释了为什么某些特定的实现选择能提高稳定性:它们隐式地控制了有效目标分布。然后,我们提出了一种原则性的改进,以稳定这种诱导分布并改善离策略学习。

英文摘要

Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This makes learning inherently off-policy. Most existing approaches nevertheless remain rooted in PPO-style trust-region objectives, treating training as approximately on-policy and using importance weights to correct distribution mismatch. These corrections can introduce high variance, destabilize optimization, and accelerate entropy collapse. Recent work suggests an alternative: rather than correcting the mismatch, one can embrace off-policy data and remove importance weights, often yielding stronger algorithms. In this paper, we provide an intuitive construction of off-policy objectives that include successful off-policy objectives and show that their effectiveness can be understood through implicit pessimism: they optimize toward target policies that are more conservative than their nominal objectives suggest. This perspective explains why some particular implementation choices improve stability: they implicitly control the effective target distribution. We then propose a principled modification that stabilize this induced distribution and improve off-policy learning.

2605.28149 2026-05-28 cs.LG

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

符号感知门控稀疏自编码器:使用Bi-Jump-ReLU激活函数建模反相关特征

Bartosz Wieciech, Zmnako Awrahman, Marcin Czelej, Victor Hugo Jaramillo Velasquez, Wioletta Stobieniecka

发表机构 * Amazon Web Services(亚马逊网络服务)

AI总结 提出符号感知门控稀疏自编码器(SA-GSAE),通过双面门控稀疏性、符号幅度路径和辅助重构,利用Bi-Jump-ReLU激活实现双极性共享,在保持参数效率的同时,在多个LLM激活点上优于标准门控SAE。

详情
AI中文摘要

稀疏自编码器(SAE)从大型语言模型中提取可解释特征,但标准变体强制非负性,迫使对截然相反的概念(例如“压力过高”与“压力过低”)使用单独的潜在变量,并在特征反相关时浪费字典容量。我们提出了符号感知门控SAE(SA-GSAE):双面门控稀疏性,带有符号幅度和辅助监督。极性敏感门控在任一符号上选择支持,符号幅度路径避免L1收缩,辅助重构防止门控崩溃。双极性共享——一个潜在变量沿共享方向编码两种符号——通过新的Bi-Jump-ReLU激活实现;参数核算表明,即使反相关对很少,符号感知也能保持参数效率。在Pythia-1B和SmolLM3-3B(6个单元,3个种子)的三个中层钩点上的真实LLM激活上,宽度为H的半宽SA-GSAE在3/6个单元(两个MLP输出钩点和resid-mid/Pythia-1B)上在整个扫描的L0重叠上严格帕累托支配宽度为2H的全宽门控SAE;在其余3个单元上,R²差距在0.025以内(最大差距-0.008),同时死单元分数绝对降低0.35-0.62。扫描几何平均死单元分数降低在MLP输出单元和Pythia-1B resid上约为100倍-500倍,在注意力单元和SmolLM3-3B resid上约为2倍-4倍。消融实验表明,双面门控和辅助损失是承重的(无辅助时LR降至0.27,98%死单元);绑定r_i^+ = r_i^-与无绑定不可区分(|ΔR²| = 0.0015),我们推荐这种对称变体作为默认。MLP输出的增益来自大多数潜在变量携带两种极性;在注意力上,双极性结构集中在一小部分顶级潜在变量中。全宽SA-GSAE在SmolLM3-3B resid上表现出可复现的重构崩溃,而半宽完全避免了这一点。

英文摘要

Sparse Autoencoders (SAEs) extract interpretable features from Large Language Models, but standard variants enforce non-negativity, forcing separate latents for diametrically opposed concepts (e.g., "pressure too high" vs. "pressure too low") and wasting dictionary capacity when features are anticorrelated. We propose the Sign-Aware Gated SAE (SA-GSAE): two-sided gated sparsity with signed magnitude and auxiliary supervision. A polarity-sensitive gate selects support on either sign, a signed-magnitude path avoids L1 shrinkage, and an auxiliary reconstruction prevents gate collapse. Bipolar sharing - one latent encoding both signs along a shared direction - is realised via a new Bi-Jump-ReLU activation; parameter accounting shows sign-awareness stays parameter-efficient even when anticorrelated pairs are rare. On real LLM activations across three mid-depth hookpoints on Pythia-1B and SmolLM3-3B (6 cells, 3 seeds), a half-width SA-GSAE at width H strictly Pareto-dominates a full-width Gated SAE at 2H over the entire swept L0 overlap on 3 of 6 cells (both MLP-output hookpoints and resid-mid/Pythia-1B); on the remaining 3 it matches R^2 within 0.025 (max gap -0.008) while cutting dead fraction by 0.35-0.62 absolute. Sweep-geomean dead-fraction reductions are ~100x-500x on MLP-output cells and Pythia-1B resid, ~2x-4x on attention cells and SmolLM3-3B resid. Ablations show the two-sided gate and auxiliary loss are load-bearing (no auxiliary collapses LR to 0.27, 98% dead); tying r_i^+ = r_i^- is indistinguishable (|Delta R^2| = 0.0015), and we recommend this symmetric variant as default. MLP-output gains come from most latents carrying both polarities; on attention, bipolar structure concentrates in a small set of top latents. Full-width SA-GSAE exhibits a reproducible reconstruction collapse at SmolLM3-3B resid that the half-width entirely avoids.

2605.27351 2026-05-28 cs.CV

Feedforward 3D Editing Learns from Semantic-Part Transformation

前馈3D编辑从语义部分变换中学习

Jiawei Weng, Saining Zhang, Zhenxin Diao, Peishuo Li, Henghaofan Zhang, Junhao Chen, Hao Zhao

发表机构 * Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学)

AI总结 提出Pxform数据集和PartFlow网络,通过语义部分变换实现高质量前馈3D编辑,在几何和外观编辑基准上达到最优性能。

Comments 31 pages, 22 figures. Project Page: https://dennis-jwweng.github.io/pxform/

详情
AI中文摘要

3D编辑是可扩展3D内容创作的基本能力。虽然图像编辑已迅速向大规模前馈生成范式发展,但3D AI生成仍以无需训练的编辑流程为主。前馈3D编辑的核心挑战在于缺乏高质量配对监督。可编辑的3D资产需要同时保持几何、多视图一致性、结构连贯性和局部编辑可控性。现有的3D编辑数据集通常依赖于独立生成的资产、图像介导的重建或狭窄的编辑分类,导致定位不准确、保持性弱、编辑边界模糊和语义一致性有限。在这项工作中,我们引入了一个新视角:可扩展的前馈3D编辑应从语义部分变换中学习。基于这一见解,我们提出了Pxform,一个高质量的3D编辑数据集,包含超过10万对七种编辑类型的一致前后编辑对。我们的流程不是将对象视为无结构形状,而是直接将编辑锚定在语义3D部分。基于Pxform,我们进一步提出了PartFlow,一个前馈3D编辑网络,它将源感知潜在控制注入预训练的3D生成先验中。PartFlow引入了掩码感知速度保持和渲染空间一致性监督,以共同提高编辑保真度和源保持,同时在推理时不需要3D编辑掩码。大量实验表明,高质量的语义部分监督显著改进了可扩展的3D编辑,使PartFlow在几何和外观编辑基准上均达到了最先进的性能。

英文摘要

3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.

2605.27102 2026-05-28 cs.CV cs.LG

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

JLT: 潜在扩散Transformer中的干净潜在预测

Funing Fu, Tenghui Wang, Guanyu Zhou, Junyong Cen, Qichao Zhu

发表机构 * Independent Researcher(独立研究者) Wuhan University of Technology(武汉理工大学) Hangzhou Jiyi Artificial Intelligence Co., Ltd.(杭州智益人工智能有限公司)

AI总结 本文提出JLT,一种在冻结的FLUX.2 VAE编码上训练的130M潜在扩散Transformer,通过干净潜在预测相比速度预测在ImageNet 256×256上获得更优的FID分数,表明潜在扩散中的预测目标是依赖于表示的几何选择。

详情
AI中文摘要

使用干净数据预测的流匹配表明,回归干净点比预测环境噪声量更能有效利用低维结构。我们询问在图像被映射到学习到的潜在空间后,这一原则是否仍然有用,因为压缩已经去除了原始像素的大部分变异性。我们引入了JLT,一个在冻结的FLUX.2 VAE编码上的130M潜在扩散Transformer,并在相同的表示、主干和训练设置下,将干净潜在预测与匹配的速度预测DiT进行比较。尽管三个变量x、epsilon和v在固定损坏时间下是线性可转换的,但局部高斯分析表明,速度回归继承了各向同性的目标协方差下限,并放大了低方差潜在方向,而干净预测则抑制了它们。在ImageNet 256×256上,JLT-B/1在无分类器引导下获得了FID-50K 2.50,与速度预测相比有较大的匹配目标差距。这些结果表明,潜在扩散中的预测目标是依赖于表示的几何选择,而不是可互换的代数参数化。

英文摘要

Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.

2605.26910 2026-05-28 cs.LG cs.AI q-bio.NC

EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models

EEG-FM-Audit:脑电图基础模型的系统评估与分析流程

Xianheng Wang, Yige Yang, Damien Coyle

发表机构 * Bath Institute for the Augmented Human(巴思增强人类研究所) University of Bath(巴斯大学)

AI总结 提出EEG-FM-Audit流程,通过ASHA驱动的基准测试、范式级消融研究和神经生理探测,系统评估脑电图基础模型,发现调优的监督基线可媲美或超越先进基础模型。

Comments 26 pages

详情
AI中文摘要

大型脑电图基础模型在解码跨多种认知任务的脑电图信号方面展现出巨大潜力。然而,现有的EEG-FM研究存在三个关键局限性:不透明的监督基线调优、复杂学习范式的贡献未经验证以及模型决策缺乏透明度。为解决这些问题,我们提出了EEG-FM-Audit,一个旨在系统化评估EEG-FM的综合评估与分析流程。EEG-FM-Audit包含三个主要组成部分:(1) ASHA驱动的基准测试协议,通过透明优化监督基线确保公平比较;(2) 范式级消融研究,评估FM中学习范式的有效性;(3) 神经生理探测框架,探究FM是否利用了有效的时域、空域和频域脑电图特性。我们将EEG-FM-Audit应用于四个最先进的EEG-FM和五个代表性监督模型,涉及三个公开数据集。结果表明,尽管参数显著减少,但适当调优的监督基线可以匹配或超越先进的FM。此外,我们发现FM学习范式的有效性高度依赖于数据集规模和架构。最后,NPP分析展示了FM如何依赖特定的生理特征,为更可解释的神经解码建立了框架。

英文摘要

Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG-FM studies exhibit three critical limitations: opaque supervised baseline tuning, unverified contributions of complex learning paradigms, and a lack of transparency in model decision-making. To address these, we propose EEG-FM-Audit, a comprehensive evaluation and analysis pipeline designed to systematize the assessment of EEG-FMs. EEG-FM-Audit consists of three primary components: (1) an ASHA-driven benchmarking protocol that ensures fair comparisons by transparently optimizing supervised baselines; (2) paradigm-level ablation studies to evaluate the effectiveness of learning paradigms in FMs; and (3) a neurophysiological probing (NPP) framework, which explores whether FMs leverage valid temporal, spatial, and spectral EEG properties. We apply EEG-FM-Audit to four state-of-the-art EEG-FMs and five representative supervised models across three public datasets. Our results reveal that properly tuned supervised baselines can match or outperform advanced FMs, despite requiring significantly fewer parameters. Furthermore, we find that the effectiveness of learning paradigms of FMs is highly dependent on dataset scale and architecture. Finally, NPP analysis demonstrates how FMs rely on specific physiological features, establishing a framework for more interpretable neural decoding.

2605.26368 2026-05-28 cs.CV cs.AI

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

统一全景几何估计:基于多视角基础模型

Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger, Denis Rozumny, Konrad Schindler, Nikolai Kalischek

发表机构 * ETH Zürich(苏黎世联邦理工学院) Google(谷歌)

AI总结 提出PaGeR框架,利用预训练3D基础模型,从单张全景图像中统一预测尺度不变深度、度量深度、表面法线和天空掩码,实现360度场景重建。

详情
AI中文摘要

从透视图像进行几何估计已取得巨大进展,成熟到现成的基础模型不仅能够从多视角图像重建3D场景结构,甚至能从单视图进行重建。一个自然的扩展是从全景图像进行3D重建,其令人兴奋的前景是从单张全景图像恢复完整的360度场景。在这项工作中,我们引入了PaGeR(全景几何重建),这是一个将专为透视图像设计的强大3D基础模型提升到全景领域的框架。我们的策略是从一个预训练的3D重建Transformer开始,将其转变为一个统一的高性能模型,该模型在单次前向传播中从透视和全向图像预测尺度不变深度、度量深度、表面法线和天空掩码。通过将架构改动保持在最小,并在训练中混合透视和全景图像,PaGeR保留了底层基础模型的丰富3D先验,同时学会从单张全景图像估计几何一致的360度场景。我们在室内和室外环境中广泛测试了我们的方法,发现它在各种场景中提供了最先进的性能和出色的零样本性能。代码、数据和模型可在此处获取:https://github.com/prs-eth/PaGeR。

英文摘要

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes. Code, data and models are available $\href{https://github.com/prs-eth/PaGeR}{\text{here}}$.

2605.26357 2026-05-28 cs.LG

Balancing Plasticity and Stability with Fast and Slow Successor Features

平衡可塑性与稳定性:基于快慢继承特征的方法

Raymond Chua, Doina Precup, Blake Richards

发表机构 * School of Computer Science, McGill University, Montr\'eal, Canada Mila - Quebec Artificial Intelligence Institute, Montr\'eal, QC, Canada CIFAR Learning in Machines Google Deepmind Montreal Neurological Institute of McGill University, Montr\'eal, Canada

AI总结 针对非平稳环境中深度强化学习面临的稳定性-可塑性困境,提出通过多时间尺度突触巩固继承特征来平衡两者,实验表明在渐变环境下稳定性比可塑性更重要。

Comments Main Paper: 9 pages, 9 figures. Accepted at The International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

智能的一个标志是在非平稳环境中适应的能力,然而深度强化学习智能体在这种环境中常常表现不佳。先前的研究通过特征或动态的突然变化引入非平稳性,而现实环境通常通过持续漂移逐渐演变。这一区别对强化学习中的“稳定性-可塑性困境”具有重要意义,因为突然的任务变化可能比自然场景需要更多的可塑性。为了解决这个问题,我们修改了现有的3D Miniworld和MuJoCo环境,以纳入自然的、持续的非平稳性,并用它们来研究在连续环境变化下稳定性和适应性如何影响性能。我们发现,偏向稳定性的方法(如突触巩固)优于关注可塑性的方法(如参数重置)。受此结果以及先前证据表明继承特征减少干扰的启发,我们研究了继承特征是否比Q值更适合作为巩固目标。在这两种环境中,将神经启发的突触巩固应用于继承特征在持续变化的环境中取得了更优的性能。此外,当继承特征在多个时间尺度上稳定时,巩固最为有效,这些时间尺度捕捉了逐渐环境变化的互补方面。这些结果共同表明,在变化逐渐的持续学习中稳定性更为关键,而对预测表示进行多时间尺度巩固是一种有效的方法。

英文摘要

A hallmark of intelligence is the ability to adapt in non-stationary environments, yet deep Reinforcement Learning (RL) agents often struggle in such settings. Prior studies introduce non-stationarity through abrupt shifts in features or dynamics, whereas real-world environments often evolve gradually through continual drift. This distinction has important implications for the "stability-plasticity dilemma" in RL, as abrupt task changes may demand more plasticity than naturalistic settings. To address this, we modify existing 3D Miniworld and MuJoCo environments to incorporate naturalistic, continual non-stationarity, and use them to examine how stability and adaptation affect performance under continuous environmental change. We find that methods favoring stability, such as synaptic consolidation, outperform approaches focused on plasticity, such as parameters resetting. Motivated by this result, and prior evidence that Successor Features (SFs) reduce interference, we investigate whether SFs are better consolidation targets than Q-values. Across both environments, applying neuro-inspired synaptic consolidation to SFs yields superior performance on continually changing settings. Moreover, consolidation is most effective when SFs are stabilized across multiple timescales, which capture complementary aspects of gradual environmental change. Together, these results suggest that stability is more critical in continual learning when changes are gradual, and that multi-timescale consolidation of predictive representations is an effective approach.

2605.26277 2026-05-28 cs.CV cs.AI

VesselSim: learning 3D blood vessel segmentation without expert annotations

VesselSim: 无需专家标注的3D血管分割学习

Erin Rainville, Melissa Ananian, Tristan Mirolla, Hassan Rivaz, Yiming Xiao

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada(计算机科学与软件工程系,康科迪亚大学,蒙特利尔,加拿大) Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada(电气与计算机工程系,康科迪亚大学,蒙特利尔,加拿大)

AI总结 提出VesselSim两阶段框架,通过几何驱动的合成血管生成和自监督测试时适应,实现无需真实标注的3D血管分割,在多个临床数据集上达到与有监督方法竞争的性能。

Comments This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情
AI中文摘要

血管分割是医学图像分析中用于血管疾病护理和手术规划的核心任务,然而提供专家血管标注的挑战对相关深度学习技术的进展构成了主要障碍。为解决这一问题,我们提出了VesselSim,一个用于通用3D血管分割的两阶段框架,在训练过程中无需真实标注数据。首先,我们引入了一个随机的、几何驱动的血管模拟框架,该框架模拟递归分支、曲率控制生长和碰撞感知拓扑,随后通过域随机化强度合成生成16,500个体解剖学上合理的3D血管造影体积。其次,仅在此合成数据上训练3D U-Net。为了在推理时弥合从合成图像到真实图像的域差距,我们通过自监督掩码重建解码器引入了一种测试时适应策略,无需先验域知识即可适应未见过的临床扫描。我们在多个真实世界数据集上以零样本设置评估VesselSim,这些数据集涵盖多个解剖区域(包括脑和肾脏)的MR和CT。尽管仅在合成数据上训练,VesselSim的性能与最先进的血管分割基础模型相竞争。这些发现表明,从合成管状结构中学习血管几何对于鲁棒的跨域泛化是有效的,大大减少了对获取的医学成像数据以及更重要的是专家标注的依赖。

英文摘要

Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challenges of providing expert vascular annotations pose a major obstacle for the progress of related deep learning techniques. To address this, we propose VesselSim, a two-stage framework for universal 3D blood vessel segmentation that eliminates the need for real annotated data during training. First, we introduce a stochastic, geometry-driven vascular simulation framework that models recursive branching, curvature-controlled growth, and collision-aware topology, followed by domain-randomized intensity synthesis to generate 16,500 anatomically plausible 3D angiographic volumes. Second, a 3D U-Net is trained solely on this synthetic data. To bridge the domain gap from synthetic to real images at inference time, we introduce a test-time adaptation strategy via a self-supervised mask reconstruction decoder, enabling adaptation to unseen clinical scans without prior domain knowledge. We evaluate VesselSim in a zero-shot setting on multiple real-world datasets spanning MR and CT across several anatomical regions, including the brain and kidneys. Despite being trained exclusively on synthetic data, VesselSim achieves performance competitive with state-of-the-art vascular segmentation foundation models. These findings suggest that learning vessel geometry from synthetic tubular structures is effective for robust cross-domain generalization, substantially reducing the reliance on acquired medical imaging data and more importantly, expert annotations.

2605.25770 2026-05-28 cs.RO

Implicit Null-space Manifold Generation for Redundant Robotic Systems

冗余机器人系统的隐式零空间流形生成

Taiki Ishigaki, Teresa Vidal-Calleja, Ko Ayusawa, Eiichi Yoshida

发表机构 * Tokyo University of Science, Japan(日本东京科学大学) University of Technology Sydney, Australia(澳大利亚悉尼技术大学) National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 针对冗余机器人系统,提出一种基于雅可比引导探索的隐式标量场方法,通过零水平集表示解流形,实现解空间几何结构的有效估计与连续任务建模。

Comments Corrected author names in references

详情
AI中文摘要

具有冗余自由度的机器人系统可以通过多种配置实现相同的任务结果,从而形成配置空间中的解流形。现有方法通常通过基于雅可比的技术局部利用这种冗余性来计算单个解或轨迹。虽然这些方法在求解计算上有效,但它们不保留解集本身的几何结构表示。在这项工作中,我们采用以表示为中心的方法来估计解空间的几何结构。我们考虑由通用任务定义映射诱导的解流形,并在配置空间上构建一个隐式标量场,其零水平集对应于解流形。为此,我们使用雅可比引导的探索策略在解流形附近生成样本,该策略有效捕获其局部和全局结构。得到的隐式表示定义在配置空间上,并自然诱导出一个连续的距离场,编码到解流形的接近度。在平面三连杆机器人和七自由度Franka机械臂上的实验证明了所提出表示的有效性。此外,该框架能够对具有连续变化的任务族进行解空间的一致建模。

英文摘要

Robotic systems with redundant degrees of freedom can achieve the same task outcome using multiple configurations, resulting in solution sets that form manifolds in the configuration space. Existing approaches typically exploit such redundancy locally through Jacobian-based techniques to compute individual solutions or trajectories. While effective for solution computation, these methods do not retain a representation of the geometry of the solution set itself. In this work, we adopt a representation-centric approach to estimate the geometric structure of the solution space. We consider solution manifolds induced by general task-defining maps and construct an implicit scalar field over the configuration space, whose zero-level set corresponds to the solution manifold. To this end, we generate samples in the neighborhood of the solution manifold using a Jacobian-guided exploration strategy, which efficiently captures its local and global structure. The resulting implicit representation is defined over the configuration space and naturally induces a continuous, distance field that encodes proximity to the solution manifold. Experiments on a planar three-link robot and a seven-degree-of-freedom Franka manipulator demonstrate the effectiveness of the proposed representation. Furthermore, the framework enables consistent modeling of solution spaces across families of tasks with continuous variation.