arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
2605.25626 2026-05-26 cs.CL

Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC

超越字面翻译:评估社交媒体用户生成内容中的文化有效性

Linjuan Wu, Ruiqi Zhang, Xinze Lyu, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Yixin Cao, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University(浙江大学) Fudan University(复旦大学) Xiaohongshu Inc.(小红书公司)

AI总结 针对社交媒体用户生成内容翻译中文化传递与情感共鸣不足的问题,提出CULTURE-MT基准,通过构建涵盖14个领域、4种文化负载类型的1002条UGC笔记,并引入文化有效性评估标准,实验表明传统指标无法捕捉文化有效性,且基础LLM的文化有效性与模型规模相关。

Comments Accepted by ICML2026

详情
AI中文摘要

社交媒体平台实现了大规模跨语言交流,但由于用户生成内容(UGC)的非正式风格、文化引用和基于互动的表达方式,其翻译仍然具有挑战性。尽管近期的大语言模型(LLM)提高了翻译质量,但现有基准和指标往往未能捕捉翻译是否在真实场景中传达了预期含义和文化共鸣。在这项工作中,我们引入了CULTURE-MT,一个专注于文化传递和UGC特定情感共鸣的社交媒体翻译基准。CULTURE-MT包含跨14个领域的1,002条UGC笔记,根据文化负载符号和语言风格特征分为四类。我们还构建了面向UGC的训练数据,以微调Qwen3-8B和Qwen3-32B作为基线。我们提出文化有效性作为新的评估标准,侧重于表达准确性和文化适应性。测试包括基线在内的15个模型,我们发现传统指标无法捕捉文化有效性。我们还观察到,基础LLM上的文化有效性与模型规模相关。我们的工作为UGC翻译模型提供了全面的评估系统,并将提供一个开放的评估平台以推动该领域的研究。我们发布了CULTURE-MT基准,并提供了一个在线排行榜,提交的翻译结果可由我们训练的JUDGER进行评估。

英文摘要

Social media platforms enable large-scale cross-lingual communication, but translating user-generated content (UGC) remains challenging due to its informal style, cultural references, and interaction-based expressions. While recent LLMs have improved translation quality, existing benchmarks and metrics often fail to capture whether translations convey intended meaning and cultural resonance in real-world settings. In this work, we introduce CULTURE-MT, a benchmark for social media translation that focuses on both CULtural Transmission and UGC-specific emotion REsonance. CULTURE-MT consists of 1,002 UGC notes across 14 domains, categorized into four types based on culture-loaded symbols and linguistic style features. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as baselines. We propose cultural effectiveness as a new evaluation criterion, focusing on expression accuracy and cultural adaptability. Testing 15 models, including the baselines, we find that traditional metrics fail to capture cultural effectiveness. We also observe that cultural effectiveness on base LLMs correlates with model size. Our work provides a comprehensive evaluation system for UGC translation models and will offer an open evaluation platform to advance research in this area. We release the CULTURE-MT benchmark and provide an online leaderboard where submitted translation results can be evaluated by our trained JUDGER.

2605.25621 2026-05-26 cs.CV

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

StreamOV: 通过证据引导记忆与响应触发的流式全视频理解

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu

发表机构 * Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出StreamOV框架,利用多模态证据引导的长短期记忆和隐状态驱动的触发机制,实现流式全视频理解中的在线推理与主动响应,并在新基准SOVBench上取得最优性能。

详情
AI中文摘要

虽然流式全视频理解需要持续感知和主动的实时交互,但这一关键领域仍未被充分探索。当前的全模态方法本质上是为离线场景设计的,由于两个根本缺陷限制了其在流式场景中的适用性。首先,它们缺乏稳健的机制来管理长时间跨度下持续增长的音视频上下文,并且无法在适当时机自主发起响应。其次,现有基准主要局限于离线、单轮问答,无法捕捉连续的多轮流式交互。为弥补这些差距,我们提出了StreamOV,一种新颖的流式全视频理解框架,用于具有有限记忆和主动响应触发的高效在线音视频推理。具体来说,StreamOV引入了多模态证据引导的长短期记忆,在固定预算下将历史音视频上下文压缩为紧凑的信息性证据。它还采用隐状态驱动的触发器来决定何时响应,避免了显式的静音令牌生成和外部路由器。我们还整理了SOVBench,这是首个用于在线、多轮全模态评估的综合基准。大量实验表明,StreamOV在各种流式和全视频基准上取得了最先进的性能,证明了其在在线和离线视频理解中的有效性。

英文摘要

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

2605.25620 2026-05-26 cs.AI

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

回归简约潜在变量:从视觉基础学习以任务为中心的世界模型

Minghao Fu, Fan Feng, Nicklas Hansen, Biwei Huang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出TC-WM框架,通过将预训练视觉嵌入线性投影为紧凑潜在状态、对比学习对齐子空间并重建嵌入,将基础模型特征转化为任务充分的世界表示,实现更好的世界建模质量和控制精度。

详情
AI中文摘要

世界模型使智能体能够根据动作预测未来动态,因此潜在表示的选择对于规划和控制至关重要。这种表示通常要么直接从像素中学习,但语义结构有限;要么继承自冻结的视觉基础模型,但包含过多与任务无关的细节,导致状态空间与下游规划和控制不匹配。这在无奖励的离线设置中尤其具有挑战性,因为模型必须从固定轨迹中学习,没有奖励监督或在线交互。为了解决这个问题,我们提出了TC-WM,一个将基础模型嵌入转化为紧凑、任务充分的世界表示的框架。关键设计是将预训练嵌入空间视为语义支架而非最终状态空间:TC-WM将高维视觉嵌入线性投影到紧凑潜在变量作为动态空间,通过对比学习将子空间与智能体的物理状态对齐,并重建嵌入以保留有用的视觉结构。这结合了基础特征的通用性和以任务为中心的动态的可控性。理论上,我们证明TC-WM足以识别潜在的任务中心潜在因子,只需简单变换。实验上,TC-WM能够在多种环境(如Robomimic和D4RL)中实现测试时规划,其世界建模质量和控制精度均优于现有方法。

英文摘要

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

2605.25616 2026-05-26 cs.LG stat.ML

Courtroom Analogy: New Perspective on Uncertainty-Aware Classification

法庭类比:不确定性感知分类的新视角

Taeseong Yoon, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology(工业与系统工程系,韩国科学技术院)

AI总结 提出法庭类比框架,通过结构化混合狄利克雷分布建模分类中的不确定性聚合,并设计单次前馈神经网络MoDEX实现高效、可解释的不确定性量化。

Comments ICML 2026

详情
AI中文摘要

分类中的单次不确定性量化方法通过预测类概率向量上的可处理分布来表示不确定性。现有方法主要关注增强该分布的表示能力,但往往对预测不确定性如何结构化和聚合提供的见解有限,导致可解释性较弱。我们引入法庭类比,将不确定性感知分类概念化为类特定倡导者之间的结构化辩论。每位倡导者形成概率意见,并通过输入依赖的可信度权重聚合这些意见得出最终裁决。在此框架中,每位倡导者的意见被建模为狄利克雷分布,其浓度参数分解为共享证据和类特定倡导。这产生了具有语义可解释参数的结构化混合狄利克雷分布。为实例化该公式,我们提出了混合狄利克雷专家(MoDEX),一种预测法庭参数的单次前馈神经架构,能够在显式建模不确定性聚合的同时实现高效且表达力强的不确定性量化。我们证明MoDEX具有强大的理论性质,并在多种基准测试中实现了最先进的不确定性量化性能,产生具有有意义语义的可解释不确定性估计。

英文摘要

Single-pass uncertainty quantification (UQ) methods for classification represent uncertainty by predicting a tractable distribution over the class probability vector. While existing approaches primarily focus on enhancing the expressiveness of this distribution, they often provide limited insight into how predictive uncertainty is structured and aggregated, resulting in weak interpretability. We introduce the courtroom analogy, which conceptualizes uncertainty-aware classification as a structured debate among class-specific advocates. Each advocate forms a probabilistic opinion, and a final verdict is reached by aggregating these opinions using input-dependent plausibility weights. In this framework, each advocate's opinion is modeled as a Dirichlet distribution whose concentration parameter is decomposed into shared evidence and class-specific advocacy. This yields a structured mixture of Dirichlet distributions with semantically interpretable parameters. To instantiate this formulation, we propose Mixture of Dirichlet EXperts (MoDEX), a single-pass neural architecture that predicts the courtroom parameters, enabling efficient and expressive UQ while explicitly modeling uncertainty aggregation. We demonstrate that MoDEX enjoys strong theoretical properties and achieves state-of-the-art UQ performance across diverse benchmarks, yielding interpretable uncertainty estimates with meaningful semantics.

2605.25615 2026-05-26 cs.CV

UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition

UAV-OVO:无人机动作识别中的视点外泛化

Yu Xia, Zhengbo Zhang, Shuaihu Zhang, Zhigang Tu

发表机构 * Wuhan University(武汉大学) Singapore University of Technology and Design(新加坡科技与设计大学)

AI总结 针对无人机动作识别中训练与测试视点不一致导致的性能下降问题,提出UAV-OVO基准和LATER方法,通过视点隔离和LoRA锚定特征重中心化实现视点鲁棒泛化。

详情
AI中文摘要

无人机动作识别面临标准基准测试常掩盖的部署偏移:从低俯视角拍摄的无人机视频训练的模型可能需要识别来自高俯视角的相同动作类别。虽然动作标签保持不变,但这种偏移改变了身体可见性、运动投影和场景上下文,促使模型依赖视点特定的捷径。我们引入UAV-OVO,一个用于无人机动作识别的视点外泛化基准。UAV-OVO从未校准视频中导出视点分数,使用视点隔离带将低俯视角视频分配给训练和分布内测试集,同时保留高俯视角视频用于分布外测试,并构建按类别分布匹配的ID/OOD测试集,使得性能差异反映视点偏移而非标签不平衡。在代表性视频识别器上,UAV-OVO揭示了显著的ID/OOD差距:拟合低俯视角训练分布良好的模型往往无法迁移到保留的高俯视角,暴露了被整体准确性隐藏的视点捷径。我们进一步提出LATER,即LoRA锚定的测试时重中心化,首先通过低秩适配(LoRA)适配识别器,然后利用学习到的LoRA子空间作为在线特征重中心化的语义锚点。具体来说,LATER在重中心化特征之前将目标域位移投影到LoRA子空间的正交补上,减少视点引起的漂移同时保留任务相关语义。UAV-OVO和LATER共同为视点鲁棒的无人机视频理解提供了一个受控测试床和一种实用的适配方法。

英文摘要

UAV action recognition faces a deployment shift that standard benchmarks often obscure: a model trained on UAV footage captured from low-depression viewpoints may be required to recognize the same action classes from high-depression viewpoints. While the action labels remain unchanged, this shift alters body visibility, motion projection, and scene context, encouraging models to rely on viewpoint-specific shortcuts. We introduce UAV-OVO, an Out-of-Viewpoint generalization benchmark for UAV action recognition. UAV-OVO derives view scores from uncalibrated videos, uses a view-isolation band to assign low-depression videos to the training and in-distribution test splits while reserving high-depression videos for out-of-distribution testing, and constructs ID/OOD test sets matched by class distribution so that performance differences reflect viewpoint shift rather than label imbalance. Across representative video recognizers, UAV-OVO reveals a substantial ID/OOD gap: models that fit the low-depression training distribution well often fail to transfer to held-out high-depression views, exposing viewpoint shortcuts hidden by aggregate accuracy. We further propose LATER, LoRA-Anchored Test-time Re-centering, which first adapts the recognizer with Low-Rank Adaptation (LoRA) and then uses the learned LoRA subspace as a semantic anchor for online feature re-centering. Specifically, LATER projects target-domain displacement onto the orthogonal complement of the LoRA subspace before re-centering features, reducing viewpoint-induced drift while preserving task-relevant semantics. Together, UAV-OVO and LATER provide a controlled testbed and a practical adaptation method for viewpoint-robust UAV video understanding.

2605.25612 2026-05-26 cs.LG cs.AI

Towards the Connection between Activation Sparsity and Flat Minima

激活稀疏性与平坦极小值之间的联系

Ze Peng, Jian Zhang, Lei Qi, Yang Gao, Yinghuan Shi

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Institute of Brain-Machine Interface, Nanjing University(南京大学脑机接口研究院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 本文发现损失景观的平坦性与Transformer中MLP激活稀疏性密切相关,通过理论推导和三种实用方法增强稀疏性,显著降低推理和训练成本。

详情
AI中文摘要

标准训练的Transformer的MLP块中出现的激活稀疏性为在不牺牲性能的情况下大幅降低计算成本提供了机会。为了从理论上解释这一现象,现有工作表明激活稀疏性并非源于数据属性或数据拟合,而是来自训练过程的隐式偏差。然而,这些联系是在强假设下得到的,无法应用于标准训练的大步数深度模型。与这些工作不同,我们发现损失景观的平坦性也与MLP激活稀疏性密切相关,并且可以作为标准深度网络的一个更弱且自然出现的假设。具体来说,我们发现:1) MLP激活稀疏性等于“增强平坦性”(平坦性度量的加权和)与输入范数和MLP激活梯度乘积的比值。我们经验性地发现该比值在训练过程中下降,导致稀疏激活。2) 我们还提出了导数稀疏性的概念,在ReLU下它退化为激活稀疏性,但进一步支持反向传播中的剪枝,并且比激活稀疏性更稳定。基于理论发现,我们通过三种方法减小分子和增大分母来进一步鼓励激活稀疏性。这些即插即用的修改可以有效降低比值并产生更稀疏的激活。在ImageNet-1K和C4上的实验表明,与原始Transformer相比,推理稀疏性至少提高36%,训练稀疏性至少提高50%,表明在推理和训练中进一步降低成本的潜力。

英文摘要

The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models standardly trained with a large number of steps. Different from these works, we find that the flatness of loss landscapes is also closely related to the MLP activation sparsity and can serve as a weaker and naturally emerging assumption standard deep networks. Specifically, we find that 1) the MLP activation sparsity equals a ratio between "augmented flatness" (a weighted sum of flatness measures) and the product of the input norm and activation gradient of the MLP. We empirically find that this ratio decreases during training, leading to sparse activations. 2) We also propose the notion of derivative sparsity, which reduces to activation sparsity under ReLU, but further enables pruning in the backward propagation and is more stable than activation sparsity. With the theoretical findings, we can further encourage activation sparsity by decreasing the numerator and increasing the denominator of the ratio using three methods. These plug-and-play modifications can effectively reduce the ratio and produce sparser activations. Experiments on ImageNet-1K and C4 demonstrate relative improvements of at least 36% on inference sparsity and at least 50% on training sparsity over vanilla Transformers, indicating further potential cost reduction in both inference and training

2605.25604 2026-05-26 cs.CL cs.LG

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

DVAO: 面向多奖励强化学习的动态方差自适应优势优化

Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang

发表机构 * Alibaba Cloud Computing(阿里巴巴云 computing)

AI总结 针对多奖励强化学习中奖励组合导致训练不稳定、优势组合依赖静态超参数的问题,提出动态方差自适应优势优化方法,通过基于经验奖励方差动态调整组合权重,实现稳定训练与多目标帕累托前沿优化。

详情
AI中文摘要

强化学习已成为将大型语言模型与人类意图和任务要求对齐的标准范式。尽管组相对策略优化为近端策略优化提供了一种高效、无价值模型的替代方案,但将其适应于现实世界的多奖励设置仍然具有挑战性。标准的标量化实践,如奖励组合和优势组合,存在显著缺陷:奖励组合经常产生平方幅度过大的优势,导致训练不稳定;而优势组合依赖静态超参数,忽略了跨目标相关性。为了解决这些限制,我们提出了动态方差自适应优势优化(DVAO),它根据 rollout 组内每个目标的经验奖励方差动态调整组合权重,有效提高具有更强学习信号的目标的权重,同时抑制噪声目标。我们从数学上证明 DVAO 保持有界的优势幅度以实现稳定训练,并引入了一种自适应的跨目标正则化机制。使用 Qwen3 和 Qwen2.5 模型在数学推理和工具使用基准上的大量实验表明,DVAO 显著优于基线方法,实现了卓越的多目标帕累托前沿和稳健的训练稳定性。

英文摘要

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

2605.25603 2026-05-26 cs.AI

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

通过电路引导的内外不一致性检测不忠实的思维链

Xu Shen, Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang, Tianlong Chen

发表机构 * Jilin University(吉林大学) University of Central Florida(中央佛罗里达大学) Arizona State University(亚利桑那州立大学) University of Vienna(维也纳大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出CIE-Scorer框架,通过追踪句子级电路并利用Fused Gromov-Wasserstein距离度量内部与外部推理图的不一致性,实现实例级思维链不忠实检测,在FaithCoT-Bench上取得最优性能并降低电路构建成本。

详情
AI中文摘要

思维链(CoT)推理提高了大型语言模型(LLMs)的问题解决能力,但生成的推理轨迹可能并不忠实地反映模型的实际决策过程。现有的CoT不忠实检测器主要依赖于生成理由的外部信号,如文本合理性或答案一致性,而忽略了来自模型内部计算的证据。尽管最近的电路追踪方法通过追踪推理过程中信息如何在模型组件间流动提供了获取模型内部证据的途径,但为长CoT构建完整推理电路成本高昂且难以扩展。为应对这些挑战,我们提出了电路引导的内外不一致性评分器(CIE-Scorer),一个用于实例级CoT不忠实检测的框架。关键思想是,忠实的推理轨迹应与模型的计算过程一致,而不忠实的轨迹可能偏离它。CIE-Scorer从信息丰富的推理令牌中高效追踪紧凑的句子级电路,构建内部和外部推理图,并使用Fused Gromov-Wasserstein距离度量它们的不一致性。在FaithCoT-Bench的四个数据集上的实验表明,CIE-Scorer在降低电路构建成本的同时实现了最先进的性能,证明了将机械可解释性信号与外部推理轨迹相结合用于CoT不忠实检测的有效性。

英文摘要

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

2605.25601 2026-05-26 cs.CL cs.AI

Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

面向大语言模型可控模拟不完美学生的基准

Alexander Apartsin, Omri Sason, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍隆理工学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维夫工程学院)

AI总结 本研究提出一个基准框架,通过提示控制语言模型模拟具有指定技能轮廓的学生,并评估其可控性,为教师教育中的刻意练习提供支持。

Comments 22 pages, 7 figures

详情
AI中文摘要

教师教育需要与表现出可识别优势、弱点和部分掌握的学习者进行刻意练习。大型语言模型可以通过模拟具有已知技能组成部分的学生来支持这种练习,使教师能够演练解释、诊断和教学回应。然而,为此目的,核心要求既不是最大化基准准确率,也不是抑制孤立的事实,而是控制模型行为,使其反映指定的技能轮廓。本文研究了是否可以通过提示引导语言模型保留某些技能同时抑制其他技能。我们引入了一个面向基准的框架,其中显式技能向量表示模拟学生,基于提示的控制指定保留和缺失的能力,并使用轮廓对齐指标、保留与遗忘比较以及跨技能校准分析来评估行为。结果表明,在结构化数学环境中可以诱导和测量选择性的部分掌握,尽管可控程度仍依赖于模型。这些发现将可控学习者模拟定位为教师教育、教育模拟和语言模型控制交叉领域的一个独特研究问题。

英文摘要

Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark accuracy nor to suppress isolated facts, but to control model behavior so that it reflects a specified skill profile. This paper investigates whether prompted language models can be steered to retain some skills while suppressing others. We introduce a benchmark-oriented framework in which an explicit skill vector represents a simulated student, prompt-based control specifies retained and missing competencies, and behavior is evaluated using profile-alignment metrics, retained-versus-forgotten comparisons, and cross-skill calibration analyses. The results show that selective partial mastery can be induced and measured in a structured mathematics setting, although the degree of controllability remains model-dependent. These findings position controllable learner simulation as a distinct research problem at the intersection of teacher education, educational simulation, and language-model control.

2605.25599 2026-05-26 cs.LG cs.CV

Generalized Evidential Deep Learning: From a Bayesian Perspective

广义证据深度学习:从贝叶斯视角

Yuanye Liu, Yibo Gao, Yuanyang Chen, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院,上海,中国)

AI总结 本文从广义贝叶斯框架出发,为证据深度学习建立理论基础,并提出统一可扩展的广义证据深度学习框架,在分类、不确定性估计和OOD检测上取得可比结果。

Comments Submitted to ICML2026

详情
AI中文摘要

证据深度学习(EDL)已成为一种高效、无需采样的不确定性估计策略。一系列EDL变体被提出以解决原始框架的特定局限性,并取得了显著成功。然而,EDL的基本理论结构以及这些变体之间的关系尚未得到系统研究。在这项工作中,我们通过在广义贝叶斯框架内解释EDL,包括先验规范、后验更新和训练目标,为其建立了原则性的理论基础。我们进一步从贝叶斯分布不确定性角度刻画了证据不确定性,并通过渐近分析建立。基于这一视角,我们进一步提出了广义证据深度学习(GEDL),这是一个统一且可扩展的框架,明确解耦了各个组件的作用,并将GEDL与现有变体系统地联系起来。大量实验表明,GEDL在分类、不确定性估计和OOD检测上取得了可比的结果,并具有理论依据。

英文摘要

Evidential Deep Learning (EDL) has emerged as an efficient, sampling-free strategy for uncertainty estimation. A series of EDL variants have been proposed to address specific limitations of the original framework, achieving notable success. However, the underlying theoretical structure of EDL and the relationships among these variants have received limited systematic investigation. In this work, we establish a principled theoretical foundation for EDL by interpreting it within a generalized Bayesian framework that includes prior specification, posterior update, and training objective. We further characterize evidential uncertainty from a Bayesian distributional uncertainty viewpoint, established via asymptotic analysis. Building on this perspective, we further propose Generalized Evidential Deep Learning (GEDL), a unified and extensible framework that explicitly disentangles the roles of individual components and systematically relates GEDL to existing variants. Extensive experiments demonstrate that GEDL yields comparable results on classification, uncertainty estimation and OOD detections, with theoretical grounding.

2605.25598 2026-05-26 cs.CV

SurfSurg6D: Geometry Consistent Dense Correspondence for Textureless Surgical Instrument Pose Estimation

SurfSurg6D:面向无纹理手术器械位姿估计的几何一致密集对应

Daiyun Shen, Shuojue Yang, Chang Han Low, Qian Li, Mengya Xu, Qi Dou, Yueming Jin

发表机构 * National University of Singapore(国立新加坡大学) Chinese University of Hong Kong(香港中文大学)

AI总结 针对无纹理手术器械位姿估计中的数据稀缺和几何一致性挑战,本文构建了SynSurg6D数据集并提出SurfSurg6D密集对应框架,在多个数据集上实现了优于现有方法的RGB-only位姿估计。

详情
AI中文摘要

手术器械位姿估计为自主机器人手术、技能评估和手术工作流程标准化等有前景的应用提供了关键信息。然而,由于高精度要求、频繁遮挡、无纹理器械、深度信息稀缺以及标注数据非常有限,该任务仍然极具挑战性。这些限制导致在将通用物体位姿估计方法应用于手术场景时性能往往不理想。为解决这些问题,我们首先构建了一个新数据集SynSurg6D,以缓解该任务中的数据短缺问题。我们进一步提出了SurfSurg6D,一个专为手术器械位姿估计设计的密集对应框架。在SurgRIPE、EndoVis2018和SurgPose数据集上的实验结果表明,我们生成的SynSurg6D数据集能够多样化位姿分布,从而提升现有方法的性能。此外,SurfSurg6D优于现有方法,为精确高效的RGB-only位姿估计提供了鲁棒解决方案。

英文摘要

Surgical instrument pose estimation provides crucial information for promising applications, including autonomous robotic surgery, skill assessment, and standardization of surgical workflow. However, this task remains highly challenging due to high precision requirements, frequent occlusions, textureless instruments, scarcity of depth information and very limited annotated data. These constraints often lead to unsatisfactory performance when employing general object pose estimation approaches to surgical scenarios. To address these issues, we first construct a new dataset SynSurg6D, to alleviate the data shortage in this task. We further propose SurfSurg6D, a dense-correspondence framework tailored for surgical instrument pose estimation. Experimental results on the SurgRIPE, EndoVis2018 and SurgPose datasets demonstrate that the introduction of our generated dataset SynSurg6D is able to diversify the pose distributions, thus enhancing the performance of existing approaches. Furthermore, SurfSurg6D outperforms existing methods, providing a robust solution for precise and efficient RGB-only pose estimation.

2605.25596 2026-05-26 cs.CL

Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

基于自监督语音模型的多语言音韵特征识别

Abner Hernandez, Tomás Arias-Vergara, Daiqi Liu, Andreas Maier, Paula Andrea Pérez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany(埃森哲大学模式识别实验室,德国) GITA Lab. Facultad de Ingeniería. Universidad de Antioquia UdeA, Medellín, Colombia(安蒂亚大学工程学院GITA实验室,哥伦比亚)

AI总结 提出PhonoQ-2.0,一种基于自监督语音模型的多语言帧级音韵特征识别器,通过方式条件门控机制直接预测结构化特征向量,在域内和域外均优于CTC基线。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

音韵特征提供了语言通用且基于语言学的语音表示。我们提出PhonoQ-2.0,一种基于自监督语音模型构建的多语言帧级音韵特征识别器。该系统直接预测每帧的结构化22维特征向量,编码方式、元音质量、发音部位和清浊,而不是从音素输出中推导特征。为确保音韵上一致的预测,我们引入了一种方式条件门控机制,激活有效的特征组。在多种语言和语料库上评估,PhonoQ-2.0在域内平均宏F1为91.3%,域外为88.9%。与强CTC音素基线相比,它在域内平均获得+8.8 F1的持续提升,域外平均+8.6。在未见语言评估中,PhonoQ-2.0将宏F1从66.9%提高到73.6%(平均+6.7),最高提升达+10.8个百分点。

英文摘要

Phonological features provide a language-general and linguistically grounded representation of speech. We present PhonoQ-2.0, a multilingual frame-level phonological feature recognizer built on self-supervised speech models. The system directly predicts a structured 22-dimensional feature vector per frame encoding manner, vowel quality, place, and voicing, instead of deriving features from phoneme outputs. To ensure phonologically coherent predictions, we introduce a manner-conditioned gating mechanism that activates valid feature groups. Evaluated across multiple languages and corpora, PhonoQ-2.0 achieves an average macro-F1 of 91.3% in-domain and 88.9% out-of-domain. Compared to a strong CTC phoneme baseline, it delivers consistent gains of +8.8 F1 in-domain and +8.6 out-of-domain on average. In unseen-language evaluation, PhonoQ-2.0 improves macro-F1 from 66.9% to 73.6% (+6.7 on average), with gains of up to +10.8 points.

2605.25595 2026-05-26 cs.CV

How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and Benchmark

AI在肝纤维化分期中取得了多大进展?大规模真实世界数据集与基准

Yuanye Liu, Nannan Shi, Zhejia Zhang, Hanxiao Zhang, Boya Wang, Derong Yu, Nao Wang, Yuxin Jin, Yang Zhou, Kunhao Yuan, Siqi Wang, Lida Yang, Xu Qiao, Wentao Liu, Xuelei He, Xin Hong, Guoyan Zheng, Xin Chen, Guang-Zhong Yang, Le Zhang, Lei Li, Yuxin Shi, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院) Department of Radiology, Shanghai Public Health Clinical Center, Fudan University, Shanghai, China(复旦大学上海公共卫生临床中心放射科) Department of Electrical and Computer Engineering, Northwestern University, Evanston, USA(西北大学电气与计算机工程系) Shanghai Key Laboratory of Flexible Medical Robotics, Tongren Hospital, Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China(上海柔性医疗机器人重点实验室) School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学生物医学工程学院) School of Computer Science, University of Nottingham, Nottingham, UK(诺丁汉大学计算机科学学院) Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学生物医学工程学院医疗机器人研究所) College of Computer Science and Technology, Huaqiao University, Xiamen, China(华侨大学计算机科学与技术学院) School of Electronic Information (School of Artificial Intelligence), Northwest University, Xi'an, China(西北大学电子信息学院(人工智能学院)) Department of Mechanical Engineering, University College London, London, UK(伦敦大学学院机械工程系) Institute of Neuroscience and Cardiovascular Research, University of Edinburgh, Edinburgh, UK(爱丁堡大学神经科学与心血管研究学院) CAS Center for Excellence in Nanoscience, National Center for Nanoscience and Technology, Beijing, China(中国科学院纳米科学卓越中心) School of Control Science and Engineering, Shandong University, Jinan, China(山东大学控制科学与工程学院) School of Engineering, College of Engineering and Physical Sciences, University of Birmingham, Birmingham, UK(伯明翰大学工程学院)

AI总结 基于多中心、多序列MRI的大规模真实世界数据集LiFS,系统评估了9种AI方法在肝纤维化分期中的表现,发现最佳AI与资深放射科医生相当,但跨中心异质性和标签不平衡仍是主要挑战。

Comments Submitted to Medical Image Analysis

详情
AI中文摘要

尽管方法学上取得了多年进展,但AI在肝纤维化分期中的进展从未在定义临床实践的异质性、多中心条件下进行系统评估。为填补这一空白,我们引入了LiFS,这是一个来自MICCAI 2025 CARE-Liver挑战的大规模数据集和基准,包含来自多个中心和扫描仪的610名患者的多序列MRI。据我们所知,LiFS是第一个提供完整钆塞酸增强序列并具有来自不同真实世界扫描仪的病理学确认注释的基准。通过对从96个注册团队中选出的9种独立开发方法进行系统评估,并与队列内放射科医生参考结果进行比较,我们的发现从三个互补角度回答了当前AI在临床级肝纤维化分期方面的进展。首先,与放射科医生相比,最佳AI方法总体上与资深放射科医生相当,并在特定设置下显著超过初级放射科医生,而中位AI性能通常接近初级放射科医生水平。其次,从数据角度来看,跨中心异质性、标签不平衡和对比增强序列变异性成为AI方法的主要挑战。第三,从技术角度来看,方法设计选择,包括空间配准、输入维度、多模态融合策略和骨干架构,似乎调节了跨中心鲁棒性,尽管没有单一选择能完全缩小差距。总体而言,LiFS为定位AI在肝纤维化分期中的当前状态以及促进对限制临床可靠部署的关键挑战的未来研究提供了严格的真实世界基准。

英文摘要

Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.

2605.25589 2026-05-26 cs.CV

Artifact Correction for Echo-Planar Imaging at Low-Field and Ultra-Low-Field MRI

低场和超低场MRI中回波平面成像的伪影校正

Sisi Qiao, Yilin Yu, Tiecheng Lin, Yuhao Liu, Jiajia Sun, Xiaoling Li

发表机构 * School of Mechanical Engineering, Xi'an Jiaotong University(西安交通大学机械工程学院)

AI总结 针对低场和超低场MRI中回波平面成像的奈奎斯特鬼影问题,提出一种无需参考扫描的校正流程,结合峰值对齐与插值重采样方法,有效抑制鬼影并提升图像质量。

Comments 19 pages, 10 figures, 2 tables

详情
AI中文摘要

目的:低场和超低场MRI中的回波平面成像因奇偶k空间错位而遭受严重的奈奎斯特鬼影伪影。本研究开发了一种无参考扫描的伪影校正流程,减少对传统参考扫描的依赖,同时实现更好的鬼影抑制。方法:从传统的基于参考扫描的鬼影校正方法出发,我们首先引入一种基于峰值对齐的鬼影校正方法,无需参考数据即可校正奇偶行位移。为进一步减少残余伪影,采用了插值与重采样策略。该组合方法在低场和超低场下的EPI和扩散加权EPI数据上进行了评估。结果:所提出的流程有效减轻了奈奎斯特鬼影,改善了结构连续性,并增强了信号均匀性。仅基于峰值对齐的鬼影校正方法提供了与基于参考扫描的鬼影校正方法相当的伪影抑制效果,而插值与重采样进一步抑制了残余伪影,使得在超低场条件下能够可靠地可视化脑结构。结论:为低场和超低场EPI提出了一种实用的无参考校正流程,结合了基于峰值对齐的鬼影校正方法和插值重采样,实现了高效的鬼影抑制,扩展了低场MRI系统的临床适用性,为基于超低场EPI的DWI成像提供了理论指导和实践经验。

英文摘要

Purpose: Echo-planar imaging (EPI) in low-field (LF) and ultra-low-field MRI (ULF) suffers from severe Nyquist ghost artifacts due to odd-even k-space misalignment. This study develops a reference-free artifact correction pipeline that reduces reliance on conventional reference scans while achieving improved ghost suppression. Methods: Starting from the traditional reference-scan-based ghost artifact correction method, we first introduce a peak-alignment-based ghost artifact correction method to correct odd-even line displacement without reference data. To further reduce residual artifacts, an interpolation-and-resampling strategy is applied. The combined method was evaluated using EPI and diffusion-weighted EPI data in LF and ULF. Results: The proposed pipeline effectively mitigated Nyquist ghosts, improved structural continuity, and enhanced signal uniformity. Peak-alignment-based ghost artifact correction method alone provided comparable artifact suppression to reference-scan-based ghost artifact correction method, while interpolation and resampling further suppressed residual artifacts, enabling reliable visualization of brain structures under ULF conditions. Conclusion: A practical, reference-free correction pipeline is presented for LF and ULF EPI, combining peak-alignment-based ghost artifact correction method and interpolation-resampling to achieve efficient ghost suppression and expand the clinical applicability of low-field MRI systems, providing both theoretical guidance and practical experience for ULF EPI-based DWI imaging.

2605.25584 2026-05-26 cs.RO cs.AI

Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation

作用于未知:面向分散式多机器人任务分配的无通信协同过滤

Alexander Apartsin, Yigal Meshulam, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛技术学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维夫工程学院)

AI总结 针对零知识多机器人任务分配问题,提出基于在线低秩协同过滤的SwarmCF方法,无需通信、先验知识或协调者,实现每个机器人在未见任务上的有效行动,并证明其样本复杂度优势。

Comments 27 pages, 12 figures

详情
AI中文摘要

多机器人任务分配通常假设某种通信、已知任务模型或协调者的组合。我们研究相反的极端情况,这在实践中常见但在理论上被忽视,我们称之为零知识MRTA(ZK-MRTA):一个没有先验知识(没有任务模型,甚至没有潜在秩)、没有通信(没有消息、没有参数共享、没有协调者)、并且只能部分且私下带噪地观察队友结果的公共流的机器人团队。一个隐藏的低秩结构决定了哪个机器人适合哪个任务,并且任务数量远多于轮次,因此大多数(机器人,任务)对从未被尝试过。然而,每个机器人可以通过在广播流上运行在线低秩协同过滤(SwarmCF)来很好地处理从未尝试过的任务以及新任务。与任何无结构学习器相比,优势是类别性的,而不是常数因子:无结构学习器在未见对上的误差被证明处于先验均值水平。我们证明了每个机器人的匹配样本复杂度(在秩d和任务数n下,Θ(d) vs Θ(n)),任务稀缺下的任意时间(累积奖励)分离,以及一个确定性条件,在该条件下从掩码广播中分散恢复是精确的(经验验证)。实验量化了广播的价值、一个正比例缩放律(每个机器人的未见对技能随团队规模增加)、以及低秩方法中最强的掩码鲁棒性和任意时间曲线,恢复了集中式全通信上限的大部分(约80%的技能收益),并在容量1竞争和基于机器人的感知实例中保持有效。

英文摘要

Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite extreme, a regime common in practice but overlooked in theory, which we name Zero-Knowledge MRTA (ZK-MRTA): a robot team with no prior knowledge (no task models, not even the latent rank), no communication (no messages, no parameter sharing, no coordinator), and only a partial and privately-noisy view of a public stream of teammates' outcomes. A hidden low-rank structure governs which robot suits which task, and there are far more tasks than rounds, so most (robot, task) pairs are never attempted. Yet each robot can act well on tasks it never attempted, and onboard new tasks, by running online low-rank collaborative filtering over the broadcast (SwarmCF). The advantage over any structure-free learner is categorical, not a constant factor: a structure-free learner is provably at the prior-mean error floor on unseen pairs. We prove a matching per-robot sample complexity (Θ(d) versus Θ(n), in the rank d and the task count n), an anytime (cumulative-reward) separation under task scarcity, and a deterministic condition under which decentralized recovery from the masked broadcast is exact (validated empirically). Experiments quantify the value of the broadcast, a positive scaling law (per-robot unseen-pair skill rises with team size), and the strongest masking-robustness and anytime profile among low-rank methods, recovering most (about 80% on earned skill) of a centralized full-communication ceiling, and holding under capacity-1 contention and in a robotics-grounded sensing instance.

2605.25581 2026-05-26 cs.LG

Learning Latent Dynamical Causal Processes for Single-Cell Perturbation Prediction

学习单细胞扰动预测的潜在动态因果过程

Wenkang Jiang, Yuhang Liu, Erdun Gao, Ehsan Abbasnejad, Lina Yao, Javen Qinfeng Shi

发表机构 * AIML, Adelaide University(AIML,阿德莱德大学) Responsible AI Research Centre(负责任人工智能研究中心) Monash University(莫纳什大学) University of New South Wales(新南威尔士大学)

AI总结 提出一种潜在动态因果生成模型(CITE-VAE),联合捕获潜在细胞程序、扰动条件机制和时间演化,实现单细胞扰动预测的分布外泛化。

Comments Accepted to SIGKDD 2026 AI4Science Track

详情
AI中文摘要

单细胞扰动预测旨在推断细胞如何响应未见过的干预,并实现分布外(OOD)泛化,为理解扰动如何随时间重塑细胞程序提供计算途径。现有的机器学习方法取得了重要进展,但通常仅捕捉响应的一方面。潜在因果方法寻求支持泛化和解释的机制,但往往将扰动效应视为静态结果。时间模型描述基因表达随时间的变化,但通常不显式恢复驱动这些变化的潜在因果生成机制。在实践中,扰动效应既是潜在的也是动态的:干预通过未观察到的细胞程序起作用,这些程序的状态随时间演变并产生观察到的表达谱。受此观点启发,我们提出一个用于单细胞扰动数据的潜在动态因果生成模型,联合捕获潜在细胞程序、扰动条件机制和时间演化。我们进一步提供可识别性分析,表明在适当条件下,潜在因果变量可恢复至标准等价类。在此分析指导下,我们开发了CITE-VAE,一个从单细胞测序数据中恢复潜在细胞程序及其扰动驱动动态的学习框架。在Causal-3DIdent上的实验验证了理论结果和所提方法在受控环境中的有效性。在真实世界的基于CRISPR的单细胞扰动数据上的额外实验表明,与最先进的基线相比,对未见扰动的泛化能力有所提升,突显了我们方法的实际鲁棒性。

英文摘要

Single-cell perturbation prediction aims to infer how cells respond to unseen interventions and to achieve out-of-distribution (OOD) generalization, providing a computational route to understanding how perturbations reshape cellular programs over time. Existing machine learning methods have made important progress, but typically capture only one side of the response. Latent causal approaches seek mechanisms that support generalization and interpretation, yet often treat perturbation effects as static outcomes. Temporal models describe how gene expression changes across time, but usually do not explicitly recover the latent causal generative mechanisms driving these changes. In practice, perturbation effects are both latent and dynamical: interventions act through unobserved cellular programs, whose states evolve over time and give rise to observed expression profiles. Motivated by this view, we propose a latent dynamical causal generative model for single-cell perturbation data that jointly captures latent cellular programs, perturbation-conditioned mechanisms, and temporal evolution. We further provide an identifiability analysis showing that, under suitable conditions, the latent causal variables are recoverable up to standard equivalence classes. Guided by this analysis, we develop CITE-VAE, a learning framework for recovering latent cellular programs and their perturbation-driven dynamics from single-cell sequencing data. Experiments on Causal-3DIdent validate the theoretical results and the effectiveness of the proposed method in controlled settings. Additional experiments on real-world CRISPR-based single-cell perturbation data show improved generalization to unseen perturbations compared with state-of-the-art baselines, highlighting the practical robustness of our approach.

2605.25577 2026-05-26 cs.LG cs.AI

Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition

基于流形分解的几何流匹配分子构象生成

Yunqing Liu, Yi Zhou, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出GO-Flow方法,通过将生成过程分解为平移、旋转和构象三个物理子空间,利用流形上的最优传输和测地流,解决现有方法忽略分子几何层次结构的问题,实现高质量、高效率的分子构象生成。

详情
AI中文摘要

生成准确的3D分子构象是计算化学和药物发现中的关键挑战。最近,扩散和流匹配模型取得了显著成功。然而,它们的数学公式与分子的物理现实之间存在严重的不匹配。现有方法主要将分子视为笛卡尔空间中的无结构点云,忽略了键长和键角相对刚性而扭转角构成主要柔性自由度的内在层次力学。这种对流形的不感知迫使模型从头重新学习基本几何约束,常常导致物理上不可信的中间结构。为了解决这个问题,我们提出了GO-Flow,通过流形分解将生成建模与分子几何对齐。GO-Flow不是强制在欧几里得空间中运动,而是将生成过程分解为三个物理驱动的子空间:具有线性最优输运的平移空间、$SO(3)$上具有测地流的旋转空间以及具有熵最优输运的构象空间。这种分解注入了几何归纳偏置,使生成路径更好地与分子自由度对齐。当与等变神经架构结合时,它鼓励旋转一致的生成并提高几何有效性。在GEOM-Drugs和GEOM-QM9上的大量实验表明,GO-Flow实现了最先进的生成质量。值得注意的是,通过在正确的流形上自然地学习更直的概率路径,我们的方法能够在仅50步的情况下实现高保真采样,有效弥合了结构精度与计算效率之间的差距。

英文摘要

The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffusion and flow matching models have achieved remarkable success. However, there is a critical misalignment between their mathematical formulation and the physical reality of molecules. Existing approaches predominantly treat molecules as unstructured point clouds in Cartesian space, overlooking the intrinsic hierarchical mechanics where bond lengths and bond angles are relatively stiff, whereas torsion angles constitute the dominant flexible degrees of freedom. This lack of manifold awareness forces models to relearn fundamental geometric constraints from scratch, often leading to physically implausible intermediate structures. To address this, we propose GO-Flow that aligns generative modeling with molecular geometry via manifold decomposition. Instead of forcing motion through Euclidean space, GO-Flow decomposes the generation process into three physically motivated subspaces: translation space with linear optimal transport, rotation space with geodesic flows on $SO(3)$, and conformation space with entropic optimal transport. This decomposition injects geometric inductive biases and makes the generative paths better aligned with molecular degrees of freedom. When combined with equivariant neural architectures, it encourages rotation-consistent generation and improves geometric validity. Extensive experiments on GEOM-Drugs and GEOM-QM9 demonstrate that GO-Flow achieves state-of-the-art generation quality. Notably, by learning straighter probability paths on the correct manifolds naturally, our method enables high-fidelity sampling with as few as 50 steps, effectively bridging the gap between structural precision and computational efficiency.

2605.25574 2026-05-26 cs.CV cs.AI

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

Mosaic: 通过向量场混合的组合式多概念擦除

Junseok Ko, Jungwoo Kim, Jong-Seok Lee

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) School of Integrated Technology, Yonsei University(延世大学整合技术学院)

AI总结 针对流式文本到图像模型中同时擦除多个目标概念的任务,提出Mosaic框架,通过动态构建概念特定掩码并选择性混合向量场,无需额外优化即可有效移除复杂场景中的多概念。

详情
AI中文摘要

概念擦除已成为确保文本到图像(T2I)模型安全与伦理图像合成的关键研究方向。现有研究虽探索了多概念擦除,但通常假设每张图像仅有一个目标概念,这一限制被现代基于流的T2I模型日益暴露,此类模型可同时生成包含多个概念的复杂场景。为弥补这一空白,我们引入组合式多概念擦除这一新任务,旨在同时移除单个场景中的多个目标概念。我们提出CoME-Bench,一个用于评估组合式多概念擦除的基准,涵盖类别内和跨类别场景。我们进一步提出Mosaic,一个用于基于流的T2I模型中多概念擦除的新框架,该框架通过动态构建概念特定掩码并选择性混合它们,利用向量场中目标概念的空间局部性,无需额外优化。大量实验表明,Mosaic能有效移除复杂组合场景中的多个目标概念,同时保留非目标上下文。

英文摘要

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.

2605.25572 2026-05-26 cs.CL cs.AI

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

PennySynth:基于RAG的数据合成用于自动量子代码生成

Minghao Shao, Nouhaila Innan, Hariharan Janardhanan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)(eBRAIN实验室,工程系,纽约大学阿布扎比分校) Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute(量子与拓扑系统中心(CQTS),NYUAD研究所) Department of Computer Science and Engineering, NYU Tandon School of Engineering(计算机科学与工程系,纽约大学坦顿工程学院)

AI总结 提出PennySynth框架,通过检索增强生成和代码感知嵌入,利用13,389个PennyLane指令-代码对数据集,在QHack竞赛中实现52%-68%的pass@5,显著提升量子代码生成的结构有效性和功能正确性。

Comments 11 pages, 3 figures

详情
AI中文摘要

量子编程框架日益增长的复杂性暴露了现有基于大语言模型(LLM)的代码助手的一个关键局限性:通用模型在面对专门的量子编码挑战时,会幻觉出PennyLane特定的门名称、错误放置设备配置并生成结构无效的电路。我们提出PennySynth,一个检索增强生成框架,通过将LLM推理条件化为一个包含13,389个PennyLane指令-代码对的精选知识库来解决这一差距,该知识库通过一个三阶段(提取、验证和去重)流程从官方PennyLane仓库、社区GitHub源和QHack竞赛档案中构建。PennySynth引入了一种使用st-codesearch-distilroberta-base的代码感知嵌入策略,该策略针对自然语言到代码的检索进行训练,将平均检索余弦相似度从通用基线的0.45提高到0.726。在涵盖QHack竞赛三年(2022、2023、2024)的74个挑战上进行评估,PennySynth在QHack 2022、2023和2024上分别达到64%、68%和52%的pass@5,相比无检索的Claude Sonnet 4.6提高了+28、+25和+28个百分点。我们进一步引入了一个量子适应的CodeBLEU指标,该指标对qml.*令牌模式进行加权,并表明结构代码相似性和功能正确性捕捉了量子代码质量的不同方面。受控消融实验揭示,代码感知嵌入是检索性能的主要驱动因素,而当检索质量足够精确时,数据集扩展和源组合提供了额外的增益。

英文摘要

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

2605.25571 2026-05-26 cs.CV

AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution

AnE: 通过锚点进化推动多模态大语言模型的推理前沿

Zehao Wang, Yihan Zeng, Zidong Gong, Yuanfan Guo, Feng Zhu, Hongzhi Zhang, Wei Zhang, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Noah's Ark Lab(华为诺亚实验室) Independent Researcher(独立研究员)

AI总结 提出锚点进化(AnE)范式,通过真值锚点数据策展和脚手架剥离机制,解决多模态大模型推理中的认知漂移和幻觉路径问题,显著提升推理性能。

Comments 34 pages,10 figures

详情
AI中文摘要

通过监督微调(SFT)和强化学习(RL)进行的后训练对于增强多模态大语言模型(MLLMs)的推理能力至关重要,然而现有范式由于静态数据的限制常常达到性能瓶颈。虽然当前方法利用自我反思或自我进化来突破这些界限,但它们仍然受到低质量合成数据导致的认知漂移和幻觉推理路径的影响。为了解决这些挑战,我们提出了锚点进化(AnE),一种整合了真值锚点数据策展和模型进化的新范式,在推理前沿实现了忠实且稳定的性能提升。具体来说,我们提出了真值锚点扩展,通过轨迹展开定位模型失败前沿,并利用真实数据库检索高保真锚点以进行忠实的数据策展。随后,我们引入了脚手架剥离机制来内化推理能力。该机制首先通过脚手架增强监督来锚定推理路径,以减轻直接在原始数据上进行SFT的学习复杂性和分布漂移,然后利用强化学习剥离脚手架模板,从而有效地将推理路径转化为内在模型能力。在多模态推理基准上的实验结果表明,我们的方法显著推进了模型性能前沿,在八个多模态基准上将基础模型提升了10.3%,并达到了最先进的结果。代码将公开提供。

英文摘要

Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3\% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.

2605.25568 2026-05-26 cs.CV

Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking

重新思考涂鸦引导的图像编辑:泛化、指令遵循与多任务

Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng

发表机构 * Xiamen University(厦门大学) Taobao & Tmall Group of Alibaba(阿里巴巴淘宝与天猫集团)

AI总结 针对涂鸦引导图像编辑在多任务场景下性能不稳定的问题,通过实证研究揭示指令级泛化瓶颈,提出覆盖-真实课程、多任务拼接和编辑聚焦损失三种策略,在VIBE基准上实现单任务和多任务的最优结果。

详情
AI中文摘要

涂鸦引导的图像编辑允许用户将简单的涂鸦注释与文本提示相结合,以指定图像编辑的位置和方式,从而实现灵活交互和精确的空间控制。然而,现有模型在这种范式下仍表现出不稳定的性能,尤其是在多任务场景中。为了提升性能,我们使用开源编辑模型进行实证研究,并揭示了泛化中的不对称性:指令级泛化(包括跨编辑任务以及从单任务到多任务设置)比图像域泛化(例如从合成图像到真实图像,或从马赛克图像到常规图像)更具挑战性。这表明主要瓶颈在于对多样化编辑指令的学习不足,而非图像域差异。受此启发,我们提出了三种策略:(a) 覆盖-真实课程,一个两阶段流程,首先构建大规模合成、指令丰富的数据以提供广泛的任务监督,然后精选少量真实数据以细化生成的真实性;(b) 多任务拼接,通过几乎零成本地拼接单任务样本来构建多任务训练样本,同时使学习到的能力泛化到非马赛克图像;(c) 编辑聚焦损失,利用合成数据中输入和输出图像之间的变化区域,将训练聚焦于编辑区域,提高学习效率和编辑准确性。通过这些策略,我们在VIBE基准上显著提升了单任务和多任务涂鸦引导编辑的性能,取得了最先进的结果。我们将公开发布我们的数据集和模型。

英文摘要

Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existing models still exhibit unstable performance under this paradigm, especially in multi-task scenarios. To improve performance, we conduct empirical studies using an open-source editing model and reveal an asymmetry in generalization: instruction-level generalization, including across editing tasks and from single-task to multi-task settings, is more challenging than image-domain generalization, such as from synthetic to real-world images or from mosaicked to regular images. This suggests that the primary bottleneck lies in insufficient learning for diverse editing instructions rather than in the image domain gap. Motivated by this insight, we propose three strategies: (a) a Coverage-then-Realism Curriculum, a two-stage pipeline that first builds large-scale synthetic, instruction-rich data for broad task supervision, then curates a small set of real-world data to refine generation realism; (b) Multi-Task Mosaicking, which constructs multi-task training samples by concatenating single-task examples at nearly zero cost while enabling the learned capability to generalize to non-mosaicked images; and (c) an Edit-Focused Loss, which leverages the changed regions between input and output images in synthetic data to focus training on edited regions, improving both learning efficiency and editing accuracy. With these strategies, we substantially improve both single-task and multi-task scribble-guided editing on the VIBE benchmark, achieving state-of-the-art results. We will publicly release our dataset and model.

2605.25566 2026-05-26 cs.AI

Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

基于大语言模型的不确定性推理用于可解释疾病诊断

Xiaoyang Fan, Yufan Cai, Zhe Hou, Jin Song Dong

发表机构 * National University of Singapore(新加坡国立大学) Griffith University(格里菲斯大学)

AI总结 提出一种神经符号推理框架,将大语言模型与模糊逻辑和声明式规则结合,实现可解释且形式可验证的医学诊断。

详情
AI中文摘要

临床决策需要对不完整、不精确且以语言表达的患者叙述进行推理。虽然大语言模型(LLMs)擅长从自然语言中提取潜在信息,但它们缺乏可信赖医疗AI所必需的可验证性和可解释性。我们提出一种神经符号推理框架,将LLMs与形式逻辑对齐,以实现可解释且形式可验证的医学诊断。患者描述和临床指南被嵌入神经知识库,其中LLMs提取结构化医疗实体、时间关系和模糊症状模式,这些被解码为用模糊逻辑和声明式规则表达的符号知识库。我们执行两阶段推理:(1)归纳符号泛化,从编码叙述中捕获诊断模式;(2)通过逻辑编程引擎进行推理验证,推导并验证符合临床标准的诊断。每个症状被视为具有概率权重的模糊谓词,推理路径可审计、可调整,并与医生反馈兼容。与纯统计方法不同,我们的系统支持迭代优化:LLM生成的诊断与真实情况之间的偏差可以通过形式规则追踪、解释和纠正。通过结合基于逻辑的透明性、LLM的适应性和概率鲁棒性,该框架实现了与人类一致的医疗推理,具有强泛化能力和可验证的逐步推理链。我们在公开基准上验证了该框架,展示了符号推理与LLM在真实临床叙述中的有效协调。结果显示,性能与最先进的LLM相当,同时额外提供了可解释的推理路径和形式可验证的诊断结论。

英文摘要

Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large language models (LLMs) excel at extracting latent information from natural language, they lack the verifiability and interpretability essential for trustworthy medical AI. We propose a neuro-symbolic reasoning framework that aligns LLMs with formal logic to enable explainable and formally verifiable medical diagnosis. Patient descriptions and clinical guidelines are embedded into a neural knowledge base, where LLMs extract structured medical entities, temporal relations, and fuzzy symptom patterns, which are decoded into a symbolic knowledge base expressed in fuzzy logic and declarative rules. We perform two-stage reasoning: (1) inductive symbolic generalization to capture diagnostic patterns from encoded narratives, and (2) inference verification via a logic programming engine to derive and validate diagnoses consistent with clinical standards. Each symptom is treated as a fuzzy predicate with probabilistic weights, and inference paths are auditable, adjustable, and compatible with physician feedback. Unlike purely statistical methods, our system supports iterative refinement: misalignment between LLM-generated diagnoses and ground truth can be traced, explained, and corrected through formal rules. By combining logic-based transparency, LLM adaptability, and probabilistic robustness, the framework enables human-aligned healthcare inference with strong generalization and verifiable, step-by-step reasoning chains. We validate our framework on public benchmarks, demonstrating effective reconciliation of symbolic reasoning and LLMs with real-world clinical narratives. Results show performance comparable to state-of-the-art LLMs, while additionally providing interpretable reasoning paths and formally verifiable diagnostic conclusions.

2605.25565 2026-05-26 cs.LG cs.CL

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

RotMoLE:通过旋转门控机制增强混合低秩专家

Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang

发表机构 * Tsinghua University(清华大学) Beijing Information Science and Technology University(北京信息科技大学) National University of Singapore(新加坡国立大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对MoE-LoRA中传统门控仅标量加权限制表示能力的问题,提出RotMoLE框架,通过引入旋转门控机制对每个专家进行旋转操作,提升专家利用率和专业化程度,在多任务和多语言训练中验证有效性。

详情
AI中文摘要

虽然大型语言模型(LLM)通常在进行垂直应用之前会针对特定领域任务进行微调,但将它们适应于具有多样化专业知识的复杂场景仍然具有挑战性。与此同时,混合专家(MoE)架构已成为训练LLM的关键范式,最近的一些工作也将MoE引入参数高效微调(PEFT),提出了混合低秩专家(MoE-LoRA),以增强低秩适配器学习复杂知识的能力。然而,MoE中的传统门控机制通常仅对选中的专家应用标量重新加权,从而限制了其表示和泛化的潜在能力。受MoE-LoRA中低秩结构的启发和推动,我们提出了RotMoLE,一个专门针对低秩专家的MoE框架,其特点是一个额外的旋转门控。除了简单的缩放,RotMoLE为每个选中的专家实现了一个旋转机制,从而在专家候选有限的情况下,实现了更好的专家利用和专业化,以学习多样化的数据。在复杂多任务和多语言训练场景下的实证结果验证了我们的有效性。

英文摘要

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

2605.25563 2026-05-26 cs.CV

CodecSplat: Ultra-Compact Latent Coding for Feed-Forward 3D Gaussian Splatting

CodecSplat: 用于前馈式3D高斯泼溅的超紧凑潜在编码

Pengpeng Yu, Runqing Jiang, Qi Zhang, Dingquan Li, Jing Wang, Yulan Guo

发表机构 * Sun Yat-sen University(中山大学) Peking University(北京大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出CodecSplat框架,通过将压缩集成到前馈式高斯生成流水线中,利用结构化中间特征表示实现超紧凑场景编码,显著降低存储和传输开销。

详情
AI中文摘要

尽管前馈式3D高斯泼溅无需逐场景优化即可从稀疏上下文视图重建可渲染的高斯基元,但现有流水线并未提供紧凑的场景表示用于存储或传输。一种自然的解决方案是将现有的3DGS压缩方法应用于生成的高斯基元。然而,这种方法作用于最终的不规则3D表示,且与内部特征到高斯的生成过程解耦,限制了压缩效率。为解决此问题,我们引入了CodecSplat,一种用于前馈式3D高斯泼溅的超紧凑潜在编码框架。CodecSplat首先将中间2D高斯生成特征编码为熵编码的场景比特流。在解码器端,潜在特征被重建并用于预测深度和高斯参数,然后映射到3D高斯基元。注意,通过将压缩集成到前馈式高斯生成流水线中,CodecSplat避免了对不规则3D高斯基元的低效压缩,并允许编解码器利用结构化的中间特征表示。我们在前馈式高斯泼溅骨干网络上实例化了CodecSplat,该网络具有深度引导的多视图特征细化和分层学习特征编解码器。在DL3DV和RealEstate10K数据集上,CodecSplat分别实现了23.56-26.36 dB和24.76-27.05 dB的PSNR,每场景仅需20.00-107.77 KiB和3.37-12.51 KiB。这比压缩前馈式生成的高斯基元大约小一个数量级,同时保持了可控的率失真行为。

英文摘要

While feed-forward 3D Gaussian splatting reconstructs renderable Gaussian primitives from sparse context views without per-scene optimization, existing pipelines do not provide a compact scene representation for storage or transmission. A natural solution is to apply existing 3DGS compression methods to the generated Gaussian primitives. However, this approach operates on the final irregular 3D representation and is decoupled from the internal feature-to-Gaussian generation process, which limits compression efficiency. To address this, we introduce CodecSplat, an ultra-compact latent coding framework for feed-forward 3D Gaussian splatting. CodecSplat first encodes an intermediate 2D Gaussian-generation feature into an entropy-coded scene bitstream. At the decoder, the latent feature is reconstructed and used to predict depth and Gaussian parameters, which are then mapped to 3D Gaussian primitives. Note that, by integrating compression into the feed-forward Gaussian generation pipeline, CodecSplat avoids inefficient compression over irregular 3D Gaussian primitives and allows the codec to exploit the structured intermediate feature representation. We instantiate CodecSplat on a feed-forward Gaussian splatting backbone with depth-guided multi-view feature refinement and a hierarchical learned feature codec. On DL3DV and RealEstate10K datasets, CodecSplat achieves 23.56-26.36 dB and 24.76-27.05 dB PSNR with only 20.00-107.77 KiB and 3.37-12.51 KiB per scene, respectively. This is roughly one order of magnitude smaller than compressing feed-forward generated Gaussian primitives, while preserving controllable rate-distortion behavior.

2605.25561 2026-05-26 cs.CV

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?

我们在半监督3D医学图像分割的模型和结果上是否过于自信?

Jun Li, Ziwei Qin

发表机构 * Institute of Systems Science and Technology, School of Electrical Engineering, Southwest Jiaotong University, China(系统科学与技术研究院,电气工程学院,西南交通大学,中国)

AI总结 针对半监督医学图像分割中伪标签框架的确认偏差和基准测试集使用不当导致的性能高估问题,提出一种基于双轴可靠性评估的三空间校准分割框架(TCSeg),以解耦置信度与不确定性并协同校正偏差。

Comments Accepted by ICML 2026

详情
AI中文摘要

半监督学习已成为减少标注成本的主流范式。然而,我们认为当前的进展被双重过度自信问题所掩盖。在算法层面,主流的伪标签框架常常将预测置信度与不确定性混为一谈,导致严重的确认偏差。在策略层面,由于多个基准数据集缺乏专用的验证集,一些研究也使用测试集进行验证,导致性能估计膨胀。后续方法为了超越已报告的最先进水平而被迫采用相同策略,引发了过拟合的军备竞赛。这引发了担忧,即社区中令人印象深刻的数值提升可能反映的是过拟合而非真正的进步。因此,我们提出了一种基于原则性双轴可靠性评估引擎的三空间校准分割框架。它明确地将置信度与不确定性解耦,并利用这一信号在特征空间、概率空间和图像空间中以协作方式检测和纠正确认偏差。在三个基准数据集上,TCSeg在现有评估协议下始终提供强大的性能。更重要的是,我们主张社区在多次运行协议下报告最终检查点结果,从而以更现实的视角建立更严格的基准。代码将公开:github.com/DirkLiii/TCSeg。

英文摘要

Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: github.com/DirkLiii/TCSeg.

2605.25558 2026-05-26 cs.AI

Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

超越查询记忆化:基于查询分解和历史匹配的大语言模型路由

Bo Lv, Jingbo Sun

发表机构 * Tencent Hunyuan(腾讯文言) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出DecoR路由框架,通过查询能力分解和历史日志匹配来避免记忆化陷阱,在保持高准确率的同时降低推理成本。

详情
AI中文摘要

优化预测性能与计算成本之间的权衡是大语言模型(LLM)部署中的核心关注点。当前的路由方法主要依赖于基于表面特征的查询到模型的直接映射,使其容易陷入记忆化陷阱,并导致在分布外(OOD)数据上的泛化能力差。在本文中,我们提出DecoR,一种新颖的路由框架,将路由任务重新定义为从历史日志中筛选相似查询的匹配过程,有效缓解了记忆化陷阱。为了提高匹配准确性,我们引入了一种查询能力分解方法,将语言表面形式与任务内在需求解耦,将匹配导向能力维度,从而将决策基于基本任务属性。此外,我们开发了CodaSet,一个用于评估路由泛化能力的综合基准,实验结果表明,DecoR在分布内和OOD设置下均保持优越的准确性,同时大幅降低推理成本。所有代码和数据可在https://github.com/lvbotenbest/DecoR获取。

英文摘要

Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models (LLMs). Current routing methods primarily rely on direct mapping from queries to models based on surface-level features, making them susceptible to the memorization trap and leading to poor generalizability on out-of-distribution (OOD) data. In this paper, we propose DecoR, a novel routing framework that recasts the routing task as a matching process of sifting similar queries from historical logs, effectively mitigating the memorization trap. To enhance matching accuracy, we introduce a query capability deconstruction method that decouples linguistic surface forms from task-intrinsic requirements, directing matching toward capability dimensions to ground decisions in essential task attributes. Furthermore, we develop CodaSet, a comprehensive benchmark for assessing routing generalization, where experimental results demonstrate that DecoR maintains superior accuracy while substantially lowering inference costs across both in-distribution and OOD settings. All the codes and data are available at https://github.com/lvbotenbest/DecoR.

2605.25554 2026-05-26 cs.AI

PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

PHGNet: 原型引导的超图构建用于异质时空预测

Ruiwen Gu, Yahao Liu, Zhenyu Liu, Qitai Tan, Xiao-Ping Zhang

发表机构 * Shenzhen Ubiquitous Data Enabling Key Lab(深圳通用数据赋能重点实验室) Shenzhen International Graduate School, Tsinghua University(深圳国际研究生学院,清华大学) School of Computer Science and Engineering(计算机科学与工程学院) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出基于原型引导超图构建的时空预测框架PHGNet,通过原型学习机制自适应地将模式相似节点分配到超边以捕获高阶交互,并引入全局-局部节点表示模块和迭代残差细化与时间查询注意力机制提升预测精度。

详情
AI中文摘要

作为智能交通系统的核心任务,交通预测在城市交通管理中起着关键作用。准确的交通预测依赖于对复杂时空依赖关系的建模,而由于交通系统中的空间异质性,这本身就具有挑战性。尽管取得了显著进展,大多数现有方法仍局限于成对空间依赖建模,难以捕获具有相似交通模式的节点之间的动态高阶交互。为了解决这个问题,我们提出了PHGNet,一种基于原型引导超图构建的新型时空预测框架。在PHGNet的核心,设计了一种原型学习机制,自适应地将模式相似的节点分配到超边,从而捕获具有时变结构的高阶交互。为了提高动态超图构建的可靠性,我们进一步开发了一个全局-局部节点表示模块来提取时间一致的特征。对于预测,引入了迭代残差细化和时间查询注意力机制,以提高预测精度并支持高效的并行解码。在多个真实世界数据集上的大量实验表明,与最先进的方法相比,PHGNet实现了优越的预测性能。

英文摘要

As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traffic forecasting relies on modeling complex spatiotemporal dependencies, which is inherently challenging due to spatial heterogeneity in traffic systems.Despite significant progress, most existing methods are still limited to pairwise spatial dependency modeling, making it difficult to capture dynamic high-order interactions among nodes with similar traffic patterns. To address this issue, we propose PHGNet, a novel spatiotemporal forecasting framework based on prototype-guided hypergraph construction. At the core of PHGNet, a prototype learning mechanism is designed to adaptively assign pattern-similar nodes to hyperedges, thereby capturing high-order interactions with time-varying structures. To improve the reliability of dynamic hypergraph construction, we further develop a global-local node representation module to extract time-consistent features. For forecasting, iterative residual refinement and Temporal Query Attention are introduced to improve forecasting accuracy while supporting efficient parallel decoding. Extensive experiments on multiple real-world datasets demonstrate that PHGNet achieves superior predictive performance compared with state-of-the-art methods.

2605.25553 2026-05-26 cs.CV cs.RO

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

ComPose:用于鲁棒类别级物体姿态估计的统一补全-姿态框架

Huan Ren, Yihan Chen, Chuxin Wang, Nailong Liu, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家空间科学探测重点实验室,深空探测实验室) Beijing Institute of Control Engineering(北京控制工程研究所)

AI总结 提出ComPose框架,通过关键点渐进补全模块和几何关系一致性损失,将形状补全与姿态估计紧密集成,在不依赖类别级形状先验的情况下提升点云不完整场景下的姿态估计精度和效率。

Comments Accepted by CVPR 2026 (Oral, Best Paper Award Candidate). Project page is available at renhuan1999.github.io/ComPose

详情
AI中文摘要

类别级物体姿态估计旨在预测特定类别中任意物体的姿态和尺寸。现有方法难以处理观测点云固有的不完整性,这限制了它们捕捉完整物体形状以实现鲁棒姿态推理的能力。虽然点云补全提供了一种有前景的解决方案,但将其作为部分观测的独立预处理步骤会引入复合误差和额外计算开销,最终阻碍准确性和效率。为解决这些挑战,我们提出了ComPose,一种新颖的统一框架,紧密集成形状补全以提供完整的几何线索,从而增强姿态估计。ComPose的核心是一个基于关键点的渐进补全模块,通过逐步预测稀疏关键点及其周围的密集点集来恢复完整形状表示,使关键点能够捕捉整体物体几何结构。几何关系编码模块进一步用局部和全局几何上下文丰富关键点特征。此外,我们引入了一种新颖的几何关系一致性损失,以强制观测关键点与其预测的NOCS坐标之间的结构对齐,确保全局一致的坐标变换。在标准基准上的大量实验表明,我们的方法在不依赖类别级形状先验的情况下优于现有最先进方法。

英文摘要

Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.

2605.25551 2026-05-26 cs.LG

Learning Permutation from Structure Without Supervision

从结构中无监督学习排列

Ran Eisenberg, Ofir Lindenbaum

发表机构 * Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel(巴伊兰大学工程学院,拉马特甘,以色列)

AI总结 提出熵自适应Gumbel-Sinkhorn方法,通过局部调节温度改善无监督排列学习的稳定性和质量。

详情
AI中文摘要

许多学习问题需要揭示隐藏的排序,以揭示无序数据中的结构,例如排序中的单调性或拼图重建中的空间连续性。在这些设置中,排列可以作为潜在算子通过优化直接定义在重排序输出上的目标来学习,通常没有真实排序的访问。可微松弛如Gumbel-Sinkhorn通过用双随机矩阵近似排列矩阵使这种方法实用。然而,无监督地从结构学习会导致非均匀的不确定性:一些分配早期变得自信,而其他分配仍然模糊。现有方法使用单个全局温度控制这一过程,迫使所有分配同时锐化或扩散,导致大规模不稳定。我们引入了一种熵自适应的Gumbel-Sinkhorn公式,根据分配不确定性局部调节温度。这使得自信的分配可以早期离散化,同时在不明确的地方保留探索。在排序和拼图重建任务以及路由式设置中,相对于固定温度基线,自适应熵控制提高了训练稳定性和最终排列质量,特别是在问题规模和分配模糊性增加时。

英文摘要

Many learning problems require uncovering a hidden ordering that reveals structure in unordered data, such as monotonicity in sorting or spatial continuity in jigsaw reconstruction. In these settings, permutations can be learned as latent operators by optimizing objectives defined directly on the reordered output, often without access to ground-truth orderings. Differentiable relaxations such as Gumbel-Sinkhorn make this approach practical by approximating permutation matrices with doubly stochastic matrices. However, learning from structure without supervision induces a non-uniform uncertainty: some assignments become confident early, while others remain ambiguous. Existing methods control this process using a single global temperature, forcing all assignments to sharpen or diffuse simultaneously and leading to instability at scale. We introduce an entropy-adaptive formulation of Gumbel-Sinkhorn that locally modulates temperature based on assignment uncertainty. This allows confident assignments to discretize early while preserving exploration where uncertainty remains. Across sorting and jigsaw reconstruction tasks and in routing-style settings, adaptive entropy control improves training stability and final permutation quality relative to fixed-temperature baselines, particularly as problem size and assignment ambiguity increase.

2605.25548 2026-05-26 cs.LG cs.AI

'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

Shubhajit Roy, Anirban Dasgupta

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institute of Technology Gandhinagar(印度理工学院甘地纳加尔)

AI总结 提出SiST-GNN,通过在一个消息传递操作中融合空间和时间信号,实现动态图表示学习的联合推理,在链接预测任务上超越先前方法109%-277%。

详情
AI中文摘要

操作于快照序列的动态图神经网络(DGNN)通常分为两类:\emph{时间优先}方法先构建每个节点的时间嵌入,然后进行空间聚合;而\emph{空间优先}方法则颠倒这一顺序,将图卷积的输出馈送到下游时间模块。无论哪种情况,严格的顺序迫使第二阶段消耗第一阶段已压缩的摘要,排除了对拓扑和演化的联合推理;具体而言,消息传递算子永远无法根据邻居的\emph{过去}轨迹来加权其贡献。本文介绍了 extbf{SiST-GNN}( extbf{Si}multaneous extbf{S}patial- extbf{T}emporal extbf{GNN}),它在单个消息传递操作中融合两种信号,而不是将它们串联。具体地,在每个快照中,我们为每个节点维护一个循环隐藏状态来总结其历史,将其与节点当前特征向量配对,并将该配对视为由跨时间边连接的两个节点;在此时间增强图上运行标准图卷积,得到更新后的表示。我们的实证研究涵盖九个公开基线和十四个模型-数据集组合,覆盖固定分割和实时更新评估场景。在每个公开基准上,SiST-GNN在链接预测任务中相对于最强先前方法,在固定分割设置中提升109%-277%,在实时更新设置中提升68%-194%。我们还通过离散化底层连续时间事件流,构建了三个动态节点分类任务;在此,SiST-GNN以7%-22%的优势击败领先的离散时间(DTDG)基线,并与直接消费原始事件的连续时间(CTDG)方法相匹配。

英文摘要

Dynamic graph neural networks (DGNNs) that operate on snapshot sequences typically fall into one of two categories. \emph{Temporal-first} approaches build per-node temporal embeddings and only afterwards perform spatial aggregation, whereas \emph{Spatial-first} approaches invert this order, feeding the output of a graph convolution into a downstream temporal module. In either case, the rigid sequencing forces the second stage to consume an already-compressed summary produced by the first, ruling out joint reasoning over topology and evolution; concretely, the message-passing operator never gets to weight a neighbor's contribution by that neighbor's \emph{past} trajectory. This paper introduces \textbf{SiST-GNN} (\textbf{Si}multaneous \textbf{S}patial-\textbf{T}emporal \textbf{GNN}), which fuses the two signals inside a single message-passing operation rather than chaining them. Concretely, at each snapshot we maintain a recurrent hidden state per node that summarises its history, pair it with the node's current feature vector, and treat the pair as two nodes joined by a cross-time edge; running a standard graph convolution on this temporally augmented graph yields the updated representation. Our empirical study spans nine public baselines and fourteen model-dataset combinations, covering both fixed-split and live-update evaluation regimes. Across every public benchmark, SiST-GNN sets a new state of the art in link prediction task over the strongest prior method by $109$--$277\%$ in the fixed-split setting and by $68$--$194\%$ in the live-update setting. We additionally construct three dynamic node-classification tasks by discretising the underlying continuous-time event streams; here SiST-GNN beats the leading discrete-time (DTDG) baseline by $7$--$22\%$ and matches continuous-time (CTDG) methods that consume the raw events directly.