arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2093
2605.30039 2026-06-01 cs.AI

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

基于最小充分表示学习的大语言模型领域特定数据合成

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

AI总结 提出DOMINO框架,通过对比解耦学习最小充分领域表示,指导生成领域对齐的合成数据,在隐式领域定义下提升微调性能。

详情
Comments
Accepted by KDD 2026
AI中文摘要

大语言模型在通用能力上取得了显著进展,并可通过在领域特定数据上微调在特定领域实现强性能。然而,获取目标领域的高质量数据仍是一个重大挑战。现有数据合成方法遵循演绎范式,严重依赖自然语言表达的显式领域描述和精心设计的提示工程,限制了其在领域难以描述或正式表述的现实场景中的适用性。在这项工作中,我们通过归纳范式处理未被充分探索的领域特定数据合成问题,其中目标领域仅通过一组参考示例定义,特别是在领域特征难以用自然语言表述时。我们提出了一种新颖框架DOMINO,它从参考样本中学习最小充分的领域表示,并利用它来指导生成领域对齐的合成数据。DOMINO将提示调优与对比解耦目标相结合,以分离领域级模式与样本特定噪声,在保留核心领域特征的同时缓解过拟合。理论上,我们证明DOMINO扩展了合成数据分布的支持集,确保了更大的多样性。在隐式领域定义的具有挑战性的编码基准上,对DOMINO合成的数据进行微调,在强大的指令调优基线上将Pass@1准确率提高了高达4.63%,证明了其有效性和鲁棒性。这项工作为领域特定数据合成建立了一种新范式,无需手动提示设计或自然语言领域规范即可实现实用且可扩展的领域适应。

英文摘要

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

2605.30018 2026-06-01 cs.CL cs.LG

Latent Performance Profiling of Large Language Models

大型语言模型的潜在性能剖析

Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty, Partha Pratim Das, Lipika Dey, Richa Singh, Mayank Vatsa

AI总结 提出潜在性能剖析(LPP)框架,通过隐藏激活和输出分布提取任务无关的诊断指标,揭示模型内在特性,补充传统基准评估。

详情
AI中文摘要

大型语言模型(LLMs)在标准化基准测试中经常取得令人印象深刻的分数,但仅凭准确性对能力的了解有限。通过排行榜评估开源LLMs面临持续的问题,如数据污染、任务范围狭窄以及与真实世界可靠性的弱对齐。基于基准的评估(如MMLU PRO、BBH或IFEval)主要捕捉模型在固定测试集上的输出,而非其如何处理信息、校准不确定性或构建内部知识。在本文中,我们主张从以基准为中心的评估转向对LLMs进行互补的、以状态为中心的内在评估。为此,我们引入了潜在性能剖析(LPP)——一个从隐藏激活和输出分布中提取任务无关诊断的框架。LPP在模型的潜在表示和动态上定义了一组标量指标,揭示了与规模无关的特征,从而实现可解释的比较并揭示隐藏的脆弱性。与静态准确性分数不同,LPP在相似规模的模型间提供稳定、对架构敏感的签名。通过对八个LLMs(规模范围0.5B-14B)的广泛实证分析,我们证明了具有相似基准分数的模型可能表现出对比的潜在特征,例如熵或适应性的差异。在这些见解的指导下,我们设计了用于不确定性和符号推理的合成探针,这些探针与内在指标一致,同时与排行榜偏差解耦。我们建议将LPP与基准一起报告,以提供对模型行为更深入、可解释的理解,从而实现更可靠的模型选择、安全评估以及超越表面准确性的评估。

英文摘要

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, state-centered intrinsic assessment of LLMs. To this end, we introduce Latent Performance Profiling (LPP) -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

2605.29879 2026-06-01 cs.CV cs.RO

DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

DGSG-Mind:用于长期场景理解与定位的动态3D高斯场景图

Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

AI总结 提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,通过概率体素网格与显式3D高斯结合实现鲁棒的跨模态实例融合和增量语义映射,并构建层次化场景图与3D高斯思维进行多模态推理,在零样本3D视觉定位、开放词汇语义分割和场景重建中取得领先性能。

详情
Comments
9 pages, 6 figures
AI中文摘要

将开放词汇语义信息集成到动态3D场景表示中对于长期具身场景理解至关重要。然而,现有方法常因跨视角线索不完整而导致脆弱的实例关联,同时处理对象级拓扑变化的能力有限,限制了长期机器人任务执行。此外,当前的3D场景理解方法要么依赖简单的特征匹配而缺乏显式空间推理,要么假设离线真实3D几何。为应对这些挑战,我们提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,配备具身推理智能体。我们的系统将概率体素网格与显式3D高斯耦合,实现鲁棒的跨模态实例融合和增量语义映射。它通过基于高斯的视觉重定位和由几何-语义一致性引导的局部掩码细化来处理动态变化。基于实例高斯图,DGSG-Mind进一步构建层次化场景图,并开发3D高斯思维,集成结构关系、空间-语义信息和视觉标注的RoI高斯渲染以进行多模态推理。大量实验表明,DGSG-Mind在基于自重建地图的方法中实现了最佳的零样本3D视觉定位性能,同时在3D开放词汇语义分割和场景重建中也表现出强劲性能。我们进一步将DGSG-Mind部署到真实世界机器人上,展示其目标导向推理和动态更新能力。DGSG-Mind的项目页面位于https://icr-lab.github.io/DGSG-Mind。

英文摘要

Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind

2605.29852 2026-06-01 cs.CV cs.LG cs.MM

Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological Scoring

参数高效子空间解耦ViT用于缓解组织学评分中的多任务负迁移

Youhan Huang, Jiajun Li, Yilin Fang, Shuai Wang, Chuheng Li

AI总结 提出子空间解耦多任务Vision Transformer,通过轻量级任务特定适配器和正交性约束构建独立特征子空间,减少任务干扰并保留共享表示,有效缓解多任务负迁移。

详情
Comments
6 pages, 5 figures, 2 tables. IEEE ICME 2026 (Oral). Camera-ready version
AI中文摘要

组织学评分对于诊断非酒精性脂肪性肝病(NAFLD)至关重要,但由于高标注成本以及多任务学习中强相关的NAFLD活动评分(NAS)指标之间的负迁移,其自动化仍然具有挑战性。为了解决这个问题,我们提出了一种子空间解耦的多任务Vision Transformer(ViT),它集成了轻量级的任务特定适配器与基于正交性的约束。该设计为脂肪变性、气球样变和炎症构建了独立的特征子空间,有效减少了任务干扰,同时保留了共享表示。我们进一步构建了一个精心策划的多任务小鼠NAFLD组织学数据集,其中包含所有NAS组件的专家标注。实验结果表明,与训练单独的单个任务模型相比,所提出的方法以显著降低的计算成本提高了多任务稳定性和泛化能力。代码和策划的数据集已准备就绪,将在接收后公开以支持可重复性。

英文摘要

Histological scoring is essential for diagnosing Non-Alcoholic Fatty Liver Disease (NAFLD), yet its automation remains challenging due to the high annotation cost and negative transfer among the strongly correlated NAFLD Activity Score (NAS) indicators in multi-task learning. To address this issue, we propose a subspace-decoupled multi-task Vision Transformer (ViT) that integrates lightweight task-specific Adapters with orthogonality-based constraints. This design constructs independent feature subspaces for steatosis, ballooning, and inflammation, effectively reducing task interference while retaining shared representations. We further construct a curated multi-task mouse NAFLD histology dataset with expert annotations for all NAS components. Experimental results demonstrate that the proposed method improves multi-task stability and generalization with substantially reduced computational cost compared to training separate single-task models. The code and the curated dataset have been prepared and will be made publicly available upon acceptance to support reproducibility.

2605.29833 2026-06-01 cs.AI

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

OmniMatBench:跨19个材料科学子领域的人类校准多模态推理基准

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang, Jue Wang, Ran Sun, Zhuo Yang, Wanli Ouyang, Lei Bai, Tianfan Fu, Lu Chen, Xin Chen, Yuqiang Li

AI总结 针对现有基准忽视从材料知识到应用的推理过程,提出OmniMatBench,包含3171个专家策划的问答与计算问题,覆盖19个子领域,评估13个多模态大模型,最佳模型仅得0.372分,揭示当前模型在材料科学推理中的显著差距。

详情
Comments
22 Pages
AI中文摘要

随着多模态语言模型在科学研究中扮演越来越重要的角色,材料科学因其跨学科、多模态和应用驱动的特性而成为一个关键的测试平台。然而,现有的材料基准主要关注属性预测、知识问答或表征理解,而忽略了从材料知识到应用的更广泛推理过程。为填补这一空白,我们提出了OmniMatBench,一个针对材料科学的人类校准多模态推理基准。OmniMatBench包含3171个专家策划的问答和计算问题,涵盖19个材料科学子领域,包括基础材料知识、结构材料与工程材料、材料加工与制造以及功能材料与应用材料。我们评估了13个开源和闭源的多模态大语言模型,发现最佳模型仅获得0.372的总体得分,揭示了当前材料科学推理中的显著差距。进一步分析显示,不同子领域之间存在强烈差异、固定的推理启发式、不均匀的材料知识,以及在公式辅助、检索辅助和代码辅助设置下有限的高级知识应用。OmniMatBench为当前多模态大语言模型的能力和局限性提供了关键见解,并为材料科学研究中可靠的AI助手奠定了基础。

英文摘要

As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to application underexplored. To fill this gap, we present OmniMatBench, a human-calibrated multimodal reasoning benchmark for materials science. OmniMatBench contains 3,171 expert-curated QA and calculation problems across 19 materials-science subfields, spanning fundamental materials knowledge, structural and engineering materials, materials processing and manufacturing, and functional and applied materials. We evaluate 13 open-source and closed-source MLLMs and find that the best model achieves only a 0.372 overall score, revealing a substantial gap in current materials-science reasoning. Further analysis shows strong variation across subfields, fixed reasoning heuristics, uneven materials knowledge, and limited high-level knowledge application under formula-, retrieval-, and code-assisted settings. OmniMatBench provides crucial insights into the capabilities and limitations of current MLLMs and establishes a foundation for reliable AI assistants in materials-science research.

2605.29796 2026-06-01 cs.AI cs.CL cs.LG

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

SAAS:面向智能体搜索中过度搜索缓解的自我感知强化学习

Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

AI总结 提出SAAS强化学习框架,通过搜索边界建模、边界感知奖励和分阶段优化策略,使LLM智能体具备动态自我感知能力,在不降低准确率的前提下显著减少过度搜索。

详情
AI中文摘要

智能体搜索使LLM能够通过迭代推理和外部搜索解决复杂的多跳问题。尽管有效,但这些系统在实践中常受限于一个关键缺陷:智能体无法识别自身知识边界,在内部知识足够时盲目触发搜索,甚至在已收集足够证据时未能终止搜索。缺乏自我感知导致严重的 extbf{过度搜索},带来大量推理延迟和过高的计算成本。为此,我们提出SAAS,一种新颖的强化学习框架,旨在培养动态自我感知能力,精确调节搜索行为而不损害准确性。SAAS引入三个关键组件:(i) 搜索边界建模机制,通过对比禁用搜索和启用搜索的轨迹,识别策略演化下的搜索边界;(ii) 边界感知奖励模块,将这种边界意识转化为轨迹级惩罚,抑制不必要和冗余的搜索;(iii) 分阶段优化策略,利用顺序课程优先考虑推理而非搜索正则化,从而避免奖励黑客。大量实验表明,SAAS在保持准确性的同时大幅减少了过度搜索。我们的代码和实现细节已在https://github.com/XMUDeepLIT/SAAS发布。

英文摘要

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

2605.29751 2026-06-01 cs.CL

DySem: Uncovering Dynamic Semantic Components of Large Language Models for Calculating Semantic Textual Similarity

DySem: 揭示大语言模型的动态语义组件以计算语义文本相似度

Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang

AI总结 提出DySem框架,通过多语言共识提取大语言模型中与语义更相关的内部组件,并构建文本相关的联合语义集实现动态维度相似度计算,无需训练且性能优于基线。

详情
Comments
18 pages, 23 figures, 5 tables
AI中文摘要

计算语义文本相似度是自然语言处理中的基础任务。当前基于大语言模型(LLM)的方法通常依赖提取固定维度的最后一层隐藏状态来计算每对文本的相似度。我们认为这种范式存在两个局限:(i)最后一层隐藏层编码的是更通用的知识而非仅语义知识,因此对于语义相似度计算并非最优;(ii)LLM的隐藏层维度通常非常大,这引入了表示语义时的冗余和噪声。在这项工作中,我们提出DySem,一种新颖的无需训练框架,通过多语言共识探究LLM中更多与语义相关的内部组件,并摆脱静态表示空间,转而通过构建文本相关的联合语义集实现动态的、样本特定的语义维度,并在该共享维度子集上计算相似度。在各种LLM上的大量实验表明,我们的方法在保持较低相似度计算维度的同时,持续优于最近的基线。代码已发布在https://github.com/szu-tera/DySem。

英文摘要

Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last-layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training-free framework that investigates more semantic-related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample-specific semantic dimensions by constructing text-dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at https://github.com/szu-tera/DySem.

2605.29655 2026-06-01 cs.CV cs.GR

SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

SuperVoxelGPT: 自适应有序3D令牌化用于自回归形状生成

Yuan Li, Congyi Zhang, Xifeng Gao, Xiaohu Guo

AI总结 提出SuperVoxelGPT框架,通过自适应且有序的超体素令牌化解决自回归3D生成中序列长度与空间顺序的矛盾,实现高质量、高效率的形状生成。

详情
AI中文摘要

自回归多模态大语言模型(MLLMs)能够进行3D生成,但由于3D令牌化不足,难以扩展到高分辨率形状。基于集合的紧凑表示丢弃了确定性的空间排序,导致序列预测模糊,而均匀或基于八叉树的体素网格保留了排序,但代价是严重的冗余和过长的序列。这种结构上的权衡限制了稳定高效的自回归3D生成。我们提出了SuperVoxelGPT,一个以表示优先的框架,通过自适应且确定性的超体素令牌化解决了这一矛盾。给定提示,我们首先预测粗略的几何显著性分布,并使用显著性引导的质心Voronoi细分构建形状自适应的超体素划分,将细粒度单元分配给复杂区域,将较大单元分配给平滑区域。基于文本和有序的超体素布局,我们引入了SuperVoxelVAE,并微调预训练的MLLM以自回归生成超体素令牌。在Trellis-500K上的实验表明,SuperVoxelGPT将令牌序列长度减少到均匀体素令牌化的12.8%,同时实现了最先进的生成质量,并且相比先前方法平均加速10倍。

英文摘要

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

2605.24700 2026-06-01 cs.CV cs.GR

SRUG: Shadow-Guided Relightable Urban Scene with Generation Model

SRUG: 基于阴影引导的可重光照城市场景生成模型

Yonghao Zhao, Zexin Yin, Jian Yang, Beibei Wang, Jin Xie

AI总结 提出SRUG框架,利用阴影引导3D补全模型恢复不可见区域几何,结合迭代材质分解和物理光照模型,实现从稀疏输入视图生成可重光照城市场景。

详情
AI中文摘要

从图像或视频创建可重光照的城市场景具有广泛用途,但高度不适定。城市环境通常是无界的,且延伸到可见区域之外。因此,场景的许多部分未被观察到,但这些不可见区域会向可见区域投射阴影。合理建模这些不可见区域投射的阴影具有挑战性,并成为创建可重光照城市场景的主要障碍。同时,稀疏的输入视图和复杂的照明条件进一步使重光照复杂化,因为它们引入了材质分解中的严重歧义。在本文中,我们提出了SRUG(Shadow-guided Relightable Urban Scene with Generation model),一种新颖的框架,旨在解决城市场景中的重光照挑战。SRUG利用阴影引导3D补全模型恢复不可见区域的几何,促进物理合理阴影的合成。此外,SRUG采用迭代材质分解方案,应用大材质模型(LMM)提供材质监督,并迭代分解场景的材质属性,实现鲁棒的材质分解。基于这些组件,我们引入了一个基于物理的光照模型,该模型捕捉城市场景的复杂照明并支持可靠的重光照。大量的定量评估和视觉比较表明,我们的方法在新视图合成和重光照任务中均优于现有方法。

英文摘要

Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene's material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.

2605.22737 2026-06-01 cs.LG cs.AI

The Distillation Game: Adaptive Attacks & Efficient Defenses

蒸馏博弈:自适应攻击与高效防御

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri

AI总结 通过最小化博弈框架研究蒸馏攻击中模型提供者的部署权衡,提出自适应评估规则和产品专家(PoE)防御方法,实验表明自适应学生能恢复更多能力,且PoE在成本和质量上具有优势。

详情
AI中文摘要

蒸馏攻击为模型提供者带来了部署权衡:使模型更有用的相同输出也可能使其更容易被模仿。我们通过一个效用受限的教师和自适应学生之间的最小化博弈来研究这种权衡。我们的框架产生了可处理的一侧响应规则:一个自适应评估规则,其中学生重新加权高价值示例,以及一个教师侧防御模板,抑制对蒸馏最有用的输出。从示例价值的廉价代理中,我们推导出产品专家(PoE),一种简单的前向传递防御,在生成过程中将教师与代理学生结合。实验上,自适应评估揭示了一个大的被动-自适应差距:在最先进的防御上,自适应学生在GSM8K和MATH上恢复了比被动评估所建议的更多的能力。在这种更强的评估下,昂贵防御和PoE之间的明显鲁棒性差距显著缩小,而PoE仍然便宜得多,并保留了更高质量的推理轨迹。总体而言,我们的结果表明,强大的蒸馏仍然难以阻止,并且反蒸馏的进展应该根据自适应学生而非被动学生来判断。我们的代码可在:https://github.com/ysfalh/distillation-game 获取。

英文摘要

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.

2605.29417 2026-06-01 cs.CV

ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects

ParCo-SDF: 学习可变形物体的无先验部分到完整有符号距离场

Deokmin Hwang, Minseok Song, Daehyung Park

AI总结 提出 ParCo-SDF 两阶段框架,通过时序几何编码和 FiLM 条件 SDF 预测,实现无需物体特定先验的可变形物体部分到完整几何重建。

详情
Comments
Accepted at the 23rd International Conference on Ubiquitous Robots (UR 2026), 6 pages
AI中文摘要

本研究针对从点云观测到可变形物体(DOs)的部分到完整几何重建,以实现精确的 DO 操作。最近的 DO 重建方法通常采用隐式神经表示(INRs)来建模连续表面并捕捉结构变异性。然而,这些方法通常依赖于物体特定的形状先验,这虽然提高了训练稳定性,但限制了泛化能力。为了解决这个问题,我们引入了 ParCo-SDF,一个两阶段的部分到完整有符号距离场(SDF)重建框架,包括时序几何编码和随后的 FiLM 条件 SDF 预测。时序编码器捕捉 DO 序列中的结构相似性,实现无先验的稳定训练。基于 FiLM 的条件化在降低网络复杂度的同时保持了重建的表达能力。我们在橡皮筋操作数据集上评估了所提方法与最先进的 DO 表面重建基线,证明了在严重遮挡下的鲁棒和高保真重建。

英文摘要

This study addresses the partial-to-complete geometry reconstruction of deformable objects (DOs) from point-cloud observations toward precise DO manipulation. Recent DO reconstruction approaches often adopt implicit neural representations (INRs) to model continuous surfaces as well as capture structural variability. However, these methods typically rely on object-specific shape priors that improve training stability and limit generalization. To figure it out, we introduce ParCo-SDF, a two-stage partial-to-complete signed distance field (SDF) reconstruction framework consisting of temporal geometry encoding followed by FiLM-conditioned SDF prediction. The temporal encoder captures structural similarity across DO sequence, enabling prior-free stable training. FiLM-based conditioning preserves reconstruction expressivity while reducing network complexity. We evaluate the proposed method against a state-of-the-art DO surface reconstruction baseline on a rubber band manipulation dataset, demonstrating robust and high-fidelity reconstruction under severe occlusions.

2605.29373 2026-06-01 cs.LG cs.NA math.NA

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

逆问题中贝叶斯推理的深度自适应降维

Yueyang Wang, Xili Wang, Kejun Tang, Xiaoliang Wan, Tao Zhou, Chao Yang

AI总结 提出基于变分流模型的深度自适应降维贝叶斯推理框架,结合VAE非线性降维、双归一化流和迭代先验更新策略,并自适应微调傅里叶神经算子代理,以高效求解高维PDE控制逆问题中的复杂非高斯后验分布。

详情
Comments
25 pages, 5 figures
AI中文摘要

求解高维PDE控制的逆问题通常具有挑战性,原因在于复杂的非高斯后验分布、昂贵的正演模型评估以及错误的先验信息。为了解决这些问题,我们提出了一种基于变分流(VF)模型的深度自适应降维贝叶斯推理框架。由于标准归一化流受双射映射限制且无法直接降维,VF通过将基于VAE的非线性降维与潜在先验和编码器的双归一化流相结合,克服了这一限制。该设计提供了严格高于VAE的证据下界,并允许更灵活地逼近复杂后验分布。我们进一步引入了一种迭代先验更新策略,该策略逐渐将先验均值移向高概率后验区域,避免了手动先验调整。这些组件与自适应微调的傅里叶神经算子(FNO)代理一起形成了一个闭环自适应循环:VF生成后验集中样本以改进代理,而更新的代理进一步改进后验推理。在100维Rosenbrock问题和三个标准PDE控制逆问题上的数值实验表明,与MCMC、UKI和SVGD基线相比,我们的方法在所有测试配置中均具有竞争性或更优的精度,在高噪声观测和高维参数空间等挑战性场景中优势最为明显。

英文摘要

Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we propose a deep adaptive dimension-reduction Bayesian inference framework based on the Variational Flow (VF) model. Since standard normalizing flows are restricted by bijective mappings and cannot directly reduce dimensions, VF overcomes this limitation by integrating VAE-based nonlinear dimension reduction with dual normalizing flows for the latent prior and encoder. This design provides a strictly higher evidence lower bound than VAE and allows more flexible approximation of complex posterior distributions. We further introduce an iterative prior updating strategy that gradually moves the prior mean toward high-probability posterior regions, avoiding manual prior tuning. These components form a closed adaptive loop together with an adaptively fine-tuned Fourier Neural Operator (FNO) surrogate: VF generates posterior-concentrated samples to refine the surrogate, while the updated surrogate further improves posterior inference. Numerical experiments on a 100-dimensional Rosenbrock problem and three standard PDE-governed inverse problems show that our method delivers competitive or superior accuracy compared with MCMC, UKI, and SVGD baselines across all tested configurations, with the most pronounced advantages emerging in challenging scenarios such as high-noise observations and high-dimensional parameter spaces.

2605.29343 2026-06-01 cs.CL

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Draft-OPD:用于推测草稿模型的在线策略蒸馏

Haodi Lei, Yafu Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng

AI总结 针对推测解码中草稿模型因离线训练与在线推理不匹配导致性能瓶颈的问题,提出Draft-OPD方法,通过目标辅助展开和重放验证暴露的错误位置实现在线策略蒸馏,在多种任务上实现超过5倍的无损加速。

详情
AI中文摘要

推测解码通过将目标模型与轻量级草稿模型配对,并行验证其提出的令牌,从而加速大型语言模型推理。构建草稿模型的常见方法(如EAGLE3或DFlash)是在目标生成轨迹上进行监督微调(SFT)。然而,我们观察到SFT很快达到平台期:草稿模型在测试数据上的接受长度停止提升。原因是离线到推理的不匹配:在SFT中,草稿模型从固定的目标生成轨迹学习,而在推测解码期间,它在其自身策略提出的块上进行评估。这激发了在线策略蒸馏(OPD),其中目标模型在草稿诱导的状态上监督草稿模型。然而,OPD对于草稿模型仍然困难,因为它们无法可靠地独立展开完整序列,而目标辅助生成使收集的序列遵循目标分布,从而消除了在线策略信号。因此,我们提出Draft-OPD,它使用目标辅助展开进行稳定延续,并从验证暴露的错误位置重放草稿。这使得草稿模型能够从接受和拒绝的提议中学习目标反馈,将训练集中在限制推测接受的草稿诱导错误上。实验表明,Draft-OPD在多种任务上对思考模型实现了超过5倍的无损加速,比EAGLE-3和DFlash分别提高了23%和13%。

英文摘要

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

2605.29317 2026-06-01 cs.CL

FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning

FoRA: 基于Fisher正交秩适配的参数高效微调

Juneyoung Park, Seongbae Lee, Han-Sang Lee, Kyuho Lee, Minjae Kim, Seungheon Hyeon, Kiduk Kwon, Seongwan Kim, Jaeho Lee

AI总结 提出FoRA方法,通过Fisher信息选择信息层并在Stiefel流形上训练LoRA下投影,在减少参数预算的同时保持性能,优于LoRA和DoRA。

详情
Comments
EMNLP 2026
AI中文摘要

参数高效微调(PEFT)主要关注LoRA及其面向精度的变体,而减少可训练参数的原始目标相对较少受到关注。我们引入了FoRA,通过减少适配层数而非适配器秩来重新审视这一目标。FoRA通过单次对角Fisher评分(训练成本低于1%)选择任务信息层,并在Stiefel流形上训练所选层的LoRA下投影,保持列正交性和有效秩。在五个LLaMA系列骨干网络上,FoRA在参数预算减半的情况下始终优于LoRA和DoRA,在参数数量为AdaLoRA四分之一时,精度差距在0.7-0.8个点以内。在来自LLaMA、Qwen3和Gemma系列的十二个骨干网络上的跨架构实验证实了从270M到32B参数的一致增益。两个组件超加性地结合:Fisher选择本身在相同预算下匹配秩缩减,而Stiefel约束提供了决定性的额外增益。

英文摘要

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.

2605.29299 2026-06-01 cs.CV cs.AI

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

口袋牙医:通过高效多模态大语言模型实现设备端牙科图像理解

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan, Yiran Shen, Ting Dang, Hong Jia

AI总结 提出Pocket-Dentist基准,通过评估14种视觉语言模型发现紧凑模型(2B参数)在牙科图像理解中精度更高且计算成本更低,并在iPhone 17 Pro上实现低延迟部署。

详情
AI中文摘要

牙科视觉语言模型的评估在数据集、任务定义和指标上仍然分散,并且常常忽略其计算成本。这限制了它们在专科中心之外的广泛部署用于牙科筛查,而及时推理、有限的硬件以及对患者图像的本地处理对于实用、保护隐私的临床预筛查至关重要。本文提出了Pocket-Dentist,一个面向牙科多模态问答的效率感知基准,它汇集了三个数据集,涵盖约1159名患者、五种任务类型和七种指标。在典型的14种VLM上,我们的结果揭示了一个有趣的观察:紧凑型VLM(例如2B参数模型)在牙科图像理解中精度更高,同时所需计算成本大幅降低。在iPhone 17 Pro上本地部署时,我们微调的紧凑型VLM Pocket-Dentist-2B处理每个样本耗时4.31秒,与7B基线相比延迟降低4.9倍,内存使用减少2.3倍。

英文摘要

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.

2605.29268 2026-06-01 cs.CL cs.AI cs.LG cs.NE

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

进化搜索中的计算分配:从深度-广度到多臂老虎机

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang, Haozheng Luo, Tianfan Fu, Aarthy Nagarajan

AI总结 针对LLM引导的进化搜索中固定预算的LLM调用分配问题,提出基于多臂老虎机的BaSE方法,通过跨并行轨迹分配调用,平均适应度提升12.3%。

详情
AI中文摘要

LLM引导的进化搜索(Evolve系统)在数学和组合任务上达到了最先进的结果,但现有系统通常只报告多次运行中的最佳结果,而未记录运行间的分布。我们询问如何分配固定的LLM调用预算,以及单次运行达到报告数字的可靠性如何。通过扫描五个模型和三个任务的深度-广度网格,我们识别出两个经验规律:一个适应度-计算包络线,其中能力排序主要取决于有效FLOPs;以及一个双线性深度-广度拟合,具有任务特定的交互;两者都受模型-任务能力门控。受这些规律启发,我们提出BaSE(基于老虎机的自进化),一种多臂老虎机,它在并行轨迹间分配LLM调用。在不改变模型、提示或评估器的情况下,BaSE在8个(模型,任务)单元上比最强的岛屿协议基线平均适应度提高12.3%,在方差高的设置上增益最大:仅通过分配实现可靠性提升。

英文摘要

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

2605.29198 2026-06-01 cs.CV

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

引导对比令牌信用分配用于离散策略优化

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yuta Kyuragi, Aditya Grover

AI总结 针对组优势强化学习方法中令牌级信用分配缺失的问题,提出引导对比策略优化(GCPO),通过正负提示下的对比预测分配令牌级优势,在文本到图像生成和思维链推理任务上优于GRPO和DAPO。

详情
Comments
21 pages, 11 figures
AI中文摘要

基于组优势的强化学习方法,如GRPO和DAPO,在包括数学推理和文本到图像生成在内的多个领域展示了强大的性能。然而,它们对样本级奖励的依赖引入了一个关键限制,即所有令牌的均匀信用分配无法捕捉细粒度的令牌级贡献。为了解决这个问题,我们提出了引导对比策略优化(GCPO),一种新颖的算法,通过对比正负提示下的模型预测来实现每个令牌的信用分配。GCPO不是均匀地广播样本级优势,而是分配与这些对比预测差异成比例的令牌级优势,从而提供更精确和信息丰富的学习信号。实验上,我们发现GCPO强调语义相关区域,例如文本到图像生成中与文本提示对齐的视觉区域,以及思维链任务中推理轨迹内的关键关键词。通过大量实验,GCPO在文本到图像生成和思维链推理基准测试上 consistently 优于GRPO和DAPO基线,证明了其作为离散策略学习的通用且可扩展优化策略的有效性。

英文摘要

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.

2605.29146 2026-06-01 cs.CL cs.AI

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

SafeRx-Agent: 基于知识的多智能体框架用于安全且可解释的药物推荐

Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu, Qincheng Lu, Ziyu Zhao, Jijun Chi, Jingrui Tian, Xiao-Wen Chang, Ziyang Song

AI总结 提出SafeRx-Agent,一种基于知识的多智能体框架,通过患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合,在MIMIC-III和MIMIC-IV数据集上提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

详情
AI中文摘要

药物推荐预测患者就诊时的用药,但现有方法仍面临两个关键挑战。在模型层面,传统药物推荐方法仅预测结构化的药物代码,证据基础有限,而LLM智能体可以利用更丰富的临床上下文,但可能缺乏安全验证和可追溯性。在任务层面,现有基准通常使用宽泛的药物类别,忽略了亚组级别的安全性差异,可能导致风险高估。我们引入了基于第四级ATC代码生成的第一个细粒度药物推荐设置。我们提出了安全处方智能体(SafeRx-Agent),一种基于知识的多智能体框架,利用患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合。在MIMIC-III和MIMIC-IV数据集上的实验结果表明,SafeRx-Agent提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

英文摘要

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2605.28918 2026-06-01 cs.LG cs.AI cs.IR

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

当LLM奖励设计失败时:面向诊断的稀疏结构化RL改进

Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu, Dingyan Shang

AI总结 针对稀疏结构化强化学习任务,提出诊断驱动的迭代奖励函数改进方法,通过训练诊断和失败模式分类指导修正,显著提升MiniGrid任务成功率。

详情
AI中文摘要

对于具有语义奖励函数接口的稀疏结构化强化学习任务,LLM生成的奖励塑造更适合被视作调试而非一次性生成。我们使用MiniGrid作为核心评估、MuJoCo作为边界压力测试,研究PPO训练的智能体。我们的审计发现两种主要的一次性失败模式——奖励泛滥和语义/API误解,以及一种较罕见的弱塑造情况。我们提出诊断驱动的迭代改进,其中训练诊断和失败模式分类法指导有针对性的奖励函数修订。改进使DoorKey-8x8从2.3%提升至97.6%,KeyCorridor从31.2%提升至86.7%,但种子间方差较高。控制实验表明这些提升并非来自重试或额外训练:仅指标重新提示导致大幅下降,而静态词汇控制恢复了大部分差距(87.6%;70.7%),表明分类法提示是主要机制,动态标签仅提供部分孤立的增量证据。预算匹配和Best-of-3比较将改进与选择和训练时间效应分离。组件移除测试、敏感性分析以及针对作者标签的审计为调试解释提供了汇聚证据,同时揭示了校准限制。连续控制结果显示了边界:基于成功的诊断可能在密集奖励的 locomotion 中误报,而回报趋势反馈移除了一个假阳性机制但未带来稳健提升。低调用协议是与基于种群的奖励搜索的成本对比,而非基准比较。在四个交叉方差设计环境中,点估计表明当LLM奖励函数方差占主导时收益更大,但bootstrap区间较宽。该方法局限于PPO下具有可靠接口的稀疏结构化任务;event_text等字段可能有益、有害或中性。

英文摘要

For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.

2605.28836 2026-06-01 cs.CL cs.AI

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

不让任何读者掉队:人人能理解的多智能体摘要

Jimin Jung, MyoungJin Kim, Jaehyung Seo, Heuiseok Lim

AI总结 提出NRLB多智能体框架,通过模拟三类读者群体并结合模板规划与迭代优化,生成既忠实又易于理解的平实语言摘要。

详情
AI中文摘要

美国的《平实语言法案》要求政府文件使用清晰、简单的语言,以便公众易于理解,但现有的摘要系统难以应对普通读者中多样化的语言和认知障碍。我们提出了NRLB(不让任何读者掉队),一个用于平实语言摘要的多智能体框架,它模拟了三类代表性读者群体:小学生读者、非母语读者和注意力缺陷读者。NRLB结合了基于模板的规划与迭代的、面向读者的优化,能够系统地检测和解决难懂术语、缺失上下文和令人困惑的句子。在多个数据集上的评估显示,在保持事实准确性的同时,可读性持续提升。人工评估进一步验证了NRLB的效果,标注者偏好率在55%到76%之间,突显了NRLB在生成既忠实于原文又广泛适用于公众的平实语言摘要方面的潜力。

英文摘要

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2605.25134 2026-06-01 cs.LG cs.AI

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

重参数化、权重衰减和自适应学习率下稀疏优化的理论分析

Huangyu Xu, Jingqin Yang, Qianqian Xu, Jiaye Teng

AI总结 针对稀疏优化中的不稳定问题,提出基于重参数化、权重衰减和自适应学习率的ReWA方法,通过改善优化景观实现比ℓ1正则化更好的稀疏性,同时保持测试精度。

详情
Comments
32 pages, 5 figures. Submitted to ICML 2026
AI中文摘要

稀疏优化是各种实际应用中的一个基本挑战。一种流行的稀疏优化方法是ℓ_p正则化。然而,当0<p<1时,由于无界梯度,它可能遇到优化不稳定性。在本文中,我们介绍了一种新的稀疏优化方法,称为ReWA,它基于重参数化、权重衰减和自适应学习率。ReWA与ℓ_p正则化密切相关,但它揭示了一个不同的优化景观,有助于缓解不稳定性问题。在CIFAR-10和ImageNet上使用ResNets进行的实验表明,与ℓ_1正则化方法相比,ReWA在保持测试精度的同时显著提高了稀疏性。

英文摘要

Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ regularization. However, it may encounter optimization instability due to the unbounded gradients when $0<p<1$. In this paper, we introduce a novel approach to sparse optimization termed ReWA, based on Reparameterization, Weight decay, and Adaptive learning rate. ReWA is closely connected to $\ell_p$-regularization, yet it unveils a distinct optimization landscape that helps mitigate instability issues. Experiments on CIFAR-10 and ImageNet with ResNets demonstrate that ReWA leads to significant sparsity improvements over the $\ell_1$-regularization approach while preserving test accuracy.

2604.22409 2026-06-01 cs.CV

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

SpaMEM:具身环境中通过感知-记忆集成进行动态空间推理的基准测试

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao

AI总结 提出SpaMEM基准,通过动作条件场景变换和多模态数据,分层评估多模态大模型在具身环境中的空间信念演化能力,揭示坐标一致性和视觉记忆瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在静态视觉-空间推理方面取得了进展,但在具身环境中,当信念必须根据环境变化下的自我中心观察不断修正时,它们往往无法保持长期的空间连贯性。我们引入了SpaMEM(动作序列的空间记忆),这是一个大规模诊断基准,通过长交互时间内的动作条件场景变换(生成、放置、移除)来隔离空间信念演化的机制。SpaMEM基于一个物理基础数据集构建,包含来自1000个程序生成房屋中25000多个交互序列的10,601,392张高保真图像,涵盖四种模态(RGB、深度、实例、语义分割)。我们将具身空间推理形式化为一个三级层次结构,包含15个诊断任务:第1级测量单次观察的原子空间感知;第2级利用神谕文本状态历史探测时间推理,以排除感知噪声;第3级要求在同一任务维度下从原始视觉流进行端到端的信念维护。我们还评估了短期(逐步)更新和长期(情节)重建。对代表性开源VLM系列的基准测试揭示了一个一致的堆叠瓶颈:坐标一致的定位仍然是一个硬上限,从第2级到第3级的急剧下降暴露了显著的符号脚手架依赖性,即模型在基于文本的记账中成功,但难以维持稳健的视觉记忆。SpaMEM提供了一个细粒度的诊断标准,并激发了状态表示、信念修正和长期情节集成的显式机制。SpaMEM的一个子集可在https://huggingface.co/datasets/mill-ct-liao/SpaMEM公开获取。

英文摘要

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration. A subset of SpaMEM is publicly available at https://huggingface.co/datasets/mill-ct-liao/SpaMEM.

2603.09632 2026-06-01 cs.CV cs.CL

X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting

X-GS:基于3D高斯溅射的感知与思考可扩展框架

Yueen Ma, Zenglin Xu, Irwin King

AI总结 提出X-GS框架,包含感知器和思考器,统一多种3DGS技术实现实时在线SLAM与语义蒸馏,并支持多模态模型完成下游任务。

详情
AI中文摘要

3D高斯溅射(3DGS)已成为新颖视图合成的强大技术,随后扩展到众多空间AI应用。然而,大多数现有3DGS方法孤立运行,专注于特定领域。本文介绍X-GS,一个包含两个主要组件的可扩展框架。X-GS-感知器统一了广泛的3DGS技术,以实现具有语义蒸馏的实时在线SLAM。X-GS-思考器容纳多模态模型,使其能够与感知器无缝交互以完成下游任务。在我们的X-GS实现中,感知器利用最新的视觉基础模型提高在线SLAM性能,并采用三种关键机制加速语义蒸馏。思考器可以基于对比和生成视觉语言模型构建,并利用感知器的语义高斯溅射解锁3D视觉定位和场景描述等功能。在多个基准上的实验结果表明了X-GS框架的高效性和新解锁的多模态能力。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods operate in isolation, focusing on specific domains. In this paper, we introduce X-GS, an extensible framework consisting of two major components. The X-GS-Perceiver unifies a broad range of 3DGS techniques to enable real-time online SLAM with semantic distillation. The X-GS-Thinker accommodates multimodal models, enabling them to seamlessly interface with the Perceiver to complete downstream tasks. In our implementation of X-GS, the Perceiver leverages the latest vision foundation models to improve online SLAM performance and employs three key mechanisms to accelerate semantic distillation. The Thinker can be built upon both contrastive and generative vision-language models and utilizes the Perceiver's semantic Gaussian splats to unlock capabilities such as 3D visual grounding and scene captioning. Experimental results on diverse benchmarks demonstrate the efficiency and newly unlocked multimodal capabilities of the X-GS framework.

2602.10388 2026-06-01 cs.CL cs.AI

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

少即是多:利用稀疏自编码器在LLM特征空间中合成多样化数据

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu

AI总结 提出基于稀疏自编码器的特征激活覆盖率(FAC)指标及数据合成框架FAC Synthesis,通过识别缺失特征并生成对应样本来提升数据多样性和下游任务性能。

详情
AI中文摘要

后训练数据的多样性对于大型语言模型(LLM)的有效下游性能至关重要。许多现有的后训练数据构建方法使用基于文本的指标来衡量多样性,这些指标捕捉语言变化,但此类指标仅能为决定下游性能的任务相关特征提供微弱信号。在这项工作中,我们引入了特征激活覆盖率(FAC),该指标在可解释的特征空间中衡量数据多样性。基于此指标,我们进一步提出了一个多样性驱动的数据合成框架,名为FAC Synthesis,该框架首先使用稀疏自编码器从种子数据集中识别缺失特征,然后生成明确反映这些特征的合成样本。实验表明,我们的方法在包括指令遵循、毒性检测、奖励建模和行为引导在内的各种任务上,持续提高了数据多样性和下游性能。有趣的是,我们识别出跨模型家族(即LLaMA、Mistral和Qwen)共享的可解释特征空间,从而实现了跨模型知识迁移。我们的工作为探索以数据为中心的LLM优化提供了坚实且实用的方法论。

英文摘要

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

2509.21190 2026-06-01 cs.LG cs.AI

Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

面向零样本时间序列异常检测的基础模型:利用合成数据和相对上下文差异

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

AI总结 提出基于相对上下文差异(RCD)的预训练范式,通过合成数据训练Transformer模型比较查询模式与上下文,实现零样本时间序列异常检测,在多个基准上超越现有基础模型。

详情
Comments
This manuscript is withdrawn, as the authors intend to further extend and develop the work beyond its current scope
AI中文摘要

时间序列异常检测(TSAD)是一项关键任务,但开发能够以零样本方式泛化到未见数据的模型仍然具有挑战性。现有的TSAD基础模型通常依赖推理时的重构误差评分,这可能会遗漏重构良好的细微异常,并可能错误地标记未见领域中复杂但正常的模式。我们引入了TimeRCD,这是一个基于相对上下文差异(RCD)构建的TSAD基础模型,RCD是一种预训练范式,通过比较查询模式与其周围上下文来训练模型检测异常。这种关系公式通过标准Transformer架构实现,使模型能够从输入上下文中推断正常性,而不是依赖固定的全局正常模式。我们进一步构建了一个大规模合成语料库,其中包含上下文相关的异常标签,为RCD提供监督预训练信号。跨多个基准的实验表明,在大多数零样本TSAD设置中,TimeRCD优于现有的通用和异常特定基础模型,同时与数据集特定的全样本基线保持竞争力。这些结果提供了实证证据,表明RCD是构建鲁棒且可泛化的TSAD模型的有效方向。

英文摘要

Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains challenging. Existing foundation models for TSAD often rely on reconstruction-error scoring at inference time, which can miss subtle anomalies that are well reconstructed and can falsely flag complex but normal patterns in unseen domains. We introduce TimeRCD, a foundation model for TSAD built on Relative Context Discrepancy (RCD), a pre-training paradigm that trains the model to detect anomalies by comparing a query pattern with its surrounding context. This relational formulation, implemented with a standard Transformer architecture, enables the model to infer normality from the input context rather than relying on fixed global normal patterns. We further construct a large-scale synthetic corpus with context-dependent anomaly labels to provide supervised pre-training signals for RCD. Experiments across diverse benchmarks show that TimeRCD outperforms existing general-purpose and anomaly-specific foundation models in most zero-shot TSAD settings, while remaining competitive with dataset-specific full-shot baselines. These results provide empirical evidence that RCD is an effective direction for building robust and generalizable TSAD models.

2605.25842 2026-06-01 cs.AI cs.CL

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

MuCRASP: 多模态思维链推理感知的结构化剪枝

Aritra Dutta, Somak Aditya

AI总结 针对视觉语言模型在结构化剪枝后思维链推理准确性下降的问题,提出MuCRASP框架,通过识别推理关键令牌并保持跨模态对齐,在压缩下维持推理质量。

详情
Comments
Preprint ver. 2
AI中文摘要

视觉语言模型(VLM)越来越依赖思维链(CoT)推理来解决复杂的多模态任务,但其庞大的参数量使得部署成本高昂。结构化剪枝提供了一种自然的解决方案;然而,现有方法无法在VLM中保持CoT推理的准确性。我们确定了两个关键原因:(1)CoT一致性依赖于生成轨迹中的稀疏过渡点(枢轴令牌),而现有剪枝方法对CoT不敏感;(2)为单模态LLM设计的剪枝方法未考虑视觉和文本模态之间的激活分布差异。基于这些观察,我们提出了MuCRASP,一种结构化剪枝框架,针对推理关键组件,同时保持跨模态对齐并在全局参数预算下考虑层间敏感性。在三个推理基准测试上的四个VLM实验表明,MuCRASP在不断增加压缩的情况下始终能保持推理质量。在Qwen2.5-VL-7B上剪枝30%时,MuCRASP在物理推理任务上获得了8.87的LLM-as-a-Judge评分,而最强基线为7.32。此外,MuCRASP在高达50%的剪枝率下仍保持高推理一致性,显著优于先前的剪枝方法,同时表现出更低的困惑度退化。

英文摘要

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

2602.01173 2026-06-01 cs.CV

EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment

EEmo-Logic:面向全面图像诱发情感评估的统一数据集与多阶段框架

Lancheng Gao, Ziheng Jia, Zixuan Xing, Wei Sun, Huiyu Duan, Guangtao Zhai, Xiongkuo Min

AI总结 提出最大图像诱发情感理解数据集EEmoDB和统一多模态大语言模型EEmo-Logic,通过指令微调和任务定制GRPO实现细粒度情感问答与评估。

详情
AI中文摘要

理解图像诱发情感的多维属性和强度细微差别对于提升机器共情能力和赋能多样化人机交互应用至关重要。然而,现有模型仍局限于粗粒度情感感知或推理能力不足。为弥补这一差距,我们引入了 extbf{EEmoDB},这是迄今为止最大的图像诱发情感理解数据集。它包含跨越5个不同任务类别的5个分析维度,促进全面解读。具体而言,我们通过自动生成从125K张图像中整理了1.2M问答对(EEmoDB-QA),以及从25K张图像中策划了36K数据集(EEmoDB-Assess)用于细粒度评估。此外,我们提出了 extbf{EEmo-Logic},一个通过指令微调和具有新颖奖励设计的任务定制组相对偏好优化(GRPO)开发的一体化多模态大语言模型(MLLM)。大量实验表明,EEmo-Logic在域内和跨域数据集上实现了稳健性能,在情感问答和细粒度评估方面表现出色。数据集和代码可在https://github.com/workerred/EEmo-Logic获取。

英文摘要

Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce \textbf{EEmoDB}, the largest image-{\ul e}voked {\ul emo}tion understanding {\ul d}ataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125K$ images via automated generation, alongside a $36K$ dataset (EEmoDB-Assess) curated from $25K$ images for fine-grained assessment. Furthermore, we propose \textbf{EEmo-Logic}, an \textbf{all-in-one} multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The dataset and code are available at https://github.com/workerred/EEmo-Logic.

2503.07482 2026-06-01 cs.LG cs.AI

How does Bayesian Sampling help Membership Inference Attacks?

贝叶斯采样如何帮助成员推断攻击?

Zhenlong Liu, Wenyu Jiang, Feng Zhou, Hongxin Wei

AI总结 提出贝叶斯成员推断攻击(BMIA),通过拉普拉斯近似对单个参考模型进行贝叶斯采样以估计条件分数分布,理论证明降低模型内方差从而提升攻击性能,并在多模态数据集上实现最先进的效果与效率。

详情
Comments
Accepted to ICML 2026
AI中文摘要

成员推断攻击(MIAs)旨在估计特定数据点是否用于给定模型的训练。现有的最先进攻击通常依赖于训练多个参考模型来近似单个数据点的条件分数分布,这导致显著的计算开销并限制了其实际适用性。在这项工作中,我们提出了一种新颖的方法——贝叶斯成员推断攻击(BMIA),通过贝叶斯采样执行条件攻击。具体来说,我们对单个参考模型应用拉普拉斯近似以获得模型参数的后验分布,从而能够直接估计条件分数分布。理论上,我们证明了贝叶斯采样降低了模型内方差,从而提高了攻击能力。这一见解自然地激发了多参考变体,当有额外的参考模型可用时,该变体进一步提升了性能。在图像、文本和表格数据集上的大量实验表明,我们的方法在有效性和效率方面均达到了最先进的性能。

英文摘要

Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing state-of-the-art attacks typically rely on training multiple reference models to approximate the conditional score distribution for individual data points, which leads to significant computational overhead and limits their practical applicability. In this work, we propose a novel approach -- Bayesian Membership Inference Attack (BMIA), which performs conditional attack through Bayesian sampling. Specifically, we apply Laplace approximation to a single reference model to obtain a posterior over model parameters, enabling direct estimation of the conditional score distribution. Theoretically, we demonstrate that Bayesian sampling reduces intra-model variance, thereby improving attack power. This insight naturally motivates the multi-reference variant that further enhances performance when additional reference models are available. Extensive experiments across image, text, and tabular datasets indicate that our method achieves state-of-the-art performance in both effectiveness and efficiency.

2605.25193 2026-06-01 cs.CV

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

SpongeBob:同步感知的和谐视听生成式编辑

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, Zhibo Chen

AI总结 提出首个端到端视听联合编辑框架SpongeBob,通过双向跨模态交互的同步感知机制和上下文感知模块,解决视频编辑中的音画不同步和语义冲突问题。

详情
AI中文摘要

物理世界中的视觉和声学事件本质上是耦合的,然而现有的视频编辑方法通常采用解耦的流水线,缺乏双向模态交互。这导致两个关键限制:(i) 视听不同步和(ii) 生成的音频与保留内容之间的上下文冲突。为了解决这些问题,我们提出了SpongeBob,这是第一个具有双向跨模态交互的端到端视听联合编辑框架。对于同步,同步感知机制通过双向注意力、时间对齐和空间约束将视觉编辑与声音事件对齐。对于上下文一致性,上下文感知模块利用声学和视觉上下文注意力来防止语义冲突。此外,我们引入了同步保持训练和指导(SPTG),以在不降低质量的情况下增强对齐。由于配对数据的稀缺,我们构建了一个可扩展的数据流水线和一个大规模的主题级数据集。我们还提出了SpongeBob-Bench用于系统评估。实验表明,SpongeBob显著优于现有基线,将Sync-C提高了30%,Ctx-F1提高了12.5%。我们的项目页面位于:https://hy-spongebob.github.io/。

英文摘要

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

2603.24254 2026-06-01 cs.LG cs.AI

Beyond Static Uncertainty: Modeling Temporal Uncertainty Dynamics for Probabilistic Time Series Forecasting

超越静态不确定性:为概率时间序列建模时间不确定性动态

Yijun Wang, Qiyuan Zhuang, Larysa Marchanka, Xiu-Shen Wei

AI总结 提出VolDy-VAE模型,通过循环尺度路径捕捉波动率动态,实现时间一致的概率预测,提升准确性和不确定性校准。

详情
AI中文摘要

现实世界的时间序列表现出时间结构化的不确定性:波动率在动荡时期聚集,在稳定时期消散,并在结构断裂处突然变化。然而,许多概率预测方法将预测不确定性估计为独立的逐点量,忽略了波动率机制的演变和持续性。我们将这一缺失维度形式化为时间不确定性动态,并在波动率动态变分自编码器(VolDy-VAE)中实例化它,这是一个具有位置-尺度解码器的非自回归生成预测器。VolDy-VAE结合了用于均值预测的位置路径和用于传递和演化波动率隐藏状态的循环尺度路径,该状态从回溯窗口转移到预测范围,从而实现时间一致的预测方差。这种设计产生了一种自适应衰减机制:高方差观测值对位置估计的影响较小,而其不确定性通过明确的尺度预测得以保留。我们进一步提供了一个简化的机制转换分析,表明当方差已知或一致估计时,波动率感知目标简化为逆方差加权,而基于MSE的估计量保持无偏但统计效率较低。在九个基准上的实验表明,VolDy-VAE在保持低推理延迟的同时,提高了预测准确性和不确定性校准,优于竞争的概率和点预测基线;插件研究进一步表明,VolDy原理可以有益于GAN、Koopman VAE和Transformer骨干网络。源代码公开于https://github.com/wangyijunlyy/VolDy-VAE。

英文摘要

Real-world time series exhibit temporally structured uncertainty: volatility clusters in turbulent regimes, dissipates in stable periods, and shifts abruptly around structural breaks. Yet many probabilistic forecasting methods estimate predictive uncertainty as an independent per-step quantity, leaving the evolution and persistence of volatility regimes under-modeled. We formalize this missing dimension as temporal uncertainty dynamics and instantiate it in the Volatility Dynamics Variational Autoencoder (VolDy-VAE), a non-autoregressive generative forecaster with a location-scale decoder. VolDy-VAE combines a location path for mean prediction with a recurrent scale path that transfers and evolves a volatility hidden state from the look-back window to the forecasting horizon, enabling temporally coherent predictive variances. This design yields an adaptive attenuation mechanism: high-variance observations receive lower influence on the location estimate while their uncertainty is preserved through explicit scale predictions. We further provide a simplified regime-switching analysis showing that, when variances are known or consistently estimated, the volatility-aware objective reduces to inverse-variance weighting, whereas MSE-based estimators remain unbiased but statistically inefficient. Experiments on nine benchmarks show that VolDy-VAE improves forecasting accuracy and uncertainty calibration over competitive probabilistic and point-forecasting baselines while maintaining low inference latency; plug-in studies further indicate that the VolDy principle can benefit GAN, Koopman VAE, and Transformer backbones. The source code is publicly available at https://github.com/wangyijunlyy/VolDy-VAE.