arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3962
2606.00706 2026-06-09 cs.CV 版本更新

CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval

CR-JEPA:用于遥感图像检索的跨模态联合嵌入预测学习

Md Aminur Hossain, Ayush V. Patel, Nitant Dube, Biplab Banerjee

发表机构 * Space Applications Centre, Indian Space Research Organisation(印度空间研究组织空间应用中心) Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay(印度理工学院孟买资源工程研究中心)

AI总结 提出CR-JEPA架构,通过模态特定主干、共享Transformer和JEPA预测目标实现跨模态语义对齐与同模态邻域保持,在BEN-14K等数据集上显著提升跨模态检索性能。

详情
Comments
24 pages
AI中文摘要

跨模态遥感图像检索旨在跨异构传感模态检索语义相关的场景。由于配对观测在成像物理、空间分辨率、光谱配置和视觉外观上可能存在显著差异,这仍然具有挑战性。此外,单一目标训练的检索投影可能不足以同时支持跨模态语义对齐和同模态邻域保持。我们提出了CR-JEPA,一种用于双模态遥感检索的跨模态检索联合嵌入预测架构。该模型使用模态特定主干、共享Transformer主干和JEPA风格的预测目标来估计模态内和跨模态的掩码潜在目标特征。受LeJEPA启发,我们对原始检索投影应用素描各向同性高斯正则化以稳定嵌入并缓解崩溃。CR-JEPA进一步采用解耦头设计,包括用于同模态检索的统一检索头和用于跨模态搜索的跨模态检索头。我们在BEN-14K、CBRSIR_VS和DSRSID上评估CR-JEPA。在BEN-14K上,与X-JEPA相比,CR-JEPA将S1到S2检索从61.23%提升至75.82%,S2到S1检索从63.73%提升至75.40%,同时以更少的参数实现了有竞争力的同模态检索。

英文摘要

Cross-modal remote sensing image retrieval aims to retrieve semantically related scenes across heterogeneous sensing modalities. This remains challenging because paired observations may differ substantially in imaging physics, spatial resolution, spectral configuration, and visual appearance. Moreover, a single retrieval projection trained with one objective may be insufficient to jointly support cross-modal semantic alignment and same-modal neighbourhood preservation. We propose CR-JEPA, a Cross-modal Retrieval Joint-Embedding Predictive Architecture for dual-modality remote sensing retrieval. The model uses modality-specific stems, a shared transformer trunk, and JEPA-style predictive objectives to estimate masked latent target features within and across modalities. Inspired by LeJEPA, we apply Sketched Isotropic Gaussian Regularization to raw retrieval projections to stabilize embeddings and mitigate collapse. CR-JEPA further employs a decoupled-head design with a unified retrieval head for same-modal retrieval and a cross-modal retrieval head for cross-modal search. We evaluate CR-JEPA on BEN-14K, CBRSIR_VS, and DSRSID. On BEN-14K, CR-JEPA improves S1 to S2 retrieval from 61.23% to 75.82% and S2 to S1 retrieval from 63.73% to 75.40% over X-JEPA, while also achieving competitive same-modal retrieval with fewer parameters.

2606.00568 2026-06-09 cs.LG q-bio.GN 版本更新

On the Recoverability of Causal Relations from Bulk Gene Expression Data

从批量基因表达数据中恢复因果关系的可能性

Gongxu Luo, Boyang Sun, Kun Zhang

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽伊德人工智能大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过形式化聚合下的一致性和推导充要条件,研究了从批量基因表达数据中恢复因果关系的可能性,并发现仅在线性聚合与仿射结构方程下可恢复,而实证数据偏离线性假设。

详情
AI中文摘要

批量基因表达谱分析将生物样本中所有细胞的RNA混合后测量,在单细胞时代仍然重要,因为它通常比单细胞检测噪声更低、灵敏度更高且成本效益更好。因此,越来越多的计算方法试图从批量表达数据中恢复基因间的因果关系。然而,聚合是对底层细胞系统的有损、不可逆的粗化,目前尚不清楚是否以及在何种条件下可以从聚合的批量基因表达数据中恢复因果关系。为了回答这个问题,我们通过两种一致性概念(函数形式一致性和条件独立性一致性)形式化了聚合下的可恢复性。然后,我们推导了可恢复性的必要和充分条件,表明这些性质仅在线性聚合(如求和/均值)与仿射结构方程结合时得以保持。为了评估这些条件的实际可行性,对四个批量基因表达数据集和四个单细胞基因表达数据集的分析进一步揭示,两种数据类型中估计的基因间成对调控函数均偏离线性,为可恢复性所需的线性假设提供了有限的经验支持。总之,这些结果告诫我们,在没有强额外假设的情况下,不应从聚合的批量表达数据中恢复因果关系。

英文摘要

Bulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出VESTA框架,通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验,提升视觉语言模型在复杂统计建模任务上的性能。

详情
AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤,但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型(VLM)来迭代地提出和优化统计模型,但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制,我们引入了VESTA:基于统计工具代理的视觉探索,这是一个框架,为VLM配备了一个动态增长的探索工具包,通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同,VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据,这些工具会累积在模型的上下文中,并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线:无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估,我们引入了DAWN(自动工作流和数值建模数据集),这是一个针对分布拟合和时间序列建模的基准,具有不同的难度等级,并最终涉及真实世界的天文学任务,包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线,在复杂和特定领域的任务上取得了最大的收益。我们进一步表明,动态生成的工具比现有视觉工具创建系统生成的工具复杂得多,每个函数覆盖更多的诊断类别,并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2606.00229 2026-06-09 cs.RO cs.AI cs.LG 版本更新

Continuous Reasoning for Vision-Language-Action

视觉-语言-动作的连续推理

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 针对视觉-语言-动作策略中语言与连续控制粒度不匹配的问题,提出一种可共享、可验证的连续推理方法,通过高斯潜变量接口和自验证目标提升机器人任务成功率。

详情
Comments
Project page: https://continuous-reasoning.airoa.io
AI中文摘要

自然语言是语言模型和视觉-语言模型强大的推理媒介,但与连续控制的粒度不匹配。文本和显式子目标在任务级粒度上操作,而视觉-语言-动作(VLA)策略必须在更细的时间尺度上选择动作;因此,单个推理步骤可能跨越多个动作块,同时与当前所需动作保持弱耦合。这为VLA提出了一个不同的问题:什么应该扮演语言的角色?我们认为,有用的VLA推理媒介必须能够在模型实例之间共享,通过下游动作改进进行验证,并与时间扩展的控制结构对齐。基于这一观点,我们提出了视觉-语言-动作的连续推理。我们的模型首先以结构化连续思想集的形式预测连续推理,然后将其重用为块结构动作生成的共享上下文。仅凭更好的动作预测并不能证明推理的有效性:如果相同的内部媒介不能在模型实例之间共享,并且不能通过改进的下游控制独立验证,那么添加的潜变量可能只是模型私有的捷径,有助于在已见行为上表现,而不支持泛化的控制。因此,我们将连续推理实例化为一个共享的高斯潜变量接口,并使用自验证目标进行训练,其中指数移动平均教师必须在预测目标动作时成功消费学生的推理。实验上,连续推理提高了LIBERO-PRO的鲁棒性,并在真实机器人上表现强劲,在TX-G2(一种AgiBot G2兼容变体)上平均子任务成功率比π0.5提高了40.4%,在HSR上提高了26.3%。这表明VLA中的推理更多是关于一个可共享、可验证的内部动作语言,而不是额外的标记。

英文摘要

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

2606.00094 2026-06-09 cs.CV cs.AI 版本更新

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

显式建模数据流形几何的扩散图像生成

Duoduo Xue, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出MIND框架,通过将离散补丁标记化集成到连续扩散模型的得分函数中显式建模流形几何,结合离散标记的结构量化能力和连续扩散的并行生成灵活性,在ImageNet 256×256上显著降低FID。

详情
AI中文摘要

图像生成模型旨在从底层数据流形中采样数据点,这需要学习并解码一个密集、低维且紧凑的参数化空间。为此,我们提出了数据流形感知图像扩散模型(MIND),一种通过将离散补丁标记化集成到连续扩散模型的得分函数中来显式建模流形几何的新框架。该方法成功利用了离散标记的结构量化能力和连续扩散的并行生成灵活性。此外,我们通过一种新颖的软top-$k$聚合机制实现了端到端可微训练,并引入了双分支高频特征嵌入层以缓解Transformer主干网络在低维输入上的谱偏差。进一步地,在推理阶段,我们设计了一种多阶段过渡采样方案,根据时间步动态调整采样方案。在ImageNet 256×256上的大量实验证明了MIND的有效性。经过80个epoch的训练,我们的基础模型在无引导情况下实现了22.73的FID,几乎将原始DiT-B/2基线的43.47 FID减半。与基线DiT和SiT相比,所提方法平均分别降低了15.95和9.06的FID。对于ImageNet-256×256上的引导图像生成,所提MIND-B仅用130M参数就实现了2.06的FID,超过了具有3.1B参数的LlamaGen-3B。所提MIND-XL具有715M参数,进一步将FID降低至1.95。我们的MIND为基于扩散的图像生成引入了全新视角,为该领域的未来研究和创新铺平了道路。代码将公开提供。

英文摘要

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top-$k$ aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256$\times$256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256$\times$256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.

2606.00024 2026-06-09 cs.CL 版本更新

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

ART:面向高效大语言模型解码的注意力运行时终止

Chen Qiu, Guozhong Li, Cristian McGee, Aritra Dutta, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology(卡布尔大学科学与技术大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出注意力运行时终止(ART)机制,通过跟踪累积注意力输出并在贡献可忽略时终止后续KV块访问,在不显著影响准确率的情况下将大批量生成吞吐量提升20%。

详情
AI中文摘要

大语言模型(LLM)中的长上下文解码受到获取大量键值(KV)缓存所需内存带宽的严重限制。大多数现有的KV管理方法依赖于解码前的仅键剪枝,尽管有证据表明注意力输出共同依赖于键和值,因为将值纳入其方法会带来过高的额外开销。在本文中,我们提出了注意力运行时终止(ART),一种轻量级的运行时机制,在内核执行期间跟踪累积的注意力输出,并在后续贡献变得可忽略时终止后续KV块访问。这种设计使ART与现有的基于键的KV缓存管理方法正交,从而能够与它们无缝集成。在LongBench基准上的实验表明,与最先进的基线相比,ART在大批量下实现了20%更高的生成吞吐量,同时保持了相当的准确率。

英文摘要

Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite the evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, as incorporating values incurs prohibitive additional overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. Rather than replacing KV selection, ART dynamically terminates redundant KV traversal on top of existing dense or sparse attention policies. We introduce a stability-based criterion that monitors both magnitude and directional changes of intermediate attention outputs, and provide a theoretical characterization of the resulting truncation error. Experiments on LongBench and RULER Needle-in-a-Haystack tasks show that ART increases the generation throughput of existing KV-cache methods by up to 20%, without compromising the quality of the results.

2605.31498 2026-06-09 cs.LG q-bio.BM 版本更新

Scalable Inference-Time Annealing with Surrogate Likelihood Estimators

可扩展的推理时退火与代理似然估计器

Daniel Peñaherrera, Rishal Aggarwal, David Ryan Koes

发表机构 * CMU-Pitt PhD Program in Computational Biology Dept. of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA(卡内基梅隆大学-匹兹堡联合博士项目 计算生物学部门 计算与系统生物学系,匹兹堡大学,匹兹堡,PA 15260,USA)

AI总结 提出可扩展推理时退火(SITA)方法,通过基于能量的模型实现快速代理似然,避免昂贵的散度计算,在丙氨酸二肽和三肽上取得最先进性能。

详情
Comments
26 pages, 5 figures, submitted to JMLR 2026
AI中文摘要

计算化学和生物物理学中长期存在的挑战是高效采样分子的玻尔兹曼分布。生成式建模的进展被提出以解决传统采样技术的局限性,通过消除模拟的计算成本。一个有前景的方向是沿着温度阶梯迭代微调扩散模型,其中训练数据通过推理时退火期间的重要性采样生成。不幸的是,这些方法需要在分数场上计算散度来估计重要性权重,使得它们对于较大系统难以处理。在这里,我们提出可扩展的推理时退火(SITA),它重新训练基于流的模型以在逐渐降低的温度下生成样本,使用基于能量的模型来促进快速代理似然。我们在丙氨酸二肽和丙氨酸三肽上展示了最先进的性能,同时避免了昂贵的散度项。我们的代码可在 https://github.com/countrsignal/sita.git 获取。

英文摘要

A long standing challenge in computational chemistry and biophysics is efficiently sampling the Boltzmann distribution of molecules. Advances in generative modeling have been proposed to address the limitations of conventional sampling techniques by eliminating the computational cost of simulation. A promising direction is iteratively finetuning diffusion models along a temperature ladder whereby training data is generated via importance sampling during inference-time annealing. Unfortunately, these methods require computing a divergence over the score field to estimate importance weights, rendering them intractable for larger systems. Here we present scalable inference-time annealing (SITA), which retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model to facilitate fast surrogate likelihoods. We demonstrate state-of-the-art performance on both Alanine Dipeptide and Alanine Tripeptide while avoiding costly divergence terms. Our code is available at https://github.com/countrsignal/sita.git

2605.31158 2026-06-09 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

详情
Comments
13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2605.31014 2026-06-09 cs.LG 版本更新

SDM-Q: Cost-Aware Staged Decision-Making for Multi-Omics Classification with Deep Q-Learning

SDM-Q: 基于深度Q学习的成本感知分阶段决策用于多组学分类

Nan Mu, Yangfan Xiao, Ling Wang, Xiaoning Li, Yue Kang, Chen Zhao

发表机构 * College of Computer Science, Sichuan Normal University(四川师范大学计算机学院) Department of Mathematics, College of Science and Mathematics, Kennesaw State University(数学系,科学与数学学院,肯纳邦克州立大学) Department of Computer Science, College of Computing and Software Engineering, Kennesaw State University(计算机科学系,计算与软件工程学院,肯纳邦克州立大学)

AI总结 提出SDM-Q强化学习框架,将多组学诊断建模为有限步序贯决策问题,通过动作价值函数平衡分类正确性与模态获取成本,在四个公共数据集上有效减少冗余模态获取并保持竞争性分类性能。

详情
AI中文摘要

多组学数据提供了疾病表型的互补分子特征,在精准医学的疾病诊断和亚型分类中发挥重要作用。然而,获取完整的多组学图谱昂贵且耗时,而现有深度学习方法大多假设推理时模态齐全,导致大量冗余并在临床环境中实用性有限。为解决此问题,我们提出SDM-Q,一种用于自适应和成本感知多组学分类的强化学习框架。具体而言,多组学诊断被重新表述为有限步序贯决策问题,其中当前获取的组学模态定义每个阶段的诊断状态。动作价值函数决定是否获取额外模态或终止决策过程并输出最终预测。为平衡诊断效用和获取成本,奖励仅在终止阶段定义,并由分类正确性和累积模态获取成本共同决定。引入反向阶段优化策略以提高策略一致性和训练稳定性。在四个公共多组学数据集(包括ROSMAP、LGG、BRCA和KIPAN)上的实验表明,与使用完整多组学输入的方法相比,SDM-Q有效减少了冗余模态获取,同时保持竞争性的分类性能。在BRCA和KIPAN数据集中,分别有超过99%和95%的受试者仅使用单一组学模态即可实现准确分类,而ROSMAP和LGG的平均获取模态数保持在2以下。这些结果表明,成本感知的序贯决策为改善精准医学工作流程的效率提供了有效范式。

英文摘要

Multi-omics data provide complementary molecular characterizations of disease phenotypes and play an important role in disease diagnosis and subtype classification in precision medicine. However, acquiring complete multi-omics profiles is expensive and time-consuming, while most existing deep learning methods assume full modality availability during inference, resulting in substantial redundancy and limited practicality in clinical settings. To address this issue, we propose SDM-Q, a reinforcement learning framework for adaptive and cost-aware multi-omics classification. Specifically, multi-omics diagnosis is reformulated as a finite-horizon sequential decision problem, where the currently acquired omics modalities define the diagnostic state at each stage. An action--value function determines whether to acquire an additional modality or terminate the decision process and output the final prediction. To balance diagnostic utility and acquisition cost, the reward is defined only at the terminal stage and jointly determined by classification correctness and cumulative modality acquisition cost. A backward stage-wise optimization strategy is introduced to improve policy consistency and training stability. Experiments on four public multi-omics datasets, including ROSMAP, LGG, BRCA, and KIPAN, demonstrate that SDM-Q effectively reduces redundant modality acquisition while maintaining competitive classification performance compared with methods using complete multi-omics inputs. In the BRCA and KIPAN datasets, more than 99\% and 95\% of subjects, respectively, achieve accurate classification using only a single omics modality, while the average number of acquired modalities remains below two for ROSMAP and LGG. These results suggest that cost-aware sequential decision-making provides an effective paradigm for improving the efficiency of precision medicine workflows.

2605.30836 2026-06-09 cs.LG math.DG 版本更新

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

跨层子空间耦合用于LLM压缩:一个统一框架及其经验极限

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出一个统一框架将SVD LLM和Basis Sharing等基于SVD的压缩方法纳入同一优化问题,但实验发现跨层耦合在实用任务中失败,原因是残差流在正向传播中解耦了相邻层,因此逐层优化优于联合优化。

详情
AI中文摘要

最近基于SVD的大型语言模型压缩方法,如SVD LLM和Basis Sharing,可以统一在一个优化问题下。尽管数学证明和在Pythia模型上的测试表明,这种统一方法将权重重建误差提高了高达46%,但在实际任务中却失败了。与标准的逐层SVD LLM相比,困惑度和准确率等下游指标严重下降。作者从机制上解释了这一失败。虽然束方法在数学上耦合了相邻层,但变换器的残差流在正向传播过程中实际上解耦了它们。因此,逐层最优性比联合跨层优化更重要。论文得出结论,权重空间重建对于跨层压缩是一个有缺陷的目标,未来的方法必须专注于逐层激活重建。

英文摘要

Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves weight reconstruction error by up to 46% percent it fails in practical tasks. Downstream metrics like perplexity and accuracy severely degrade compared to standard per layer SVD LLM. The authors explain this failure mechanistically. Although the bundle method mathematically couples adjacent layers the transformer residual stream actually decouples them during forward passes. Thus per layer optimality matters more than joint cross layer optimization. The paper concludes that weight space reconstruction is a flawed objective for cross layer compression and future methods must focus on per layer activation reconstruction instead.

2605.30608 2026-06-09 cs.CL 版本更新

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

语义运动锚点:连接共语手势中的运动与意义

Varsha Suresh, Mohammad Mahdi Abootorabi, Mohamed Salman, M. Hamza Mughal, Christian Theobalt, Ashwin Ram, Jürgen Steimle, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) MPI for Informatics(信息研究所) Saarland Informatics Campus(萨尔兰信息校园) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Zuse School(祖斯学校)

AI总结 提出语义运动锚点方法,通过将3D手势离散化为身体-手部运动基元并转化为结构化描述,在文本与运动之间建立辅助对比监督,显著提升共语手势检索的语义相关性。

详情
AI中文摘要

学习口语文本与手势之间的共享表示是共语手势检索、合成和理解的核心,但对于语义上有意义的手势仍然具有挑战性,因为其交际意图无法仅通过运动捕捉。转录文本与连续运动嵌入之间的直接对比对齐往往过度强调低级运动学,而忽略了语义手势的符号内容。我们提出语义运动锚点,即手势运动的自然语言抽象,捕捉物理形式和交际意图。我们的方法将3D手势离散化为身体-手部运动基元,将其转化为结构化描述,并将其嵌入转录文本中以提供辅助对比监督。在BEAT2上,我们的方法在文本到手势的R@1上比直接文本-运动基线提高了8.2%,并在文本到手势和手势到文本检索方向上优于先前的检索方法。除了总体检索指标外,语义运动锚点监督有助于检索对口语查询具有语义意义的手势,而不是默认使用通用运动模式。一项下游检索增强手势生成研究表明,用户显著偏好我们方法检索的手势,而非检索增强生成基线,表明语义基础的检索转化为在下游生成中更好传达交际意图的手势。

英文摘要

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

2605.30407 2026-06-09 cs.CL cs.AI cs.IR cs.LG 版本更新

Exploring Autonomous Agentic Data Engineering for Model Specialization

探索用于模型专业化的自主智能体数据工程

Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Platform and Content Group, Tencent(腾讯平台与内容部)

AI总结 本文提出自主智能体数据工程任务,让LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化,实验显示GPT-5.2通过迭代数据适应使学生模型性能提升57.29%。

详情
Comments
Work in progress
AI中文摘要

大型语言模型(LLM)在通用任务上表现出色,但往往难以适应没有高质量领域特定数据的专业领域。现有的基于LLM的数据策划方法主要依赖人工设计的工作流程,尚未检验LLM能否自主执行端到端的数据工程流水线以实现模型专业化。我们形式化了 extbf{自主智能体数据工程},这是一个新任务,旨在评估LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化。我们将数据视为可优化组件,研究能够跨多个领域规划、生成和迭代优化训练数据的智能体,并以训练后性能提升为指导。实验表明,自主LLM数据工程师带来了显著收益,GPT-5.2构建的训练课程使学生模型性能提升了 extbf{57.29\%},完全通过迭代的智能体驱动数据适应实现。通过揭示潜力和瓶颈,我们的研究将自主数据工程确立为一种可衡量的能力,并为智能体驱动的模型专业化指明了道路 ootnote{代码将在https://github.com/zjunlp/DataAgent发布。}。

英文摘要

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at https://github.com/zjunlp/DataAgent).

2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Pruning and Distilling Mixture-of-Experts into Dense Language Models

将混合专家模型剪枝和蒸馏为密集语言模型

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

发表机构 * KRAFTON KAIST(韩国科学技术院)

AI总结 提出首个将混合专家(MoE)模型转换为标准密集架构的系统框架,通过专家评分、选择、分组、拼接和知识蒸馏,在参数匹配条件下比密集到密集剪枝平均下游准确率提升6.3个百分点,训练速度提升1.6倍。

详情
AI中文摘要

混合专家(MoE)现在是前沿语言模型的主导架构,但它需要将所有专家参数加载到内存中,因此在内存受限的部署中不太受欢迎。现有的压缩方法减少了专家数量,但输出仍然是具有相同基本限制的MoE模型。我们提出了第一个将训练好的MoE转换为标准全密集架构的系统框架:专家被评分、选择和分组,然后拼接成密集的前馈网络(FFN),并通过MoE教师的知识蒸馏进行精炼。我们在Qwen3-30B-A3B上评估了7种评分方法、5种分组方法和2种幅度缩放方法,涵盖了多种选定的专家数量,共产生350种配置。我们发现评分方法的选择影响最大,我们提出的新颖的多样性感知评分在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于先前的方法。在参数匹配的受控比较下,经过约4B token的蒸馏,MoE到密集的转换在平均下游准确率上比密集到密集的剪枝高出6.3个百分点,训练壁钟速度提升1.6倍。

英文摘要

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

2605.27786 2026-06-09 cs.LG cs.AI 版本更新

Locality-Aware Redundancy Pruning for LLM Depth Compression

面向LLM深度压缩的局部感知冗余剪枝

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo, Minkyu Kim, Sunwoo Lee

发表机构 * University of Southern California(美国南加州大学) Neural Superintelligence Lab, MODULABS(MODULABS神经超级智能实验室) Seoul National University(首尔国立大学) Inha University(釜山大学)

AI总结 提出LoRP,一种基于表示局部性的无训练单次深度剪枝框架,通过引入表示局部性分数(RLS)来识别和剪除冗余层,在多种LLM上提升了困惑度和下游任务准确率。

详情
AI中文摘要

大型语言模型在跨网络深度上已知存在表示冗余,这使得深度剪枝成为提高推理效率的有效方法。现有的单次剪枝方法依赖于局部层重要性或跨架构的固定冗余假设。我们提出了局部感知冗余剪枝(LoRP),一种由表示局部性引导的无训练单次深度剪枝框架。我们表明,层间冗余可以是局部化的或全局分布的,具体取决于LLM架构。为了表征这一现象,我们引入了表示局部性分数(RLS),该分数源自全局层间隐藏状态相似性。使用小的校准集,LoRP计算成对层相似性,按表示相似性对层进行聚类,并根据残差簇内冗余分配剪枝。跨多种LLM家族的实验表明,在困惑度和下游任务准确性上均有提升。

英文摘要

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy. Official github repository: https://github.com/daniel-eai/LoRP-Locality-Aware-Redundancy-Pruning/

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师:以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Southern California(南加州大学) Independent Researcher(独立研究者) National University of Singapore(新加坡国立大学) Microsoft(微软) Google(谷歌) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Northwestern University(西北大学) Allen Institute for AI (AI2)(人工智能研究院(AI2))

AI总结 提出以学生为中心的答案采样(SCAS)框架,通过估计学生中心的学习成本选择教师生成的答案,从而提升学生模型性能。

详情
AI中文摘要

LLM训练越来越依赖教师生成的监督,包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据,隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败:即使多个教师对同一问题提供正确答案,最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题,我们提出以学生为中心的答案采样(SCAS),该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发,我们推导出该成本的高效前向代理,并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明,SCAS持续提升学生性能,表明有效的蒸馏应优先考虑与当前学生匹配的监督,而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

2605.26078 2026-06-09 cs.LG 版本更新

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Wasserstein策略梯度在熵正则化强化学习中的全局收敛性

Zhaoyu Zhu, Rui Gao, Shuang Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文通过利用熵正则化强化学习的Bellman结构,证明了Wasserstein策略梯度(WPG)方法的全局收敛性,并建立了分布Polyak-Łojasiewicz条件。

详情
AI中文摘要

Wasserstein策略梯度(WPG)是一种利用动作分布的最优传输几何的强化学习(RL)策略优化方法。对于熵正则化RL目标,WPG通过将每个状态条件策略沿软Q函数的动作梯度以及Langevin型扩散进行传输来演化。尽管它在连续控制问题中具有吸引力,但其全局收敛性质仍不清楚。标准的Langevin分析并不直接适用,因为RL目标通过Bellman递归而非静态凸泛函依赖于策略,且Langevin漂移由软Q函数决定,其正则性必须在策略迭代过程中加以控制。在本文中,我们通过利用熵正则化RL的Bellman结构,发展了WPG的全局收敛理论。我们表明,通常由凸性扮演的角色可以被基于Bellman的论证所取代:软Bellman残差相对于Gibbs策略具有状态级KL表示;Bellman压缩将此残差与全局最优性差距联系起来;而Bellman预解恒等式将价值改进与相对Fisher信息联系起来。结合演化Gibbs族的均匀对数Sobolev不等式(LSI),这些要素产生了分布Polyak-Łojasiewicz条件。我们进一步建立了控制离散化误差所需的正则性和一致界,从而获得直到离散化偏差的几何收缩。概念上,我们的分析表明,尽管熵正则化RL在通常的平坦意义上不是凸的,但Bellman递归诱导了一种有利的Polyak-Łojasiewicz型(PL)几何,支持WPG的全局收敛。

英文摘要

Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.

2605.26452 2026-06-09 cs.RO cs.LG cs.SY eess.SY 版本更新

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

鲁棒Koopman控制屏障滤波器用于安全演员-评论家强化学习

Dhruv S. Kushwaha, Zoleikha A. Biron

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出鲁棒Koopman-CBF SAC框架,通过数据驱动学习Koopman预测器、构建提升空间中的仿射CBF约束并利用二次规划安全层实施,同时通过投影残差裕度处理近似误差,实现零约束违反或减少违规。

详情
Comments
17 pages, 7 figures
AI中文摘要

机器人系统的安全强化学习需要策略在训练和部署期间满足状态和输入约束的同时提高任务性能。控制屏障函数通过最小侵入性安全滤波器提供强制执行前向不变性的原则性机制,但其在无模型强化学习中的应用受限于对精确动力学和手工设计屏障证书的需求。我们提出鲁棒Koopman-CBF SAC,一种安全滤波的演员-评论家框架,从数据中学习有限维Koopman预测器,在提升空间中构建仿射CBF约束,并通过二次规划安全层强制执行。为考虑有限维Koopman近似误差,使用从留出轨迹数据估计的投影残差裕度收紧CBF条件。评论家在执行的安操作上训练,而演员则被正则化向Koopman-CBF可行集,减少训练中对滤波器的依赖。在安全控制基准测试中,该方法在CartPole稳定和跟踪上实现零约束违反,同时匹配或超过无约束SAC的回报。在高维Safety Gymnasium运动任务中,该方法在某些设置下减少了违规,但也暴露了一阶速度屏障和线性EDMD模型的重要局限性,推动了高阶和多步Koopman-CBF扩展。这些结果表明,鲁棒Koopman-CBF滤波器是无模型强化学习和可证明安全之间的有前途桥梁,同时阐明了此类滤波器保持有效的结构条件。所有代码可在\href{https://github.com/DhruvKushwaha/Koopman-CBF-Soft-Actor-Critic}{Github仓库}获取。

英文摘要

Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.

2605.26108 2026-06-09 cs.CV 版本更新

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

通过奖励倾斜分布匹配增强少步生成器

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

发表机构 * Tencent Hunyuan(腾讯文英) Hong Kong University of Science and Technology(香港科技大学) Westlake University(西湖大学)

AI总结 提出奖励倾斜分布匹配蒸馏(RTDMD)两阶段框架,结合分布匹配蒸馏与奖励引导强化学习,在仅4步推理下实现文本到图像生成的最新性能。

详情
Comments
Code and models are available at https://github.com/Harahan/RTDMD
AI中文摘要

近期少步扩散蒸馏的进展实现了高效图像生成,但将这些模型与人类偏好对齐仍具挑战。我们提出奖励倾斜分布匹配蒸馏(RTDMD),一个两阶段框架,将分布匹配蒸馏与奖励引导的强化学习统一用于少步流生成器。我们证明,最小化到奖励倾斜教师分布的KL散度自然分解为分布匹配项和奖励最大化项。在第一阶段,我们引入环境一致分布匹配蒸馏(AC-DMD),它执行子区间分布匹配,并用一致性正则化增强假分数目标,帮助假分数模型在有限更新下跟踪变化的生成器分布。在第二阶段,我们联合优化两项:对于奖励最大化项,我们推导出一个混合策略梯度,将GRPO风格的估计器用于随机中间过渡,与通过确定性最后步骤的直接奖励反向传播相结合,并进一步引入步骤子集GRPO(SubGRPO)以降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD在偏好、美学和组合指标上仅用4步推理就建立了新的最先进结果,超越了先前的少步文本到图像生成方法。代码和模型见https://github.com/Harahan/RTDMD。

英文摘要

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

2605.30226 2026-06-09 cs.RO cs.AI 版本更新

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

BORA: 弥合离线强化学习与在线残差适应以实现真实世界灵巧VLA模型

Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) CASIA(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室) USTC(中国科学技术大学)

AI总结 提出BORA框架,通过离线构建动作条件价值引导的评论家,并结合在线冻结VLA基础、引入人类在环的分块残差适应机制,解决灵巧操作中高维探索导致的时间不一致、样本低效和硬件风险问题,在五个真实灵巧任务上平均成功率提升33%。

详情
Comments
24 pages,11 figures
AI中文摘要

视觉-语言-动作(VLA)模型已成为将视觉-语言理解融入真实世界机器人操作的一种有前景的范式。然而,由于高维手部控制和复合执行误差,灵巧操作对VLA策略仍然具有挑战性,这使得真实世界的强化学习后训练对于弥合视觉基础动作生成与物理可靠灵巧执行之间的差距至关重要。然而,高维灵巧探索常常引发真实世界中的时间不一致性、样本低效和硬件风险。为应对这些挑战,我们提出BORA,一种为真实世界灵巧VLA模型设计的离线到在线强化学习后训练框架。在离线阶段,BORA构建一个以VLM的认知令牌和动作块作为输入的评论家。这种设计实现了动作条件价值引导,使评论家能够评估超越视觉上下文的灵巧手部运动。在随后的在线阶段,BORA冻结VLA基础,并引入一种轻量级、人类在环(HiL)的分块残差适应机制,以减轻真实世界执行误差并进一步在真实物理环境中纠正离线学习到的意图。通过继承离线评论家并采用干预驱动奖励,BORA有效纠正执行差异并适应真实世界物理变化,同时将预训练策略作为稳定先验。在五个复杂真实世界灵巧任务上的广泛评估表明,BORA显著优于纯模仿学习和传统解耦强化学习基线,在标准设置下平均成功率绝对提升33%,在未见物体泛化中提升高达43%。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.

2605.30184 2026-06-09 cs.LG physics.ao-ph 版本更新

Can AI Weather Models Predict Beyond Two Weeks? A Quantitative Benchmark and Analysis of Long Rollouts

AI天气模型能否预测两周以上?长期推演的定量基准与分析

Fanny Lehmann, Firat Ozdemir, Yun Cheng, Torsten Hoefler, Sebastian Schemm, Benedikt Soja, Siddhartha Mishra

发表机构 * ETH AI Center(ETH人工智能中心) ETH Zurich(苏黎世联邦理工学院) Swiss Data Science Center(瑞士数据科学中心) Scalable Parallel Computing Lab(可扩展并行计算实验室) Dep. of Applied Mathematics and Theoretical Physics(应用数学与理论物理系) University of Cambridge(剑桥大学) Institute of Geodesy and Photogrammetry(大地测量与摄影测量研究所) Seminar for Applied Mathematics(应用数学研讨会)

AI总结 通过九种AI天气模型的一年推演,将长期不稳定性分类为爆发、漂移和季节性丧失三种模式,并发现稳定性取决于对小时空尺度的处理。

详情
AI中文摘要

虽然AI天气模型在短期到中期预报(最多15天)中表现出色,但在更长时间推演时经常出现定义不清的“不稳定性”。本文通过九种最先进的AI天气模型的一年推演,将这些失败形式化为三种不同的模式:爆发、漂移和季节性丧失。我们的分析表明,稳定性取决于对小时空尺度的处理:不稳定的模型放大高频能量,而稳定的模型在输入中添加噪声时起到去噪作用。我们的发现远未将这些模型简化为随机鹦鹉,而是强调稳定模型根据初始状态生成独特的天气轨迹。我们通过对架构设计选择的消融研究验证了我们的发现,这些研究使用了最先进的Vision Transformer(ViT)AI天气模型架构。

英文摘要

While AI weather models excel at short-to-medium range forecasts (up to 15 days), they frequently suffer from ill-defined "instabilities" when rolled out over longer horizons. This work addresses the lack of a formal taxonomy by categorizing these failures into three distinct regimes: blow-up, drift, and loss of seasonality, through year-long rollouts of nine state-of-the-art AI weather models. Our analysis reveals that stability hinges on the treatment of small spatio-temporal scales: unstable models amplify high-frequency energy, while stable models act as denoisers when noise is added to their inputs. Far from reducing these models to mere stochastic parrots, our findings highlight that stable models generate unique weather trajectories, conditioned on the initial state. We verify our findings through ablation studies on architectural design choices, conducted using state-of-the-art Vision Transformer (ViT) AI weather model architectures.

2605.29920 2026-06-09 cs.LG 版本更新

Midpoint Generative Models

中点生成模型

Daniil Shlenskii, Nikita Gushchin, Lev Novitskiy, Dmitry V. Dylov, Alexander Korotin

发表机构 * AXXX, Russia(俄罗斯AXXX) Applied AI Institute, Russia(俄罗斯应用人工智能研究所) Kandinsky Lab, Russia(俄罗斯康德斯基实验室)

AI总结 提出中点生成模型(MGM),利用流匹配的对称性定义中点散度,并通过变分目标训练单步生成模型,在性能上与现有方法竞争。

详情
AI中文摘要

我们引入了中点生成模型(MGM),这是一个用于训练单步生成模型的原则性框架。MGM基于线性插值流匹配的一个简单对称性:当两个端点分布重合时,相应的漂移场在中点时间$t=1/2$处消失。我们证明该场的范数定义了分布之间的有效差异,称为中点散度。我们通过引入随机翻转插值将该散度扩展到中点之外,并通过用对称随机插值替代确定性线性流匹配插值进一步推广,得到广义中点散度。最后,我们推导了广义散度的变分形式,从而得到一个可处理的目标用于训练单步生成器。由此产生的MGM算法为生成建模提供了一种有效且理论上有依据的方法,在单步生成建模方法中取得了有竞争力的性能。

英文摘要

We introduce Midpoint Generative Models (MGM), a principled framework for training one-step generative models. MGM is based on a simple symmetry of Flow Matching with linear interpolation: when the two endpoint distributions coincide, the corresponding drift field vanishes at the midpoint time, $t=1/2$. We show that the norm of this field defines a valid discrepancy between distributions, which we call the Midpoint Divergence. We extend this discrepancy beyond the midpoint by introducing randomly flipped interpolations and further generalize it by replacing deterministic linear Flow Matching interpolations with symmetric stochastic interpolants, yielding a generalized Midpoint Divergence. Finally, we derive a variational formulation of our generalized divergence, yielding a tractable objective for training a one-step generator. The resulting MGM algorithm offers an effective and theoretically grounded approach to generative modeling, achieving competitive performance against existing one-step generative modeling methods.

2605.29823 2026-06-09 cs.AI 版本更新

Quantifying and Optimizing Simplicity via Polynomial Representations

通过多项式表示量化和优化简单性

Tianren Zhang, Xiangxin Li, Minghao Xiao, Guanyu Chen, Feng Chen

发表机构 * arXiv.org [cs.AI](计算机科学与人工智能)

AI总结 提出多项式表示作为分布感知的低维神经函数代理,通过正交多项式基近似网络预测行为,以有效度作为简单性度量,并导出可微正则化器以提升泛化。

详情
Comments
ICML 2026
AI中文摘要

深度网络通常表现出对“简单”解的偏好,这种简单性偏差被广泛认为在泛化中起关键作用。然而,一种广泛适用、定量的简单性度量仍然难以捉摸。我们引入多项式表示作为分布感知的、低维神经函数代理:我们使用正交多项式基沿数据依赖的插值路径近似网络的预测行为,从而得到紧凑的函数表示。我们表明,该表示的有效度可作为实用的简单性度量,能够预测跨任务和架构的泛化,并且持续优于现有的泛化代理(如锐度)。最后,多项式表示自然产生可微的简单性正则化器,在图像和文本分类、微调对比视觉语言模型以及强化学习中持续改善泛化。

英文摘要

Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introduce polynomial representations as a distribution-aware, low-dimensional surrogate for neural functions: we approximate a network's predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that the effective degree of this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield a differentiable simplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision-language models, and reinforcement learning.

2504.19399 2026-06-09 cs.RO 版本更新

Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation

跟随一切:具有目标感知适应的领导者跟随与避障框架

Qianyi Zhang, Shijian Ma, Boyi Liu, Jianhao Jiao, Dimitrios Kanoulas

发表机构 * Institute of Robotics and Automatic Information System, Nankai University, China(南开大学机器人与自动化信息系统研究所) Centre for Data Science, University of Macau, China(澳门大学数据科学中心) Electrical and Computer Engineering Department, Hong Kong University of Science and Technology, China(香港科学与技术大学电子与计算机工程系) Department of Computer Science, University College London, UK(伦敦大学学院计算机科学系) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学航空与航空工程系)

AI总结 提出统一框架,用分割模型替代检测模型以跟随任意形态领导者,并设计目标感知适应机制和基于图的规划器,实现领导者暂时离开视野时的鲁棒跟随与避障。

详情
AI中文摘要

鲁棒且灵活的领导者跟随是机器人融入人类社会的一项关键能力。现有方法难以泛化到任意形态的领导者,并且在领导者暂时离开机器人视野时常常失败,本文引入了一个统一框架来应对这两个挑战。首先,用分割模型替代传统检测模型,使领导者可以是任何物体。为了增强识别鲁棒性,实现了一个距离帧缓冲区,在多个距离存储领导者嵌入,以考虑领导者跟随任务的独特特征。其次,设计了一种目标感知适应机制,根据领导者的可见性和运动来控制机器人规划状态,并辅以基于图的规划器,为每个状态生成候选轨迹,确保高效跟随和避障。在室内外环境中,使用腿式机器人跟随者与各种领导者(人、地面机器人、无人机、腿式机器人、停止标志)进行的仿真和真实世界实验显示,在跟随成功率、减少视觉丢失时长、降低碰撞率和减小领导者-跟随者距离方面取得了竞争性改进。

英文摘要

Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.

2605.29475 2026-06-09 cs.CL cs.AI cs.CE cs.HC 版本更新

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot:一个基于网络的交互式助手,用于统一探索性和细粒度科学假设发现

Hongran An, Zonglin Yang

发表机构 * Central Conservatory of Music(中央音乐学院) Nanyang Technological University(南洋理工大学)

AI总结 提出MOOSE-Copilot,通过形式化的人机交互协议,将发散性探索和收敛性细化统一,利用蓝图、路由和反馈三种信号引导生成,显著优于纯自主基线。

详情
Comments
Accepted to ACL 2026 (System Demonstrations)
AI中文摘要

大型语言模型(LLMs)在科学假设发现中展现出显著潜力。然而,现有方法存在两个关键限制:它们将发散性探索构思和收敛性细粒度细化视为孤立任务,并且自主运行,几乎没有人类指导。我们提出了MOOSE-Copilot,这是第一个通过形式化的人机交互(HAII)协议弥合这一抽象差距的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程:初始蓝图、阶段间路由和再生反馈。定量评估表明,注入这些结构化专家信号显著优于纯自主基线,并在神谕指导下建立了性能上限。此外,为了普及这一范式,我们开发了一个直观的基于网络界面,具有交互式树状可视化。这明确消除了复杂命令行代理工具的陡峭学习曲线,使跨学科研究人员能够直接利用、视觉编排并加速端到端的科学突破。

英文摘要

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback-no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.

2605.28912 2026-06-09 cs.LG cs.CR 版本更新

Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems

基于环空间感知的电力系统自编码器盲假数据注入攻击检测

Xin Li, Chenhan Xiao, Jonathan Cohen, Aviad Elyashar, Yang Weng, Rami Puzis

发表机构 * Faculty of Computer and Information Science, Ben-Gurion-University, Be’er Sheva, Israel(计算机与信息科学学院,本·古里安大学,贝尔谢巴,以色列)

AI总结 针对自编码器利用测量流形零空间生成的盲假数据注入攻击,提出基于拓扑环空间检测器,利用最小环基实现最优泛化误差,有效检测数据驱动攻击。

详情
Comments
13 pages, 11 figures
AI中文摘要

人工智能驱动的数据中心和大型储能系统的快速增长,使得电力系统运行越来越依赖实时测量数据和自动决策。然而,许多现有的检测方法依赖于对测量值的统计或数据驱动分析,当攻击者利用相同的数据结构构造隐蔽扰动时,这些方法可能会失效。为说明这一局限性,我们展示了一种盲假数据注入攻击(FDIA),其中自编码器学习测量流形并生成与雅可比零空间对齐的扰动,从而使得攻击能够逃避基于残差的坏数据检测器和时间序列异常检测器。为了缓解利用零空间的数据驱动FDIA,我们提出了一种拓扑感知的环空间检测器(CSD),该检测器利用网络的环空间施加结构约束,以增强零空间估计。此外,我们证明,通过使用最小环基(MCB),所提出的CSD实现了攻击检测的最优泛化误差。通过利用拓扑导出的环约束而不是仅仅依赖于数值零空间估计,所提出的方法不需要精确的线路参数,并改善了正常测量与受攻击测量之间的分离。在IEEE 14、30、57和118节点系统上的仿真结果表明,该方法在实际测量噪声下有效检测数据驱动FDIA。

英文摘要

The rapid growth of AI-driven data centers and large-scale energy storage systems is increasing the reliance of power system operation on real-time measurement data and automated decision-making. However, many existing detection methods rely on statistical or data-driven analysis of measurements and can fail when attackers exploit the same data structure to craft stealthy perturbations. To illustrate this limitation, we demonstrate a blind False Data Injection Attack (FDIA) in which an Autoencoder learns the measurement manifold and generates perturbations aligned with the Jacobian null space, thereby allowing the attack to evade both residual-based baddata detectors and time-series anomaly detectors. To mitigate data-driven FDIAs which exploit the null space, we propose a topology-informed Cycle-Space Detector (CSD) that leverages the Cycle-Space of the network to impose structural constraints that enhance null space estimation. In addition, we prove that by using the Minimum Cycle Basis (MCB), the proposed CSD achieves the optimal generalization error for attack detection. By exploiting topology-derived cycle constraints rather than relying solely on numerical null space estimation, the proposed method does not require precise line parameters and improves the separation between normal and attacked measurements. Simulation results on IEEE 14-, 30-, 57-, and 118-bus systems demonstrate that the proposed method effectively detects data-driven FDIAs under realistic measurement noise.

2605.28860 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源:为什么RL比SFT更好地保留电路?

Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 通过引入差异电路脆弱性指标,研究比较了强化学习与监督微调在大型语言模型微调中对内部计算电路的保留程度,发现RL虽任务适应较慢但能更好保留电路,从而减轻灾难性遗忘。

详情
AI中文摘要

微调大型语言模型(LLMs)经常导致先前能力的灾难性遗忘。最近的研究表明,强化学习(RL)比监督微调(SFT)更有效地保留先前能力,这归因于策略梯度更新更接近基础策略\cite{shenfeld2025rl}。我们将这种行为解释扩展到机制层面,并探究RL的优势是否通过内部计算电路的更强保留来体现。我们引入了差异电路脆弱性,一种头部级别的度量,用于衡量电路在微调下的退化程度,并将其用于比较RL和SFT在Qwen2.5-3B-Instruct适应科学问答任务上的表现。我们发现了清晰的机制权衡:SFT更快地适应目标任务,但导致更大的电路破坏和先前能力的遗忘,而RL保留了更大比例的基础电路,代价是任务适应较慢。这些发现表明,电路保留可能有助于解释为什么RL对灾难性遗忘更具鲁棒性。我们在此发布了代码:https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability。

英文摘要

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

2605.28831 2026-06-09 cs.CL cs.AI 版本更新

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3Mem:用于长时域交互式问答的结构化时空场景-事件记忆

Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) The University of Sydney(悉尼大学) Beihang University(北航)

AI总结 提出S3MEM框架,通过结构化场景-事件记忆和锚点敏感检索,在长时域交互式问答中实现比通用记忆接口更优的准确率-效率平衡。

详情
AI中文摘要

长时域交互代理通常积累大量轨迹历史,但仍无法可靠地回答关于早期事件的问题。我们认为主要瓶颈不仅是上下文长度,而是长期记忆的轨迹到答案接口。当历史以纯文本块存储并使用标准检索增强生成(RAG)查询时,系统通常检索到局部相关但链不完整的证据,特别是对于空间、时间、重复事件和多跳状态问题。我们提出S3MEM,一种用于长时域交互式问答(QA)的结构化场景-事件情节记忆框架。S3MEM将轨迹写入结构化记忆单元,通过锚点敏感检索检索证据,并为答案时间推理提供紧凑的令牌预算感知证据接口。从这个意义上说,S3MEM是一种结构化证据利用工具,将代理轨迹转换为查询对齐的支持。我们在两个内部标题环境(Crafter、Jericho)和两个外部环境(SciWorld、ALFWorld)上评估S3MEM。在共享的冻结答案时间协议下,S3MEM在所有四个环境中一致优于Vanilla RAG,在Crafter、Jericho和ALFWorld上超过Graph-NoReader,在SciWorld上与之匹配,同时使用的证据令牌显著减少。三个改编的近期基线——A-MEM启发、MemoryOS改编和LightMem改编——在多个设置中优于Vanilla RAG,但没有一个达到S3MEM的整体准确率-效率前沿。总体而言,证据支持一个有限的结论:在当前冻结的答案时间协议下,结构化写入和锚点敏感证据路由为长时域交互式QA提供了比通用记忆接口更强的准确率-效率前沿。

英文摘要

Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches \(0.48\) F1 and \(0.40\) BLEU with (1{,}073) evidence tokens per question, about \(15.8\times\) fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only \(189\)tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.

2605.19276 2026-06-09 cs.CL cs.LG 版本更新

OpenCompass: A Universal Evaluation Platform for Large Language Models

OpenCompass:大型语言模型的通用评估平台

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Zhiwei Fei, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Zhuozhi Xiong, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo

发表机构 * OpenCompass Team(OpenCompass团队) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出OpenCompass,一个模块化、高兼容性、灵活且高并发的通用LLM评估平台,支持多种任务场景和主流基准数据集。

详情
AI中文摘要

近年来,人工智能领域经历了从特定任务的小规模模型到通用大型语言模型(LLM)的范式转变。随着LLM的快速迭代,对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。目前,基于静态基准数据集的主流评估方法面临任务类型多样性、评估标准不一致以及数据处理流程碎片化等挑战,难以高效进行跨领域和大规模模型评估。为解决上述问题,本文提出并开源了OpenCompass,一个一站式、可扩展且支持高并发的通用LLM评估平台。该平台遵循模块化和组件解耦的设计理念,具有三大核心优势:高兼容性、灵活性和高并发性。OpenCompass的核心架构包括五个关键组件:配置系统、任务划分模块、执行与调度模块、任务执行单元和结果可视化模块。其工作流程提供基于规则、LLM作为评判者和级联评估器,以适应不同任务场景的需求。平台支持知识、推理、计算、科学、语言、代码等多个领域的基准数据集,为学术界和工业界提供统一高效的LLM评估工具,有助于准确识别LLM的优缺点并进行后续优化。

英文摘要

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

2605.25985 2026-06-09 cs.AI 版本更新

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

面向多自由变量复杂逻辑查询的神经可扩展符号搜索框架

Weizhi Fei, Hang Yin, Zihao Wang, Shukai Zhao, Wei Zhang, Yangqiu Song

发表机构 * Department of Mathematical Sciences, Tsinghua University(清华大学数学科学系) Squarepoint Capital(Squarepoint资本) Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程系) Department of Computer Sciences, University of Rochester(罗切斯特大学计算机科学系)

AI总结 针对知识图谱上多自由变量复杂查询的联合排序难题,提出神经可扩展符号搜索(NS3)框架,通过预算约束和超节点合并近似联合排序,显著提升性能。

详情
Comments
10 pages, 5 figures
AI中文摘要

复杂查询回答(CQA)是在不完整知识图谱(KG)上进行知识表示和推理的基本任务。回答带有$k$个自由变量的存在性一阶查询(即$ ext{EFO}_k$查询)是一个关键但具有挑战性的问题,因为它需要对$\mathcal{E}^k$中的答案元组进行排序,其中$\mathcal{E}$表示KG的实体集。随着$k$的增长,这很快变得难以处理。因此,现有基准和方法依赖于单个变量的边际排序;然而,边际排序是元组真实联合排序的较差代理。基于$ ext{EFO}_1$查询的神经符号搜索,我们提出了神经可扩展符号搜索(NS3),这是一个预算框架,无需枚举$\mathcal{E}^k$即可近似联合排序。NS3 (i) 回答边际化子查询以获得必要的候选集,(ii) 将多个自由变量合并为超节点,其域由动态预算$B$修剪和控制,以及(iii) 逐步将$ ext{EFO}_k$查询简化为在预算缩减域上的$ ext{EFO}_{k-1}$查询。在三个标准KG数据集上,NS3在保持强边际准确性的同时,显著提高了联合排序性能。我们进一步发布了一个联合排序基准,将现有的$ ext{EFO}_1$数据集扩展到$k=3$,从而能够系统评估多变量查询。我们的代码提供在https://github.com/HKUST-KnowComp/NS3_KDD2026。

英文摘要

Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for $\text{EFO}_1$ queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating $\mathcal{E}^k$. NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget $B$, and (iii) progressively reduces an $\text{EFO}_k$ query to an $\text{EFO}_{k-1}$ query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing $\text{EFO}_1$ datasets to $k=3$, enabling systematic evaluation of multi-variable queries. Our code is provided in https://github.com/HKUST-KnowComp/NS3_KDD2026.

2605.25449 2026-06-09 cs.CV 版本更新

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion

Pantheon360: 通过3D感知的360°视频扩散驯服数字孪生生成

Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu, Fangzhou Lin, Hengyuan Zhang, David Paz, Xinyu Huang, Yuliang Guo, Yu-Lun Liu, Yue Wang, Liu Ren

发表机构 * University of Southern California(南加州大学) National Yang Ming Chiao Tung University(国家阳明交通大学) Cornell University(康奈尔大学) Bosch Research(博世研究)

AI总结 提出Pantheon360框架,利用显式3D缓存从稀疏360°输入生成高保真视频,实现全局几何一致性和可控相机路径,解决传统透视视频生成器视野受限导致的跨视图不一致和时间漂移问题。

详情
Comments
Accepted to CVPR 2026. Project page: https://koi953215.github.io/pantheon360_page/
AI中文摘要

从视频生成完整的数字孪生需要精确的相机控制、全局场景覆盖以及严格的空间-时间一致性约束,由于透视视频生成器的视野(FoV)有限,这些要求仍然具有挑战性。其狭窄的FoV迫使使用长轨迹或多视图轨迹,从而加剧了跨视图不一致和时间漂移。我们认为360°视频生成提供了一种自然的解决方案:全景覆盖简化了轨迹设计,并为保持一致性提供了强大的全局上下文。我们提出Pantheon360:通过3D感知的360°视频扩散驯服数字孪生生成,这是一个可控的360°视频生成框架,能够从稀疏的360°输入合成高保真视频。关键思想是一个显式的3D缓存,从输入中重建,作为任何用户定义相机路径的几何骨架。这使得扩散模型可以专注于逼真的纹理细化,同时3D缓存强制执行全局几何一致性。实验表明,Pantheon360实现了卓越的视觉质量和无与伦比的几何一致性,为下游仿真和数字孪生应用提供了可靠且灵活的360°场景生成。

英文摘要

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.