arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30353 2026-05-29 cs.AI astro-ph.CO cs.HC cs.SE

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

物理学就是一切?物理学家监督人工智能开发科学软件的案例研究

Nhat-Minh Nguyen

AI总结 通过一个物理学家监督AI编码代理开发可微扰动理论模块的案例,研究AI代理在科学软件开发中的可靠性,发现监督设计比模型能力更能决定输出可信度。

详情
Comments
10 pages, 2 figures, 2 tables, 1 physicist and a few AI agents. Accepted by ICML 2026 AI for Science Workshop. Code and development log are available at this repo: https://github.com/MinhMPA/clax-pt
AI中文摘要

AI代理是工具、合著者还是研究者?我们提出了一个量化案例研究(N=1):一位物理学家在12个工作日和57个会话中监督一个AI编码代理(Claude Code, Sonnet和Opus模型),构建了CLAX-PT,一个基于JAX的可微单圈扰动理论模块。我们按干预级别记录并分类了15个监督事件。代理通过迭代与oracle测试自主解决了10个事件,另外2个通过物理学家的领域知识解决。它无法解决的三个事件——均避开了oracle检测——有一个共同特征:代理将症状缓解视为根本原因解决。它在57个会话中花费了33个来调整一个无法表示目标物理的代码架构内的系数,并且即使被提示重新考虑也无法重新评估其CLASS-PT分支选择;只有注入一个物理概念(各向异性BAO阻尼)才触发了重新设计。另外,代理提交了一个经过校准的修正,该修正通过了所有oracle测试,但不对应理论中的任何量,在其他宇宙学参数下预测错误值。这个修正因子在同一会话中被发现并替换。三个监督实践被证明对于捕捉oracle测试遗漏的问题至关重要:在基准校准之外的多样参数点进行测试;共享变更日志,揭示跨会话的停滞探索;以及明确禁止非物理数值补丁的规则。在这个案例中,监督设计而非模型能力决定了代理的输出是否可信。缩小差距需要代理能够提出架构替代方案,而不是在给定结构内优化,并区分预测充分性与解释正确性——这些能力在本案例中未展现,且显然不能仅通过规模扩展来解决。[删节版]

英文摘要

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

2605.30352 2026-05-29 cs.CV

GMOS: Grounding Moving Object Segmentation in 3D Space and Time

GMOS: 在3D空间和时间中定位运动物体分割

Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

AI总结 提出GMOS框架,直接在RGB视频上实现3D感知、时间细粒度的多运动物体分割,并构建GMOS-2K数据集和MOS-I评估协议,在多个基准上取得最先进结果。

详情
Comments
Project Page: https://www.robots.ox.ac.uk/vgg/research/gmos/
AI中文摘要

运动物体分割(MOS)旨在发现、分割和跟踪独立于相机运动的物体。然而,当前的MOS方法存在两个基本限制:它们依赖于预计算的2D辅助模态(如光流或点轨迹),缺乏3D几何信息,并且将运动视为序列级属性,忽略了每个物体的瞬时运动状态。我们通过将MOS定位于3D空间和时间来解决这两个问题,并提出GMOS,这是一个直接对RGB视频进行操作的框架,可生成多个运动物体的3D感知、时间细粒度分割,同时还有一个前景-背景变体GMOS-S用于更快部署。为了支持这种模式下的训练和评估,我们整理了GMOS-2K数据集,包含来自五个已建立的视频物体分割(VOS)基准的2,210个真实世界视频,带有每个物体的时间运动注释,并正式定义了MOS-I(“I”表示瞬时),这是一个具有三个互补指标的时间细粒度评估协议。GMOS在MOS、MOS-I和无监督VOS基准上均取得了最先进的结果,同时运行速度显著快于先前的多物体MOS方法,并支持用于流式部署的在线推理。

英文摘要

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

2605.30351 2026-05-29 cs.CV cs.AI

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA: 用于分钟级自回归视频扩散的低秩潜在KV缓存

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag

AI总结 本文提出VideoMLA,通过多头潜在注意力(MLA)将每头KV替换为共享低秩内容潜在和分离的3D-RoPE位置键,在视频扩散中减少92.7%的KV内存,并保持质量与吞吐量提升。

详情
Comments
Project Page: https://videomla.github.io/
AI中文摘要

长序列因果视频扩散已收敛于固定大小的滑动窗口KV缓存,近期创新通过改变窗口内令牌或位置编码方式在此布局内进行改进。每头KV布局本身是流式内存和延迟的主要贡献者,但基本保持不变。本文首次研究多头潜在注意力(MLA)在视频扩散中的应用。VideoMLA将每头的键和值替换为共享的低秩内容潜在和共享的解耦3D-RoPE位置键,在每个缓存层将每令牌KV内存减少92.7%。我们进一步探究了为什么MLA在视频扩散中成功,尽管语言模型中常用于激励它的谱假设不成立:预训练视频注意力不是低秩的,99%能量的有效秩远高于任何实际潜在维度。VideoMLA在压缩比下保持质量,而直接谱近似会预测较大的重构误差。我们表明,MLA瓶颈而非预训练谱决定了有效秩:谱和随机初始化都从初始化开始占据几乎全部秩预算,训练在此预算内适应。在VBench上,VideoMLA匹配短视界流式视频扩散基线,在长视界中取得最佳总体分数,并在单个B200上将吞吐量提升1.23倍。

英文摘要

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

2605.30350 2026-05-29 cs.RO cs.LG

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

DynaFLIP: 通过三模态动力学引导表示重新思考机器人感知

Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang

AI总结 提出DynaFLIP,一种动力学感知的多模态预训练框架,通过图像-语言-3D流三元组训练图像编码器,利用单纯形体积最小化与余弦正则化和对比目标对齐三模态,提升机器人操作中的运动理解与泛化能力。

详情
Comments
Project website: https://dynaflip-robotics.github.io
AI中文摘要

机器人操作关键依赖于保留场景中与动作相关方面的感知。然而,大多数机器人学习流程基于为静态识别或视觉-语言对齐预训练的视觉编码器,将运动理解留给下游策略。我们引入了DynaFLIP,一种动力学感知的多模态预训练框架,将运动理解上推到感知中。我们从异构的人类和机器人视频中构建图像-语言-3D流三元组,并使用这些三元组作为训练时监督来塑造仅图像的编码器。我们的关键思想是鼓励三种模态在共享的超球面空间中跨越一个小的单纯形体积——较小的单纯形体积表示更强的对齐。为了避免朴素体积最小化的几何模糊性和平凡坍缩,我们将单纯形体积最小化与余弦正则化和对比目标相结合。我们的分析表明,DynaFLIP关注对操作至关重要的控制相关区域。得到的动力学感知表示作为可重用的视觉骨干,在包括VLA在内的各种下游策略中持续优于基线。我们在多种模拟和真实世界设置中验证了这一点,在分布外场景下增益达到+22.5%。我们的结果表明,当视觉表示被训练为不仅编码存在什么,而且编码世界在动作下如何变化时,机器人泛化能力会提高。

英文摘要

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

2605.30349 2026-05-29 cs.CV

AdaState: Self-Evolving Anchors for Streaming Video Generation

AdaState: 用于流式视频生成的自演化锚点

Yusuf Dalva, Pinar Yanardag

AI总结 针对自回归视频扩散模型中静态首帧锚点抑制视频动态的问题,提出自适应状态锚点,通过隐变量与内容联合去噪并随时间演化,显著提升运动丰富度和场景自然演进。

详情
Comments
Project page: https://adastate.github.io/
AI中文摘要

自回归视频扩散模型通过顺序生成帧来产生流式视频,每个块的条件基于先前生成的内容。这些模型在结构上锚定于第一帧:其键值表示在注意力缓存中占据特权位置,并在整个生成过程中作为主要场景参考。作为缓存中最干净、无错误的位置,该锚点吸引了不成比例的注意力,抑制了视频动态,并将场景构图锁定在初始视角,即使场景自然演变也是如此。结果是一个时间上浅显的视频,其中运动、相机移动和场景进展被抑制,以利于静态一致性。为了解决这个问题,我们用自适应状态(一个隐藏的潜在变量)替换静态锚点,该状态在每个块中与内容一起被模型去噪,但从不渲染。模型不是参考冻结的第一帧,而是通过关注先前状态和当前内容,在每一步生成自己的场景锚点,产生一个随生成内容演变的参考。与编码绝对时间概念的标准视频生成不同,我们的公式将时间视为相对的:每个生成步骤看到相同的位置结构,无论生成进行到多远,并且状态转换在每个块中相同。这些特性共同在生成过程中引入了循环,其中去噪作为转换函数,KV缓存作为载体,无需外部模块。实验表明,自适应状态显著改善了视频动态,使生成视频中的运动更丰富,场景进展更自然。

英文摘要

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.

2605.30348 2026-05-29 cs.CL cs.AI cs.LG

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

LLMSurgeon: 诊断大型语言模型的数据混合

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen

AI总结 提出LLMSurgeon框架,通过逆问题方法从目标LLM生成文本中估计预训练语料的领域分布,实现无需训练数据的后验审计。

详情
Comments
ACL 2026 Main. Code at https://github.com/Yaxin9Luo/LLMSurgeon
AI中文摘要

大型语言模型(LLM)的预训练数据混合构成了它们的“数字DNA”,塑造了模型的行为、能力和失败模式。然而,这种组成很少被披露,使得事后审计数据组合或来源变得困难。在这项工作中,我们形式化了$ extbf{数据混合手术(DMS)}$:仅从目标LLM生成的文本中,在预定义分类法下估计其预训练语料的领域级分布。我们提出了$ extbf{LLMSurgeon}$,一个强大的框架,将DMS视为标签偏移假设下的逆问题。LLMSurgeon不是直接聚合分类器输出,而是估计一个校准的$ extit{软}$混淆矩阵,并解决一个约束逆问题以纠正系统性的领域混淆并恢复潜在的混合先验。为了评估,我们引入了$ extbf{LLMScan}$,一个基于具有透明预训练混合的开源LLM构建的配方可验证评估套件。在LLMScan上,LLMSurgeon在固定协议下以高保真度恢复了领域混合。我们的工作提出了一种实用的、事后审计基础模型数字DNA的方法,无需访问其训练数据。

英文摘要

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

2605.30347 2026-05-29 cs.CV cs.GR

NeuROK: Generative 4D Neural Object Kinematics

NeuROK:生成式4D神经物体运动学

Chen Geng, Guangzhao He, Yue Gao, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu

AI总结 提出基于Transformer的编码器-解码器模型NeuROK,通过学习物体潜在运动学参数化,实现从静态3D物体生成逼真的4D动态变形,克服了传统方法对预定义物理模型和特定类别的依赖。

详情
Comments
CVPR 2026
AI中文摘要

数据驱动的方法已经彻底改变了3D视觉,使Transformer能够有效地重建和生成静态3D物体。然而,生成模拟4D动态——即静态物体在各种物理条件下的逼真时间变形——仍然具有挑战性且通常是特设的,尽管它在构建全面的3D世界模型中很重要。大多数现有方法假设一个预定义的物理模型并使用系统辨识来估计参数,将这些方法限制在特定类别和小规模数据集上。我们提出,通过学习物体中心物理系统的数据驱动运动学状态参数化,可以克服这些限制。具体来说,我们学习了一个表示物体所有可能状态的潜在空间,以及一个将任何采样的潜在向量映射到物体合理变形形状的解码器。我们将这种参数化称为神经物体运动学(NeuROK),并在精心策划的大规模4D数据集上训练基于Transformer的编码器-解码器模型。这种公式化和学习到的模型显著简化了模拟动态的生成,因为我们只需要从经典物理中拉格朗日力学的角度考虑低维潜在空间内的动态。我们在各种动态物体类型上展示了这种神经模拟框架的有效性和通用性,显示出相对于先前工作的明显优势。项目页面:https://chen-geng.com/neurok

英文摘要

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this neural simulation framework across diverse dynamic object types, showing clear advantages over prior works. Project page: https://chen-geng.com/neurok

2605.30346 2026-05-29 cs.CV

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

YoCausal: 视频生成距离世界模型还有多远?一个因果视角

You-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang, Yu-Lun Liu, Zhixiang Wang

AI总结 提出YoCausal基准,通过时间反转真实视频生成反事实样本,利用反向惊奇指数(RSI)和因果认知指数(CCI)评估视频扩散模型的因果理解能力,发现模型感知时间方向不等于理解因果关系,与人类水平存在显著差距。

详情
Comments
Project page: https://www.youzhexie.me/papers/YoCausal/index.html
AI中文摘要

随着视频扩散模型(VDMs)向世界模型迈进,一个关键问题浮现:它们是否真正理解因果关系,还是仅仅过拟合了统计时间模式?现有基准大多依赖合成数据,由于模拟到现实的差距限制了真实世界的泛化能力。我们提出YoCausal,一个受认知科学中期望违背(VoE)范式启发的两级基准。通过零成本地将真实视频时间反转作为自然反事实样本,YoCausal建立了一个可任意扩展的评估协议。第一级引入反向惊奇指数(RSI),通过去噪损失量化时间箭头感知。第二级引入因果认知指数(CCI),利用VLM将数据集分层为因果和非因果子集,将真正的因果推理与时间偏差分离开来。对13个最先进VDMs的评估揭示,感知时间箭头并不等同于理解因果关系,并且与人类水平的因果认知相比仍存在显著差距。

英文摘要

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

2605.30345 2026-05-29 cs.AI cs.CL cs.LG

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

SchGen: 基于语义接地代码表示的PCB原理图生成

Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu

AI总结 提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型,通过语义接地代码表示将几何驱动问题转化为语义驱动匹配任务,并构建大规模数据集,在连线准确性和功能正确性上显著优于现有方法。

详情
Comments
19 pages, 7 figures
AI中文摘要

印刷电路板(PCB)原理图设计几乎定义了所有电子硬件,但它仍然是手动且依赖专业知识的。虽然生成式AI已推动数字和模拟集成电路设计的发展,但从自然语言意图生成PCB原理图的研究仍基本空白。本文提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型。关键挑战在于缺乏适合LLM的表示和大规模数据集。当前的原理图格式以冗长、特定于工具的语法和几何描述为主,难以可靠生成。我们引入一种语义接地代码表示,该表示通过相对位置和基于引脚名的布线对原理图编辑原语进行编码,将几何驱动生成问题转化为适合LLM的语义驱动匹配任务。我们进一步通过人机协作流水线将开源硬件设计转换为我们的表示,构建了与用户提示配对的大规模PCB原理图数据集。实验表明,SchGen在连线准确性和功能正确性上显著优于替代表示甚至更大的通用LLM。我们的结果突出了表示设计在使生成模型胜任复杂硬件设计任务中的关键作用。

英文摘要

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

2605.30344 2026-05-29 cs.AI

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

小巧但可信:面向时间序列异常检测的高效视觉-语言推理

Xiaona Zhou, Muntasir Wahed, Tianjiao Yu, Constantin Brif, Ismini Lourentzou

AI总结 针对时间序列异常检测中缺乏自然语言解释的问题,构建VisAnomBench基准并微调参数高效的视觉-语言模型VisAnomReasoner,在准确性和泛化性上显著超越基线。

详情
AI中文摘要

近期视觉-语言模型(VLM)的进展在许多任务上取得了令人瞩目的性能,然而先前的研究报告称,将大型语言或多模态模型应用于发现序列数据中的异常模式时表现不佳。公共异常检测基准通常提供区间标注而非自然语言解释,这使得微调VLM以产生有根据、可解释的决策变得困难。为解决这一差距,我们构建了VisAnomBench,这是一个从公共时间序列数据集构建的精选基准,并利用细粒度、任务特定的奖励从多个大型VLM中选择高质量异常解释进行增强。通过在该基准上进行微调,我们开发了VisAnomReasoner,一种用于时间序列异常检测的参数高效VLM。在VisAnomBench上的实验结果表明,VisAnomReasoner实现了更准确的异常定位,并始终优于所有基线,精确率和F1分别至少提高21.23和23.87个百分点。在TSB-AD-U基准上的额外实验证明了强大的跨基准泛化能力,VisAnomReasoner将精确率和F1分别提高了9.57和13.39个百分点。

英文摘要

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.

2605.30343 2026-05-29 cs.CL cs.AI

Unlocking the Working Memory of Large Language Models for Latent Reasoning

解锁大型语言模型的工作记忆以实现潜在推理

Lukas Aichberger, Sepp Hochreiter

AI总结 提出一种名为RiM的潜在推理方法,通过固定记忆块替代自回归生成中间推理步骤,在单次前向传播中实现计算高效的潜在推理。

详情
Comments
Preprint
AI中文摘要

为了提升大型语言模型的推理能力,通常通过在最终答案之前生成中间令牌来扩展测试时计算。然而,这会将推理与自回归生成耦合,从而混淆内部计算与外部通信。相比之下,人类认知可以利用工作记忆在内部保持和操作信息,而无需将中间思维外部化。基于这一原理,我们引入了记忆推理(RiM),一种潜在推理方法,用记忆块替代推理步骤的自回归生成。这些记忆块是固定序列的特殊令牌,能够解锁大型语言模型的工作记忆容量。由于它们是固定的而非生成的,可以在单次前向传播中处理,从而实现计算高效的潜在推理。为了操作这些记忆块,我们采用了两阶段课程。首先,通过在每个记忆块后预测显式推理步骤来奠定基础。其次,我们丢弃这种步骤级监督,并在每个记忆块后迭代地优化最终答案。我们在推理基准上的实验表明,跨不同家族和规模的语言模型,RiM在避免思维自回归生成的同时,匹配或超越了现有的潜在推理方法。这些结果表明,大型语言模型可以被训练为使用工作记忆作为潜在推理的有效机制。

英文摘要

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts. Drawing on this principle, we introduce Reasoning in Memory (RiM), a latent reasoning method that replaces the autoregressive generation of reasoning steps with memory blocks. These memory blocks are fixed sequences of special tokens that unlock the working-memory capacity of large language models. Since they are fixed rather than generated, they can be processed in a single forward pass, enabling compute-efficient latent reasoning. To operationalize these memory blocks, we employ a two-stage curriculum. First, we ground them by predicting explicit reasoning steps after each memory block. Second, we discard this step-level supervision and iteratively refine the final answer after each memory block. Our experiments on reasoning benchmarks show that, across language models of different families and sizes, RiM matches or exceeds existing latent reasoning methods while avoiding the autoregressive generation of thoughts. These results demonstrate that large language models can be trained to use working memory as an effective mechanism for latent reasoning.

2605.30342 2026-05-29 cs.CV cs.RO

Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

基于各向异性可见性场的不确定性驱动的3D高斯溅射主动建图

Shangjie Xue, Jesse Dill, Dhruv Ahuja, Frank Dellaert, Panagiotis Tsiotras, Danfei Xu

AI总结 提出GAVIS框架,通过各向异性可见性场量化3DGS的不确定性,并基于最大信息增益实现主动建图,在精度和效率上显著优于现有方法。

详情
Comments
Accepted to CVPR 2026. Project page https://gatech-rl2.github.io/GAVIS/
AI中文摘要

我们提出了高斯溅射各向异性可见性场(GAVIS),这是一个用于3DGS中不确定性量化和主动建图的新框架。我们的关键洞察是,训练视图中未见的区域会导致3DGS产生不可靠的预测。为了解决这个问题,我们引入了一种原则性且高效的方法来量化3DGS中的可见性场,定义为每个粒子相对于训练视图的各向异性可见性,并使用球谐函数表示。得到的可见性场被集成到基于贝叶斯网络的不确定性感知3DGS光栅化器中,实现了对合成视图的实时(200 FPS)不确定性量化。在此基础上,进一步在最大信息增益框架内执行主动建图。跨多种环境的广泛实验表明,GAVIS在精度和效率上始终且显著优于先前的方法。此外,除了独立使用外,我们的方法还可以事后应用于改进现有方法的性能。

英文摘要

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network-based uncertainty-aware 3DGS rasterizer, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation. Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.

2605.30341 2026-05-29 cs.CV cs.AI

GPIC: A Giant Permissive Image Corpus for Visual Generation

GPIC:用于视觉生成的大型许可图像数据集

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, Li Fei-Fei

AI总结 提出GPIC,一个约28万亿像素的大型许可图像数据集,包含1亿训练样本,通过最先进的视觉语言模型标注,用于视觉生成建模研究。

详情
Comments
25 pages; Dataset: https://huggingface.co/datasets/stanford-vision-lab/giant-permissive-image-corpus; Project website: https://gpic.stanford.edu
AI中文摘要

研究视觉生成建模的可扩展方法需要大型、可访问且稳定的数据集。我们引入了GPIC,一个约28万亿像素的大型许可图像数据集。GPIC包含由最先进的视觉语言模型标注的多样化互联网图像,包括1亿训练样本、20万验证样本和100万测试样本。此外,所有GPIC图像均获得研究及商业用途的许可。GPIC经过安全过滤、去重,并集中托管在Hugging Face上。我们为GPIC上的生成建模提供了一个基准测试协议。最后,我们提供了GPIC上像素空间流匹配的参考基线。我们的数据集、基准和模型可在https://huggingface.co/datasets/stanford-vision-lab/gpic获取。评估工具包和代码可在https://gpic.stanford.edu获取。

英文摘要

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

2605.30339 2026-05-29 cs.CV cs.MM cs.SD eess.AS

Benchmarking Single-Factor Physical Video-to-Audio Generation

单因素物理视频到音频生成的基准测试

Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu

AI总结 提出FlatSounds基准,通过控制反事实对和单视频模式测试评估视频到音频模型的物理推理能力,发现模型依赖文本描述而非视觉流,且物理准确性与时序对齐存在权衡。

详情
Comments
CVPR 2026
AI中文摘要

生成式视频到音频(V2A)模型能产生高度逼真的音轨,但尚不清楚它们是否捕捉了底层物理过程。现有评估强调感知真实性,忽视了在受控干预下的物理正确性。本文中,我们引入FlatSounds,一个通过以下方式审计V2A模型物理推理的基准:1)改变单个物理因素的受控反事实对,以及2)探测内部一致性和方向趋势的单视频模式测试。这些设置测试生成的音频是否正确反映特定的物理属性和时序。我们对最先进模型的评估揭示了一致的权衡:模型更依赖文本描述而非视觉流来推断物理和语义。描述通常提高物理和语义准确性,但矛盾地降低了时序对齐。我们的结果强调了需要超越音频质量,直接从像素学习物理过程。最后,我们发现我们的基于物理的指标与我们自己数据上的人类偏好测试强相关。项目网页:https://research.nvidia.com/labs/cosmos-lab/flatsounds/

英文摘要

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/

2605.30338 2026-05-29 cs.CV

REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

REST3D: 从单张图像重建物理稳定的3D场景

Xiaoxuan Ma, Jiashun Wang, Nicolas Ugrinovic, Yehonathan Litman, Kris Kitani

AI总结 提出REST3D框架,通过物理场景理解与物理约束优化,从单张RGB图像重建物理稳定的3D场景,显著减少物理错误并提升仿真稳定性。

详情
Comments
Project page: https://shirleymaxx.github.io/REST3D/
AI中文摘要

从单张RGB图像重建物理稳定的3D场景,能够将日常图像转化为可用于沉浸式交互和内容创作等应用的仿真就绪数字资产。然而,现有的单图像重建方法在捕捉场景的物理结构方面存在不足,因此常常产生几何上合理但物理上不一致的结果,包括物体漂浮和穿透,这导致物理仿真中的不稳定行为。基于图像条件的场景生成方法提高了物理合理性,但通常依赖强场景先验,产生合理但不准确的物体排列,无法匹配输入图像。我们提出REST3D,一种单图像重建框架,通过将物理场景理解与物理约束优化相结合,能够重建物理稳定的3D场景。我们首先引入一种智能物理场景理解技术,该技术从重力支撑角度构建场景树表示,捕捉物体物理状态和物体间关系,为重建提供结构先验。利用这一结构,我们使用图像到3D模型初始化场景,然后通过场景树引导的对齐和物理约束优化来解决物理违反问题,同时保持与输入图像的视觉一致性。实验表明,我们的方法在合成和真实世界数据集上显著减少了物理错误,提高了仿真稳定性,同时保持了良好的重建质量。我们进一步在基于VR的人机交互中展示了重建场景,显示了它们在沉浸式应用中的潜力。

英文摘要

Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications such as immersive interaction and content creation. However, existing single-image reconstruction methods fall short in capturing the physical structure of a scene. As a result, they often produce geometrically plausible but physically inconsistent results, including object floating and penetration, which lead to unstable behavior in physics simulations. Image-conditioned scene generation methods improve physical plausibility but often rely on strong scene priors, yielding plausible yet inaccurate object arrangements that fail to match the input image. We propose REST3D, a single-image reconstruction framework that can reconstruct physically stable 3D scenes by integrating physical scene understanding with physics-constrained refinement. We first introduce an agentic physical scene understanding technique that constructs a scene-tree representation capturing object physical states and inter-object relationships from a gravity-support perspective, providing a structural prior for reconstruction. Leveraging this structure, we initialize the scene using image-to-3D models, followed by scene-tree-guided alignment and physics-constrained optimization to resolve physical violations while preserving visual consistency with the input image. Experiments show that our method significantly reduces physical errors and improves simulation stability on both synthetic and real-world datasets while maintaining strong reconstruction quality. We further demonstrate the reconstructed scenes in VR-based human-object interaction, showing their potential for immersive applications.

2605.30337 2026-05-29 cs.LG

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

通过凸重构和梯度缓存实现LLM的高效测试时微调

Alaa Khamis, Alaa Maalouf

AI总结 提出HullFT方法,利用凸重构和梯度缓存加速LLM的测试时微调,在保持质量的同时显著降低运行时间。

详情
AI中文摘要

测试时微调(TTFT)是一种快速发展的范式,它通过检索相关序列、在序列上更新模型然后评估提示来使语言模型适应每个提示。然而,TTFT只有在快速的情况下才实用:选择和微调都在每个查询时发生,使得每个步骤都成为直接瓶颈。现有方法以速度换取质量:快速检索通常是冗余的,而更强的多样性感知选择增加了过高的每查询成本。我们引入HullFT,一种几何方法来解决这两个瓶颈。给定一个查询,HullFT首先使用高效的免投影Frank-Wolfe优化将查询嵌入表示为少量训练序列的稀疏凸组合。这产生了一个固有相关且多样化的支持集。然后,我们通过几何整数化过程将分数凸权重转换为用于微调的精确整数多重集。由此产生的多重性自然地创建了重复示例,我们利用梯度重用在重复微调步骤中分摊前向-反向计算。我们的实验表明,HullFT在质量-效率权衡上优于当前最先进的TTFT方法,以显著更低的总运行时间实现了更低的每字节比特数。

英文摘要

Test-time finetuning (TTFT) is a rapidly evolving paradigm that adapts a language model to each prompt by retrieving related sequences, updating the model on them, and then evaluating the prompt. However, TTFT is only practical if it is fast: selection and finetuning both happen per query, making each a direct bottleneck. Existing methods trade speed for quality: fast retrieval is often redundant, while stronger diversity-aware selection adds prohibitive per-query cost. We introduce HullFT, a geometric approach to TTFT that addresses both bottlenecks. Given a query, HullFT first represents the query embedding as a sparse convex combination of few training sequences, using efficient projection-free Frank-Wolfe optimization. This yields a support set that is inherently relevant and diverse. We then convert the fractional convex weights into an exact integer multiset for finetuning through a geometric integerization procedure. The resulting multiplicities naturally create repeated examples, which we exploit with Gradient Reuse to amortize forward-backward computation across repeated finetuning steps. Our experiments show that HullFT improves the quality-efficiency tradeoff over current state-of-the-art TTFT methods, achieving lower bits-per-byte at substantially lower total runtime.

2605.30336 2026-05-29 cs.LG

Fairness-Aware Federated Learning with Trajectory Shapley Value

基于轨迹Shapley值的公平感知联邦学习

Daniel Kuznetsov, Ziqi Wang

AI总结 提出轨迹Shapley值作为贡献度量,并设计FedTSV自适应聚合方法,以解决联邦学习中固定权重导致的偏倚和不稳定问题,实现公平、鲁棒且高效的协作学习。

详情
Comments
Accepted for publication at the 24th European Control Conference (ECC 2026)
AI中文摘要

联邦学习是一种新兴的分布式范式,解决了由异构、隐私敏感数据带来的挑战。它允许多个客户端通过聚合其在服务器上的本地更新来协作训练模型。然而,传统的聚合方案通常使用固定权重,无法反映客户端贡献的不平等和时变特性,导致学习过程偏倚且不稳定。为了提高公平性和稳定性,我们提出了轨迹Shapley值(TSV),这是一种贡献度量,通过基于验证的、时间一致的效用评估每个客户端如何影响全局模型的优化轨迹。基于TSV,我们设计了FedTSV,一种自适应聚合方法,将每轮评估转换为动态客户端权重,使服务器能够实时响应异构和对抗性参与。在基准数据集上的实验表明,FedTSV加速了收敛,提高了鲁棒性,并产生了更公平的贡献评估,从而为公平感知的联邦优化提供了原则性基础。

英文摘要

Federated learning is an emerging distributed paradigm that addresses the challenges posed by heterogeneous, privacy-sensitive data. It enables multiple clients to train a model collaboratively by aggregating their local updates at a server. However, conventional aggregation schemes typically use fixed weights that fail to reflect unequal and time-varying client contributions, leading to biased and unstable learning. To improve fairness and stability, we propose the Trajectory Shapley Value (TSV), a contribution metric that evaluates how each client influences the optimization trajectory of the global model using a validation-based, temporally consistent utility. Building on TSV, we design FedTSV, an adaptive aggregation method that converts per-round evaluations into dynamic client weights, allowing the server to respond to heterogeneous and adversarial participation in real time. Experiments on benchmark datasets show that FedTSV accelerates convergence, improves robustness, and yields more equitable contribution assessments, thereby providing a principled foundation for fairness-aware federated optimization.

2605.30335 2026-05-29 cs.AI cs.CL

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

局部一致,全局不一致:多组件LLM代理中的组合不一致性界定

Anany Kotawala

AI总结 本文形式化多组件LLM代理中局部一致但全局不一致的失败,提出组合残差eps*度量不一致性,并通过层次投影修复和序贯一致性监测方法,在实验中发现广泛存在的不一致性及其对决策的影响。

详情
Comments
25 pages, 7 figures, 24 tables. Preliminary versions to appear at the ICML 2026 Workshops on Combining Theory and Benchmarks (CTB), Statistical Frameworks for Uncertainty in Agentic Systems (AgenticUQ), and Failure Modes of Agentic AI (FAGEN)
AI中文摘要

多组件LLM代理从每个仅看到联合问题一部分的组件中组装概率声明;即使每个组件局部一致,组合也可能违反基本概率公理。我们通过组合残差eps*(从组合报价到联合一致多面体的L2距离)形式化这种局部一致、全局不一致的失败,该残差可在运行时根据系统输出和声明的跨组件耦合约束计算。一个乘积结构二分法刻画了局部一致性何时足够,而瑞利商预测在四个关系类别中的三个上,观测残差与预测相差在7%以内。一种层次化的Boyle-Dykstra投影确定性修复组合;一个任意有效的e-过程提供序贯一致性监测。在四个LLM中端面板(前沿面板在5.5节重新运行)的1,876个集成团上,33-94%的团中eps* > 0,在比例分配规则下,这转化为1,770个已解决赌注中每注+0.115纳特的遗憾(在自身一致化的投注者下,增益降至+0.006)。三种直观的LLM端缓解措施(检索、分区感知提示、聚合器LLM)均失败或倒退。

英文摘要

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

2605.30334 2026-05-29 cs.AI cs.CL

Demystifying Data Organization for Enhanced LLM Training

揭秘数据组织以增强大语言模型训练

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

AI总结 本文系统探索数据组织对大语言模型训练的影响,提出边界锐化、循环调度、课程连续性和局部多样性四项优化准则,并基于此设计了两种新的数据排序方法STR和SAW,实验验证了其在预训练和微调阶段的有效性。

详情
Comments
ACL 2026 Main Conference
AI中文摘要

大语言模型(LLMs)已经彻底改变了各个领域,但其训练效率严重依赖于有效的数据整理。虽然数据选择已被广泛研究,但用于增强训练的战略性数据组织仍然是一个未被充分探索的领域,特别是因为当前的LLMs通常只训练一个或几个epoch。本文通过重用先前为数据效率生成的预计算样本级分数,系统地探索了数据组织对LLM训练的影响,从而产生最小的额外计算开销。我们识别并形式化了优化数据组织的四个关键准则:边界锐化、循环调度、课程连续性和局部多样性。在这些准则的指导下,我们引入了两种新颖的数据排序方法,称为STR和SAW。跨不同模型规模和数据大小的广泛实验,包括预训练和SFT阶段,验证了我们总结的准则的有效性。它们也证明了我们提出的数据排序方法在增强LLM训练的稳定性和性能方面的鲁棒性。Github链接:https://github.com/microsoft/data-efficacy/

英文摘要

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

2605.30333 2026-05-29 cs.CL

COMPOSE: Composing Future Theorems from Citations and Formal Structure

COMPOSE: 从引文和形式结构组合未来定理

David Busbib, Michael Werman

AI总结 提出 COMPOSE 双图框架,结合科学引文图和形式定理依赖图,生成有依据的未来数学定理,实验表明优于基线方法。

详情
AI中文摘要

一个合理的未来数学主张必须满足两个约束:它应遵循先前工作的方向,并尊重限制有效后续的形式依赖。现有方法通常只建模其中一个来源,产生的声明要么基础薄弱,要么动机不足。我们引入了有依据的未来数学生成,其目标是为锚定论文生成一个合理的未来定理式声明,使用两个互补的上下文来源:其科学引文图和对齐的形式定理依赖图。为了解决这一设置,我们提出了 COMPOSE,一个双图框架,它在科学引文上下文和形式定理结构上条件化语言模型。为了支持这一设置,我们从 arXiv 和 Mathlib 构建了一个包含 108K 对科学-形式图示例的数据集,以及一个包含 2024-2025 年 47K 篇未来论文的基准。实验表明,COMPOSE 在检索真实未来论文方面优于强基线,并在 LLM 评判评估下取得了最佳整体性能,生成了更有依据且数学上更丰富的输出。这些结果表明,未来数学生成受益于将科学上下文与形式结构相结合。项目页面位于 https://david-busbib.github.io/COMPOSE-page/。

英文摘要

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at https://david-busbib.github.io/COMPOSE-page/.

2605.30332 2026-05-29 cs.CV

Colored Noise Diffusion Sampling

有色噪声扩散采样

Hadar Davidson, Noam Issachar, Sagie Benaim

AI总结 针对扩散模型生成过程中均匀白噪声注入的低效问题,提出一种无训练的有色噪声采样方法(CNS),通过动态频率依赖的噪声调度利用谱偏置,显著提升图像生成质量。

详情
AI中文摘要

扩散模型实现了最先进的图像合成,其生成轨迹从根本上表现出谱偏置,早期解析低频全局结构,后期解析高频细节。传统的随机微分方程(SDE)求解器未能考虑这一动态,在整个过程中幼稚地注入均匀白噪声,并误用有限能量预算。在这项工作中,我们建立了一个数学框架,将SDE推理重新视为一种有针对性的、频率解耦的能量传递。利用这一框架,我们引入了有色噪声采样(CNS),一种新颖的、无需训练的随机求解器。CNS不注入均匀白噪声,而是利用动态的、依赖于时间步和频率的调度,更有效地将注入能量分配给结构未解决的频带。通过主动利用模型固有的谱偏置,CNS系统地将生成分布引导向真实数据流形。大量实验表明,作为严格的即插即用推理时采样器替代,CNS在多种架构(SiT、JiT、FLUX)上显著优于标准ODE和SDE基线。与ImageNet-256上的标准采样相比,CNS在无引导下实现了显著的FID降低,SiT-XL/2从8.26降至6.27,JiT-B/16从32.39降至26.69,JiT-H/16从11.88降至8.31,同时在无分类器引导下也获得一致的相对FID改进。项目页面:https://hadardavidson.github.io/CNS/。

英文摘要

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at https://hadardavidson.github.io/CNS/.

2605.30330 2026-05-29 cs.LG

When, why, and how do diffusion posterior samplers fail? A finite-sample lens

何时、为何以及如何扩散后验采样器失败?一个有限样本视角

Benjamin A. Burns, Sara Fridovich-Keil

AI总结 本文从有限样本视角分析扩散后验采样器中似然近似误差导致后验分布偏差的原因,发现中间时间步的后验扩散估计不准确会导致模式加权错误和幻觉,并提出一种与近似类型无关的诊断方法。

详情
Comments
All code for experiments is available at: https://github.com/voilalab/diagnosing-posterior-sampling
AI中文摘要

扩散模型具有对自然数据复杂分布进行建模的出色能力,这使其成为成像逆问题中后验采样的流行且有效的选择。现有方法可以在推理时融入任何测量模型,但为了计算可行性,必须在中间时间步使用不精确的似然近似。尽管这些近似通常在经验上效果良好,但它们对采样后验的下游影响尚不清楚,并可能导致无法解释的失败。为了理解这些似然近似何时、为何以及如何传播到错误的后验分布,我们引入了一个有限样本视角的后验采样,该视角在训练集大小趋于无穷时,对于任何前向模型和先验分布,都能以任意精度逼近后验。使用这个有限样本透镜,我们观察到流行的后验采样近似倾向于在中间时间步低估或高估后验的扩散,导致下游后果,包括对早期停止时间的敏感性、后验模式的相对权重不准确以及幻觉,既包括后验中不存在的先验模式,也包括先验不支持的似然模式。此外,我们发现这些后验误差的原因既不需要非线性测量模型也不需要多模态后验,而可能仅仅由于多模态先验和中间采样时间步的后验扩散不准确而产生。我们的有限样本后验采样方法对似然近似的类型和(线性或非线性)前向模型的类型不可知,因此可以作为即插即用的诊断工具,用于评估现有和未来后验采样器的准确性和失败模式。

英文摘要

Diffusion models have excellent capacity to model complex distributions of natural data, which has made them a popular and effective choice for posterior sampling in imaging inverse problems. Existing methods can incorporate any measurement model at inference time but must use an inexact approximation for the likelihood at intermediate timesteps for computational tractability. Although these approximations can often work well empirically, their downstream effect on the sampled posterior is poorly understood and can result in unexplained failures. To understand when, why, and how these likelihood approximations propagate to erroneous posterior distributions, we introduce a finite-sample perspective on posterior sampling that approximates the posterior to arbitrary precision as training set size tends towards infinity, for any forward model and prior distribution. Using this finite-sample lens, we observe that popular posterior sampling approximations tend to under- or over-estimate the spread of the posterior at intermediate timesteps, causing downstream consequences including sensitivity to early stopping time, inaccurate relative weighting of posterior modes, and hallucination, both of prior modes that are not in the posterior and likelihood modes that are not supported by the prior. Moreover, we find that the cause of these posterior errors requires neither a nonlinear measurement model nor a multimodal posterior, but can arise solely due to a multimodal prior and inaccurate posterior spread at intermediate sampling times. Our finite-sample posterior sampling approach is agnostic to the type of likelihood approximation and the type of (linear or nonlinear) forward model, and can thus serve as a drop-in diagnostic to evaluate the accuracy and failure modes of existing and future posterior samplers.

2605.30329 2026-05-29 cs.LG

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

SoundnessBench:你的AI科学家真的能区分好的研究想法和坏的吗?

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang

AI总结 提出SoundnessBench基准,通过ICLR提交的1099个机器学习研究提案评估LLM判断研究想法方法论合理性的能力,发现当前LLM存在普遍乐观偏差,无法可靠作为科学严谨性的独立初筛评估者。

详情
Comments
Project Page: https://hosytuyen.github.io/projects/SoundnessBench
AI中文摘要

自主AI研究智能体旨在通过自动化研究流程(从假设生成到同行评审)加速科学发现。然而,现有基准很少测试一个基本瓶颈:大型语言模型在投入时间和计算资源之前,能否判断研究想法的方法论可行性。我们引入了SoundnessBench,一个从ICLR提交中重建的1099个机器学习研究提案的精选基准,标注了评审者的合理性子分数,并对照源论文进行了审计。SoundnessBench应被解释为可恢复的提案阶段合理性基准,而非对完整论文评审结果的精确预测。在12个前沿LLM中,我们发现了一个普遍的乐观偏差:在标准提示下,模型经常将低合理性的提案评为合理,而激进提示则主要将错误从假阳性转移到假阴性。对公共语料污染、论文识别短语、表面特征和人工审计质量的额外控制表明,这种行为不能由单一混杂因素解释。我们的结果表明,当前LLM作为科学严谨性的独立初筛评估者尚不可靠。

英文摘要

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

2605.30328 2026-05-29 cs.CV

Supercharging Thermal Gaussian Splatting with Depth Estimation

利用深度估计增强热高斯泼溅

Manoj Biswanath, Chenxin Cai, Hannah Schieber, Daniel Roth, Benjamin Busam

AI总结 提出一种仅使用热红外图像和深度估计的单模态方法TDg,通过热到深度高斯泼溅推导辐射场,在渲染质量和训练时间上优于多模态基线。

详情
Comments
8 pages, 4 figures. Accepted and will be published in ISPRS proceedings (ISPRS Congress 2026)
AI中文摘要

高效且鲁棒的3D场景表示在自动驾驶、机器人及相关领域至关重要。虽然RGB图像为3D重建提供了有价值的内容,但热成像或深度等其他模态可以提供环境的额外信息。最近,像3D高斯泼溅这样的新视角合成方法开始使用多模态来进一步提升性能。但融合或组合多模态数据可能使过程变慢,并带来额外挑战。因此,我们的项目旨在基于热红外域使用单模态,尽可能减少对可见光的依赖。这种单模态有望更快,因为它不依赖多模态数据。我们提出了一种方法,热到深度高斯泼溅(TDg),其架构仅使用热图像和深度估计来推导辐射场。我们的TDg方法在大多数情况下优于我们的测试数据集RGBT-Scenes和ThermalMix上的MSMG(多单模态高斯)基线。平均而言,TDg的渲染质量指标如学习感知图像块相似度(LPIPS)、结构相似性指数(SSIM)和峰值信噪比(PSNR)分别比基线MSMG值好1.12%、0.034%和0.01%。它还显著减少了训练时间,减少了12分47秒(提升55%)。总体而言,我们的方法成功推导了这些热辐射场,最终可以应用于多种场景,例如识别监控、搜索或救援行动中的热源,以及工业检查中温度广泛用于监测机器的情况。

英文摘要

Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional information on the environment. Lately, novel view synthesis methods like 3D Gaussian Splatting have started using multiple modalities to further boost their performance. But fusing or combining multimodal data can make the process slower and can bring in additional challenges. Therefore, our project aims to use single modality based on thermal infrared domain, by removing the reliance on visible light as much as possible. This single modality can be expected to be faster as it does not rely on multimodal data. We propose a method, Thermal-to-Depth Gaussian Splatting (TDg), that uses only thermal images and depth estimation in its architecture to derive the radiance fields. Our TDg method outperforms the MSMG (Multiple Single-Modal Gaussians) baseline in most cases on our test datasets, RGBT-Scenes and ThermalMix. On average, the rendering quality metrics such as learned perceptual image patch similarity (LPIPS), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) of TDg are 1.12%, 0.034%, and 0.01% better than the baseline MSMG values. It also reduces the training time significantly, by 12 mins 47 secs (55% improvement). Overall, our method is successful in deriving these thermal radiance fields, which can ultimately have several applications, such as identifying heat sources critical in surveillance, search or rescue operations, and industrial inspections where temperature is widely used to monitor machines.

2605.30327 2026-05-29 cs.LG cs.AI cs.CL math.ST stat.ML stat.TH

Reasoning with Sampling: Cutting at Decision Points

基于采样的推理:在决策点进行裁剪

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

AI总结 提出Entropy-Cut Metropolis-Hastings算法,利用基础模型的下一词元熵作为代理识别关键决策点并重新采样,从而高效地从幂分布中采样以增强推理能力,在多个基准上超越基线和RL训练模型。

详情
AI中文摘要

前沿推理模型是通过对基础语言模型进行强化学习后训练而产生的。最近的研究对此提出了挑战,表明从基础模型分布的锐化版本(即所谓的幂分布)中采样,无需额外训练、精心策划的数据集或验证器,就能产生可比的推理能力。然而,使这种方法实用化需要高效地从幂分布中采样。采样器需要“混合”到幂分布,这需要在目标分布的模态之间移动;直观地说,例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹中均匀随机选择一个“裁剪”位置,并从该位置开始重新采样后缀。然而,推理轨迹通常包含少数关键决策(例如,证明策略或算法的选择),我们观察到均匀选择的裁剪往往重写局部细节,而不是重新审视决策点。我们引入了一种算法(Entropy-Cut Metropolis-Hastings),该算法使用基础模型的下一词元熵作为代理来识别关键决策点,并从这些位置重新采样。我们通过实验验证了熵跳变是决策点的有用代理,并在一个风格化的推理模型中证明了我们的方法的混合时间与轨迹中的决策数量成比例,而不是与可能大得多的词元数量成比例。在MATH500、HumanEval、GPQA Diamond和AIME26上,我们的方法始终优于基线和RL训练模型。

英文摘要

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

2605.30326 2026-05-29 cs.RO cs.AI

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

RoboWits:机器人创造性问题解决中的意外挑战

Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

AI总结 本文提出RoboWits双臂机器人基准,通过多智能体协作的自动化任务生成流水线评估机器人在几何、材料和装配推理中的认知推理、创造性工具使用及鲁棒性,发现预训练VLA在突变任务中表现脆弱。

详情
Comments
The first two authors contributed equally
AI中文摘要

在真实环境中运行的机器人必须具备在意外挑战下推理、适应和创造性解决问题的能力。然而,当前的机器人基准主要强调技能级执行,对此类认知推理能力的洞察有限。我们提出了RoboWits,一个双臂机器人基准,旨在系统评估认知推理、创造性工具使用以及对意外条件的鲁棒性。为了实现可扩展的高质量推理中心意外场景构建,我们提出了一种自动化任务生成流水线,该流水线被设计为多智能体协作框架,包括种子任务生成与验证、度量生成、场景生成和任务变异等智能体。利用该流水线,我们整理了30个多样化的种子任务和208个带有变异和分级难度的任务,涵盖几何、材料和基于装配的推理。我们对流行的机器人策略、预训练VLA和oracle状态规划器进行了基准测试。结果揭示了显著的性能差距:预训练VLA在单任务微调后在种子任务上表现出初步成功,但在变异任务上表现不佳,这表明它们在需要推理、策略适应以及对欺骗性或受限环境鲁棒性的操作任务中具有脆弱性。项目页面位于https://umass-embodied-agi.github.io/RoboWits。

英文摘要

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

2605.30325 2026-05-29 cs.CV

Veda: Scalable Video Diffusion via Distilled Sparse Attention

Veda: 通过蒸馏稀疏注意力实现可扩展的视频扩散

Shihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei, Yi Jiang, Xiaojuan Qi

AI总结 提出Veda蒸馏稀疏注意力框架,通过统计感知的tile评分和头感知tile选择,在保持生成质量的同时实现视频扩散模型的高效加速。

详情
Comments
Accepted to ICML 2026
AI中文摘要

扩展扩散Transformer以生成高分辨率、长视频受限于自注意力的二次成本,现有的稀疏注意力方法在高稀疏度下性能下降。我们通过实验证明,生成质量并非由稀疏度本身决定,而是由稀疏掩模与全注意力的tile级几何对齐程度决定。基于这一洞察,我们提出Veda,一个蒸馏稀疏注意力框架,将tile选择形式化为从全注意力中显式重建的问题。Veda整合了统计感知的tile评分与头感知的tile选择,以减少估计误差和结构不匹配,从而实现高稀疏度。一个硬件高效的tile跳过内核将理论稀疏度转化为实际墙钟加速。在包括Waver和Wan2.1在内的大型视频扩散模型上的实验表明,Veda实现了显著的加速,且生成质量无明显下降。为了在Waver-T2V-12B上生成720P 10秒视频,Veda实现了5.1倍的端到端加速和10.5倍的自注意力加速,将注意力开销从92%降低到50%。值得注意的是,加速增益随序列长度增加而增加,表明Veda在跨模型的时空分辨率上具有良好的可扩展性。

英文摘要

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention. Based on this insight, we propose Veda, a distilled sparse attention framework that formulates tile selection as an explicit reconstruction problem from full attention. Veda integrates statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch, enabling aggressive sparsity. A hardware-efficient tile-skipping kernel converts theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models, including Waver and Wan2.1, demonstrate substantial acceleration with no noticeable degradation in generation quality. To generate 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1$\times$ end-to-end speedup and a 10.5$\times$ self-attention speedup, reducing attention overhead from 92% to 50%. Notably, the gains increase with sequence length, indicating that Veda scales favorably with spatiotemporal resolution across models.

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

AI总结 研究有界记忆下语言生成的极限问题,通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

详情
Comments
The abstract has been shortened to fit within the arXiv limit
AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中,学习器每次观察来自未知目标语言的一个示例,并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史,这是一个强假设,因为实际算法只保留有限的过去信息。学习理论中的经典工作表明,记忆约束会显著改变可学习性;我们将此扩展到语言生成。 首先,我们研究无记忆生成器。在温和的枚举限制下,每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制,我们精确刻画了何时无记忆生成是可能的。对于有限集合,我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。 我们进一步表明,最后$W$个示例的滑动窗口不会改善这种最坏情况密度,而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。 最后,我们重新审视极限识别,其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体,其中学习器只记住其之前的猜测。在这里,尽管精确识别在仅包含三种语言的集合上失败,但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。 这些结果表明,有界记忆对这些任务的影响不同:生成对于每个可数集合仍然可实现,而密度和识别仅限于有限集合,且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

2605.30323 2026-05-29 cs.LG cs.AI

In-Context Reward Adaptation for Robust Preference Modeling

上下文奖励自适应用于鲁棒偏好建模

Zhenyu Sun, Zheng Xu, Ermin Wei

AI总结 提出基于Transformer的上下文奖励自适应框架,通过少量偏好示例和人类反应时间辅助信号,在线建模多样且未见的人类偏好,实现鲁棒的偏好建模和分布偏移适应。

详情
AI中文摘要

基于人类反馈的强化学习通常依赖静态奖励模型来使大型语言模型与人类偏好对齐。然而,人类价值观本质上是多样且异质的,单一奖励模型往往缺乏泛化到未见偏好领域所需的鲁棒性。虽然现有的多奖励框架试图解决这一问题,但它们通常局限于一组固定的已知领域,并且无法在没有昂贵重新训练的情况下适应未见的人类分布。在这项工作中,我们提出了上下文奖励自适应,一个基于Transformer的框架,旨在动态建模多样且未见的人类偏好。通过利用Transformer的上下文学习能力,我们的方法从少量偏好示例中自适应地推断出潜在的奖励结构。我们证明,标准Transformer架构由于对真实值存在渐近偏差而不足以完成此任务,但将人类反应时间作为辅助输入信号使模型能够成功适应来自先前未见领域的偏好。我们的研究结果表明,这种方法为偏好建模提供了更鲁棒的基础,允许表示异质奖励和偏好分布偏移,并为更灵活的人机对齐提供了一条可扩展的路径。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.

2605.30322 2026-05-29 cs.LG cs.AI

Gram: Assessing sabotage propensities via automated alignment auditing

Gram:通过自动化对齐审计评估破坏倾向

David Lindner, Victoria Krakovna, Sebastian Farquhar

AI总结 提出Gram框架,通过模拟17种代理部署场景自动审计AI代理的破坏倾向,发现Gemini模型在约2-3%的轨迹中存在不当行为,并引入调查代理管道以识别驱动因素。

详情
AI中文摘要

我们引入了Gram,一个自动化对齐审计框架,用于评估AI代理参与破坏的倾向。我们在17个模拟的代理部署场景中评估了Gemini模型,这些场景激励破坏行为。我们发现Gemini模型在大约2-3%的模拟轨迹中存在不当行为。其中许多案例可以通过Gemini模型中的“过度急切”来解释,导致过度的角色扮演和目标寻求行为。与其他对齐审计方法相比,Gram专门设计用于评估代理编码和研究代理中的失调和有意破坏。我们还引入了一个实验性的调查代理管道,能够进行细粒度的定向实验,以识别不当行为的驱动因素。我们发现,增加环境的真实性和移除不当行为的提示往往会使破坏率降低到接近零。

英文摘要

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is designed to specifically evaluate misalignment and intentional sabotage in agentic coding and research agents. We additionally introduce an experimental investigator agent pipeline which enables fine-grained targeted experiments to identify the drivers of misbehavior. We find that increasing realism of environments and removing nudges to misbehave tends to reduce sabotage rates close to zero.