arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2106
2607.01233 2026-07-02 cs.CL cs.AI 新提交

Measuring the Gap Between Human and LLM Research Ideas

衡量人类与LLM研究想法之间的差距

Ziyu Chen, Yilun Zhao, Arman Cohan

发表机构 * Yale University(耶鲁大学) University of Chicago(芝加哥大学)

AI总结 本文构建大规模评估框架,通过逆向工程提取论文核心想法,引入双轴研究品味分类法,发现LLM想法集中在桥梁式机会和综合方法,而人类想法分布更广,揭示两者系统性差距。

详情
AI中文摘要

LLM越来越多地被用于头脑风暴研究想法,但现有评估大多根据新颖性、可行性或专家偏好来评判单个想法。我们转而问:当前LLM生成的想法与人类研究人员之间有多大的差距?为了描述这一差距,我们为高质量人类研究论文的构思构建了一个大规模评估框架。对于每篇论文,我们逆向工程出一小组可能启发其核心思想的密切相关的先前工作。然后提示LLM从论文标题和摘要集合中生成一个新想法。我们引入一个双轴研究品味分类法,通过机会模式和研究范式来描绘每个想法的特征,并用它来量化人类与LLM想法之间的分歧。在不同LLM生成的想法集合中,我们观察到一致的分布差距:LLM想法不成比例地集中在桥梁式机会和综合方法上,而人类论文参考分布在构建差距和构建贡献的方式上更广泛。这一结果表明,强大的LLM可以产生一系列合理的想法,但该范围仍然比人类研究品味更窄,并且相对于人类研究品味存在系统性偏移。

英文摘要

LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.

2607.01232 2026-07-02 cs.LG cs.CL 新提交

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

一层就够了吗?训练单个Transformer层可以匹配全参数RL训练

Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Hongzhou Lin, Mingyi Hong

发表机构 * University of Minnesota(明尼苏达大学) Peking University(北京大学) Amazon(亚马逊)

AI总结 发现RL训练收益高度集中在少数Transformer层,仅训练单层即可恢复大部分全参数RL收益,且高贡献层集中在模型中部。

详情
AI中文摘要

强化学习(RL)已成为大型语言模型(LLM)后训练的核心组成部分,但关于RL适应如何分布在Transformer各层中的了解甚少。现有方法通常统一更新所有模型参数,隐含地假设每一层对RL后训练获得的收益贡献相似。在这项工作中,我们通过系统的逐层RL训练研究挑战了这一假设。令人惊讶的是,我们发现训练单个Transformer层可以恢复全参数RL训练所获得的大部分收益,在某些情况下甚至超越它。为了量化这一现象,我们引入了层贡献量,它衡量了单独训练一层所恢复的全RL改进的比例。在跨越两个模型家族(Qwen3、Qwen2.5)、三种RL算法(GRPO、GiGPO、Dr. GRPO)以及包括数学推理、代码生成和智能体决策在内的多个任务领域的七个模型中,我们观察到一个非常稳定的模式:RL收益高度集中在一小部分,甚至在许多情况下是单个Transformer层中。更引人注目的是,相同的结构模式一致出现:高贡献层集中在Transformer堆栈的中部,而靠近输入和输出端的层贡献显著较少。由此产生的层排名在数据集、任务、模型家族和RL算法之间保持强相关性。

英文摘要

Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.

2607.01225 2026-07-02 cs.LG cs.AI 新提交

Language-Critique Imitation Learning from Suboptimal Demonstrations

语言-批评模仿学习从次优演示中

Chih-Han Yang, Dai-Jie Wu, Yun-Ping Huang, Ping-Chun Hsieh, Kenneth Marino, Shao-Hua Sun

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University (NTU)(国立台湾大学通信工程研究所) University of Utah(犹他大学) National Yang Ming Chiao Tung University(国立阳明交通大学) NTU Artificial Intelligence Center of Research Excellence(国立台湾大学人工智能研究中心)

AI总结 提出语言批评框架,用自然语言作为结构化监督信号从次优演示中学习策略,避免压缩为标量,在连续控制任务上优于基线方法。

详情
AI中文摘要

先前从次优演示中模仿学习的工作通常依赖于压缩的监督信号,如置信度估计、判别器分数或重要性权重。这些标量信号本质上是有限的,因为它们无法明确表达关于任务进展、失败模式或纠正动作的中间推理。我们提出一个语言批评框架,用于从次优演示中模仿学习,该框架利用自然语言作为结构化监督信号,避免了将表达性反馈压缩为标量。我们的方法首先从演示中构建语言标签,明确描述当前进展、识别次优行为并提供细粒度的纠正指导。然后我们引入一个语言批评损失,直接使用这些结构化信号训练策略而不将其降为标量,并将其实例化用于行为克隆和扩散策略,得到LC-BC和LC-DP。我们进一步提供一个理论结果,表明在标准假设下,所提出的目标函数上界了专家性能差距。在实验中,我们在涵盖导航、操作和游戏的各种连续控制任务上评估,我们的方法一致优于强模仿学习和离线强化学习基线。这些结果表明,语言可以作为从次优数据中学习鲁棒策略的一种强大且结构化的监督形式。

英文摘要

Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions. We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars. Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance. We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP. We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions. Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines. These results demonstrate that language can serve as a powerful and structured form of supervision for learning robust policies from suboptimal data.

2607.01224 2026-07-02 cs.AI cs.CL cs.MA 新提交

AutoMem: Automated Learning of Memory as a Cognitive Skill

AutoMem:自动化学习记忆作为认知技能

Shengguang Wu, Hao Zhu, Yuhui Zhang, Xiaohan Wang, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学)

AI总结 提出AutoMem框架,通过双循环自动优化LLM的记忆管理结构和使用能力,在长时域任务中提升性能约2-4倍。

Comments Project Website: https://autolearnmem.github.io/

详情
AI中文摘要

记忆专长是一种习得的技能:知道编码什么、何时检索以及如何组织知识——在认知科学中称为元记忆。我们将这一视角引入LLM,将记忆管理视为一种可训练的技能。我们将文件系统操作提升为与任务操作同等重要的第一类记忆动作,让模型自己决定如何管理其记忆。这种记忆技能沿着两个轴改进:支持它的结构(提示、文件模式、动作词汇)以及使用它的模型的熟练度。这两个轴都难以手动优化:长时域任务中的情节运行数千步,单个记忆错误可能在显现之前隐藏很长时间,使得人工审查完整轨迹不切实际。我们引入了AutoMem,一个自动化这两个轴的框架。在第一个循环中,一个强大的LLM审查完整的智能体轨迹,并迭代修改塑造智能体与其记忆文件交互方式的记忆结构。在第二个循环中,从多个情节中识别出智能体自身的良好记忆决策,并用作训练信号,直接提高模型的记忆熟练度。在三个程序生成的长时域游戏(Crafter、MiniHack和NetHack)中,仅优化记忆——而不修改模型的任务动作行为——将基础智能体的性能提升了约2-4倍,使一个32B开源模型与前沿系统如Claude Opus 4.5和Gemini 3.1 Pro Thinking竞争。我们的结果表明,记忆管理是一个独立可学习的技能,并且是一个在长时域任务上产生巨大收益的高杠杆目标。

英文摘要

Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.

2607.01223 2026-07-02 cs.AI cs.CL cs.LG cs.LO cs.SE 新提交

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Theoria: 非正式推理状态上的重写-可接受性验证

Ben Slivinski, Michael Saldivar

发表机构 * Independent Researchers(独立研究者)

AI总结 提出Theoria验证架构,通过将候选解重写为带显式理由的序列化状态转换并验证变更完整性,在HLE-Verified Gold上以91.4%精确率认证105个问题,优于整体式LLM评判。

详情
AI中文摘要

何时应信任AI系统的答案?形式化证明助手提供确定性但无法覆盖大部分问题分布;标量LLM评判器提供覆盖率但产生不透明的分数,事后无法审计,且与任何LLM一样存在连贯性问题。我们提出Theoria,一种弥合这一差距的验证架构。候选解被重写为一系列带类型的状态转换,每个转换由显式理由(无论是引用、计算还是问题给定事实)授权,且每个转换均可独立审计。基础不变性是变更的完整性:连续证明状态之间的每个差异都必须被解释,因此隐藏前提作为未授权的突变浮现,而非静默通过。在HLE-Verified Gold(185个纯文本专家问题)上,Theoria以91.4%的严格精确率(Wilson 95% CI [84.5%, 95.4%])认证了105个问题。每个认证产生一个人类可读的证明轨迹,其中每一步都可被独立质疑。整体式LLM评判器在匹配覆盖率下达到可比的精确率,但在不同问题上失败(Jaccard 0.14-0.36),使得两种方法互补。在跨15个领域的95个对抗性中毒证明上,结构化评判器捕获94.7%,而整体式评判器为83.2%(p=0.0017)。整体11.5个百分点的差距集中在隐藏前提(90.6% vs. 62.5%,28个百分点差异)和伪造引用(100% vs. 90%)上,这些是形式分析预测优势的错误类别;在算术和定理误用错误上性能相同,这些错误上未预测到优势。在GPQA Diamond(n=65)上,认证精确率为97.1%(Wilson CI [85.1%, 99.5%])。

英文摘要

When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

2607.01222 2026-07-02 cs.CV 新提交

Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models

Ink3D: 通过视频生成模型雕刻具有极其复杂纹理的3D资产

Yue Han, Chong Li, Zhening Liu, Cong Huang, Fang Deng, Yong Liu, Fangyun Wei, Yan Lu

发表机构 * ZGCA & ZGCI Microsoft Research(微软研究院) Zhejiang University(浙江大学) HKUST(香港科技大学)

AI总结 提出Ink3D框架,利用大规模视频生成模型合成复杂纹理,通过OrbitPainter生成密集轨道扫描视频,并用TextureOptimizer进行神经烘焙以生成一致纹理。

Comments Accepted to ECCV 2026. Project page: https://yuehan99.github.io/Ink3D-TextureGen/

详情
AI中文摘要

最近的3D生成模型可以合成高质量的几何形状,但通常难以从参考图像中重现复杂的纹理,这主要是由于缺乏具有丰富表面外观的大规模3D训练数据。相比之下,视觉生成模型在数量级更大的数据集上训练,擅长建模复杂的视觉模式。受此差距启发,我们引入了Ink3D,一个将3D生成与大规模视频生成模型桥接起来的框架,以合成极其复杂的纹理。Ink3D首先使用现成的3D生成模型重建白色网格几何体。然后,它采用条件视频生成模型OrbitPainter,生成密集的轨道扫描视频,捕捉物体在不同视角下的外观。为了将这些视图转换为一致的纹理,我们引入了TextureOptimizer,一个神经烘焙模块,它整合密集的多视图观测,同时减轻视频生成引起的几何不一致性。通过解耦几何和纹理合成,并利用大规模预训练视频先验,Ink3D能够生成比先前方法更丰富、更逼真的纹理。

英文摘要

Recent 3D generative models can synthesize high-quality geometry but often struggle to reproduce intricate textures from reference images, largely due to the scarcity of large-scale 3D training data with rich surface appearance. In contrast, visual generative models are trained on datasets several orders of magnitude larger and excel at modeling complex visual patterns. Motivated by this gap, we introduce Ink3D, a framework that bridges 3D generation with large-scale video generative models to synthesize extremely complex textures. Ink3D first reconstructs a white-mesh geometry using an off-the-shelf 3D generation model. It then employs OrbitPainter, a conditional video generative model, to produce dense orbit-scan videos capturing object appearance across viewpoints. To convert these views into coherent textures, we introduce TextureOptimizer, a neural baking module that integrates dense multi-view observations while mitigating geometry inconsistencies arising from video generation. By decoupling geometry and texture synthesis and leveraging large-scale pretrained video priors, Ink3D enables significantly richer and more faithful texture generation than prior approaches.

2607.01218 2026-07-02 cs.CL cs.AI cs.LG 新提交

The State-Prediction Separation Hypothesis

状态-预测分离假说

Giovanni Monea, Nathan Godey, Kianté Brantley, Yoav Artzi

发表机构 * Cornell University(康奈尔大学) Harvard University(哈佛大学)

AI总结 提出状态-预测分离假说,通过双流Transformer解耦状态存储与下一词预测,在预训练中提升数据与计算效率,下游任务平均提升2-3个百分点。

Comments Preprint

详情
AI中文摘要

Transformer使用相同的前向计算流来同时预测下一个token并存储对未来token预测有用的状态。我们提出了\emph{状态-预测分离假说}:将这两个角色解耦能够带来更好的语言建模性能。我们设计了一种使用两个计算流来分离这两个功能的Transformer变体,并在不同规模上进行了预训练实验。我们的实验表明,状态-预测分离一致地提供了更好的数据和计算效率,改进了验证损失,并在下游任务上平均优于标准Transformer 2-3个百分点。我们还进行了广泛的实证分析,排除了潜在的混淆因素,并展示了我们的设计所导致的梯度上的根本差异。

英文摘要

Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.

2607.01212 2026-07-02 cs.RO cs.AI 新提交

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

FurnitureVLA: 使用视觉-语言-动作模型学习长时域双臂家具组装

Chenyang Ma, Yue Yang, Radu Corcodel, Siddarth Jain, Andrew Wu, Chiori Hori, Diego Romeres

发表机构 * Mitsubishi Electric Research Laboratories(三菱电机研究实验室) University of Oxford(牛津大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出FurnitureVLA,首个使用视觉-语言-动作模型进行真实尺寸双臂家具组装的系统,通过进度增强VLA和设计因素研究,将平均模拟成功率从48%提升至80%。

Comments Project Page: https://dannymcy.github.io/furniturevla/

详情
AI中文摘要

当前关于机器人家具组装的工作大多集中在玩具规模的设置或单臂操作上。我们介绍了FurnitureVLA,这是首个使用视觉-语言-动作模型(VLA)进行真实尺寸双臂家具组装的系统性研究。我们形式化了任务,开发了一个可扩展的模拟管道用于专家数据生成和评估,并构建了一个单操作员双臂控制的VR遥操作系统,以收集高质量的真实世界演示。为了应对包含多达7个子任务和1550个控制步骤的极端长时域组装,我们提出了一种进度增强的VLA,在语义基础子任务上进行微调,联合预测动作和连续进度信号,实现自动子任务转换并减少推理过程中的累积误差。我们进一步研究了感知和控制设计因素,这些因素对真实尺寸组装的精度至关重要。与基线相比,FurnitureVLA在三种家具类型上的平均模拟成功率从48%提高到80%,并且我们的设计因素研究额外带来了21%的提升。我们在真实的Kinova Gen3平台上进行了验证,在最困难的任务上仅下降了16%。

英文摘要

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.

2607.01208 2026-07-02 cs.CL cs.AI cs.LG 新提交

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

蒸馏检测:通过弹药筒蒸馏揭露大语言模型中的隐蔽偏见

Shayan Talaei, Abhinav Chinta, Devvrit Khatri, Amin Karbasi, Azalia Mirhoseini, Amin Saberi

发表机构 * Stanford University(斯坦福大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) Foundation AI–Cisco Systems Inc.(Foundation AI–思科系统公司)

AI总结 提出Distill to Detect (D2D)方法,通过蒸馏模型与基座之间的分布偏移到KV缓存前缀适配器中,放大隐蔽偏见信号至可检测程度,并基于Fisher加权投影理论解释其有效性。

Comments Accepted to the ICML 2026 Workshops on TAIGR, AI4GOOD, Mechanistic Interpretability, and CoLoRAI

详情
AI中文摘要

部署在高风险角色中的语言模型可能潜在地偏向某些实体、品牌或观点,从而大规模引导用户决策。这种偏好偏见可能由模型供应链中的任何参与者引入,并且当模型仅在相关主题上表现出偏好,而在所有其他输入上与其未修改的基座表现一致时最为危险。最近的研究表明,这些偏见可以通过在语义无关数据上的上下文蒸馏进行转移,信号完全存在于软logit分布中,对基于文本的检查不可见。然而,防御者面临一个根本的不对称性:在不知道偏见主题的情况下,无论检查生成的文本、内部表示还是模型权重,没有任何检测方法能够可靠地揭示隐蔽的偏好偏见。在这里,我们引入了Distill to Detect (D2D),一种通过将怀疑模型与其基座之间的分布偏移蒸馏到弹药筒(一个KV缓存前缀适配器)中来揭示隐藏偏见的方法,集中主导差异并将偏见信号放大到生成的文本中。我们表明,D2D成功地将隐蔽模型的隐藏偏见放大到可以在多种偏见类型中可靠检测的程度。我们还提出了一个理论框架,通过Fisher加权投影logit分布偏移的视角解释了D2D的有效性,并得到了经验观察的支持。通过将前缀调优适配器的容量瓶颈转化为检测工具,D2D为审计部署语言模型中的隐藏行为提供了一个实用的构建模块。

英文摘要

Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection. However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights. Here we introduce Distill to Detect (D2D), a method that surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text. We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types. We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations. By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.

2607.01205 2026-07-02 cs.CV 新提交

Linkify: Learning from Interface-Augmented Assembly Graphs

Linkify: 从接口增强的装配图中学习

Anushrut Jignasu, Daniele Grandi

发表机构 * Iowa State University(爱荷华州立大学) Autodesk Research(欧特克研究院)

AI总结 提出Linkify框架,通过接口增强的装配图学习部件间几何信息,实现机械装配中的上下文感知部件检索,采用图注意力网络解决掩码部件预测任务。

Comments Code is available at https://github.com/ajignasu/linkify

详情
AI中文摘要

我们提出了Linkify,一个从接口增强的装配图中学习的框架,以实现机械装配中的上下文感知部件检索。虽然最近用于CAD的生成式AI方法主要集中在孤立部件或整体装配上,但部件之间实现功能的接口处的丰富几何信息仍未得到充分探索。我们通过重新计算Fusion 360 Gallery Assembly数据集的高保真接口几何形状,纠正缺失和错误的接触,并生成局部接触区域的点云表示来填补这一空白。利用这些数据,我们构建了装配图,其中节点编码部件几何形状,边通过预训练的点云编码器编码接口几何形状。在此表示之上,我们训练了一个基于GATv2的图注意力网络来解决掩码部件预测任务:给定一个缺少一个部件的装配体,模型从大量几何聚类部件词汇中预测缺失组件的类别,从而近似一个真实的部件检索场景。与在聚合节点特征上操作的逻辑回归和k近邻等非图基线相比,Linkify实现了更高的Top-K准确率和F1分数。关于图连通性、边属性和注意力机制的消融研究表明,准确的接触计算和接口上的动态注意力对性能至关重要。我们公开发布了修正后的接口数据集和训练流程,为未来的接口感知装配检索、验证和生成设计模型奠定了基础。

英文摘要

We present Linkify, a framework for learning from interface-augmented assembly graphs to enable context-aware part retrieval in mechanical assemblies. While recent generative AI methods for CAD have focused largely on isolated parts or monolithic assemblies, the rich geometric information at the interfaces between parts, where function is realized, remains underexplored. We address this gap by recomputing high-fidelity interface geometry for the Fusion 360 Gallery Assembly dataset, correcting missing and erroneous contacts, and generating point-cloud representations of local contact regions. Using this data, we construct assembly graphs whose nodes encode part geometry and whose edges encode interface geometry via a pretrained point-cloud encoder. On top of this representation, we train a Graph Attention Network based on GATv2 to solve a masked part prediction task: given an assembly with one part held out, the model predicts the class of the missing component from a large vocabulary of geometrically clustered parts, thereby approximating a realistic part-retrieval scenario. Compared to non-graph baselines such as logistic regression and k-nearest neighbors operating on aggregated node features, Linkify achieves higher Top-K accuracy and F1 scores. Ablation studies on graph connectivity, edge attributes, and attention mechanisms demonstrate that accurate contact computation and dynamic attention over interfaces are critical for performance. Our corrected interface dataset and training pipeline, released publicly, provide a foundation for future interface-aware models for assembly retrieval, validation, and generative design.

2607.01204 2026-07-02 cs.LG 新提交

TiRex-2: Generalizing TiRex to Multivariate Data and Streaming

TiRex-2: 将 TiRex 推广到多变量数据和流式处理

Patrick Podest, Marco Pichler, Elias Bürger, Levente Zólyomi, Bernhard Voggenberger, Wilhelm Berghammer, Daniel Klotz, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

发表机构 * ELLIS Unit Linz, LIT AI Lab & Institute for Machine Learning, JKU Linz, Austria(ELLIS 林茨单元,LIT AI 实验室与机器学习研究所,林茨约翰·开普勒大学,奥地利) NXAI Lab, Linz, Austria(NXAI 实验室,林茨,奥地利) NXAI GmbH, Linz, Austria(NXAI 有限公司,林茨,奥地利) Interdisciplinary Transformation University Austria, Linz, Austria(奥地利跨学科转型大学,林茨,奥地利)

AI总结 提出基于 xLSTM 的循环时间序列基础模型 TiRex-2,通过记忆中心设计实现多变量预测与流式处理,在零样本任务上达到最优性能。

详情
AI中文摘要

我们介绍了 TiRex-2,一个基于循环 xLSTM 的时间序列基础模型,它将单变量 TiRex 推广到具有过去和未来协变量的多变量预测。现实世界的预测本质上是顺序的:观测值连续到达,变量联合演化,并且一部分协变量是预先已知的。现有的基于 Transformer 的时间序列基础模型能够捕捉跨变量依赖关系,但在上下文长度上具有二次复杂度,并且随着新观测值的到达需要重新计算完整历史。TiRex-2 通过以记忆为中心的循环设计解决了这些限制,在流式处理下以恒定的每块成本运行。该模型结合了双向时间混合器和非对称分组注意力变量混合器,能够在保持目标变量严格因果性的同时集成未来已知协变量。据我们所知,这是第一个实现这种属性组合的时间序列基础模型。为了支持可扩展的多变量预训练,我们提出了一种合成耦合流水线,该流水线从大型单变量语料库中动态组合多样化的多变量样本。实验上,TiRex-2 在 GIFT-Eval 和 fev-bench 上实现了最先进的零样本性能,在流式处理到任意上下文长度时保持稳定,并且每块推理成本恒定。该模型在单变量模式下使用 3840 万个活跃参数,在多变量预测中额外激活 4410 万个参数。

英文摘要

We introduce TiRex-2, a recurrent xLSTM-based time series foundation model that generalizes the univariate TiRex to multivariate forecasting with both past and future covariates. Real-world forecasting is inherently sequential: observations arrive continuously, variables evolve jointly, and a subset of covariates is known ahead of time. Existing Transformer-based time series foundation models capture cross-variate dependencies but incur quadratic complexity in context length and require full-history recomputation as new observations arrive. TiRex-2 addresses these limitations through a memory-centric recurrent design that operates at constant per-patch cost under streaming. The model combines a bidirectional time mixer with an asymmetric grouped-attention variate mixer, enabling the integration of future-known covariates while preserving strict causality over target variables. To our knowledge, this is the first time series foundation model that achieves this combination of properties. To support scalable multivariate pretraining, we propose a synthetic coupling pipeline that composes diverse multivariate samples on the fly from large univariate corpora. Empirically, TiRex-2 achieves state-of-the-art zero-shot performance on GIFT-Eval and fev-bench, remains stable when streamed to arbitrary context lengths, and maintains constant inference cost per patch. The model uses 38.4M active parameters in univariate mode, with an additional 44.1M parameters activated for multivariate forecasting.

2607.01202 2026-07-02 cs.CV cs.AI cs.GR 新提交

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

世界源于运动:从单目视频生成动态高斯重建

Liyuan Zhu, Shengyu Huang, Amrita Mazumdar, Tianye Li, Zan Gojcic, Gordon Wetzstein, Iro Armeni, Shalini De Mello, Alex Trevithick

发表机构 * Stanford University(斯坦福大学) NVIDIA(英伟达)

AI总结 提出World from Motion方法,利用视频模型从单目视频生成可自由渲染的动态3D高斯表示,通过像素对齐渲染校正伪影并填充缺失区域,实现4D重建新高度。

Comments Project page: https://research.nvidia.com/labs/amri/projects/world-from-motion/

详情
AI中文摘要

我们提出World from Motion,一种从单目视频生成可自由渲染的动态3D高斯表示的方法。我们的方法将视频模型条件于密集的、像素对齐的渲染结果,这些渲染结果编码了外观、几何和3D场景运动,沿着输入和目标相机轨迹,以校正渲染伪影并填充初始重建中的缺失区域。为了训练该模型,我们构建了一个对齐的多视角视频对和动态3DGS表示的数据集,并模拟了单目重建特有的伪影。在测试时,我们将模型的生成结果(包括新观察到的区域和运动)蒸馏回一个一致、高质量的动态3DGS中,从而改进了新视角合成和底层3D运动。我们的方法在4D重建中达到了新的最优水平,并能够无缝泛化到具有大视角变化和动态运动的野外视频。

英文摘要

We present World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. Our approach conditions a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. To train this model, we construct a dataset of aligned multiview video pairs and dynamic 3DGS representations, with simulated artifacts characteristic of monocular reconstruction. At test time, we distill the model's generations, including newly observed regions and motions, back into a single consistent, high-quality dynamic 3DGS, improving both novel-view synthesis and the underlying 3D motion. Our method sets a new state of the art in 4D reconstruction and seamlessly generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.

2607.01201 2026-07-02 cs.RO 新提交

Sensorless Four-Channel Control Architecture Using Inverse Dynamics Modeling for Human-Scale Bilateral Teleoperation

基于逆动力学建模的无传感器四通道控制架构用于人尺度双边遥操作

Amir Noohian, Dylan Miller, Justin Valentine, Alan Lynch, Martin Jagersand

发表机构 * University of Alberta(阿尔伯塔大学)

AI总结 针对人尺度遥操作中高惯性、建模困难和力传感器依赖问题,提出基于逆动力学的无传感器四通道架构,在WAM平台上验证,优于传统方案,提升位置/力跟踪并降低操作力。

详情
AI中文摘要

四通道遥操作架构是实现双边系统透明性的成熟框架。然而,其在人尺度遥操作中的性能受到高惯性、建模挑战以及对噪声大且昂贵的力/扭矩传感器的依赖的限制。本文提出了一种基于逆动力学建模的无传感器四通道架构。该控制器在定制的WAM双边遥操作平台上实现并验证。实验表明,所提方法优于传统的两通道和四通道方案以及透明性增强方法,改善了位置和力跟踪,减少了操作者努力,并在无外部传感器的情况下增加了最大可传递阻抗。一个涉及沿机械臂持续全身接触的开门案例研究进一步证明了该方法在真实人尺度操作任务中的有效性。

英文摘要

The four-channel teleoperation architecture is a well-established framework for achieving transparency in bilateral systems. However, its performance in human-scale teleoperation is limited by high inertia, modeling challenges, and reliance on noisy and costly force/torque sensors. This paper introduces a sensorless four-channel architecture based on inverse dynamics modeling. The controller is implemented and validated on a customized WAM bilateral teleoperation setup. Experiments demonstrate that the proposed approach outperforms conventional two- and four-channel schemes as well as transparency-enhancement methods, improving position and force tracking, reducing operator effort, and increasing maximum transmittable impedance without external sensors. A door-opening case study involving sustained whole-body contact along the manipulator further demonstrates the effectiveness of the method in realistic human-scale manipulation tasks.

2607.01197 2026-07-02 cs.LG 新提交

Quantum vs. Classical Machine Learning: A Unified Empirical Comparison

量子与经典机器学习:统一的实证比较

Chuanming Yu, Jiaming Liu, Zihao Ge, Xiongfei Wu, Lulu Zhu, Pengzhan Zhao, Jianjun Zhao

发表机构 * Hebei Normal University(河北师范大学) Kyushu University(九州大学) University of Luxembourg(卢森堡大学)

AI总结 通过七组监督学习和强化学习模型的实证比较,发现当前量子机器学习模型在预测性能、策略稳定性和训练时间上尚未超越经典基线,但在噪声过滤和假阳性控制方面具有潜力。

Comments This paper has been accepted for a poster presentation at the 5th CCF Quantum Computation Conference (CQCC 2026) on August 3, 2026

详情
AI中文摘要

量子计算已成为机器学习(ML)的一种有前景的计算范式,有望提供超越经典方法的计算优势。目前,支持量子机器学习(QML)模型相对于经典模型性能和优势的证据尚不充分。为弥补这一空白,本文对QML模型及其经典对应模型的性能进行了实证研究。我们比较了涵盖监督学习和强化学习的七组模型对。结果表明,评估的量子机器学习模型在整体预测性能、策略稳定性或训练时间上尚未超越经典基线。尽管如此,QML在过滤噪声和控制假阳性方面仍是一种有前景的方法。我们的研究结果总结了量子机器学习在硬件环境、训练效率和收敛稳定性方面面临的挑战,为QML的鲁棒性和参数优化研究奠定了基础。该工作公开可访问于https://this https URL。

英文摘要

Quantum computing has emerged as a promising computational paradigm for machine learning (ML), with the potential to offer computational advantages over classical approaches. At this stage, the evidence supporting the performance and advantages of quantum machine learning (QML) models relative to classical models is insufficient.To address this gap, this paper presents an empirical study on the performance of QML models and their classical counterparts. We compare seven model pairs spanning supervised learning and reinforcement learning. Our results indicate that the evaluated quantum machine learning models do not yet surpass the classical baselines in overall prediction performance, policy stability, or training time. Nevertheless, QML remains a promising approach for filtering noise and controlling false positives. Our research findings summarize the challenges facing quantum machine learning across hardware environments, training efficiency, and convergence stability, providing a foundation for research into the robustness and parameter optimization of QML. This work is publicly available at https://github.com/Z-537-437/QML.

2607.01191 2026-07-02 cs.CV 新提交

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

感知到推理:解耦感知与推理用于细粒度视觉推理

Hongxing Li, Xiufeng Huang, Dingming Li, Wenjing Jiang, Zixuan Wang, Haolei Xu, Hanrong Zhang, Haiwen Hong, Longtao Huang, Hui Xue, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 提出Perceive-to-Reason (P2R)框架,将细粒度视觉推理解耦为感知和推理两阶段,并引入PRA-GRPO强化学习策略,在多个高分辨率基准上显著提升性能。

Comments Code: https://github.com/ZJU-REAL/Perceive-to-Reason

详情
AI中文摘要

细粒度视觉推理对视觉语言模型仍然具有挑战性,尤其是当微小但关键的视觉线索隐藏在高分辨率图像中时。现有方法依赖重复裁剪或测试时视觉搜索来引入局部证据,但通常没有明确区分感知与推理。在本文中,我们提出Perceive-to-Reason (P2R),一个统一框架,将细粒度视觉推理形式化为两阶段过程:模型首先作为感知器定位与问题相关的证据,然后作为推理器基于标注图像和裁剪区域回答问题。为了更好地将训练与此解耦公式对齐,我们进一步引入感知-推理交替GRPO (PRA-GRPO),一种角色感知的强化学习策略,仅使用最终答案监督在感知聚焦和推理聚焦更新之间交替。基于Qwen3-VL-Instruct-2B/4B/8B,P2R在不同模型规模上持续提升性能。特别是,P2R-4B在V-Star上达到93.2%,在HR-Bench-4K上达到81.9%,在HR-Bench-8K上达到80.5%,显著优于其对应骨干网络。进一步实验表明,P2R的优势不仅限于高分辨率基准,还扩展到更广泛的多模态推理任务。这些结果表明,显式解耦感知与推理为细粒度视觉推理提供了有效框架。

英文摘要

Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.

2607.01185 2026-07-02 cs.LG 新提交

Neural Certificate Pricing for Combinatorial Optimization Problems

组合优化问题的神经证书定价

Jingyi Chen, Xinyuan Zhang, Xinwu Qian

发表机构 * Rice University(莱斯大学)

AI总结 提出神经证书定价(NCP)方法,利用无监督学习预测证书级对偶价格,通过结构化恢复层构建原始边际,实现摊销分离,在三个组合优化问题上显著优于或匹配现有方法,且泛化性强。

详情
AI中文摘要

组合优化(CO)问题之所以困难,是因为可证明的离散结构导致指数级搜索。需要搜索指数级多的候选解来证明最优性,然而,一旦提供路径、打包或覆盖的结构可行性,可以在多项式时间内验证。在本研究中,我们引入了神经证书定价(NCP),在无监督学习框架下利用这种不对称性。训练神经网络预测证书级对偶价格,而结构化恢复层构建诱导的原始边际。NCP可视为摊销分离:不是枚举违反的不等式,而是学习残差价格,通过这些价格,它们的聚合效应进入恢复。当证书一致性条件成立时,恢复的边际是全局可行的,局部理论表明,预测价格中的一阶误差仅导致目标值的二阶损失。在三个类别的CO问题中,NCP要么以较大优势超越最先进的神经基线,要么以一小部分计算时间匹配它们,并显示出更强的分布外泛化能力。

英文摘要

Combinatorial optimization (CO) problems are difficult because certifiable discrete structure induces exponential search. One needs to search over the set exponentially many candidates to certify optimality, however, the structural feasibility of a path, packing, or cover can be verified in polynomial time once supplied. In this study, we introduce Neural Certificate Pricing (NCP) that exploits this asymmetry under an unsupervised learning framework. A neural network is trained to predict certificate-level dual prices, while a structured recovery layer constructs the induced primal marginal. NCP can be viewed as amortized separation: instead of enumerating violated inequalities, it learns the residual prices through which their aggregate effect enters recovery. When the certificate-consistency condition holds, the recovered marginal is globally feasible, and a local theory shows that first-order errors in the predicted price induce only second-order loss in objective value. Across three classes of CO problems, NCP either outperforms state-of-the-art neural baselines by large margins or matches them at a fraction of the computation time, and shows stronger out-of-distribution generalization.

2607.01181 2026-07-02 cs.LG cs.AI cs.CL 新提交

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

以正确的方式正确:结合可验证奖励和人类演示的LM训练

Mehul Damani, Isha Puri, Idan Shenfeld, Jacob Andreas

发表机构 * MIT EECS(麻省理工学院电气工程与计算机科学系)

AI总结 提出对抗生成器-判别器框架,在可验证奖励基础上加入人类演示信号,同时优化任务准确性和非可验证属性,在代码修复、故事生成等任务中提升人类相似度并减少奖励黑客行为。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为在具有明确定义成功指标的任务(如代码生成和数学推理)上训练LM的强大范式。然而,当前的RLVR方法仅优化可客观评分的内容,往往忽略了人类输出中主观、不可验证的方面,如风格和结构。这一限制导致了诸如多样性崩溃、不自然响应的输出和奖励黑客等有据可查的失败模式。我们提出了一种对抗生成器-判别器框架,该框架通过来自人类演示的学习信号增强可验证奖励。生成器模型使用RL进行训练,以最大化任务准确性和来自判别器的对抗奖励。与生成器策略一起训练的判别器学习区分人类编写的输出和模型生成的输出。判别器作为人类输出分布的学习代理,为难以形式化为标量奖励的生成方面提供反馈。在包括错误修复和开放式生成的多个领域,我们的方法一致地改善了非可验证属性,同时保持了RLVR的准确性增益。在错误修复中,我们的方法产生的解决方案与RLVR基线相比具有显著更低的编辑距离,同时匹配最终性能。在故事生成中,我们的方法显著提高了胜率,同时生成了多样且更类人的故事。在一个简单的奖励黑客基准测试中,我们的方法几乎消除了模型的不当行为,同时保持了高基准分数。这些结果共同表明,我们的方法桥接了RL和SFT,为联合优化任务的可验证和不可验证属性提供了一条可扩展的路径。

英文摘要

RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learned signal from human demonstrations. A generator model is trained using RL to maximize both task accuracy and an adversarial reward derived from a discriminator. The discriminator, trained alongside the generator policy, learns to distinguish human-written outputs from model-generated ones. The discriminator serves as a learned proxy for the human output distribution, providing feedback on aspects of generation that are difficult to formalize as scalar rewards. Across diverse domains, including bug fixing and open-ended generation, our approach consistently improves non-verifiable properties while preserving the accuracy gains of RLVR. In bug fixing, our method produces solutions with significantly lower edit distance compared to RLVR baselines while matching end performance. In story generation, our method significantly improves win rate while producing stories that are diverse and more human-like. And in a simple reward hacking benchmark, our method nearly eliminates model misbehavior while maintaining high benchmark scores. Together, these results show that our approach bridges RL and SFT, offering a scalable path toward jointly optimizing the verifiable and non-verifiable properties of a task.

2607.01179 2026-07-02 cs.LG cs.CL 新提交

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

QuasiMoTTo: 准蒙特卡洛测试时扩展

Michael Y. Li, Anthony Zhan, Kanishk Gandhi, Noah D. Goodman, Emily B. Fox

发表机构 * Stanford University(斯坦福大学)

AI总结 提出QuasiMoTTo方法,利用准蒙特卡洛相关采样替代独立同分布采样,在推理和强化学习中减少冗余,以更少样本达到相同性能。

详情
AI中文摘要

通过为每个问题生成多个并行尝试来扩展推理计算,是提高语言模型能力的一种成本高昂但可靠的杠杆。默认情况下,这些尝试是独立生成的,浪费了推理计算在冗余解决方案上。这种浪费似乎不可避免。毕竟,独立性正是使并行采样易于扩展的原因。然而,这种权衡并非根本性的:存在一个丰富的采样器设计空间,可以完全并行地生成相关但精确的样本。我们探索这个设计空间,作为在扩展推理计算和强化学习(RL)中提高样本效率的途径。具体地,我们引入了QuasiMoTTo,它使用相关样本作为独立同分布样本的即插即用替代。为了生成这些样本,QuasiMoTTo将自回归采样重新参数化为逆CDF采样,并使用准蒙特卡洛(QMC)抽取底层均匀分布;由于QMC比独立同分布更均匀地分布均匀变量,生成的样本覆盖输出空间时冗余大大减少。尽管批次是相关的,但每个样本边际上服从语言模型分布,因此我们可以使用该批次进行策略梯度训练。我们的实证分析侧重于理解QuasiMoTTo如何高效地将计算转化为性能。为了评估相关采样器(其依赖性破坏了标准的pass@k估计器),我们首先开发了一个无偏的bootstrap估计器。在四个推理基准上,QuasiMoTTo以25-47%更少的样本匹配了独立同分布的pass@k准确率。引人注目的是,QuasiMoTTo经常达到任何边际保持采样器的pass@k上限。我们还将QuasiMoTTo应用于策略梯度RL(GRPO),其中它以50%更少的训练步骤匹配了独立同分布的性能。这些增益来自于更高的覆盖率,这为每个批次提供了更强的学习信号。

英文摘要

Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in parallel. We explore this design space as an avenue for improving sample efficiency in scaling inference compute and reinforcement learning (RL). Concretely, we introduce QuasiMoTTo, which uses correlated samples as a drop-in replacement for i.i.d. samples. To generate these samples, QuasiMoTTo uses a reparameterization of autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo (QMC); because QMC spreads the uniforms out more evenly than i.i.d., the resulting samples cover the output space with far less redundancy. Even though the batch is correlated, each sample is marginally distributed according to the language model, so we can use the batch for policy-gradient training. Our empirical analysis focuses on understanding how efficiently QuasiMoTTo can turn compute into performance. To evaluate correlated samplers, whose dependence breaks standard pass@k estimators, we first develop an unbiased bootstrap estimator. Across four reasoning benchmarks, QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples. Strikingly, QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler. We also apply QuasiMoTTo to policy-gradient RL (GRPO) where it matches i.i.d. performance with 50% fewer training steps. These gains come from higher coverage, which yields a stronger learning signal per batch.

2607.01176 2026-07-02 cs.CV 新提交

High-dimensional Embedding Prior for Noisy K-space Domain MRIReconstruction

高维嵌入先验用于噪声k空间域MRI重建

Yu Guan, Tianjia Huang, Qinrong Cai, Qiuyun Fan, Dong Liang, Qiegen Liu

发表机构 * School of Advanced Manufacturing, Nanchang University(南昌大学先进制造学院) School of Mathematics and Computer Science, Nanchang University(南昌大学数学与计算机科学学院) School of Information Engineering, Nanchang University(南昌大学信息工程学院) Academy of Medical Engineering and Translational Medicine, Medical School, Faculty of Medicine, Tianjin University(天津大学医学部医学工程与转化医学研究院) Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 针对噪声k空间MRI重建,提出高维嵌入框架增强扩散模型表示能力,在多种噪声和欠采样条件下提升重建质量,尤其在高噪声场景效果显著。

详情
AI中文摘要

磁共振成像(MRI)在现实采集条件下的重建可以基本视为从不完整且受噪声污染的测量中估计潜在的k空间分布。虽然扩散模型最近作为逆问题的生成先验显示出强大潜力,但现有方法在处理噪声重建设置时存在困难,尤其是在k空间域直接操作时。在这项工作中,我们提出了一个统一的针对噪声逆问题的高维k空间重建框架,通过表示增强来改进基于扩散的求解器。在底层优化过程中,所提出的框架增强了数据表示空间,使现有的基于扩散的求解器能够在具有更高表达能力的丰富k空间嵌入上运行。在内部和公共数据集上,针对不同噪声水平和欠采样因子的大量实验表明,所提出的框架一致地提高了多种基于扩散的逆求解器的重建质量。值得注意的是,在高噪声区域观察到最大的增益,这与我们在高维表示下误差传播的理论分析一致。这些结果表明,高维表示为在噪声设置中改进基于扩散的MRI重建提供了一种通用且与模型无关的机制,为实际逆问题的鲁棒k空间生成建模提供了新视角。代码将在此https URL提供。

英文摘要

Magnetic resonance imaging (MRI) reconstruction under realistic acquisition conditions can be fundamentally viewed as estimating the underlying k-space distribution from incomplete and noise-corrupted measurements. While diffusion models have recently shown strong potential as generative prior for inverse problems,existingapproachesstruggletohandlenoisyreconstruction settings, especially when operating directly in k-space domain. In this work, we propose a unified high-dimensional k-space reconstruction framework tailored for noisy inverse problems, whichenhancesdiffusion-based solversthroughrepresentation lifting.Ratherthanmodifyingthe underlying optimization procedures, the proposed framework augments the data representation space, enabling existing diffusion-based solvers to operate on enriched k-space embeddings with improved expressiveness. Extensive experiments on both in-house and public datasets across varying noise levels and undersampled factors demonstrate that the proposed frame work consistently improves reconstruction quality for multiple diffusion-based inverse solvers. Notably, the largest gains are observed in high-noise regimes, which is consistent with our theoretical analysis of error propagation under high-dimensional representation. These results suggest that high-dimensional representation provides a general and model-agnostic mechanism for improving diffusion-based MRI reconstruction in noisy settings, offering a new perspective on robust k-space generative modeling for practical inverse problems. The code will be available at https://github.com/yqx7150/HEP-MRIRec.

2607.01171 2026-07-02 cs.LG stat.ML 新提交

Decision-Aware Training for Sample-Based Generative Models

面向样本生成模型的决策感知训练

Kornelius Raeth, Nicole Ludwig

发表机构 * University of Augsburg, Germany(奥格斯堡大学)

AI总结 针对样本生成模型训练目标忽视下游决策成本的问题,提出将可微决策损失与能量分数结合,实现决策感知训练,在保持完整概率预测的同时提升成本敏感区域性能。

详情
AI中文摘要

样本生成模型越来越多地用于高风险决策场景中的概率预测,然而其训练目标忽略了决策者的成本结构。这些模型通常使用严格适当的评分规则(如能量分数)进行训练,这些规则根据数据密度分配训练信号,而不考虑预测误差对下游决策成本最大的区域。因此,我们提出面向样本生成模型的决策感知训练,用可微决策损失增强能量分数目标,该损失直接惩罚基于模型预测行动所产生的成本。这种组合损失具有理论基础,因为决策损失本身就是一个适当的评分规则。我们在一个合成任务和两个真实世界任务上验证了我们的方法,显示出在成本敏感区域的针对性改进,同时保留了完整的概率预测。

英文摘要

Sample-based generative models are increasingly used for probabilistic forecasting in high-stakes decision settings, yet their training objectives are blind to the decision maker's cost structure. These models are commonly trained with strictly proper scoring rules, such as the energy score, which allocate their training signal in proportion to data density, with no awareness of where forecast errors are most costly for downstream decisions. We therefore propose decision-aware training for sample-based generative models, augmenting the energy score objective with a differentiable decision loss that directly penalises the cost incurred by acting on the model's forecast. This combined loss is theoretically grounded, as the decision loss is itself a proper scoring rule. We validate our method on one synthetic and two real-world tasks, showing targeted improvements in cost-sensitive regions while retaining full probabilistic forecasts.

2607.01153 2026-07-02 cs.CL cs.AI cs.SE 新提交

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

面向AI安全评估的对抗语用学:指令冲突、嵌入命令与策略模糊性基准

Brett Reynolds

发表机构 * Humber Polytechnic(汉博理工学院) University of Toronto(多伦多大学)

AI总结 提出对抗语用学基准和标注协议,通过语言学控制的分类法评估模型在指令冲突、嵌入命令等场景下的行为,为安全评估提供实证和方法论工具。

Comments 15-page main paper plus 9-page supplement; 6 figures and 8 tables total; code and data artifact available at the linked repository

详情
AI中文摘要

语言模型的安全评估越来越依赖于对模糊自然语言行为的判断:模型是否遵循了指令、是否恰当拒绝、是否遵守策略、是否抵抗嵌入命令、或在代理任务中错误报告进展。现有基准通常将这些区别压缩为通过/失败标签,掩盖了失败是源于能力限制、策略模糊性、指令冲突、支架故障还是评估者判断不稳定。本文引入对抗语用学作为基准和标注协议,用于评估模型在指令冲突、嵌入命令、引用、范围模糊性、指示语、间接言语行为和多轮代理转录本下的行为。贡献是经验性和方法论的:一个语言学控制的分类法、一个带有验证者强制元数据的18项种子基准、一个54行本地种子试点、一个区分任务成功、策略合规、安全风险、拒绝结果和评估者置信度的专家评估协议,以及用于判断有效性、诊断模糊性和分类法漂移的指标。该框架将语言判断方法论转化为验证安全评估、LLM判断器、金标准构建、提示注入测试和安全文档的实用工具。

英文摘要

Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmarks often compress these distinctions into pass/fail labels, obscuring whether failures arise from capability limits, policy ambiguity, instruction conflict, scaffold failure, or unstable evaluator judgments. This paper introduces adversarial pragmatics as a benchmark and annotation protocol for evaluating model behaviour under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts. The contribution is empirical and methodological: a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, a 54-row local seed pilot, an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, and metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The framework turns linguistic judgment methodology into a practical tool for validating safety evals, LLM judges, gold-set construction, prompt-injection tests, and safety documentation.

2607.01152 2026-07-02 cs.CL 新提交

AGC-Bench: Measuring Artificial General Creativity

AGC-Bench:衡量人工通用创造力

Roger Beaty, Vijeta Deshpande, Clin K. Y. Lai, Anna Attuch, Namrata Shivagunde, Swastik Roy, Rajkumar Pujari, Paul V. DiStefano, Sherin Muckatira, Claire E. Stevenson, Mikhail Gronas, Anna Rumshisky

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校) University of Amsterdam(阿姆斯特丹大学) Amazon AGI(亚马逊AGI) Dartmouth College(达特茅斯学院)

AI总结 提出AGC-Bench基准,通过系统文献综述和标准化框架评估LLM创造力,发现单一创造力因子'c',并验证提示'要有创造力'比推理更有效。

详情
AI中文摘要

创造力研究一直争论创造力是领域特定的(例如,视觉、写作、科学),还是可以从心理测量学上与一般智力分离。这两个问题现在都适用于LLM,但统一的AI创造力基准仍然难以捉摸。我们引入了AGC-Bench,一个基于AI创造力文献系统综述(筛选了3101篇论文,识别了497个基准)构建的人工通用创造力基准,并配有一个代理框架,将特异的代码库转换为HELM标准化的基准。首次发布涵盖了78个数据集,涵盖头脑风暴、问题解决、STEM、叙事、比喻语言和幽默。为了解决LLM作为评判者的偏见,我们应用了评判者响应理论——对评判者宽严程度的心理测量校准;然后我们在三个前沿LLM的偏差校正评分上微调Qwen3-30B,产生AGC-Judge,一个开放权重的模型,能够稳健地评分它未训练过的新创造力基准。结果显示,前沿模型位于AGC-Bench排行榜顶部,开放模型紧随其后。LLM表现出不同的创造力优势,在某些领域(例如写作)排名高于其他领域(例如科学构思)。大量实验得出三个主要发现。首先,对83个LLM进行因子分析,我们恢复了一个单一的创造力因子'c',类似于一般智力的'g'因子,解释了81.5%的方差,与一般知识/推理相关但可分离。其次,我们表明提示模型“要有创造力”比启用推理更能提升其性能,证明该基准追踪的是创造力而非一般能力。第三,在人类匹配的子集上,我们发现顶级人类在创造力上仍然领先于顶级LLM。我们发布了AGC-Bench,附带公共排行榜、AGC-Judge和人类数据,作为大规模衡量AI创造力的开放基础设施。

英文摘要

Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of AI creativity remains elusive. We introduce AGC-Bench, an artificial general creativity benchmark built from a systematic review of the AI creativity literature (3,101 papers screened, 497 benchmarks identified), paired with an agentic harness that converts idiosyncratic codebases into HELM-standardized benchmarks. The first release covers 78 datasets spanning brainstorming, problem solving, STEM, narrative, figurative language, and humor. To address bias in LLM-as-judge, we apply Judge Response Theory -- a psychometric calibration of judge leniency/severity; we then fine-tune Qwen3-30B on the bias-corrected ratings of three frontier LLMs to produce AGC-Judge, an open-weight model that robustly scores new creativity benchmarks it was not trained on. Results reveal frontier models at the top of the AGC-Bench leaderboard, with open models close behind. LLMs show different creative strengths, ranking higher on some domains (e.g., writing) than others (e.g., scientific ideation). Extensive experiments yield three main findings. First, applying factor analysis across 83 LLMs, we recover a single creativity factor 'c', analogous to the 'g' factor of general intelligence, that explains 81.5% of variance, related to but separable from general knowledge/reasoning. Second, we show that prompting models to "be creative" boosts their performance far more than enabling reasoning, evidence that the benchmark tracks creativity over general ability. Third, on a human-matched subset, we find the top human still leads the top LLM on creativity. We release AGC-Bench with a public leaderboard, AGC-Judge, and human data as open infrastructure for measuring AI creativity at scale.

2607.01147 2026-07-02 cs.CV 新提交

EquiSteer: Cross-Attention Steering Towards a Fairer Text-Guided Image Generation

EquiSteer: 面向更公平的文本引导图像生成的交叉注意力引导

Tatiana Gaintseva, Akshit Achara, Gregory Slabaugh, Jiankang Deng, Ismail Elezi

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) Huawei Noah’s Ark(华为诺亚方舟实验室) King’s College London(伦敦国王学院) Imperial College London(帝国理工学院)

AI总结 提出EquiSteer,一种无需训练的方法,通过在推理时引导交叉注意力激活来减少文本到图像扩散模型中的性别偏见,平均减少高达87%的性别差异。

详情
AI中文摘要

文本到图像扩散模型支撑着日常创意任务,但它们仍然再现了训练数据中的人口统计偏见。对于诸如“护士照片”、“CEO照片”等常见提示,模型会根据训练数据的统计数据而非文本内容,使输出偏向某一性别。现有的去偏方法在狭窄场景下显示出潜力,但需要重新训练、批次级控制或特定提示调整,限制了其可扩展性。我们提出EquiSteer,一种无需训练的方法,通过在推理时引导交叉注意力(CA)激活来逐样本工作。对于每个目标属性,EquiSteer从对比提示中预计算引导向量。在生成时,一个提示感知的门控保持属性特定提示不变,而对于中性提示,它从CA激活中清除现有属性信号并注入目标属性。在SD-1.5、SD-2.1、SDXL和SANA上,EquiSteer平均减少了高达87%的性别差异,对图像质量和文本-图像对齐影响极小。代码可在\href{this https URL}{this https URL}获取。

英文摘要

Text-to-image diffusion models power everyday creative tasks, but they still reproduce the demographic biases in their training data. On common prompts such as ``a photo of a nurse,'' ``a photo of a CEO'', they skew their outputs toward one gender, driven by the statistics of training data rather than anything in the text. Existing debiasing methods show promise in narrow settings but require retraining, batch-level control, or prompt-specific tuning, limiting their scalability. We propose \emph{EquiSteer}, a training-free method that works per sample by steering cross-attention (CA) activations at inference time. For each target attribute, EquiSteer precomputes steering vectors from contrastive prompts. Then at generation time, a prompt-aware gate leaves attribute-specific prompts untouched, while for neutral ones it clears existing attribute signals from the CA activations and injects a target attribute. Across SD-1.5, SD-2.1, SDXL, and SANA, EquiSteer reduces the average parity gap by up to $87\%$, with minimal effect on image quality and text-image alignment. Code is available at \href{https://github.com/Atmyre/EquiSteer}{https://github.com/Atmyre/EquiSteer}.%

2607.01145 2026-07-02 cs.LG eess.SP 新提交

A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data

一种用于多变量时间序列的轻量级自监督学习框架:基于层次JEPA的心电图数据方法

Siwon Kim

发表机构 * Research Institute of Basic Sciences, Seoul National University(首尔大学基础科学研究院)

AI总结 提出ER-JEPA,一种轻量级自监督学习框架,通过层次化联合嵌入预测架构对多变量时间序列进行表征学习,在心电图数据上实现高效预训练和下游任务最优性能。

Comments 25 pages, 7 figures. Code will be made publicly available soon

详情
AI中文摘要

医学领域的数据分析常面临目标数据集有限而大量未标注数据分布广泛的情况。在这种情况下,自监督学习方法能有效利用大数据集,成为心电图分析的热门选择。本文提出事件重建联合嵌入预测架构(ER-JEPA),一种用于多变量时间序列的轻量级自监督学习框架,其名称和双层层次结构受心脏病专家诊断方法的启发。ER-JEPA的核心包括:(1)两阶段结构,先为每个时间间隔构建表征,再将这些表征作为单变量时间序列处理;(2)两个联合嵌入预测架构的层次化集成;(3)视觉Transformer骨干网络。两个JEPA的结构串联将模型归类为层次JEPA(H-JEPA),旨在编码多级抽象表征以增强复杂任务的预测能力。本研究报告了H-JEPA在12导联心电图数据(作为多变量时间序列)上的成功应用,并分析了预训练阶段层次表征的敏感性。该模型在约18万条10秒记录上预训练,在ST-MEM基准测试中实现了最先进的下游性能,且计算速度快、资源占用极少。

英文摘要

Data analysis in the medical domain often encounters scenarios involving a limited target dataset and a large, unannotated dataset with a general distribution. Under such circumstances, self-supervised learning (SSL) methods are highly effective for utilizing large datasets, making them a popular choice for electrocardiogram (ECG) analysis. This work presents the Event Reconstruction Joint-Embedding Predictive Architecture (ER-JEPA), a lightweight SSL framework for multivariate time series, whose name and two-fold hierarchical structure are inspired by the diagnostic approach of cardiologists. At its core, ER-JEPA features: (1) a two-stage structure that constructs representations for each time interval and subsequently processes these representations as a univariate time series, (2) the hierarchical integration of two Joint-Embedding Predictive Architectures (JEPAs), and (3) a Vision Transformer (ViT) backbone. The structural concatenation of two JEPAs categorizes the model as a Hierarchical JEPA (H-JEPA), designed to encode multiple levels of abstract representations for enhanced prediction on complex tasks. This study reports a successful application of H-JEPA to 12-lead ECG data as a multivariate time series alongside an analysis of the sensitivity of hierarchical representation during the pretraining stage. Pretrained on approximately 180,000 10-second recordings, the model achieves state-of-the-art downstream performance on the ST-MEM benchmark, with rapid computation and minimal resource usage.

2607.01144 2026-07-02 cs.LG cs.AI cs.CE 新提交

Sequentially-Controlled Interactive Multi-Particle Flow-Maps for Online Feedback-Driven Search

序列控制交互式多粒子流图用于在线反馈驱动搜索

Binglin Ji, Anindya Sarkar, Hengchang Lu, Jens Sjölund, Yevgeniy Vorobeychik

发表机构 * Uppsala University(乌普萨拉大学)

AI总结 提出序列控制交互式多粒子流图(IMPFM),通过多粒子交互和流图驱动的后验样本共享机制,实现在线反馈驱动搜索中的全局探索与偏好对齐,避免模式崩溃和权重退化。

Comments 28 pages, 19 figures

详情
AI中文摘要

虽然生成模型已经实现了无需训练的奖励对齐,但当前方法通常擅长在底层分布的狭窄区域内进行局部探索。当偏好先验未知且仅通过序列反馈揭示时,这些方法难以应对——这种情况需要广泛探索以发现高效用区域。为了解决这个问题,我们提出了序列控制交互式多粒子流图(IMPFM),一个样本高效的在线反馈驱动搜索框架。IMPFM逐步将一组交互粒子向目标分布传输,保持异质偏好对齐所需的广泛覆盖。IMPFM引入了一种基于流图的原则性且高效的后验样本共享机制。通过在每次重采样步骤中用整个集成体的集体后验样本纠正单个粒子漂移,该框架最大化样本效用以实现全局探索,同时主动缓解标准控制框架中常见的奖励过度优化问题。结合涉及多粒子交互的原则性探索-利用重新加权机制,这种序列校正的多粒子动力学明确保留了结构多样性,并克服了标准SMC采样器固有的权重退化问题。关键的是,我们证明了所得到的采样框架产生了一个多粒子交互感知的Feynman-Kac校正器,逐步将多粒子系统引导向KL倾斜的目标分布,促进全局探索并防止模式崩溃。在多种搜索和对齐任务上的广泛经验评估和严格消融实验证实了IMPFM相对于现有基线的有效性。

英文摘要

While generative models have enabled training-free reward alignment, current methods typically excel in local exploration within narrow regions of the underlying distribution. These approaches struggle when preferences are unknown a priori and only revealed through sequential feedback-a scenario demanding broad exploration to uncover high-utility regions. To address this, we propose Sequentially-Controlled Interactive Multi-Particle Flow-Maps (IMPFM), a framework for sample-efficient online feedback-driven search. IMPFM progressively transports a group of interactive particles toward the target distribution, maintaining the broad coverage essential for heterogeneous preference alignment. IMPFM introduces a principled and efficient posterior sample sharing mechanism across particles powered by flow maps. By correcting individual particle drift with the collective posterior samples of the entire ensemble at each resampling step, the framework maximizes sample utility to enable global exploration while actively mitigating reward over-optimization, typical of standard control frameworks. Paired with a principled exploration-exploitation reweighting mechanism involving multi-particle interaction, this sequentially corrected multi-particle dynamics explicitly preserves structural diversity and overcomes the weight degeneracy inherent to standard SMC samplers. Crucially, we prove that the resulting sampling framework yields a multi-particle interaction-aware Feynman-Kac corrector that progressively steers the multi-particle system toward a KL-tilted target distribution, facilitating global exploration and preventing mode collapse. Extensive empirical evaluations and rigorous ablations across diverse search and alignment tasks confirm the efficacy of IMPFM over existing baselines.

2607.01140 2026-07-02 cs.CV 新提交

Relation-Centric Open-Vocabulary 3D Gaussian Segmentation

以关系为中心的开放词汇3D高斯分割

Eunsung Cha, Hyunjoon Lee, Jaesik Park

发表机构 * Seoul National University(首尔大学)

AI总结 提出PairGS框架,通过建模高斯间成对关系实现开放词汇3D高斯分割,无需逐场景优化,速度比优化方法快50倍。

Comments Project Page: https://eunsungcha.github.io/PairGS-web/

详情
AI中文摘要

开放词汇3D高斯分割具有挑战性,因为它需要对多样化查询进行语言理解,并沿物体边界准确分离高斯。先前的方法要么将语言知识嵌入单个高斯以提高查询响应性,要么优化每个高斯的实例特征以编码物体身份。然而,这些策略可能产生噪声高斯分割,或依赖于成本高昂的逐场景优化。我们提出PairGS,一个将高斯分割重新定义为建模高斯间成对关系的框架。3D高斯表示为关系估计提供了丰富的信号,例如视图贡献权重和多视图掩码证据。通过利用这些线索,PairGS显式构建用于分割的关系图,而无需繁重的优化过程。PairGS首先使用低维描述符提出稀疏边候选,仅在这些候选上计算精确的成对亲和度,并构建用于多粒度查询的层次聚类树。它在开放词汇3D高斯分割基准上取得了最先进的结果,而快速变体比基于优化的实例特征方法快50倍。

英文摘要

Open-vocabulary 3D Gaussian segmentation is challenging because it requires language understanding for diverse queries and accurate separation of Gaussians along object boundaries. Prior approaches either embed language knowledge into individual Gaussians to improve query responsiveness or optimize per-Gaussian instance features to encode object identity. However, these strategies may produce noisy Gaussian segmentations or rely on cost-inefficient per-scene optimization. We propose PairGS, a framework that reframes Gaussian segmentation as modeling pairwise relations between Gaussians. 3D Gaussian representations provide rich signals for relation estimation, such as view contribution weights and multi-view mask evidence. By leveraging these cues, PairGS explicitly constructs a relation graph for segmentation without a heavy optimization process. PairGS first proposes sparse edge candidates using low-dimensional descriptors, computes precise pairwise affinities only on those candidates, and builds a hierarchical cluster tree for multi-granular querying. It achieves state-of-the-art results on open-vocabulary 3D Gaussian segmentation benchmarks, while the fast variant is 50x faster than optimization-based instance-feature approaches.

2607.01139 2026-07-02 cs.CV 新提交

SD-RouteFusion: Ego-Trajectory Prediction with SD-Map Route Conditioning

SD-RouteFusion:基于SD地图路线条件的自车轨迹预测

Sviatoslav Voloshyn, Bruno K. W. Martens, Wangxin Liu, Jakob Vinkås, Junsheng Fu

发表机构 * Zenseact

AI总结 提出SD-RouteFusion,融合前视相机、车辆动力学和SD地图导航路线进行自车轨迹预测,无需HD地图,通过双假设设计和门控分类器实现鲁棒融合,在8秒预测时ADE降低16.9%。

Comments 9 pages, 4 figures, 29th International Conference on Information Fusion

详情
AI中文摘要

本文提出SD-RouteFusion,一种可部署的端到端自车轨迹预测方法,融合前视相机、车辆动力学和来自标准定义(SD)地图的导航路线。与依赖高精(HD)地图几何的方法不同,SD-RouteFusion将学习目标与可扩展且可投入生产的SD地图路线输入对齐,无需HD地图基础设施即可实现路线感知预测。首先,我们证明SD地图路线先验提供了强大的长时域语义先验。通过对包含10个欧洲国家和美国48万驾驶场景的大规模真实世界数据集进行全面研究,我们量化了SD路线条件的价值:与仅使用图像和运动学信息的基线相比,引入SD地图路线使ADE降低10.5%,而我们的完整融合策略在8秒预测时域下实现16.9%的ADE降低。融合策略由双假设设计与门控分类器组成,以确保在路线损坏和视觉不确定性下的鲁棒性。最后,为支持更广泛的评估,我们发布了一个SD路线生成工具包,使所有包含自车位姿和未来轨迹的数据集都能进行SD路线条件下的自车轨迹预测。综上,SD-RouteFusion为大规模鲁棒、路线感知的自车轨迹预测建立了一条实用路径。

英文摘要

This paper presents SD-RouteFusion, a deployable end-to-end ego-trajectory prediction method that fuses a front-facing camera, vehicle kinematics, and a navigation route derived from a Standard Definition (SD) map. Unlike approaches that rely on High Definition (HD) map geometry, SD-RouteFusion aligns the learning objective with scalable and production-ready SD-map route inputs, enabling route-aware prediction without requiring HD-map infrastructure. First, we demonstrate that SD-map route prior provides a powerful long-horizon semantic prior. Through a comprehensive study on a large-scale real-world dataset comprising 480k driving scenarios across 10 European countries and the U.S., we quantify the value of SD-route conditioning: incorporating SD-map routes yields a 10.5% ADE improvement over an image-and-kinematics baseline, while our full fusion strategy achieves a 16.9% ADE reduction given a prediction horizon of 8 seconds. The fusion strategy consists of a dual-hypothesis design paired with a gated classifier, to ensure robustness under route corruption and visual uncertainty. Finally, to support broader evaluation, we release an SD-route generation toolkit that enables SD-route-conditioned ego-trajectory prediction on all datasets containing ego pose and future trajectories. Together, SD-RouteFusion establishes a practical path toward robust, route-aware ego-trajectory prediction at scale.

2607.01133 2026-07-02 cs.CV cs.RO 新提交

Towards Metric-Agnostic Trajectory Forecasting

迈向度量无关的轨迹预测

Markus Knoche, Daan de Geus, Bastian Leibe

发表机构 * RWTH Aachen University(亚琛工业大学) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出度量无关的概率训练目标,并引入TraDiE策略将预测分布映射为轨迹和置信度,实现度量优化作为下游任务,在Waymo基准上取得最优结果。

Comments ECCV 2026. Project page at https://vision.rwth-aachen.de/TraDiE-policies

详情
AI中文摘要

准确预测周围交通参与者的轨迹是自动驾驶的核心能力,使车辆能够预测行为并规划安全操作。我们观察到,当前在Argoverse 2和Waymo开放运动数据集上的最先进预测模型,其训练目标针对不同的基准度量进行了定制。由于这些度量鼓励相互冲突的行为,我们提出轨迹预测的范式转变:使用度量无关的概率目标训练模型,并将度量优化作为应用于预测分布的下游任务。具体而言,我们引入了轨迹分布评估(TraDiE)策略,这是一种度量特定的策略,将预测分布映射为轨迹预测度量所需的$K$条轨迹和置信度。我们通过引入DONUT-NLL来评估该框架,该模型调整了最先进轨迹预测模型DONUT的训练目标,直接优化预测分布。使用我们的策略,DONUT-NLL在Waymo运动预测基准的所有度量上均达到了最先进的结果。

英文摘要

Accurate trajectory forecasting of surrounding traffic participants is a core capability for autonomous driving, enabling vehicles to anticipate behavior and plan safe maneuvers. We observe that current state-of-the-art forecasting models on Argoverse 2 and the Waymo Open Motion Dataset tailor their training objectives to the different benchmark metrics. Because these metrics encourage conflicting behavior, we propose a paradigm change for trajectory forecasting: training models with metric-agnostic probabilistic objectives and treating metric optimization as a downstream task applied to the predictive distribution. Concretely, we introduce Trajectory Distribution Evaluation (TraDiE) policies, metric-specific policies that map a predictive distribution to the set of $K$ trajectories and confidences required by trajectory forecasting metrics. We evaluate this framework by introducing DONUT-NLL, which adapts the training objective of the state-of-the-art trajectory forecasting model DONUT to directly optimize the predictive distribution. Using our policies, DONUT-NLL achieves state-of-the-art results on all metrics of the Waymo motion prediction benchmark.

2607.01131 2026-07-02 cs.CV cs.AI 新提交

Autonomous Scientific Discovery via Iterative Meta-Reflection

通过迭代元反射实现自主科学发现

Bingchen Zhao, Sara Beery, Oisin Mac Aodha

发表机构 * University of Edinburgh(爱丁堡大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出DiscoPER框架,利用大语言模型进行开放式的自主科学发现,通过动态代码生成、统计检验和二阶推理机制,在iNatDisco基准上以72.7%假设支持率恢复8/9已知模式。

详情
AI中文摘要

自主科学发现系统有望通过自动化假设生成和验证过程来加速研究。然而,当前系统在受限搜索空间内运行或需要预定义研究问题,限制了其真正开放式探究的能力。此外,虽然它们迭代生成假设,但很大程度上缺乏明确综合自身积累发现以揭示复杂相互关联现象的能力。我们提出DiscoPER,一个自主的大语言模型驱动框架,通过动态生成和执行代码来探索数据集,无需预设研究目标,从而进行开放式研究。为确保严格的科学有效性,每个提出的发现必须通过统计检验。为克服孤立搜索的局限性,我们的框架引入了一种二阶推理机制,定期分析自身积累的发现。通过将先前发现视为经验数据,DiscoPER识别结构模式、混杂因素和认知空白,主动将假设探索导向搜索空间的未知区域。通过整合工具使用进一步扩展搜索空间,使系统能够通过无缝处理和提取来自图像等多模态源的有用信息,探索超越结构化元数据的假设。在iNatDisco(一个从同行评审文献获得模式级别真实标注的新型多模态生态知识基准)上评估,DiscoPER以72.7%的假设支持率恢复了9个已知模式中的8个,优于经典因果发现和LLM引导基线。消融实验表明,DiscoPER随数据量增加而扩展,并证实了二阶元反射的益处。

英文摘要

Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.

2607.01127 2026-07-02 cs.CL 新提交

$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space

Log$_\ ext{b}$Quant: 在对数空间中量化语言模型

Jeremias Bohn, Tizian Dippold, Mahdi Koubaa, Elias R. Wahl, Georg Groh

发表机构 * School of Computation, Information and Technology, Technical University of Munich(慕尼黑工业大学计算、信息与技术学院)

AI总结 提出Log$_\ ext{b}$Quant,一种可调基数的对数量化方法,适应常见参数分布,在4位精度下优于非对称线性量化,实现适度加速和高内存节省。

详情
AI中文摘要

量化已成为减少现代语言模型内存需求和推理速度的宝贵工具,特别是使其可用于消费级设备和边缘设备。虽然先前的工作主要关注均匀量化码本,但由于低频高幅值权重,此类方法容易出现次优表示。我们引入了Log$_\ ext{b}$Quant,一种具有可调基数的对数量化方法,以适应常见的参数分布。我们表明,在张量级粒度下,与不对称线性量化相比,我们的方法在多个性能基准测试中展现出4位精度下的优越性能,同时实现了适度的加速和高内存节省,使其适用于消费级GPU上的私人使用。

英文摘要

Quantization has become an invaluable tool to reduce memory requirements and inference speed of modern language models, in particular to make them available for consumer setups and edge devices. While previous work has primarily focused on uniform quantization codebooks, such approaches are prone to suboptimal representations due to low-frequency high-magnitude weights. We introduce Log$_\text{b}$Quant, a novel logarithmic quantization approach with adjustable bases, to adapt to common parameter distributions. We show that our method exhibits superior performance at 4-bit precision on several performance benchmarks compared to asymmetric linear quantization at tensor-wise granularity, while achieving moderate speedup and high memory savings, making it suitable for private use on consumer-grade GPUs.