arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03288 2026-06-03 cs.CY cs.AI

AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study

AI生成的新手程序员追踪:多机构研究中的学习效果与学习者差异

Yuri Noviello, Naaz Sibia, Anastasiia Birillo, Thomas Overklift Vaupel Klein, Michael Liut, Gosia Migut

AI总结 本研究提出AI生成的类比动画追踪(GATs),通过多机构实验比较其与文本解释对新手程序员学习程序执行的影响,发现GATs在即时学习上有选择性优势,但效果依赖情境且短暂,且受学习者参与度调节。

详情
AI中文摘要

入门编程(CS1)课程常常难以支持学生对程序执行的理解。虽然可视化可以使执行过程明确,但其有效性取决于设计和情境,而AI生成可视化的实证证据仍然有限。我们提出了生成动画追踪(GATs),即基于AI生成的、类比驱动的、配有旁白的动画,协调源代码、执行状态和概念类比。我们在两个机构的CS1课程中(Python,N=961;Java,N=151)进行了一项研究,比较GATs与文本解释。我们测量了即时学习表现和体验、课程结束时的参与度和考试成绩。结果表明,GATs可以在即时学习方面产生选择性优势,但优势取决于情境且是短期的。我们观察到GATs对表现的影响受到学习者参与度概况的调节。这一发现强调了个性化方法的重要性。

英文摘要

Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can make execution processes explicit, their effectiveness depends on design and context, and empirical evidence for AI-generated visualizations remains limited. We propose Generated Animated Traces (GATs), AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. We conduct a study at two institutions in CS1 courses (Python, N=961; Java N=151) comparing GATs to textual explanations. We measure immediate learning performance and experience, end-of-course engagement and exam performance. Results show that GATs can yield selective benefits for immediate learning, but benefits are context-dependent and short-term. We observe that GATs' influence on performance is moderated by learner engagement profiles. This finding underscores the importance of personalized approaches.

2606.03287 2026-06-03 cs.CV

BA-T: An Iterative Transformer for Two-View Bundle Adjustment

BA-T: 一种用于双视图束调整的迭代Transformer

Ganlin Zhang, Weirong Chen, Daniel Cremers, Xi Wang

AI总结 受经典束调整启发,提出BA-T,一种通过迭代Transformer在隐式token空间中实现结构化更新的轻量级方法,用于改进双视图三维重建的精度和多视图一致性。

详情
AI中文摘要

前馈三维重建模型通过深度跨视图注意力在图像间交换信息取得了强性能。然而,这些方法通常依赖沉重的解码器堆栈,缺乏几何精化的结构化机制,导致多视图一致性差。我们通过借鉴经典束调整(BA)来解决这个问题,BA可被视为位姿与局部几何之间的迭代信息传播过程。受BA启发,我们提出BA-T,一种迭代Transformer,将BA风格的结构化更新作为可重复层在隐式token空间中实现。BA-T不依赖深度注意力堆栈,而是通过单个轻量层基于潜在残差精化预测。实验表明,BA-T在迭代中逐步提升位姿和重建精度,比传统解码器实现更强的跨视图一致性,在使用仅16%解码器参数的情况下匹配或超越更大的模型。BA-T为深度注意力提供了一种紧凑、高效且结构化的替代方案,在轻量架构内实现精确的三维重建。代码将在以下网址公开:https://this https URL。

英文摘要

Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at https://github.com/zhangganlin/BA-T.

2606.03284 2026-06-03 cs.CL

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

SEA-NLI:将自然语言推理作为理解东南亚文化的透镜

Peerawat Chomphooyod, Jian Gang Ngui, Yosephine Susanto, Attapol T. Rutherford, Alham Fikri Aji, Sarana Nutanong, Can Udomcharoenchaikit, Peerat Limkonchotiwat

AI总结 提出SEA-NLI基准,通过自然语言推理评估模型对东南亚文化的理解,发现现有模型表现不佳,文化适应和提示可提升性能。

详情
AI中文摘要

前沿LLM在西方语境中表现良好,但在东南亚等代表性不足的文化中测试不足。现有的NLI基准大多以西方为中心、源自翻译或单语,限制了其衡量文化基础推理的能力。我们引入了SEA-NLI,一个原生的、基于文化的NLI基准,涵盖八个东南亚国家的英语和本地区域语言,并由母语者验证。在17个编码器和解码器模型中,我们观察到所有模型表现较低,尤其是在语言和科技等知识密集型类别中。我们的分析表明,失败案例主要源于缺乏东南亚文化知识:适应东南亚的模型和文化感知提示提升了性能,而思维链提示带来的提升有限。

英文摘要

Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.

2606.03279 2026-06-03 cs.LG

A Geometric Lens on Physics-Aligned Data Compression

物理对齐数据压缩的几何视角

Aleix Segui, Wesley Armour

AI总结 本文通过局部几何理论揭示了物理信息损失函数在科学数据压缩中导致的率失真权衡,并提出了基于主特征空间重叠的对齐诊断方法。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
Comments
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

在人工智能科学中,物理信息损失函数越来越多地被用于训练科学数据的学习压缩器,但其率失真影响仍知之甚少。在固定比特率下,这些目标通常能改善目标物理可观测量的保存,但会降低标准重建保真度。我们发展了一个局部几何理论,表明这种权衡由熵模型、物理可观测量和失真度量引起的潜在空间敏感性的相互作用所支配。在每个操作点,这些因素诱导出压缩噪声应被抑制的优先方向,从而产生各向异性的误差分配机制。当这些方向未对齐时,在固定速率下改善可观测量必然恶化标准失真,这确立了同时保存的基本限制。我们通过局部切空间率失真定律形式化这一点,并引入基于主特征空间重叠的实用对齐诊断方法。跨科学领域的实验测试了该理论,并验证了对齐诊断与观测到的数据和物理空间权衡相关。

英文摘要

In AI for Science, physics-informed losses are increasingly used to train learned compressors for scientific data, but their rate-distortion implications remain poorly understood. At fixed bitrate, these objectives often improve preservation of a target physical observable while degrading standard reconstruction fidelity. We develop a local geometric theory showing that this tradeoff is governed by the interaction of latent-space sensitivities induced by the entropy model, the physical observable, and the distortion metric. At each operating point, these induce preferred directions along which compression noise should be suppressed, yielding an anisotropic error-allocation mechanism. When these directions are misaligned, improving the observable at fixed rate necessarily worsens standard distortion, establishing a fundamental limit on simultaneous preservation. We formalise this through a local tangent-space rate-distortion law and introduce a practical alignment diagnostic based on dominant eigenspace overlap. Experiments across scientific domains test the theory and validate that the alignment diagnostic correlates with observed data- and physics-space trade-offs.

2606.03273 2026-06-03 cs.CV cs.AI cs.CL

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

VistaHop: 视觉深度搜索的多跳视觉推理基准

Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin

AI总结 提出VistaHop基准,通过多跳问答任务评估多模态大推理模型在视觉深度搜索中的迭代图像检查、视觉锚点定位和跨证据链推理能力,实验表明现有模型表现有限。

详情
AI中文摘要

视觉深度搜索要求多模态大推理模型(MLRM)智能体通过反复检查图像区域、将中间推理锚定在视觉证据上,并跨长推理链连接细粒度线索来回答复杂的视觉查询。然而,现有基准主要关注单步视觉理解或静态图像问答,对迭代图像检查、视觉锚点定位和多跳证据整合的评估有限。在这项工作中,我们引入了VistaHop,一个用于评估视觉深度搜索中以视觉为中心的搜索和多跳视觉推理的基准。VistaHop包含300张高分辨率图像、25个视觉搜索场景和350个多跳QA任务,这些任务要求模型跟随从视觉锚点出发的证据链,或融合跨多个基于图像的推理路径的信息。我们进一步开发了VistaArena,一个统一的评估环境,支持带有文本搜索、图像搜索、图像裁剪和基于证据的答案验证的工具增强推理。在七个代表性MLRM上的实验表明,当前模型远未解决VistaHop:最佳模型SenseNova-MARS-32B仅达到24.31%的Pass@1。这些结果揭示了在视觉定位、证据重访、长链推理和多锚点信息融合方面的持续局限性,凸显了对更强基准和训练方法的需求,以推动视觉深度搜索的发展。

英文摘要

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

2606.03270 2026-06-03 cs.LG cs.AI

Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles

常见子结构可迁移吗?基于神经向量丛的黎曼图基础模型

Li Sun, Zhenhao Huang, Yiding Wang, Qin Chen, Pietro Lio, Philip S. Yu

AI总结 针对图结构迁移性理论缺失的问题,提出基于黎曼几何的神经向量丛框架GAUGE,通过内在几何学习实现可迁移子结构表征,在零样本链接预测和图同构任务中验证了优越性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

基础模型通过预训练-适应范式引发了革命,最近的研究将这一成功扩展到图。与其他模态不同,图包含丰富的结构模式,但其结构迁移性仍知之甚少。先前的研究考虑离散领域中的常见子结构,我们被一个基本问题所驱动:常见子结构可迁移吗?其背后的理论很大程度上未被探索。在这项工作中,我们转向通过功能行为的视角学习可迁移结构。理论上,我们将可迁移子结构与表示空间的内在几何联系起来。然而,表征这种内在几何很少被触及。基于黎曼几何,我们开发了一个称为神经向量丛的图内在几何学习框架,该框架能够用局部坐标解析内在几何。在此基础上,我们设计了GAUGE,一个可预训练的神经架构,它构建向量丛,展平几何兼容的局部坐标,以及一个新的狄利克雷损失,该损失也衡量迁移努力。我们通过实验验证了其在具有挑战性的任务(包括零样本链接预测和图同构)中的优越表现力。

英文摘要

Foundation models have sparked a revolution via a pretraining-adaptation paradigm, with recent efforts extending this success to graphs. Unlike other modalities, graphs contain rich structural patterns, yet their structural transferability remains poorly understood. Prior studies consider common substructures in the discrete realm, and we are motivated by a fundamental question: Are common substructures transferable? The underlying theory is largely underexplored. In this work, we shift toward learning transferable structures through the lens of functional behavior. Theoretically, we connect transferable substructures to intrinsic geometry of the representation space. However, characterizing such intrinsic geometry has rarely been touched. Grounded in Riemannian geometry, we develop a graph intrinsic geometry learning framework called Neural Vector Bundle, which enables parsing intrinsic geometry with local coordinates. Building on this, we design GAUGE, a pretrainable neural architecture that constructs the vector bundle, flattening geometrically compatible local coordinates, and a new Dirichlet loss, which also measures the transfer effort. We empirically validate its superior expressiveness in challenging tasks including zero-shot link prediction and graph isomorphism.

2606.03269 2026-06-03 cs.AI

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

从LLM中蒸馏答案集编程规则用于神经符号视觉问答

Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch

AI总结 提出从大语言模型中蒸馏答案集编程规则的方法,以可解释的方式扩展视觉问答系统的推理能力,仅需少量示例即可生成正确规则。

详情
Comments
Under consideration in Theory and Practice of Logic Programming (TPLP)
AI中文摘要

视觉问答(VQA)是关于图像回答问题的任务,需要整合多模态输入和推理。将基于逻辑的表示纳入推理组件的模块化方法,相比端到端训练系统具有明显优势,尤其是在可解释性方面。然而,当任务需求变化时,调整或扩展这些表示可能会给开发者带来沉重负担。为了解决这一挑战,我们提出了一种从大语言模型(LLM)中蒸馏规则的方法。我们的方法提示LLM扩展一个初始的VQA推理理论(表示为答案集程序),以满足任务的新要求。VQA数据集中的示例指导LLM,验证结果,并通过利用ASP求解器的反馈帮助纠正错误规则。我们证明了该方法在多种VQA数据集上的有效性。值得注意的是,仅需少量示例即可从LLM中引出正确规则。我们的实验表明,从LLM中蒸馏规则是传统数据驱动规则学习方法的一种有前景的替代方案。正在考虑发表于《逻辑编程理论与实践》(TPLP)。

英文摘要

Visual Question Answering (VQA) is the task of answering questions about images, requiring the integration of multimodal input and reasoning. Modular approaches that incorporate logic-based representations into the reasoning component offer clear advantages over end-to-end trained systems, particularly in terms of interpretability. However, adapting or extending these representations when task requirements change can place a significant burden on developers. To address this challenge, we present an approach for distilling rules from Large Language Models (LLMs). Our method prompts an LLM to extend an initial VQA reasoning theory, expressed as an answer-set program, to meet new requirements of the task. Examples from VQA datasets guide the LLM, validate the results, and help correct erroneous rules by leveraging feedback from the ASP solver. We demonstrate that our approach is effective across diverse VQA datasets. Notably, only a few examples are needed to elicit correct rules from LLMs. Our experiments suggest that rule distillation from LLMs is a promising alternative to traditional data-driven rule learning approaches. Under consideration in Theory and Practice of Logic Programming (TPLP).

2606.03268 2026-06-03 cs.RO

EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations

EaDex: 一种基于低成本演示的跨形态灵巧操作框架

Qian Zhao, Xin Tong, Chengdong Wu, Yang Yang, Yingtian Li

AI总结 提出EaDex框架,通过RGB-D相机捕捉人手运动并构建结构化演示数据,结合基于接触奖励的动态演示退火机制,在低成本演示条件下实现多形态灵巧操作的快速学习和训练。

详情
Comments
11 pages, 5 figures, Conference: CoRL 2026, Submitted as Preprint
AI中文摘要

灵巧操作学习长期以来受到数据和训练高成本的阻碍,因为纯强化学习通常需要大规模交互探索,而模仿学习依赖于昂贵的高质量演示。为了解决这个问题,我们提出了EaDex,一种在低成本演示条件下的多形态灵巧操作学习框架,它能够快速生成演示数据,从而减少训练时间以实现高效的灵巧操作。在数据层面,EaDex仅使用单个RGB-D相机捕捉人手运动,并通过基于MANO的手部建模、数据归一化和运动重定向构建结构化演示数据。在学习层面,我们引入了一种基于接触奖励的动态演示退火机制,该机制在演示引导下进行早期探索,并随着接触奖励的积累逐渐过渡到自主优化。使用我们自定义的数据集,我们在三种灵巧手和三种铰接物体打开任务上评估了EaDex,涵盖了九种跨形态操作设置,相比没有演示退火的基线实现了55.3%的相对改进。这些结果验证了所提出的低成本演示流程和动态演示退火策略在灵巧操作学习中的有效性。

英文摘要

Dexterous manipulation learning has long been hindered by the high costs of data and training, as pure reinforcement learning typically requires large-scale interactive exploration and imitation learning depends on high-quality demonstrations that are expensive to collect. To address this problem, we propose EaDex, a multi-embodiment dexterous manipulation learning framework under low-cost demonstration conditions, which enables rapid generation of demonstration data and consequently reduces training time for efficient dexterous manipulation. At the data level, EaDex captures human hand motions using only a single RGB-D camera and constructs structured demonstration data through MANO-based hand modeling, data normalization, and motion retargeting. At the learning level, we introduce a contact-reward-based dynamic demonstration annealing mechanism, which guides early-stage exploration under demonstration and gradually transitions to autonomous optimization with accumulating contact rewards. Using our custom dataset, we evaluate EaDex on three dexterous hands and three articulated object-opening tasks, covering nine cross-embodiment manipulation settings, achieving a 55.3% relative improvement over the baseline without demonstration annealing. These results validate the effectiveness of the proposed low-cost demonstration pipeline and the dynamic demonstration annealing strategy for dexterous manipulation learning.

2606.03265 2026-06-03 cs.RO

Wheel-Mounted/GNSS Fusion with AI-Aided Position Updates

基于人工智能辅助位置更新的轮式/GNSS融合定位

Gal Versano, Itzik Klein

AI总结 提出一种混合神经惯性导航框架,结合轮式惯性传感器、强制周期轨迹和神经网络,通过误差状态扩展卡尔曼滤波融合GNSS位置更新,实现定位精度提升约46%。

详情
AI中文摘要

精确且鲁棒的定位仍然是自主地面车辆面临的基本挑战。在这项工作中,我们提出了一种混合神经惯性导航框架,该框架集成了轮式惯性传感器、强制周期轨迹以及一个简单高效的神经网络,能够在误差状态扩展卡尔曼滤波中通过GNSS位置更新回归车辆位移。周期轨迹提高了惯性信噪比,使得网络仅利用惯性读数即可估计位移。通过使用多个轮式惯性传感器的真实世界实验验证了该方法。实验结果表明,与标准轮式惯性传感器融合GNSS更新相比,所提方法在定位精度上实现了显著提升,位置均方根误差降低了约46%。

英文摘要

Accurate and robust localization remains a fundamental challenge for autonomous ground vehicles. In this work, we propose a hybrid neural inertial navigation framework that integrates a wheel-mounted inertial sensors, enforced periodic trajectories, and a simple, efficient neural network capable of regressing vehicle displacement with GNSS position updates in an error-state extended Kalman filter. The periodic trajectories increase the inertial signal-to-noise ratio, allowing the network to use only inertial readings to estimate displacement. The approach is validated through real-world experiments using multiple wheel-mounted inertial sensors. Experimental results demonstrate that the proposed method achieves a significant improvement in positioning accuracy, reducing the position root mean squared error by approximately 46 % compared to standard wheel-mounted inertial sensor fusion with GNSS updates.

2606.03264 2026-06-03 cs.CV

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

PaddleOCR-VL-1.6:通过欠优化区域精炼和渐进式后训练扩展文档解析前沿

Zelun Zhang, Hongen Liu, Suyin Liang, Yubo Zhang, Yiqing Xiang, Jiaxuan Liu, Ting Sun, Manhui Lin, Yue Zhang, Changda Zhou, Tingquan Gao, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

AI总结 提出PaddleOCR-VL-1.6,通过区域感知数据优化框架识别并增强前代模型的薄弱区域,结合渐进式后训练策略,在OmniDocBench v1.6上达到96.33%的新SOTA。

详情
AI中文摘要

我们介绍了PaddleOCR-VL-1.6,这是一个基于PaddleOCR-VL-1.5升级的紧凑型文档解析模型。尽管PaddleOCR-VL-1.5建立了强大的0.9B基线,但其剩余错误集中在欠优化区域,这些区域模型行为不稳定、数据覆盖稀疏或监督不可靠。PaddleOCR-VL-1.6没有不加区分地扩大训练语料,而是引入了一个区域感知数据优化框架,从先前模型中识别薄弱区域,对这些区域进行针对性增强,并提高监督信号的可靠性。它进一步采用基于精选数据选择和强化学习的渐进式后训练方案,通过分阶段优化将模型性能提升到更高水平。PaddleOCR-VL-1.6在OmniDocBench v1.6上达到了96.33%的新SOTA分数,展现出与顶级VLM的强劲竞争力,并为PaddleOCR-VL系列提供了实用的后训练方案。

英文摘要

We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.

2606.03262 2026-06-03 cs.LG cs.NA math.NA

Let There Be Light: Reflection, Refraction and Scattering for Neural Operators

Let There Be Light: 面向神经算子的反射、折射与散射

Keke Wu, Yixuan Zhang, Jingrun Chen

AI总结 提出一种受光传输启发的神经算子LiNO,通过反射、折射和散射三种机制分解潜在演化,实现局部特征调制与全局空间通信的结构化分离,并开发高效散射变体将空间复杂度从二次降至线性。

详情
AI中文摘要

神经算子学习无限维函数空间之间的映射,为参数化偏微分方程(PDE)提供数据驱动的代理建模范式。现有架构通常通过在指定变换域中参数化积分核,或对离散空间点应用类似注意力的交互来获得表达能力。尽管这些方法取得了显著进展,但它们常常面临物理可解释性、非局部空间通信、网格可扩展性和计算成本之间的持续权衡。我们提出了一种光启发的神经算子(LiNO),其潜在演化被分解为由基本光传输启发的三种机制:反射、折射和散射。反射和折射在潜在特征空间中充当自适应逐点变换,实现局部特征重定向和各向异性调制,而散射则在物理域上执行输入依赖的非局部传播。我们首先将散射公式化为具有相对位置偏置的归一化成对核,然后开发了一种高效的散射变体,用正特征全局传播和局部扩散分支替代显式的成对交互,将主导空间复杂度从二次降至线性。这产生了一个结构化的神经算子,将局部特征调制与全局空间通信分离,同时保留了模块化和可解释的潜在演化。

英文摘要

Neural operators learn mappings between infinite-dimensional function spaces and provide a data-driven surrogate modeling paradigm for parametric partial differential equations (PDEs). Existing architectures typically obtain expressivity by parameterizing integral kernels in prescribed transform domains or by applying attention-like interactions over discretized spatial points. While these approaches have achieved substantial progress, they often face a persistent trade-off among physical interpretability, nonlocal spatial communication, mesh scalability, and computational cost. We propose a Light-inspired neural operator(LiNO), an operator-learning architecture whose latent evolution is decomposed into three mechanisms motivated by elementary light transport: reflection, refraction, and scattering. Reflection and refraction act as adaptive pointwise transformations in latent feature space, enabling local feature reorientation and anisotropic modulation, whereas scattering performs input-dependent nonlocal propagation over the physical domain. We first formulate scattering as a normalized pairwise kernel with relative positional bias, and then develop an efficient scattering variant that replaces explicit pairwise interactions with positive-feature global propagation and a local diffusion branch, reducing the dominant spatial complexity from quadratic to linear. This yields a structured neural operator that separates local feature modulation from global spatial communication while retaining a modular and interpretable latent evolution.

2606.03260 2026-06-03 cs.LG cs.AI

EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs

EqGINO: 面向3D PDE的等变几何信息傅里叶神经算子

Sungwon Kim, Juho Song, Seungmin Shin, Guimok Cho, Sangkook Kim, Chanyoung Park

AI总结 提出EqGINO框架,通过在谱域强制执行各向同性,实现离散对称性的精确等变,并泛化到任意连续旋转,有效建模3D PDE的坐标不变物理规律。

详情
Comments
ICML 2026
AI中文摘要

用于3D偏微分方程(PDE)的深度学习代理通常难以在几何变换下泛化,因为它们严重依赖于特定的坐标系。虽然等变网络提供了一种解决方案,但它们通常依赖于空间域中的局部操作,使得对PDE动力学至关重要的全局感受野计算成本高昂。相反,傅里叶神经算子(FNO)高效地捕获全局交互,但由于谱群卷积的过高成本,在其中建立3D等变性仍然不切实际。为弥合这一差距,我们引入了EqGINO,一个在谱域中强制执行各向同性的几何鲁棒框架。通过设计,EqGINO保证对离散化计算域固有的离散对称性具有精确等变性。除了这种离散保证外,我们的结构先验使得即使在有限数量的SE(3)变换训练样本下,也能有效泛化到任意连续方向。因此,我们的方法在复杂的非规则3D几何上鲁棒地建模坐标不变的物理定律。我们的代码可在此https URL获取。

英文摘要

Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)-transformed training samples. Consequently, our method robustly models coordinate-invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung-won-kim/EqGINO

2606.03259 2026-06-03 cs.CL

Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent

超越“敬启者”:面向受众和意图的机器翻译定制化

Raphael Merx, Ekaterina Vylomova, Trevor Cohn

AI总结 本文通过系统评估50种语言、5种模型规模和8个文本领域,研究了大型语言模型在机器翻译中利用显式指令实现目的驱动翻译的能力,发现指令能显著提升翻译适应性,但传统指标无法评估适应质量。

详情
AI中文摘要

翻译质量取决于目的:同一源文本根据受众、语气和交际意图需要不同的翻译。然而,机器翻译模型和指标将翻译视为从源语言到目标语言的固定映射。大型语言模型使用户能够明确指定目的以及源文本,但这一能力尚未得到大规模评估。我们引入了一种跨50种语言、5种模型规模和8个文本领域的目的驱动翻译的系统评估。我们发现:(1) 显式指令显著提高了翻译的适应性,在非正式领域(对话、社交媒体)、较大模型规模和高资源语言中提升更大;(2) 指令优于语义匹配的少样本示例和段落级上下文;(3) 传统机器翻译指标无法捕捉适应质量,通常惩罚适应性翻译;(4) 当没有精心设计的指令时,模型可以从周围文档上下文中自我生成指令,缩小高达80%的与精心设计指令的适应性差距。我们的结果表明,目的适应型机器翻译是大型语言模型的一种可行且可衡量的能力,同时强调了需要目的感知的指标。

英文摘要

Translation quality depends on purpose: the same source text demands different translations depending on audience, tone, and communicative intent. Yet MT models and metrics treat translation as a fixed mapping from source to target. LLMs enable users to explicitly specify purpose alongside source text, yet this capability has not been evaluated at scale. We introduce a systematic evaluation of purpose-driven MT across 50 languages, 5 model sizes and 8 text domains. We find that (1) explicit instructions substantially improve translation adaptedness, with larger gains on informal domains (conversation, social media), for larger model sizes and for higher-resource languages; (2) instructions outperform semantically-matched few-shot examples and paragraph-level context; (3) traditional MT metrics fail to capture adaptation quality, often penalizing adapted translations; (4) when curated instructions are unavailable, models can self-generate them from surrounding document context, closing up to 80% of the adaptedness gap to curated instructions. Our results establish that purpose-adapted MT is a viable and measurable capability of LLMs, while highlighting the need for purpose-aware metrics.

2606.03257 2026-06-03 cs.NE cs.AI cs.LG

PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers

PSViT:一种结构剪枝脉冲视觉Transformer的方法

Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique

AI总结 提出PSViT方法,通过结构化剪枝(均匀通道滤波器和基于敏感性的细粒度剪枝)压缩脉冲视觉Transformer,在ImageNet-1K上实现22.4%内存节省且精度损失小于3%。

详情
Comments
8 pages, 7 figures, 3 tables
AI中文摘要

脉冲视觉Transformer(SViT)模型是很有前景的低功耗ViT模型,用于解决基于视觉的任务,具有最先进的性能。然而,它们的大尺寸限制了在资源受限的嵌入式平台上的部署,凸显了模型压缩的需求。一种突出的压缩技术是剪枝,最先进的工作采用非结构化剪枝技术来压缩SViT模型。这种技术需要专门针对稀疏模式定制的硬件架构才能最大化其效率优势,使得这种方法不可扩展。为了解决这个问题,我们提出了PSViT,一种对SViT模型进行结构化剪枝的新方法,从而使得利用现有且广泛使用的计算架构高效加速其推理成为可能。为此,PSViT采用了几个关键步骤:均匀通道滤波器剪枝以结构化消除非显著权重,敏感性分析以评估单层通道剪枝对精度和网络大小的影响,以及基于敏感性分析和给定网络架构的细粒度通道剪枝。实验结果表明,PSViT通过单次剪枝有效获得了22.4%的内存节省,同时在ImageNet-1K上保持高精度(未经微调为70.3%,经微调为72.8%),与原始未剪枝SViT模型(73.3%)相比精度损失在3%以内。这些结果还表明,PSViT方法推进了在资源受限应用中实现高效SViT部署的努力。

英文摘要

Spiking Vision Transformer (SViT) models are promising low-power ViT models for solving vision-based tasks with state-of-the-art performance. However, their large sizes limit their deployments for resource-constrained embedded platforms, underscoring the needs of model compression. One of prominent compression techniques is pruning, and the state-of-the-art works employ unstructured pruning techniques to compress SViT models. Such techniques require specialized hardware architectures tailored for the sparsity patterns to maximize their efficiency benefits, making this approach not scalable. To address this, we propose PSViT, a novel methodology to perform structured pruning on SViT models, hence making it possible to efficiently accelerate their inference using the existing and widely-used computing architectures. To do this, PSViT employs several key steps: uniform channel-wise filter pruning to structurally eliminate the non-significant weights, sensitivity analysis to evaluate the impact of channel-wise pruning of individual layer on accuracy and network size, as well as fine-grained channel-wise pruning based on the sensitivity analysis and the given network architecture. Experimental results show that PSViT effectively obtains 22.4% memory saving through single-shot pruning, while maintaining high accuracy within 3% (70.3% without fine-tuning and 72.8% with fine-tuning) from the original non-pruned SViT model (73.3%) on the ImageNet-1K. These results also show that the PSViT methodology advances the effort in enabling efficient SViT deployments on resource-constrained applications.

2606.03254 2026-06-03 cs.CV

FreeStreamGS: Online Feed-forward 3D Gaussian Splatting from Unposed Streaming Inputs

FreeStreamGS: 来自无位姿流式输入的在线前馈3D高斯泼溅

Ruiyang Chen, Feiran Li, Chu Zhou, Zonglin Li, Zhanyu Ma, Heng Guo

AI总结 提出FreeStreamGS,一种在线前馈框架,通过解耦内参恢复头和动态点精炼偏移策略,实现从无位姿流式输入的高效高质量新视角合成。

详情
AI中文摘要

前馈3D高斯泼溅(3DGS)允许从离线录制的图像序列进行高效高保真的新视角合成(NVS)。然而,从流式和无位姿图像输入实现在线NVS仍然具有挑战性。尽管已经提出了用于流式深度和点云恢复的在线前馈几何估计方法,但由于严重的渲染伪影,它们无法适应NVS。这是因为NVS对高斯尺度和位姿-几何对齐要求更严格的多视图一致性;即使微小的偏差也会随时间累积并明显降低渲染质量。为此,我们提出了FreeStreamGS,一个鲁棒的在线前馈框架,用于高效高质量的NVS。我们引入了两个关键机制:解耦内参恢复头,消除累积的相机内参偏差并防止长期流式中的场景尺度抖动;以及动态点精炼偏移策略,放松刚性反投影以校正耦合的位姿-深度漂移。大量实验表明,尽管FreeStreamGS无法访问未来帧,但其渲染质量与最先进的离线前馈3DGS方法相当。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) allows efficient and high-fidelity novel view synthesis (NVS) from an offline recorded image sequence. However, achieving online NVS from streaming and unposed image inputs remains challenging. Although online feed-forward geometric estimation methods have been proposed for streaming depth and point cloud recovery, they cannot be adapted to NVS due to severe rendering artifacts. This is because NVS demands stricter multi-view consistency in Gaussian scales and pose-geometry alignment; even minor deviations would accumulate over time and visibly degrade rendering quality. To this end, we propose FreeStreamGS, a robust online feed-forward framework for efficient and high-quality NVS. We introduce two key mechanisms: a Decoupled Intrinsic Recovery Head that removes cumulative camera intrinsic bias and prevents scene scale jitter during long-term streaming, and a Dynamic Point Refinement Offset strategy that relaxes rigid unprojection to correct coupled pose-depth drift. Extensive experiments show that FreeStreamGS achieves rendering quality competitive with state-of-the-art offline feed-forward 3DGS methods, despite operating without access to future frames.

2606.03252 2026-06-03 cs.RO cs.AI

AirDreamer: Generalist Drone Navigation with World Models

AirDreamer: 基于世界模型的通用无人机导航

Zian Liu, Andong Yang, Chunkai Yang, Ruidong An, Chao Gao, Guyue Zhou

AI总结 提出一种结合强化学习策略和世界模型理解的无人机导航框架,通过稀疏奖励函数避免局部最优,在复杂未知环境中实现优于基线5.3%的成功率,并支持零调参的仿真到现实迁移。

详情
Comments
8 pages, 8 figures
AI中文摘要

在未知且杂乱的环境中导航无人机需要可靠地泛化到未见过的场景布局,并理解与机器人能力相关的环境结构。先前的方法假设相同的环境配置,通常严重依赖人工设计的感知管道和预定义规则来引导机器人到达目标。这个过程依赖于环境,且跨环境泛化能力差。受动物导航行为启发,我们设计了一个导航框架,该框架在基于世界模型的环境理解之上使用基于强化学习的策略进行导航,以克服这些问题。此外,我们设计了一个无需手工塑造项的稀疏奖励函数,以避免局部极小值陷阱并鼓励偏航控制行为。在仿真和真实无人机上,我们的方法展现出在复杂未知环境中导航和逃离其他方法失败的局部最优的新兴能力。在具有挑战性的地图上,它比最佳基线实现了5.3%更高的导航成功率。此外,所提出的框架在部署期间无需任何调整即可实现有效的仿真到现实迁移。代码将公开。

英文摘要

Navigating a drone in unseen and cluttered environments requires reliable generalization to unseen scene layouts and understanding of environmental structure relative to the robot's capabilities. Previous methods, which assume the same environment configuration, often rely heavily on human-designed perception pipelines and predefined rules to guide the robot toward the target. This process is environment-dependent and generalizes poorly across environments. Inspired by animal navigation behavior, we design a navigation framework that navigates with a reinforcement-learning-based policy on top of a world-model-based environment understanding to overcome these issues. In addition, a sparse reward function without hand-crafted shaping terms is designed to avoid local minima traps and encourage yaw control behaviors. In simulation and on real drones, our method exhibits emergent capabilities for navigating complex, unseen environments and escaping local optima where other methods fail. In challenging maps, it achieves a 5.3% higher navigation success rate than best baseline. Furthermore, the proposed framework achieves effective sim-to-real transfer without any tuning during deployment. The code will be publicly available.

2606.03251 2026-06-03 cs.AI cs.CV cs.LG eess.IV stat.ML

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

现实世界数据集是否包含自然实验?基于因果特征选择的实证研究

Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke

AI总结 本文利用因果发现和特征选择检测现实世界数据集中的自然实验,并通过干预性处理提升模型性能。

详情
AI中文摘要

在自然界中,影响某些个体或群体但不影响其他个体或群体的事件构成隐式干预,被称为自然实验。例如,COVID-19大流行是冠状病毒对感染COVID的亚群的一次干预。我们问:现有的现实世界数据集中是否存在自然实验?如果存在,我们应该如何处理它们?为了检测数据中的自然实验,我们使用因果发现恢复潜在因果图,并基于因果链接进行特征选择。如果通过将数据视为干预性而非观测性来提升下游性能,我们认为这表明数据集包含自然实验。我们首先通过使用合成图模拟包含和不包含自然实验的数据集来验证这一假设。然后,我们在大量现实世界数据集上进行系统的实证评估。我们的结果表明,现实世界数据集确实包含自然实验,我们可以利用这些自然实验通过因果推断来提升模型性能。我们的工作代表了该领域的初步探索,在有限范围内进行了初步研究。

英文摘要

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

2606.03250 2026-06-03 cs.CL

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

词与道:德语医学NLP中领域特定BERT预训练的策略

Henry He, Johann Frei, Raphael Schmitt

AI总结 本文提出ChristBERT系列模型,通过对比持续预训练、从头训练和领域词汇适应三种策略,在德语医学NLP任务中实现最优性能,并建立新的基准。

详情
Comments
Under revision at BMC Medical Informatics and Decision Making
AI中文摘要

数字医疗产生大量临床文本,可支持AI辅助应用,但德语生物医学语言模型仍受限于较旧的架构或受限的训练数据。我们提出了ChristBERT(临床与健康相关议题及主题调优BERT),这是一个基于德语RoBERTa的领域特定语言模型家族,在包含科学出版物、临床文本、健康相关网络内容和翻译临床资源的13.5GB语料库上训练。为了探究领域适应策略在德语临床NLP中的影响,我们比较了持续预训练、从头训练和领域词汇适应。所得模型在三个医学命名实体识别任务和两个文本分类任务上进行了评估。ChristBERT在五个基准中的四个上持续优于现有的通用和医学德语语言模型,并为德语临床语言建模建立了新的最先进水平。我们的结果表明,最优适应策略取决于任务:在我们的评估中,从头训练对高度专业化的临床文本特别有效,而持续预训练在更常见的医学文本上表现良好。所有模型均已公开发布,以支持德语医学NLP的未来研究和应用。

英文摘要

Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.

2606.03247 2026-06-03 cs.CL cs.IR

Structures Facilitate Retrieve, Rerank, and Generate

结构促进检索、重排序和生成

Yeqin Zhang, Haomin Fu, Xujie Zhang, Cam-Tu Nguyen

AI总结 提出SF-Re2G方法,通过利用文档结构信息改进段落表示、构建结构增强的重排序器并融入子图上下文,以提升文档对话系统的检索、重排序和生成性能。

详情
AI中文摘要

文档对话系统(DGDS)利用外部文档中的知识来回答特定领域的用户问题。现有解决方案通常将文档划分为独立的段落进行检索和响应生成。然而,这种方法既没有充分利用文档内的结构信息,也没有为知识选择和响应提供足够的(文档)上下文。本文提出SF-Re2G来系统地解决这些问题。首先,我们通过将段落与同一章节的其他段落进行对比来改进段落表示,从而提高检索性能。其次,构建了一个结构增强的重排序器,利用同一对话轮次的多个基础段落往往位于同一邻近区域的事实。具体来说,来自检索的候选者根据文档结构被分组为子图。重排序器将结合其组信息对候选者重新评分。最后,选中的段落用于生成响应,同时考虑子图上下文以改进生成。在两个DGDS数据集上的实验结果验证了我们的方法在中文和英文上的有效性。

英文摘要

Document-grounded dialogue systems (DGDS) utilize knowledge from external documents to answer domain-specific user questions. Existing solutions typically divide documents into independent passages for retrieval and response generation. This approach, however, neither makes good use of structural information within documents nor provides enough (document) context for knowledge selection and responses. This paper proposes SF-Re2G to address such issues systematically. Firstly, we seek to improve a passage representation by contrasting it with others of the same section, thus improving the retrieval performance. Secondly, a structure-enhanced reranker is built, leveraging the fact that multiple grounding passages of one dialog turn tend to be in the same neighborhood. Specifically, candidates from the retrieval are grouped into subgraphs according to the document structure. The reranker will rescore the candidate integrating its group information. Finally, the chosen passages are used for responses, taking into account the subgraph context for better generation. Experimental results on two DGDS datasets validate our method for both Chinese and English.

2606.03246 2026-06-03 cs.CV

MariData: One-Step Unpaired Image Translation for Maritime Environments

MariData: 海洋环境下的单步非配对图像翻译

Santeri Henriksson, Mehdi Asadi, Amin Majd, Juha Kalliovaara

AI总结 针对海洋自主水面船舶训练数据稀缺问题,提出基于CycleGAN-turbo的单步非配对图像翻译框架,通过零卷积跳跃连接保留小目标细节,生成逼真的天气与光照条件合成数据。

详情
AI中文摘要

海洋自主水面船舶(MASS)鲁棒感知系统的发展受到多样化训练数据稀缺的严重制约,尤其是恶劣天气和低光照条件。由于在动态海洋环境中收集配对图像在物理上不可行,通过非配对图像到图像翻译生成合成数据提供了一种关键解决方案。然而,现有生成模型因潜在压缩瓶颈而无法保留小型导航目标的精细结构细节。在本文中,我们介绍了一个使用CycleGAN-turbo(一种单步非配对翻译架构)生成合成海洋数据的框架。通过引入零卷积跳跃连接以绕过变分自编码器(VAE)瓶颈,我们的方法在翻译过程中明确保留了小目标细节(例如远处的船只和海上标志)。我们收集了一个包含7000张海洋图像的数据集,用于训练和评估白天到雾天、白天到日落以及白天到夜晚的域翻译模型。定性评估和变强度推理研究表明,我们的方法有效地合成了逼真的大气条件,同时保持了场景的底层语义结构。白天到雾天和白天到日落模型表现出良好的结构保留,而白天到夜晚模型则突显了语义幻觉的挑战,例如由不平衡训练分布引起的人工海岸灯光生成。最终,这项工作建立了一个高效、结构感知的数据合成管道,直接解决了自主海洋导航中的数据稀缺瓶颈。

英文摘要

The development on robust perception systems for Maritime Autonomous Surface Ships (MASS) is heavily constrained by the scarcity of diverse training data, particularly for adverse weather and low-light conditions. Because collecting paired images in dynamic maritime environments is physically impossible, synthetic data generation via unpaired image-to-image translation offers a critical solution. However, existing generative models suffer from failing to preserve the fine structural details of small navigational objects due to latent compression bottlenecks. In this paper, we introduce a framework for generating synthetic maritime data using CycleGAN-turbo, a one-step unpaired translation architecture. By incorporating zero-convolution skip connections to bypass the Variational Autoencoder (VAE) bottleneck, our approach explicitly preserves small object details (e.g., distant vessels and sea marks) during translation. We compiled a dataset of 7,000 maritime images to train and evaluate models for Day-to-Foggy, Day-to-Sunset, and Day-to-Night domain translations. Qualitative evaluations and variable-strength inference studies demonstrate that our method effectively synthesizes realistic atmospheric conditions while maintaining the underlying semantic structure of the scene. The Day-to-Foggy and Day-to-Sunset models exhibit great structural retention, whereas the Day-to-Night model highlights the challenge of semantic hallucination, such as generating artificial coastal lights, induced by unbalanced training distributions. Ultimately, this work establishes an efficient, structure-aware data synthesis pipeline that directly addresses the data scarcity bottleneck in autonomous maritime navigation.

2606.03244 2026-06-03 cs.CL

When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

何时复杂度调节对冻结句子嵌入有帮助?基于逐句和句子对难度适配的受控研究

Suhwan Hwang

AI总结 通过受控实验研究冻结句子编码器后接轻量适配器时,基于句子级和句子对级难度信号的调节效果,发现句子对级残差门控在较大和分级任务上持续提升性能,而句子级方法无效。

详情
Comments
13 pages, 3 figures, 2 tables
AI中文摘要

一个常见的直觉是句子嵌入应适应输入的难度。我们在受控的多随机种子设置中测试这一直觉:一个轻量后编码器适配器附加到冻结的Qwen3-Embedding-0.6B编码器上,仅访问其最终池化嵌入,并在四个释义和语义相似度任务(PAWS、MRPC、QQP、STS-B)上评估。该想法的朴素形式失败:基于表面的逐句复杂度与冻结基线误差几乎不相关(Pearson约0.05),且相比常数或打乱对照无优势,同时降低饱和基线。即使目标与非循环的句子对难度信号对齐,逐句门控仍无法可靠捕获难度,因为难度主要是句子对的属性,而非单个句子。相比之下,由留出的交叉编码器难度信号门控的小型句子对级残差在较大和分级任务上持续提升,包括STS-B上+0.022 Spearman和QQP上+0.037,同时所有随机种子均锚定于冻结基线。由于这种有用形式操作于句子对而非单个句子,所得模型最好理解为缓存冻结嵌入上的轻量重排序器,而非替代的单向量嵌入;我们不声称达到最先进。我们的贡献是对难度感知适配何时有帮助何时失败的受控说明,以及预测可用余量的预训练诊断。

英文摘要

A common intuition is that sentence embeddings should adapt to the difficulty of the input. We test this intuition in a controlled, multi-seed setting: a lightweight post-encoder adapter attaches to a frozen Qwen3-Embedding-0.6B encoder, accessing only its final pooled embedding, and is evaluated on four paraphrase and semantic-similarity tasks (PAWS, MRPC, QQP, STS-B). The naive form of the idea fails: surface-based per-sentence complexity is nearly uncorrelated with frozen-baseline error (Pearson approximately 0.05) and provides no advantage over constant or shuffled controls, while degrading a saturated baseline. Even when the target is aligned to a non-circular pair-difficulty signal, the per-sentence gate still cannot reliably capture difficulty because difficulty is primarily a property of the pair, not the individual sentence. In contrast, a small pair-level residual gated by a held-out cross-encoder difficulty signal yields consistent gains on the larger and graded tasks, including +0.022 Spearman on STS-B and +0.037 on QQP, while remaining anchored to the frozen baseline across all seeds. Because this useful form operates on sentence pairs rather than individual sentences, the resulting model is best understood as a lightweight re-ranker over cached frozen embeddings, not a replacement single-vector embedding; we make no state-of-the-art claim. Our contribution is a controlled account of when difficulty-aware adaptation helps and when it fails, together with a pre-training diagnostic that predicts the available headroom.

2606.03243 2026-06-03 cs.CV

MemoGen: Can Past Experience Improve Future Text-to-Image Generation?

MemoGen:过去的经验能否改善未来的文本到图像生成?

Wenshuo Chen, Kuimou Yu, Bowen Tian, Jianfei Song, Shaofeng Liang, Haozhe Jia, Kan Cheng, Haosen Li, Kaishen Yuan, Lei Wang, Jiemin Wu, Songning Lai, Yutao Yue

AI总结 提出MemoGen框架,通过代理进化层和可重用经验记忆,在不更新生成器的情况下,利用过去经验改进文本到图像生成,在知识密集和推理基准上超越专有系统。

详情
AI中文摘要

现代文本到图像模型已实现强大的视觉合成,但在提示需要隐式视觉约束、关系推理或外部知识时仍不可靠。现有的检索增强和代理生成方法通过获取外部知识、参考或当前请求的优化提示来缓解此问题,但它们通常将每次生成视为孤立事件,并未系统性地保留过去的成功或失败以供将来使用。在这项工作中,我们探究文本到图像系统能否在不更新底层生成器的情况下,从自身的生成经验中持续改进。我们提出MemoGen,一种无需训练的框架,通过代理进化层增强现有图像生成器。对于每个任务,MemoGen显式推断视觉需求,必要时检索外部证据和参考,将其转化为可执行的生成约束,评估生成结果,并将任务理解、参考选择、视觉反馈、成功策略和失败教训存储为可重用的经验记忆。在进化轮次中,代理检索相关经验以改进类似的未来生成,选择性修复先前失败的案例同时保留成功的案例,从而实现在无需参数更新的情况下进行测试时自我进化。在知识密集和推理导向基准上的广泛实验证明了该范式的有效性:仅经过两轮进化,基于开源Qwen-Image骨干的MemoGen在WISE和Mind-Bench上超越了强大的专有系统,如Nano Banana Pro和GPT-Image-1,表明显式经验记忆可以作为可靠文本到图像生成的强大持续学习信号。

英文摘要

Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.

2606.03241 2026-06-03 cs.CL eess.AS

Benchmarking Speech-to-Speech Translation Models

语音到语音翻译模型基准测试

Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo

AI总结 提出统一可复现的基准框架COMPASS,集成46个指标评估语音到语音翻译模型,通过相关性过滤将指标缩减至10个,并验证了领域特定指标与人类判断的高度相关性。

详情
Comments
Paper under submission
AI中文摘要

语音到语音翻译(S2ST)已取得快速进展,但离线评估缺乏统一协议:研究报告非重叠的指标子集,阻碍了直接比较。我们引入COMPASS,一个统一且可复现的基准测试框架,集成了跨八个维度的46个指标,并将其部署在来自FLEURS和CVSS的1,248个模型-语言配置上,涵盖级联和端到端架构的十种语言对。架构表现出互补优势:最佳与最差之间的差距在自然度和说话人保留方面超过30%,但在翻译质量上仅相差几个百分点,因此单一指标排名系统地歪曲了系统质量。相关性过滤将46个指标减少到每个方向10个,其中三个轴在X→EN和EN→X上需要不同的指标(例如,TER/UTMOS vs. ChrF++/NISQA-MOS);这些子集保留了排名(Spearman's ρ>0.80),同时将评估时间减少了约2.5倍。在配音、播客和医学领域的人类验证表明,独立的MOS预测器无法预测听众偏好,而顶级领域特定指标与人类判断相关(ρ≥0.90)。我们发布COMPASS作为领域感知S2ST评估的基础。

英文摘要

Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $ρ>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($ρ\geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

2606.03240 2026-06-03 cs.RO

GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

GeoAlign: VLA模型中的状态引导空间对齐超越语义

Yizhi Chen, Zhanxiang Cao, Xinyi Peng, Yixiao Zheng, Xiaxi Si, Yiheng Li, Liyun Yan, Keqi Zhu, Xueyun Chen, Shengcheng Fu, Tianyue Zhan, Yufei Jia, Jinming Yao, Yan Xie, Kun Wang, Cewu Lu, Yue Gao

AI总结 提出GeoAlign架构,通过RGB几何分支的后训练和机器人本体状态引导的几何特征查询,实现几何感知的空间对齐和动态可供性选择,在多个基准上取得高性能。

详情
Comments
20 pages, 9 figures, 8 tables, including appendix
AI中文摘要

当前的视觉-语言-动作(VLA)模型通常优化语义基础,而可执行的操纵需要几何感知的空间对齐和动态可供性选择。我们引入了GeoAlign,一种用于VLA策略学习的状态引导空间对齐架构。GeoAlign使用机器人领域的RGB-D监督对RGB几何分支进行后训练,生成RGB衍生的几何增强后训练(GEP)特征用于策略部署。机器人的本体状态查询GEP特征网格,产生紧凑的、相位相关的几何令牌用于动作预测。GeoAlign在LIBERO上达到99.0%,在三个SimplerEnv-Fractal任务上达到85.3%,在八个几何关键的真实世界ALOHA任务上达到78.8%,消融实验证实了几何后训练和本体状态引导查询的价值。

英文摘要

Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.

2606.03239 2026-06-03 cs.CL

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

ARBOR: 通过可复用评分标准缓冲区的在线过程奖励用于搜索智能体

Zheng Liu, Longxiang Zhang, Xintong Wang, Zhiang Xu, Shaoxiong Zhan, Xin Shan, Wen Huang, Tao Dai, Shu-Tao Xia, Chengfu Huo, Liang Ding

AI总结 针对基于LLM的搜索智能体训练中结果奖励缺乏过程监督的问题,提出ARBOR框架,通过维护跨查询共享的评分标准记忆库,利用对比轨迹生成局部草案并整合为通用评分标准,以稀疏成对判断提供过程级梯度,在四个多跳QA基准上优于GRPO和DAPO基线。

详情
AI中文摘要

基于LLM的搜索智能体主要使用结果奖励进行训练,搜索过程本身无监督。这种信号在结果同质组(所有采样轨迹共享相同正确性)中退化,产生零组内优势和无梯度。现有的过程监督要么训练昂贵的验证器,要么生成每个查询的评分标准,这些评分标准在查询间不一致且使用一次后丢弃。我们提出ARBOR(自适应评分标准缓冲区用于在线奖励),一种可复用的过程奖励框架,维护跨查询共享的评分标准记忆。由对比轨迹诱导的查询局部草案被接纳、整合为跨查询通用评分标准,并随策略演化而淘汰。一小部分活跃的通用评分标准通过稀疏成对判断对轨迹评分,所得分数加到基础奖励上,即使在结果奖励一致时也能提供过程级梯度。ARBOR在四个多跳QA基准上持续优于GRPO和DAPO基线,将LLM评判准确率平均提高最多4.2个百分点,并将最多42%的原本零梯度训练组转化为信息丰富的组。

英文摘要

LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.

2606.03238 2026-06-03 cs.LG cs.AI

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

当RLHF失败时:奖励黑客、崩溃和评估者博弈的机制分类

Zelalem Abahana

AI总结 本文通过PPO、DPO等方法的对比实验,提出了一种基于奖励和评估者分数方向的机制分类法,将RLHF失败模式分类为可定位、可预测的训练动态。

详情
Comments
20 pages, 8 figures; includes code, artifacts, and live demo
AI中文摘要

从人类反馈中强化学习(RLHF)通过用学习到的可扩展代理替代未明确指定的人类目标,实现了大规模后训练。这种替代同时创建了一个结构化的失败面:优化可以提高学习到的奖励而外部质量下降,降低代理和评估者分数,揭示代理欠对齐,或产生评估者特定的分歧。我们展示了一个紧凑RLHF流程的实证失败模式研究,该流程包括近端策略优化(PPO)、直接偏好优化(DPO)、不确定性惩罚PPO(UP-PPO)、奖励模型不确定性、近似策略漂移、多样性和重复诊断,以及两个外部LLM评估者。我们不将奖励黑客视为单一终端事件,而是使用学习到的奖励、评估者分数和平均评估者分数的方向对检查点之间的匹配转换进行分类。在61个检查点行和1920个行级转换中,激进的PPO具有最高的局部奖励黑客率(14.45%;bootstrap 95% CI: 10.16-18.75),而UP-PPO在相同激进机制下产生较低率(11.33-10.94%)。转换前的逻辑模型以ROC-AUC 0.821预测未来行级奖励黑客,行级分析发现12个设置中有3个存在检查点平均值遗漏的局部奖励黑客。核心结论是方法论上的:RLHF失败不仅是最终模型病理,而且是可分类、可定位和部分可预测的训练动态。

英文摘要

Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical failure-mode study of a compact RLHF pipeline with proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, diversity and repetition diagnostics, and two external LLM judges. Rather than treating reward hacking as a single terminal event, we classify matched transitions between checkpoints using the directions of the learned reward, judge scores, and average judge score. Across 61 checkpoint rows and 1920 row-level transitions, aggressive PPO has the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16-18.75), while UP-PPO yields lower rates in the same aggressive regime (11.33-10.94%). A pre-transition logistic model predicts future row-level reward hacking with ROC-AUC 0.821, and row-level analysis finds localized reward hacking that checkpoint averages miss in 3 of 12 settings. The central conclusion is methodological: RLHF failures are not only final-model pathologies, but training dynamics that can be classified, localized, and partially anticipated.

2606.03237 2026-06-03 cs.AI cs.CL cs.CY cs.LG cs.MA

Solipsistic Superintelligence is Unlikely to be Cooperative

唯我论超级智能不太可能合作

Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo

AI总结 本文指出,基于唯我论方法设计的超级智能(极端能力的任务求解器)因忽视部署引发的内生非平稳性而难以合作,呼吁将相互依存作为核心设计原则的非唯我论研究范式。

详情
Comments
24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026
AI中文摘要

AI的核心挑战正从能力转向共存。AI研究的主导范式侧重于开发将世界视为外生且平稳反馈源的强大智能体。我们认为,源于这种唯我论AI设计方法的超级智能(极端能力的任务求解器)不太可能合作。部署AI系统会引发内生非平稳性,导致训练-测试-部署差距,即历史分布与部署环境相偏离。我们称此为单边优化的自我削弱属性。缩小这一差距需要参与合作的AI:即多个行为体导航其相互依存的均衡选择过程。我们呼吁一种非唯我论的研究范式,将这种相互依存作为核心设计原则,而非将合作视为待解决的任务。这需要构建涉及自适应对手方的动态评估测试平台,将制度视为设计原语,并保留人类能动性作为我们构建系统的结构性特征。

英文摘要

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

2606.03236 2026-06-03 cs.AI

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

先感知后推理:一种用于高效可靠主动移动代理的预推理感知框架

Zhijie Ding, Weinan Hong, Zicheng Zhu, Lei Li, Dezhi Kong, Hao Wang, Peng Zhou, Xuchu Jiang, Jiaming Xu

AI总结 提出预推理感知框架(PRPF),通过轻量级多模态主动感知器(MPP)进行干预门控和上下文压缩,仅在需要时激活主动代理推理器(PAR),以解决主动移动代理中干预时机与方式决策的目标错位和冗余推理问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)显著推动了移动代理的发展,但主动移动辅助仍然具有挑战性,因为代理必须在决定如何协助之前确定何时干预。现有系统通常在一个统一的基于MLLM的流水线中实现这两个决策,导致保守的干预过滤与全面的辅助生成之间的目标错位,以及在代理应保持沉默时的冗余推理。为了解决这些限制,我们提出了预推理感知框架(PRPF),这是一个基于先感知后推理的两阶段框架。PRPF引入了一个轻量级的多模态主动感知器(MPP)用于干预门控和上下文压缩,并仅在需要干预时激活主动代理推理器(PAR)。在ProactiveMobile基准上的实验表明,与ProactiveMobile基线相比,PRPF显著降低了误触发率(FTR),同时提高了成功率(SR)和推理效率。

英文摘要

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.

2606.03234 2026-06-03 cs.LG

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

正确即力量:对齐验证的隐藏状态增强强化学习推理

Ziyue Wang, Aomufei Yuan, Yongfu Zhu, Shuai Dong, Wenpu Liu, Yiran Yao, Weichu Xie, Yuqi Xu, Caoyuan Ma, Wenqi Shao, Xiaoying Zhang, Nan Duan, Jiaqi Wang

AI总结 提出Hidden-Align辅助损失函数,在强化学习训练中对齐正确rollout在锚点token处的最后一层隐藏状态,提升数学推理性能。

详情
Comments
16 pages, 7 figures
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提升大语言模型数学推理的主流方法,但当前方法将每个正确rollout简化为单个奖励比特,忽略了其隐藏状态共享的几何结构。研究这一结构发现,在锚点token(答案标记前的位置)处,正确rollout自然收敛,因为它们必须产生相同答案(余弦相似度约0.84),但每个rollout仍保留其独特推理路径的残余方差。鼓励在该点完全对齐,推动模型提取统一的“正确决策”表示,减少对推理路径的敏感性。基于此观察,我们提出Hidden-Align,一种辅助损失函数,在RL训练中对齐正确rollout在锚点token处的最后一层隐藏状态,训练和推理中零开销。在八个数学推理基准上,Hidden-Align在Qwen3-1.7B、4B和14B上分别比DAPO基线平均提升pass@1 3.8、6.2和5.4个百分点,且在所有三种规模上pass@k一致提升,消融实验支持了损失类型、锚点位置、层深度和损失权重的影响。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring the geometric structure shared among their hidden states. Investigating this structure, we find that at the anchor token (the position immediately before the answer marker), correct rollouts converge naturally because they must produce the same answer (cosine similarity ~0.84), yet each retains residual variance from its unique reasoning path. Encouraging full alignment at this point pushes the model to extract a unified "correct decision" representation, reducing sensitivity to which reasoning path was taken. Based on this observation, we propose Hidden-Align, an auxiliary loss function that aligns the last-layer hidden states of correct rollouts at the anchor token during RL training, with zero overhead in both training and inference. On eight mathematical reasoning benchmarks, Hidden-Align improves average pass@1 over the DAPO baseline by 3.8, 6.2, and 5.4 percentage points on Qwen3-1.7B, 4B, and 14B respectively, with consistent pass@k gains across all three scales, supported by ablations on loss type, anchor position, layer depth, and loss weight.

2606.03232 2026-06-03 cs.LG cs.AI

GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond

GFFMERGE: 图神经力场的高效合并及其扩展

Parth Verma, Parv P. Singh, Vipul Garg, Ishita Thakre, N. M. Anoop Krishnan, Sayan Ranu

AI总结 提出GFFMERGE框架,通过凸嵌入对齐问题解析解实现图神经网络的闭式模型合并,在力场回归任务中恢复接近联合训练的性能,并实现5-27倍加速。

详情
AI中文摘要

图神经网络(GNN)通过降低计算成本实现接近量子精度的原子模拟,彻底改变了神经力场,但将这些模型适应新化学系统需要对基础模型进行昂贵的重新训练。受视觉和语言处理中模型合并的启发,我们提出了GFFMERGE,这是第一个用于GNN闭式模型合并的原则性框架。我们利用消息传递层的线性结构,将合并问题形式化为具有解析解的凸嵌入对齐问题。通过对GNN模型合并的首次系统基准测试,我们发现为视觉和语言设计的现有方法在力场回归任务上灾难性地失败,而GFFMERGE恢复了接近黄金标准联合训练的性能。在分子(MD17、MD22)、固态(LiPS20)和大规模图基准测试中,GFFMERGE及其通用GNN对应物GNNMERGE实现了5-27倍的加速,同时支持专业模型的模块化组合。值得注意的是,我们的闭式解在微调前就优于所有基线方法,并为更快、数据高效的收敛提供了优越的初始化。

英文摘要

Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of message-passing layers and formulate merging as a convex embedding-alignment problem with an analytical solution. Through the first systematic benchmarking of model merging for GNNs, we show that existing methods designed for vision and language catastrophically fail on force field regression, while GFFMERGE recovers performance approaching gold standard joint training. Across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph benchmarks, GFFMERGE and GNNMERGE (its generic GNN counterpart) achieve 5-27$\times$ speedups while enabling modular composition of specialized models. Remarkably, our closed-form solution alone outperforms all baseline methods before fine-tuning and provides superior initialization for faster, data-efficient convergence.