arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12994 2026-06-12 cs.LG cs.CE 新提交

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

DeepJEB++: 基于基础模型驱动的二维潜空间增强的大规模三维工程数据集

Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee, Sunwoong Yang, Hyogu Jeong, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology(韩国科学技术院赵春植移动研究生院) Department of Mechanical Engineering, Hanyang University(汉阳大学机械工程系) Narnia Labs(纳尼亚实验室)

AI总结 提出DeepJEB++框架,通过二维潜空间增强和基础模型,将少量喷气发动机支架种子设计扩展为大规模带仿真标签的三维数据集,实现40倍扩展。

详情
Comments
16 pages, 14 figures. Submitted to ASME Journal of Mechanical Design
AI中文摘要

数据驱动的工程设计受到缺乏大规模三维数据集的限制,这些数据集需要将几何形状与基于物理的性能标签配对。特别是,现有的三维数据增强技术在保留微妙且多样的几何变化方面存在局限性,并且自动化后续的仿真标注过程仍然困难,因为边界条件取决于生成的几何形状。我们提出了DeepJEB++,一个基础模型驱动的数据增强框架,在资源受限的情况下将少量喷气发动机支架种子设计扩展为大规模、带仿真标签的三维数据集。我们的关键思想是在数据丰富的二维潜空间中进行增强,然后转移到三维。在第一阶段,我们在多视图渲染上微调预训练的二维潜扩散模型,并通过潜插值合成新视图,通过视觉语言模型(VLM)质量过滤器保留可制造的设计。在第二阶段,经过验证的图像通过领域适应的生成基础模型提升为三维网格。在第三阶段,一个自动化流水线识别每个网格上的载荷和螺栓接口,并分配有限元标签——质量、应力和位移——无需人工干预。我们沿着三个内在轴评估增强质量:可制造性、相对于SimJEB真实值的标签保真度以及分布一致性。从少于400个种子设计开始,DeepJEB++在每阶段使用单个GPU的情况下,生成了15,360个带仿真标签的三维支架——实现了40倍的扩展。该数据集将公开提供,以支持可复现的工程AI研究。

英文摘要

Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

2606.12991 2026-06-12 cs.AI 新提交

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

APCyc:通过自动环化实现环肽的性质导向设计

Yifan Zhao, Lang Qin, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) AI-Peptide Drug Design Joint Laboratory(AI-多肽药物设计联合实验室)

AI总结 提出APCyc框架,通过扩展残基词汇和显式编码环化位点与连接类型,结合贝叶斯后验引导,实现目标感知的环肽从头设计并联合优化多种理化性质。

详情
Comments
Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

环肽是现代药物发现中一类有前景的治疗化合物,通常具有更好的稳定性和结合亲和力。然而,环肽的从头设计仍然具有挑战性,因为方法必须识别口袋适应的环化模式和连接位点,同时控制药物相关性质。这一挑战对于主要在线性肽数据上训练的生成模型尤为突出,这些模型可能无法捕捉环化特异性约束。为解决这一局限性,我们引入了APCyc,一个目标感知的从头环肽生成框架,该框架显式建模环化并联合优化多种基本理化性质。通过使用扩展的残基词汇表并显式编码环化位点和连接类型信息,APCyc学习环化感知表示,并利用贝叶斯后验引导将采样导向满足多个性质目标的环肽。实验结果表明,我们的模型学习了目标依赖的环化偏好,并实现了环肽设计的有效且可控的多性质优化。本文源代码可在以下网址获取:https://this https URL。

英文摘要

Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at this https URL.

2606.12990 2026-06-12 cs.LG 新提交

Exposure Bias as Epistemic Underidentification in Recursive Forecasting

递归预测中的曝光偏差作为认知欠识别问题

Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho

发表机构 * University of Bristol(布里斯托大学)

AI总结 本文证明递归多步预测中的曝光偏差不仅是分布偏移,更是部分可观测性下的认知欠识别问题,并提出基于来源变量的误差分解与校正方法。

详情
Comments
Accepted for ICML 2026 EIML workshop
AI中文摘要

递归多步预测通常被表述为分布偏移:模型在观测历史数据上训练,但部署于自身预测结果上。我们通过证明在部分可观测性或状态截断下,递归展开也是一个认知欠识别问题,表明这种表述是不完整的。即使具有确定性潜在动力学,一步贝叶斯监督仅在观测上下文中识别行为,一旦展开查询自生成诱导状态(其正确的局部目标不能仅由数值状态确定),则无需识别部署的递归预测器。我们通过诱导状态 $Z$ 和来源变量 $P$ 形式化这一点,并推导出诱导状态误差分解为教师强制/展开不匹配、表示-类别逼近和来源信息差距。实验表明,展开进入一个不同的诱导状态区域,固定诱导状态定义了一个不同的局部校正任务,闭环增益不仅来自局部适应,还来自改变展开期间访问的诱导状态。使用简单的二进制来源编码,来源感知校正可以进一步提高性能,尽管增益是有条件的而非均匀的。这些结果将曝光偏差重新定义为自诱导认知不确定性下的推理。

英文摘要

Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 新提交

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

详情
Comments
10 pages, 9 figures, 2 tables
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $\rho = 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2606.12985 2026-06-12 cs.CV 新提交

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

物体先于词汇:用于从儿童视角视频中语言接地学习的物体优先归纳偏置

Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 针对婴儿视角视频中命名参照物出现时间和位置的双重歧义,提出BabyMind方法,通过物体优先的归纳偏置、掩码区域接口和原型空间多实例对比学习,在稀疏弱监督下提升语言接地性能。

详情
AI中文摘要

从自然经验中学习接地词汇含义需要解决婴儿视角记录中的两个歧义:命名参照物何时出现以及在杂乱画面中的位置。在SAYCam风格的数据中,看护者的语言稀疏且与自我中心视频弱同步,因此单帧对比配对会产生噪声正样本,其中目标物体缺失或被干扰物纠缠。我们提出BabyMind,一种在稀疏、噪声监督下用于儿童视角对比学习的物体优先偏置。BabyMind使用离线掩码区域接口提取候选物体嵌入,通过跟踪将短话语中心窗口内的候选物体链接成轻量级物体文件,并使用原型空间多实例对比目标将话语与物体文件袋对齐。轨迹一致性和全局物体一致性正则化器稳定学习,并将物体文件结构转移到评估时使用的全局帧嵌入中。在SAYCam-S上,BabyMind将Labeled-S 15强制选择准确率比CVCL提高了+2.6个点,并在词汇内分布外基准测试中取得一致提升。代码可在该网址获取。

英文摘要

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at this https URL.

2606.12984 2026-06-12 cs.CL 新提交

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

SkillChain: 为基于图像的电商AI助手闭环技能演化

Yimin Hu, Mengtao Xu, Hao Guo, Yuheng Song, Xiaoyong Zhu, Bo Zheng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出SkillChain框架,通过技能创建、路由优化和主体精炼三阶段自动化技能生命周期,解决电商图像助手多意图混淆问题,显著提升响应质量和用户参与度。

详情
AI中文摘要

基于图像的AI助手现已大规模部署在电商平台上,其中单张上传图像可能触发根本不同的用户意图:产品搜索、风格推荐、视觉百科或实用工具调用,每种意图都需要自己的响应格式、工具调用和领域知识。如果没有按意图的行为约束,基于LLM的系统会混淆这些异构模式,达不到领域质量标准,而意图空间的广度和动态性使得手动工程不可行。为解决这一问题,我们提出了SkillChain,它闭环了技能演化的生产反馈循环,通过三个阶段自动化技能生命周期:用于从任务规范和轨迹中引导启动的技能创建器、用于路由对齐的路由优化器,以及通过双路径LLM-Judge评估进行迭代技能主体精炼的主体精炼器。部署在生产规模的电商图像助手上,SkillChain显著提高了聚合响应质量,在结构合规性和内容质量上提升最大;为期一周的在线A/B实验进一步证实了用户参与度、内容消费和长期留存率的显著提升。

英文摘要

Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

2606.12983 2026-06-12 cs.AI 新提交

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

面向LLM驱动的硬件描述语言设计与验证数据整理的结构化测试台生成

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin, Yao-Ting Hsieh, Cheng Liang, Hsiang-Yu Tsou, Mu-Chi Chen, Yu-Kai Hung, Shao-Chun Ho, Po-Hsuang Huang, Shih-Hao Hung, H.T. Kung

发表机构 * National Taiwan University(国立台湾大学) Academia Sinica(中央研究院) Harvard University(哈佛大学)

AI总结 提出STG框架,利用硬件设计固有结构生成确定性测试台,比迭代LLM方法快720倍,编译成功率更高,覆盖率更高,误判更少,并用于数据整理和测试时扩展。

详情
Comments
9 pages, 10 figures
AI中文摘要

自动化测试台生成已成为大型语言模型(LLM)驱动的寄存器传输级(RTL)工作流中的关键瓶颈,其中大量候选设计必须快速可靠地验证。现有的基于提示的方法将测试台生成视为无约束的代码合成,产生随机输出,具有高令牌成本、低可重复性和不足的覆盖率。为了解决这一差距,我们提出了STG,一个结构化测试台生成框架,利用硬件设计的固有结构生成确定性测试台。作为直接验证工具,STG比基于迭代LLM的测试台生成流程快720倍,具有更高的编译成功率,实现更高的覆盖率,并减少对不正确DUT的错误通过判定。STG还通过暴露有缺陷的基准测试台帮助识别RTL生成基准中的错误。作为数据整理引擎,它在单个CPU核心上比基于LLM的过滤快11倍,能耗低127倍,由此得到的蒸馏模型在我们的多基准评估中提供了最先进的性能。作为测试时扩展预言,它减少了14-47%的节点数。我们的模型可在https://this URL获取。

英文摘要

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at this https URL.

2606.12981 2026-06-12 cs.CV 新提交

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

用于TUMTraf V2X协同3D目标检测的相机与LiDAR BEV融合

Muhammad Shahbaz, Shaurya Agarwal

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida(中佛罗里达大学土木、环境与建筑工程系)

AI总结 提出一种融合路边相机与基础设施-车辆点云的BEV空间检测器,采用CenterPoint风格头部和IoU重排序,在DriveX 2026挑战赛公开测试集上达到0.85 mAP,并分析了训练/验证与测试集重叠对分数的影响。

详情
AI中文摘要

我们描述了一种为DriveX 2026挑战赛的TUMTraf V2X协同3D目标检测赛道开发的相机与LiDAR融合检测器。该检测器在共享的鸟瞰视图空间中融合三个路边相机与一个融合的基础设施-车辆点云,并通过带有广义IoU回归损失和IoU质量重排序头的CenterPoint风格头部预测边界框。在提供的训练和验证分割上训练后,模型在公开Codabench测试分割上达到了0.85的3D mAP。在迭代系统时,我们观察到50个测试帧中有44个也出现在已发布的训练(40个)和验证(4个)分割中并带有标签。因此,我们进行了两项额外研究来量化这种重叠对最终分数的影响:(1)一个微调运行,对44个重叠帧进行过采样,达到0.89 mAP;(2)一个后处理运行,将这些帧上的预测替换为已发布的真实值,达到0.99 mAP(上传到我们的Codabench账户进行测试,但未在排行榜上发布)。报告了所有三种配置及其每类结果。

英文摘要

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

2606.12979 2026-06-12 cs.LG 新提交

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

EPM-JEPA:JEPA系列世界模型中的算子侧经验调制

Vedant Pandya

发表机构 * School of Artificial Intelligence and Data Engineering (SAIDE), Indian Institute of Technology Jodhpur(印度理工学院焦特布尔分校人工智能与数据工程学院)

AI总结 提出EPM-JEPA,通过LoRA在权重层面调制预测器,以应对测试时动态偏移;实验表明其优于无记忆基线,但效果弱于预期,并揭示了三种独立动力学过程。

详情
Comments
16 pages, 5 figures, 9 tables, 5 code listings. Pre-registered experimental study with mechanism analysis
AI中文摘要

JEPA系列世界模型使用静态预测器,其权重在测试时动态偏离训练时不会自适应。我们比较了在分布偏移下将累积经验融入JEPA预测器的两种机制:操作数侧注入(EI-JEPA),将压缩的经验表示作为残差添加到预测器的隐藏状态;以及算子侧调制(EPM-JEPA),通过应用于预测器权重的LoRA生成低秩权重增量。在预注册的比较(Moving MNIST,重力偏移)中,EPM-JEPA(D_shift^{n=50} = 0.7848 +/- 0.0078,三个种子)与EI-JEPA(0.8238)相差delta = 4.74% - 根据我们声明的标准,结果C:零结果 - 是一个有效结果。作为次要的、非预注册的观察,EPM-JEPA在无记忆基线(0.8000)上提高了1.90%,且在所有种子上一致,而EI-JEPA低于基线,表明收益特定于权重级调制。我们的主要贡献是机制分析:D_shift^{n=50}轨迹反映了三个独立的动力学过程——缓冲区循环、EMA目标漂移和内在的LoRA稳定瞬态(+0.021)——而非收敛到平衡。这些发现推动了PEM-JEPA,一个基于物理的后续模型,以解决这一动力学峰值限制。

英文摘要

JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

2606.12978 2026-06-12 cs.RO cs.CV eess.SY 新提交

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文发现VLA模型存在轨迹级漏洞:看似保留原始指令的对抗性提示,能重定向机器人最终物理结果,并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将自然语言引入闭环机器人控制,使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色,因为提示在每个重新规划步骤中被重复使用,每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示,这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式:一个提示仍然$\textit{看起来}$指定了预期任务,但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$,这是一种仅提示的威胁模型,其中攻击者在情节开始前选择一个提示,所有策略和环境组件保持不变,并且提示必须保持接近良性指令,同时省略目标词和纠正语言。为了找到这样的提示,我们引入了一种在线提示搜索方法,该方法使用滚动来发现扰动,其闭环行为跟踪目标任务,同时满足命令保持约束。在仿真和硬件上的实验表明,接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞:看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站:此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: this https URL

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 新提交

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence(佛罗伦萨大学) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) College of Cyber Security, Jinan University(暨南大学网络空间安全学院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室) Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科技学院计算机与信息科学系) University of Siena(锡耶纳大学)

AI总结 针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题,提出基于个性化归一化模块的编码方法,并引入无损函数不变参数变换的抗共谋机制,实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情
AI中文摘要

模型指纹识别,即将用户特定标识(指纹)嵌入生成输出中,最近已成为保护生成式文本到图像(T2I)模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中,我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞:它们缺乏对共谋攻击的鲁棒性,其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题,我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串(即指纹)编码到集成到T2I模型中的个性化归一化模块(PNM)的系数中,从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发,我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量,使其实际上无法使用。此外,我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本,而无需重新训练。我们还引入了一种最坏情况优化策略,以提高对模型级攻击的鲁棒性。实验表明,所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性,指纹提取准确率超过99.5%。与现有方法相比,我们的方法首次通过显著增加共谋模型的FID,展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

2606.12976 2026-06-12 cs.AI 新提交

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

面向协作问题求解与AI推理数据集生成的数学论坛平台

Akbar Erkinov, Nurmukhammad Abdurasulov

发表机构 * Independent Researchers, San Francisco, CA, USA(独立研究者,美国加利福尼亚州旧金山)

AI总结 提出一个集成图像到LaTeX转换管线的论坛系统,消除数学内容分享的摩擦,支持桌面和移动端,并生成社区验证的数学问题数据集以训练AI推理。

详情
Comments
11 pages, 3 figures
AI中文摘要

在在线论坛中分享数学内容仍然是学生和教师的一个显著痛点:编写原始LaTeX容易出错,独立的光学字符识别工具需要切换平台,而当前的论坛软件没有提供从公式照片到渲染帖子的集成路径。我们提出了一个统一系统,通过将图像到LaTeX转换管线直接嵌入论坛发布界面来消除这一摩擦。用户上传或拍摄数学表达式的图像;系统通过Mathpix OCR API路由该图像,检测返回的输出是LaTeX还是包含内联数学的纯文本,应用适当的分隔符规范化,并在帖子提交到数据库之前以LaTeX或Markdown模式提供实时预览。该架构分为三个松散耦合的层:图像处理、渲染和存储,并支持桌面和移动客户端。已提交一份涵盖核心方法的美国临时专利申请。我们描述了完整的系统设计、每个组件的细节、数据模式以及关键的技术创新,并将该工作与现有的独立工具和论坛平台进行对比,以展示其填补的实际空白。除了直接的可用性之外,我们认为这种部署的平台构成了一个持续增长、社区验证的数学问题和逐步解决方案数据集,该资源可用于训练和基准测试AI系统以实现准确的数学推理。

英文摘要

Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

2606.12971 2026-06-12 cs.LG 新提交

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

从二元对话中的语音和交互动态预测认知负荷

Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科尔比学院计算机科学系)

AI总结 研究在自然协作对话中,通过语音和交互动态特征预测感知认知负荷,发现对话交互(如话轮转换)能有效预测时间压力、脑力工作等认知负荷维度。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

从语音估计认知负荷主要在受控实验室环境中研究,对其在自然协作对话中的可靠性了解有限。我们研究语音和交互动态是否能预测二元对话中的感知认知负荷。我们分析了53对执行九项协作任务的对话音频,提取静态声学、动态和交互特征,训练双头门控循环单元编码器预测认知负荷分数。结果表明,对话交互为预测与时间压力、脑力工作、努力和任务表现相关的认知负荷提供了有用信号。时间需求与话轮转换动态(如重叠和说话者切换)相关,而脑力需求与说话者之间的不平衡参与相关。这些发现强调了任务结构和对话交互在自然协作环境中建模认知负荷的重要性。

英文摘要

Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

2606.12966 2026-06-12 cs.LG cs.NE 新提交

Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

电路同步先于泛化:来自Grokking Transformer中傅里叶结构的因果证据

Achyuthan Sivasankar

发表机构 * New York University(纽约大学)

AI总结 提出频率同步度(FSD)指标,发现其在模算术任务中比grokking早500-3000步同步,且通过权重衰减控制验证了间隔期的正则化本质,提供因果证据。

详情
Comments
16 pages, 6 figures, 10 tables
AI中文摘要

Grokking——模算术上的transformer从近乎随机突然转变为近乎完美的验证准确率——归因于傅里叶电路,但其时机、因果结构和可控性仍知之甚少。我们引入了频率同步度(FSD),一种无需先验电路知识的归一化、置换检验的傅里叶电路同步度量。在九个模加法配置(素数p∈{53,71,97,113,131},三个种子)中,FSD在grokking前500-3000步同步(平均领先+1722步;所有九个为正,符号检验p≈0.004),并且在所有九个案例中先于受限logit损失基线(Nanda等人的排除损失),使其成为最早可用的预测器。我们提供了直接因果证据,证明相间间隙是一种正则化现象:在FSD峰值步骤分叉训练并变化权重衰减λ,会产生严格单调的更早grokking,且Δ_t与1/λ成正比。该定律在三个素数(p∈{53,97,131};两个干净案例的R²=1.00和R²=0.99)上重复,表示为Δ_t ~ C/λ,与(1/λ)*log(||W_mem||/τ)一致。架构消融实验表明,仅注意力模型在强FSD前兆下grok;仅MLP模型从不grok;单层模型的FSD滞后,确认了前兆是多块电路属性。

英文摘要

Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

2606.12965 2026-06-12 cs.RO 新提交

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

EmbodiSteer: 用关节空间引导的具身无关视觉运动策略实现零样本跨具身部署

Shihefeng Wang, Kangchen Lv, Mingrui Yu, Xiang Li

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Beijing Key Laboratory of Embodied Intelligence Systems(北京具身智能系统重点实验室) Institute for Embodied Intelligence and Robotics, Tsinghua University(清华大学具身智能与机器人研究所)

AI总结 提出EmbodiSteer框架,通过前向运动学和雅可比更新将推理时的扩散采样提升到目标机器人关节空间,并加入全身碰撞感知引导,实现零样本、具身感知的部署,在模拟和物理机器人上显著降低碰撞率并提高任务成功率。

详情
Comments
The first two authors contribute equally
AI中文摘要

可扩展的机器人模仿学习依赖于来自不同机器人的大规模异构数据或无身体数据,使得笛卡尔末端执行器动作成为具身无关策略学习的关键接口。然而,仅末端执行器的抽象使得笛卡尔策略对部署的机器人身体无感知,导致其在全身碰撞避免等机器人特定约束下脆弱。为克服这一限制,我们提出EmbodiSteer,一种无需训练的框架,将具身无关的视觉运动策略引导至零样本、具身感知的部署。EmbodiSteer将策略学习保持在笛卡尔空间,同时通过前向运动学和基于雅可比的更新,高效地将推理时的扩散采样提升到目标机器人的关节空间。在每个去噪步骤后,通过关节轨迹上的全身碰撞感知引导,机械臂可以在保持学习到的末端执行器行为的同时避开碰撞。与仅笛卡尔执行相比,EmbodiSteer在9个模拟机器人上将碰撞率降低46.1%,任务成功率提高28.5%,并在高度受限场景下的两个物理机器人上实现碰撞率降低90.0%,成功率提高36.7%。我们的项目页面位于此https URL。

英文摘要

Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at this https URL.

2606.12958 2026-06-12 cs.CV 新提交

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

YOLO-AMC:一种改进的带有注意力机制的YOLO架构用于建筑裂缝检测

Ching-Yu Tsai, Chia-Min Lin, Chih-Hsiang Yang, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出YOLO-AMC,在YOLOv11中移除C2PSA并引入GAM、Res-CBAM、SA等注意力机制,增强裂缝检测性能,在测试集上mAP@0.5达0.9917,速度110.95 FPS,兼顾精度与部署效率。

详情
Comments
14 pages, 8 tables, 6 figures. Expanded version of IET ICETA 2025 conference paper
AI中文摘要

裂缝检测在基础设施检查和结构健康监测(SHM)中起着重要作用。然而,裂缝通常表现为薄、低对比度的结构,且容易受到背景噪声的影响,给现有目标检测模型带来了挑战。本研究提出了一种改进的基于YOLO的架构,集成了注意力机制,称为YOLO-AMC(用于裂缝检测的YOLO注意力机制),以增强自动裂缝检测性能。基于YOLOv11,移除了原始的C2PSA模块,并在Neck的多尺度特征融合层中引入了多种注意力机制,包括全局注意力机制(GAM)、残差卷积块注意力模块(Res-CBAM)和Shuffle Attention(SA),以加强跨尺度特征整合。实验结果表明,YOLO-AMC在多个评估指标上始终优于基线模型YOLOv11n和YOLOv8n。在评估的注意力模块中,GAM取得了最佳检测性能,在测试数据集上获得了mAP@0.5 = 0.9917和mAP@0.5:0.95 = 0.9506,高于YOLOv11(0.9833 / 0.9112)和YOLOv8(0.9707 / 0.8921)。此外,在保持7.6 GFLOPs计算复杂度的同时,所提出的模型在NVIDIA RTX 4090平台上达到了110.95 FPS,在Raspberry Pi 5边缘设备上约为5 FPS,展示了准确性与部署效率之间的良好权衡。本研究的实现代码可在GitHub上获取,网址为:https://this https URL。

英文摘要

Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at this https URL.

2606.12956 2026-06-12 cs.RO 新提交

SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

SERF:面向长时域移动操作任务的时空环境与机器人特征地图

Sunghwan Kim, Byeonghyun Pak, Kehan Long, Yulun Tian, Nikolay Atanasov

发表机构 * UC San Diego(加州大学圣地亚哥分校) Agency for Defense Development(国防发展局) SceniX Inc.(SceniX公司) University of Michigan(密歇根大学)

AI总结 提出SERF地图,将环境与机器人身体表示为共享潜空间中的神经点,并在线更新,作为VLA模型的状态输入,提升长时域移动操作中的推理能力,在BEHAVIOR-1K上优于纯图像基线。

详情
Comments
Project page: this https URL
AI中文摘要

长时域机器人移动操作需要对定位、环境变化和任务进度进行持续推理,而这些都难以仅从图像观测中推断。在本文中,我们表明,将移动操作策略条件化于一个时空特征地图可以改善长时域上的推理。该地图将环境和铰接机器人身体表示为共享潜空间中的神经点,并从自我中心观测和本体感受状态在线更新。我们使用基于对象的刚性跟踪更新环境神经点,并使用正向运动学更新机器人神经点。通过从多个参考帧和空间尺度提取地图标记,我们将时空环境与机器人特征(SERF)地图作为状态输入到视觉-语言-动作(VLA)模型中,为策略提供局部和全局上下文。我们在BEHAVIOR-1K(一个家庭环境中的长时域移动操作基准)上展示了SERF。实验表明,SERF VLA策略优于纯图像基线,通过遵循更直接的轨迹更快地达到子目标,提高了对场景配置变化的鲁棒性,并能从物体掉落失败中恢复。

英文摘要

Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 新提交

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ:面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 提出OpenMedQ,在14个数据集(约335万样本)上预训练医学视觉语言模型,在PathVQA上BLEU-1达75.9,超越562B参数的Med-PaLM M,并在8个未见医学分类任务上取得最高平均macro-F1(0.757)。

详情
Comments
Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track
AI中文摘要

我们提出OpenMedQ,一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型:包含14个数据集,总计约335万预训练样本,涵盖病理学、放射学、显微镜和纯文本临床问答。OpenMedQ在PathVQA上达到最先进的BLEU-1(75.9),击败了参数多达562B(约大80倍)的Med-PaLM M变体,并在VQA-MED上匹配了最佳报告的BLEU-1(64.5)。其视觉编码器在相同的下游配方下迁移到8个未见过的医学分类基准,获得了最高的平均macro-F1(0.757),优于BiomedCLIP(0.745)、PMC-CLIP(0.745)、PubMedCLIP(0.746)和从头训练的基线(0.616)。我们公开了代码,并提供了一个交互式演示,作为社区的可复现基线。

英文摘要

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

2606.12945 2026-06-12 cs.AI 新提交

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

学习该记住什么:一种基于认知的多因素记忆价值模型

Zhibao Chen, Qian Cheng

发表机构 * Huatai Securities(华泰证券) OneBeget.com

AI总结 针对长期LLM代理的记忆管理问题,提出一种基于认知心理学的多因素记忆价值函数,通过无梯度优化学习权重,统一控制编码深度、遗忘风险和检索排名,在LongMemEval上显著优于单一因素和近因策略。

详情
Comments
11 pages, 3 figures
AI中文摘要

长期运行的LLM代理积累的交互历史远超任何上下文窗口,迫使面临一个持续决策:在固定记忆预算下,哪些内容应深度编码、哪些应遗忘、哪些应检索。生产系统采用语义相似性或近因性——两者对于遗忘决策都是错误指定的,因为遗忘决策是在未来查询未知的整合时刻做出的。我们提出一个多因素记忆价值函数 V(m)=∑_i w_i f_i(m),涵盖七个可解释因素(情感强度、目标相关性、价值对齐、自我/用户相关性、任务效用、可靠性和使用历史),这些因素来自认知心理学,其权重通过无梯度优化器从下游目标中学习,并且该单一标量统一控制编码深度、遗忘风险和检索排名。我们提出一个方法论观点:在LongMemEval上,针对保留的评估问题对目标相关性进行评分,使得黄金证据保留率达到≈0.98——这衡量的是检索,而非遗忘。在现实盲态模式下,学习到的多因素价值在479个可用案例中保留了0.770±0.011的黄金证据,而均匀权重为0.657,最佳单一因素为0.518,近因性为0.368;每对差距的95%自助法置信区间均高于零,且基于相同因素的神经网络与线性模型持平。学习到的权重是可解释的——可靠性、情感强度和自我/用户相关性占主导,而查询时的目标相似性在遗忘决策中被正确降权。一个带有植入混淆的受控合成任务证实,学习器恢复了分离性权重(保留率1.00),而均匀权重失败(0.62)。该基础架构是开源的;所有实验在单CPU上运行,无需API调用。

英文摘要

Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

2606.12942 2026-06-12 cs.AI 新提交

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

PRISMR: 通过参数化表示内化克服多模态列表排序中的解析崩溃

Hao Jiang, Xin Li, Annan Wang, Zhi Yang, Haoxiang Zhang, Yichi Zhang, Weisi Lin

发表机构 * Nanyang Technological University(南洋理工大学) Peking University(北京大学) Independent Researcher(独立研究员)

AI总结 针对多模态长上下文场景中生成式列表排序的解析崩溃问题,提出PRISMR框架,用参数化结构条件替代临时上下文列表处理,通过轻量级超网络并行编码候选并生成LoRA权重,显著减少解析崩溃并提升排序性能。

详情
AI中文摘要

基于大型多模态模型(LMM)的生成式列表排序旨在单次前向传播中捕获全局列表上下文,但其效果在长上下文多模态场景中会退化。我们识别出一种重复出现的失败模式——解析崩溃,即自回归解码器生成流畅但不完整的排序,通过静默省略候选并提前终止。这种失败源于有限的上下文利用,而非简单的格式错误,使得提示工程和约束解码不足以解决。我们提出PRISMR(参数化表示内化用于语义多模态排序)框架,用参数化结构条件替代临时的上下文内列表处理。PRISMR使用轻量级超网络并行编码多模态候选并生成项目特定的LoRA权重,这些权重被合成为LMM的实例特定适配器。这种范式在保留基础模型的同时,实现了更鲁棒的列表结构内化。我们进一步引入了一个大规模多模态评论排序基准用于评估。实验表明,PRISMR显著减少了解析崩溃,提高了列表排序性能,并有效跨领域和指令微调骨干网络迁移。

英文摘要

Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.

2606.12941 2026-06-12 cs.CL 新提交

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

当上下文分片到达时的多轮推理:可扩展的分片与记忆增强强化学习

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo

发表机构 * The University of Melbourne(墨尔本大学) Google Research Australia(谷歌澳大利亚研究院)

AI总结 针对多轮对话中信息碎片化导致LLM准确率下降65%的问题,提出通过训练模型维护紧凑滚动记忆而非增长历史来缓解,并引入低成本分片流水线将单轮QA转换为多轮碎片化情节,训练的记忆增强策略显著提升多轮准确率并零样本泛化到更难任务。

详情
AI中文摘要

当用户在多个对话轮次中透露任务关键信息时,尽管上下文完全可用,LLM的准确率下降高达65%。我们表明,这种“迷失在对话中”的退化可以通过训练模型维护紧凑的滚动记忆而不是关注增长的历史来大幅缓解。为了使这种训练可扩展,我们引入了一个低成本的分片流水线,将单轮QA数据集转换为多轮碎片化信息情节,消除了数小时手动标注的需求。仅在分片的GSM8K上训练,我们的记忆增强策略显著提高了多轮准确率,并零样本泛化到更难的数学和域外长上下文QA。此外,即使在测试时给定完整历史,记忆训练模型也优于全历史基线,这表明学习压缩比单独的全上下文暴露能诱导更稳健的增量推理。

英文摘要

When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

2606.12939 2026-06-12 cs.CV 新提交

MAMVI: 3D Test-Time Adaptation via Masked Multi-View Point Clouds

MAMVI:通过掩蔽多视角点云实现3D测试时自适应

Inseok Kong, Geunyoung Jung, Jiyoung Jung

发表机构 * Department of Geo Informatics, University of Seoul(首尔大学地理信息学系) Department of Artificial Intelligence, University of Seoul(首尔大学人工智能系)

AI总结 针对3D点云在分布偏移下性能下降的问题,提出MAMVI方法,用统一单步自适应替代顺序优化,结合混合掩蔽策略和多视角损失聚合,实现快速且高精度的测试时自适应。

详情
Comments
Accepted by ICPR 2026
AI中文摘要

3D点云模型在传感器噪声、遮挡和环境变化引起的分布偏移下会出现显著的性能下降。测试时自适应(TTA)已成为在推理过程中缓解此问题的实用范式。最近,利用多视角增强在提升3D TTA性能方面显示出潜力。然而,现有的多视角方法通常受限于将每个视角独立处理的顺序优化。这种顺序优化由于重复的优化步骤导致显著的推理延迟,使得实时自适应不切实际。为了解决这个问题,我们提出了掩蔽多视角测试时自适应(MAMVI),它用统一的单步自适应替代顺序优化。具体来说,MAMVI利用一种混合掩蔽策略,结合固定比例以保持稳定性,以及Beta分布采样以增加多样性。通过聚合多个视角的损失,MAMVI基于多视角共识通过单次反向传播执行自适应。此外,使用基于置信度的自适应学习率来动态调整每个样本的自适应强度。在ModelNet-40C、ShapeNet-C和ScanObjectNN-C上的大量实验表明,MAMVI在ShapeNet-C和ScanObjectNN-C上达到了最先进的准确率。同时,它在ModelNet-40C上保持竞争力,同时推理速度提高了4.9-8.9倍,使其非常适合实时应用。我们的代码可在以下网址获取:this https URL

英文摘要

3D point cloud models suffer significant performance degradation under distribution shifts caused by sensor noise, occlusions, and environmental changes. Test-time adaptation (TTA) has emerged as a practical paradigm for mitigating this issue during inference. Recently, leveraging multi-view augmentation has shown promise in improving 3D TTA performance. However, existing multi-view approaches are often constrained by sequential optimization that treats each view independently. This sequential optimization leads to substantial inference latency due to repetitive optimization steps, making real-time adaptation impractical. To address this, we propose Masked Multi-View Test-Time Adaptation (MAMVI), which replaces sequential optimization with a unified single-step adaptation. Specifically, MAMVI utilizes a hybrid masking strategy that combines fixed ratios for stability with Beta-distributed sampling for diversity. By aggregating losses across multiple views, MAMVI performs adaptation through a single backward pass based on multi-view consensus. Additionally, a confidence-based adaptive learning rate is used to dynamically adjust the adaptation intensity for each sample. Extensive experiments on ModelNet-40C, ShapeNet-C, and ScanObjectNN-C demonstrate that MAMVI achieves state-of-the-art accuracy on ShapeNet-C and ScanObjectNN-C. Moreover, it remains competitive on ModelNet-40C while delivering 4.9-8.9 times faster inference, making it highly suitable for real-time applications. Our code is available at this https URL

2606.12935 2026-06-12 cs.AI 新提交

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

MARS: 用于并行LLM测试时扩展的边际对抗风险控制停止策略

Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su, Tianpei Xie

发表机构 * Amazon(亚马逊) Stanford University(斯坦福大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出MARS停止规则,通过监测中间检查点的聚合投票并利用对抗性边界估计未来投票变化,在保证准确率的同时节省25-47%的自一致性token。

详情
AI中文摘要

并行测试时扩展采样多个推理轨迹并对答案进行多数投票,提高了LLM的准确性,但需要轨迹运行至完成,导致大量计算开销。我们观察到,在中间检查点探测部分轨迹可以在不中断生成的情况下提取当前答案,揭示出不断演变的聚合投票。基于这一观察,我们引入了MARS,一种边际对抗性停止规则,它估计哪些活跃轨迹可能改变其答案,并在未来投票移动的保守边界下,一旦领先者保持安全就停止。该规则分离了两种不确定性来源。它学习轨迹级别的切换概率,这些概率决定了当前边际有多少可能被保留,同时通过从预热轨迹中校准的对抗性边界处理切换轨迹落在哪里的更难问题。在真实切换概率下,MARS以高概率保证提前停止的答案与完整预算投票一致。在实践中,一个五特征逻辑模型紧密匹配了神谕切换行为。在三个推理模型和三个竞赛数学基准上,MARS节省了25-47%的自一致性token,并在DeepConf Online(一个已经过滤和截断弱轨迹的强置信加权基线)之上额外节省14-29%,同时匹配相应完整预算基线的准确率。

英文摘要

Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

2606.12924 2026-06-12 cs.AI 新提交

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

迭代优化搜索:面向电子商务中智能搜索架构评估的双智能体仿真框架

Jetlir Duraj, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi Zhou

发表机构 * eBay Inc.(eBay公司)

AI总结 提出模块化双智能体仿真框架,通过固定买家智能体对比不同应答器设计,发现滚动窗口记忆在质量和速度上优于意图提取记忆,并基于失败分析将失败率降低62%。

详情
AI中文摘要

我们提出了一个模块化的双智能体仿真框架,用于评估对话式购物助手架构。一个独立的买家智能体,配置了角色、任务和耐心水平,与一个可互换的应答器配对,该应答器与真实的电子商务搜索API集成。在实验中保持买家不变,可以在相同场景下对照比较应答器设计。利用跨越14个角色桶的2011次对话,我们建立了四个实证发现。首先,滚动窗口记忆在所有质量指标上优于意图提取记忆,同时每个查询速度快35%。其次,通过对应答器版本的系统性失败分析,实现了有针对性的修复,将整个数据集上的失败和接近失败率降低了62%,展示了快速的证据驱动迭代。第三,将应答器的LLM骨干从Gemini~2.5切换到Llama~3.3~70B,尽管架构相同,但性能下降了0.16-0.45点。最后,我们记录了前沿LLM评判者之间系统性的哲学分歧:Gemini奖励过程正确性,而Claude要求具体结果,尽管使用了相同的评估提示。

英文摘要

We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

2606.12923 2026-06-12 cs.LG cs.AI cs.CL 新提交

Order Is Not Control

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation(澳大利亚广播公司)

AI总结 本文论证秩序不等于控制,提出接收器门控响应定律,并在生物、大语言模型、适配器和随机算子面板中验证,表明控制是局部的、可测量的。

详情
Comments
52 pages, 7 figures
AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律:一个分母索引算子,将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的:干预可以被接纳、饱和、变号、泄漏或过驱动,取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别,而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时,控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据,同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律:在四种物质条件下,响应向量的分量符号预测准确率为72.8-73.7%,非零分量上提升至84.3-84.8%;留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质,随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述:驱动通过制备介质、浴和接收器作用,产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子,同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

2606.12922 2026-06-12 cs.CL cs.CY 新提交

Polar: A Benchmark for Evaluating Political Bias in LLMs

Polar: 评估大语言模型中政治偏见的基准

Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院) Dept. of Computer Science and Engineering, Seoul National University(首尔大学计算机科学与工程系)

AI总结 提出Polar基准,通过选项级似然度测量大语言模型的政治偏见,覆盖美国和韩国政治语境,发现偏见随语境、议题、模型组和语言变化。

详情
Comments
Submitted to ARR 2026 May cycle
AI中文摘要

大语言模型(LLM)中的政治偏见日益显著,但在不同政治和语言背景下难以可重复地测量。我们引入了Polar,一个包含4,026个实例的多项选择基准,通过选项级似然度而非基于提示的生成来测量政治偏见。Polar覆盖了两个意识形态轴和来自Manifesto Project的八个议题类别,并在美国和韩国政治语境中并行评估模型。在38个LLM中,测量的偏见随政治语境、议题类别、模型组和呈现语言系统性地变化。所有模型在美国政治内容上倾向于左翼进步派,但在韩国内容上表现出更居中且混合的模式。翻译实验进一步表明,仅呈现语言就能改变测量的偏见。这些发现凸显了对LLM中政治偏见进行多语言和跨语境评估的必要性。

英文摘要

Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U.S. and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U.S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

2606.12921 2026-06-12 cs.LG cs.AI 新提交

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon:低秩流形上的谱最速下降

Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

发表机构 * Ateneo de Manila University(雅典耀马尼拉大学) EleutherAI NaXys, UNamur(纳慕尔大学NaXys研究所)

AI总结 提出LoRA-Muon优化器,将Muon的谱最速下降规则应用于低秩微调,解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题,在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

详情
Comments
20 pages, 4 figures
AI中文摘要

低秩适应(LoRA)显著降低了微调深度学习模型的计算和内存成本,但通常比稠密训练更难调优:当使用因子级优化器(如AdamW)时,它对初始化选择敏感,其最优学习率在秩之间迁移性差,且常常无法超越稠密基线。我们通过将Muon优化器的谱最速下降规则应用于低秩设置,推导出LoRA-Muon。结合我们的分裂权重衰减规则,我们的主要主张是LoRA-Muon是全秩Muon和Shampoo族优化器的一个良好的低秩代理。其最优学习率在秩、宽度、深度和因子重缩放之间均可迁移。在我们计算匹配的TinyShakespeare研究中,秩2代理恢复了稠密最佳测试学习率,秩32的LoRA-Muon运行在种子平均扫描中达到了比稠密基线更低的平均验证损失。我们进一步表明,Spectron优化器依赖于任意的因子缩放,因此在从严重不平衡的因子开始微调时可能不太适用,并且LoRA-RITE的简化QR坐标核心实现了相同的谱更新。LoRA-Muon无需QR分解即可计算该更新,并避免存储二阶矩,使其更易于加速器使用且内存效率更高。

英文摘要

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

2606.12916 2026-06-12 cs.AI cs.CL cs.LG 新提交

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge:稀疏模拟器反馈下的智能分子动力学流水线设计

Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出MDForge,利用LLM智能体通过多智能体辩论将稀疏奖励稠密化,自动设计分子动力学流水线,在SAMPL基准上达到专家水平,并发现新型高亲和力CB[7]结合剂。

详情
AI中文摘要

分子动力学(MD)是原子分子科学中经典的计算机模拟方法,从第一性原理物理模拟分子行为。为新系统设计MD流水线需要大量专业知识:即使在一个分子上运行也代价高昂,排除了试错法。我们使用LLM智能体自动化这一专家流水线设计过程。与现有编排预定义工具集的MD智能体不同,我们将流水线设计视为开放式代码生成,其中智能体的行为通过语言奖励在线重塑。具体而言,我们构建了MDForge,一个LLM智能体,其上下文更新规则通过物理专家间的多智能体辩论将稀疏奖励稠密化。在三个SAMPL主客体结合自由能基准上,MDForge自动设计的MD流水线与人类专家竞争。部署在未见过的候选客体库上,其CB[7]流水线发现了一种新型结合剂,湿实验竞争NMR证实其为高亲和力、皮摩尔级的CB[7]结合剂。我们的数据和代码可在https://this URL获取。

英文摘要

Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at this https URL.

2606.12908 2026-06-12 cs.CL 新提交

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL: 用于训练工具使用语言模型智能体的失败驱动强化学习

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang

发表机构 * Northeastern University(东北大学) Independent Researcher(独立研究员) Northwestern University(西北大学)

AI总结 提出SENTINEL框架,通过将智能体失败转化为针对性训练任务,在Tau2-Bench Retail上提升Qwen3-4B模型Pass@1从66.4到74.9,优于通用合成任务上的强化学习。

详情
AI中文摘要

语言模型智能体通过多轮工具使用在解决现实任务方面越来越有效。然而,训练可靠的工具使用智能体在实践中仍然具有挑战性。虽然强化学习提供了一种从智能体自身环境交互中改进智能体的在策略范式,但其有效性在很大程度上取决于训练任务分布。当任务在训练前固定时,任务分布可能越来越与策略不断发展的能力不匹配,导致许多轨迹被浪费在无信息的任务上。我们提出SENTINEL,一种失败驱动的强化学习框架,将求解器的轨迹失败转化为有针对性的训练任务。SENTINEL遵循控制器-提议者-求解器循环:控制器分析失败轨迹并总结重复出现的错误模式,提议者生成可执行的任务来强调这些弱点,求解器在针对性任务上接受训练。在Tau2-Bench Retail上使用Qwen3-4B-Thinking-2507,SENTINEL将Pass@1从66.4提高到74.9,并且在Pass@k指标上优于通用合成任务上的强化学习。这些结果表明,模型失败为改进工具使用语言模型智能体提供了有效且可扩展的针对性训练信号来源。

英文摘要

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

2606.12903 2026-06-12 cs.CL 新提交

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

X-MADAM-RAG:诊断和处理检索增强生成中的中英文证据冲突

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * Sichuan University(四川大学)

AI总结 提出X-MADAM-RAG管道,通过分解证据处理步骤(候选提取、可见证据修复、确定性分组和冲突感知聚合)解决RAG中中英文证据冲突问题,在受控基准上取得高准确率,但发现文档级提取是主要瓶颈。

详情
AI中文摘要

检索增强生成(RAG)系统可能接收到不仅噪声大而且相互矛盾的证据。这个问题在多语言环境中尤为突出,因为检索到的中文和英文证据可能支持不相容的答案候选。我们通过X-RAMDocs-ZHEN(一个从RAMDocs衍生的受控中英文基准)研究此问题,用于诊断RAG中的证据冲突。该基准包含300个示例,涵盖六种平衡条件,包括单语言支持、双语一致、反向冲突方向以及带可选噪声的冲突。我们进一步研究了X-MADAM-RAG,一个可解释的管道,将证据处理分解为每个文档的候选提取、可见证据修复、确定性候选分组和冲突感知聚合。在原始受控基准上使用Qwen2.5-7B-Instruct,X-MADAM-RAG达到了0.9667的严格准确率和0.9767的冲突感知成功率,优于证据归一化的单次调用基线。然而,一个零调用的纯规则提取器在同一基准上达到了1.0000,揭示了强模板规律性。为了探究这一局限性,我们构建了一个确定性自然化压力测试,移除了显式答案模板但保留了候选字符串。在其100样本子集上,纯规则提取器降至0.0000,但X-MADAM-RAG也降至0.3000严格准确率,低于朴素基线和证据归一化基线。特权Oracle保持完美,表明文档级提取是主要瓶颈。这些发现将X-RAMDocs-ZHEN和X-MADAM-RAG定位为受控证据冲突的诊断工具,而非通用幻觉检测或对自然检索鲁棒性的证据。

英文摘要

Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.