arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11382 2026-06-11 cs.LG q-bio.BM 新提交

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

GLACIER:用于分子性质预测的多模态师生基础模型

Emily Nguyen, Yongchan Hong, Harsh Toshniwal, Yan Liu, Andreas Luttens

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Quantitative and Computational Biology, University of Southern California(南加州大学定量与计算生物学系) Amazon(亚马逊) Department of Medical Biochemistry and Biophysics, Science for Life Laboratory, Karolinska Institutet(卡罗林斯卡学院医学生物化学与生物物理系,生命科学实验室)

AI总结 提出GLACIER师生框架,通过融合分子图、SMILES和物理化学描述符三种模态,并利用大模型蒸馏,实现高效准确的分子性质预测。

详情
AI中文摘要

深度学习模型有助于在数十亿候选化合物中发现具有定制性质的分子。然而,开发和部署最先进模型的计算负担不断增加,限制了其可扩展性。大多数大规模模型本质上是单模态的,忽视了利用互补分子数据模态的潜力。为了解决这些缺点,本文介绍了用于化学推理和探索的图-语言对齐表示(GLACIER)模型,这是一个师生框架,集成了分子图、SMILES字符串和物理化学描述符,以学习丰富的分子嵌入。我们的框架包括三个阶段:(1)我们在100,000个药物样分子上预训练三个学生编码器:用于分子图的消息传递神经网络、用于SMILES字符串的基于Transformer的编码器以及用于物理化学描述符的多层感知器;(2)我们使用新颖的Finsler几何感知模块融合这些学生模态;(3)通过对比学习,将来自大型教师模型(包括MiniMol和MolFormer)的互补知识蒸馏到一个轻量级模型中。我们证明GLACIER是一个稳健的框架,在复杂的分子性质预测任务中提供高预测性能和计算效率。我们的代码在此https URL公开可用。

英文摘要

Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at this https URL.

2606.11381 2026-06-11 cs.CV 新提交

From Simulation to Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting

从仿真到现实:面向机器人草莓采摘的实地6D位姿数据集与基线

Woojung Son (1), Won Suk Lee (1), Zijing Huang (1), Daeun Choi (1), Catia Silva (2), Yu She (3), Yan Gu (4) ((1) Department of Agricultural and Biological Engineering, University of Florida, (2) Department of Electrical and Computer Engineering, University of Florida, (3) Edwardson School of Industrial Engineering, Purdue University, (4) School of Mechanical Engineering, Purdue University)

发表机构 * Department of Agricultural and Biological Engineering, University of Florida(佛罗里达大学农业与生物工程系) Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Edwardson School of Industrial Engineering, Purdue University(普渡大学爱德华森工业工程学院) School of Mechanical Engineering, Purdue University(普渡大学机械工程学院)

AI总结 针对机器人草莓采摘中6D位姿估计的仿真到现实差距问题,首次构建了实地草莓6D位姿真值数据集(12,040张图像),并基于NVIDIA Isaac Sim生成具有场景级真实感的合成数据集,通过基线实验量化了差距。

详情
Comments
7 pages, 6 figures, 1 table
AI中文摘要

机器人草莓采摘需要精确的6D位姿估计;然而,在实际农业田间收集6D位姿真值本身具有挑战性。现有的6D位姿估计方法因此仅依赖缺乏场景级真实感的合成数据,其在真实农业田间条件下的性能尚未量化。在这项工作中,我们提出了据我们所知的第一个在实际农业田间收集的草莓6D位姿真值数据集(12,040张图像)。我们还引入了一个在NVIDIA Isaac Sim中渲染的合成数据集,具有场景级真实感和域随机化。尽管如此,我们的实验表明,显著的仿真到现实差距仍然存在,强调了可靠评估需要真实农业田间数据。我们进一步通过跨骨干编码器的基线6D位姿估计结果量化了仿真到现实差距,作为未来工作的参考。真实世界数据集将在接收后公开。

英文摘要

Robotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing 6D pose estimation methods have therefore relied solely on synthetic data that lacks scene-level realism, leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Nevertheless, our experiments reveal that a significant sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work. The real-world dataset will be made available upon acceptance.

2606.11375 2026-06-11 cs.CL cs.AI cs.LG 新提交

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

当探测精度饱和时,脆弱性揭示问题:LLM预训练分析的互补度量

Orion Reblitz-Richardson

发表机构 * Distiller Labs

AI总结 针对线性探测在预训练中精度快速饱和的问题,提出脆弱性度量,通过激活噪声水平衡量探测鲁棒性,揭示精度无法捕捉的表示结构演化。

详情
Comments
22 pages, 5 figures. Code and datasets at this https URL
AI中文摘要

标准线性探测在隐藏状态上的分类器达到高精度时,宣称属性被“编码”。该协议在快照上表现良好,但在预训练过程中失效:探测精度在最初几千步内饱和,使得大部分训练过程对仪器不可见。我们引入脆弱性,一种互补的逐层度量,定义为探测精度崩溃时的激活噪声水平。脆弱性对可分性边际和表示冗余均敏感,这两者在精度平台期后仍持续演化。应用于开放检查点语言模型时,脆弱性恢复了精度单独无法看到的结构。道德化表示沿着词汇→组合梯度出现:词汇道德检测在先,组合道德编码在后。由于探测精度本身跟踪数据集在词汇层面的可分性,我们通过证明其在共享无对比标记的构造类型间转移,直接建立了组合编码。层深度鲁棒性梯度在训练中单调发展,而精度保持平坦。匹配的微调语料库产生相同的探测精度,却留下不同的脆弱性指纹,表明数据整理在不改变探测精度的情况下重塑了探测鲁棒性。在我们测试的每个比较中,当探测精度返回平坦答案时,脆弱性返回结构化答案。

英文摘要

Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

2606.11372 2026-06-11 cs.RO 新提交

HiPi: Reproducible High-Fidelity Piezoresistive Sensors for Robotic Manipulation

HiPi: 用于机器人操作的可复现高保真压阻传感器

Changyi Lin, Raihan Haque, Hui-Ping Wang, Ding Zhao

发表机构 * Carnegie Mellon University(卡内基梅隆大学) General Motors(通用汽车)

AI总结 提出HiPi系统,通过低串扰读出原理和可复现硬件设计,在双机械臂四阵列2048触觉点场景下实现220Hz读出,将接触几何IoU从0.428提升至0.797。

详情
AI中文摘要

压阻触觉传感器因其薄、轻、低成本且可扩展至密集大面积传感而受到机器人操作的青睐。然而,现有系统仍面临实际权衡:近期可复现设计强调易用性和可复现性,而高保真读出架构则更难制造、组装和部署。我们提出HiPi,一种用于机器人操作的可复现高保真压阻传感系统。基于低串扰读出原理,HiPi围绕可复现性、可部署性和多传感器可扩展性重新设计了完整硬件堆栈。该系统包括:兼容商业PCB制造和组装服务的紧凑读出PCB,消除了手动焊接;更小、更低成本的基于STM32的MCU模块;优化的通信管道,在双机械臂设置中实现220 Hz读出,配备四个密集触觉阵列(共2048个触觉点);以及基于FPCB的导电层,简化了传感器制造和堆叠。使用结构化3D打印接触图案的实验表明,HiPi在保持接触几何方面显著优于可复现基线,将平均IoU从0.428提高到0.797,平均Dice分数从0.539提高到0.886。这些结果表明,HiPi弥合了可复现制造与高保真读出之间的重要差距,使密集压阻触觉传感在双机械臂操作和多指机器人系统中更加实用。

英文摘要

Piezoresistive tactile sensors are attractive for robotic manipulation because they are thin, lightweight, low-cost, and scalable to dense large-area sensing. However, existing systems still face a practical trade-off: recent reproducible designs emphasize accessibility and ease of reproduction, whereas high-fidelity readout architectures remain more difficult to fabricate, assemble, and deploy. We present HiPi, a reproducible high-fidelity piezoresistive sensing system for robotic manipulation. Building on a low-crosstalk readout principle, HiPi redesigns the complete hardware stack around reproducibility, deployability, and multi-sensor scalability. The system includes a compact readout PCB compatible with commercial PCB fabrication and assembly services, eliminating manual soldering; a smaller and lower-cost STM32-based MCU module; an optimized communication pipeline that achieves 220 Hz readout in a bimanual setup with four dense tactile arrays (2048 taxels in total); and FPCB-based conductive layers that simplify sensor fabrication and stacking. Experiments with structured 3D-printed contact patterns show that HiPi preserves contact geometry substantially better than a reproducible baseline, improving the average IoU from 0.428 to 0.797 and the average Dice score from 0.539 to 0.886. These results suggest that HiPi bridges an important gap between reproducible fabrication and high-fidelity readout, making dense piezoresistive tactile sensing more practical for bimanual manipulation and multi-fingered robotic systems.

2606.11363 2026-06-11 cs.CV 新提交

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

NSVQ: 通过稳定向量量化中的编码器漂移缓解码本崩溃

Hao Lu, Yongxin Guo, Onur Koyun, Zhengjie Zhu, Abbas Alili, Metin N. Gurcan

发表机构 * Wake Forest University School of Medicine(维克森林大学医学院) Advocate Health(倡导健康)

AI总结 提出NSVQ训练策略,通过非平稳嵌入损失、码本替换和分阶段编码器冻结,缓解大码本VQ中的码本崩溃,在ImageNet-1k上提升重建质量并保持100%码本利用率。

详情
AI中文摘要

向量量化是现代生成建模流程的核心,但大码本VQ模型常遭受码本崩溃。我们识别出编码器漂移是此失败的关键驱动因素:当编码器移动潜在分布时,稀疏更新的码向量可能滞后、失去分配并增加量化误差,通过直通估计器形成反馈循环。我们提出NSVQ,一种非平稳感知的VQ训练策略,结合密集非平稳嵌入损失、码本替换和分阶段编码器冻结。NSVQ首先在早期训练中帮助码本跟踪编码器漂移,然后冻结编码器以在固定潜在几何下巩固码本,最后重新引入对抗性细化。在ImageNet-1k上的实验表明,NSVQ在保持完全码本利用率的同时提高了重建质量。在ImageNet-1k 128×128分辨率下使用65,536个码本,与SimVQ相比,NSVQ将rFID从2.39降至2.10,同时两种方法均保持100%利用率。额外的潜在扩散实验表明,NSVQ还改善了下游ImageNet生成的FID。

英文摘要

Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.

2606.11350 2026-06-11 cs.CL cs.IR 新提交

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

当更多文档损害RAG:利用领域限定、模型无关的检索缓解向量搜索稀释

Nabaraj Subedi, Ahmed Abdelaty, Shivanand Venkanna Sheshappanavar

发表机构 * Dept. of Electrical Engineering & Computer Science, University of Wyoming(怀俄明大学电气工程与计算机科学系) Dept. of Civil, Architectural Engineering & Construction Management, University of Wyoming(怀俄明大学土木、建筑工程与施工管理系)

AI总结 针对检索增强生成在异构文档集合中因向量搜索稀释导致性能下降的问题,提出基于组织元数据的领域限定方法MASDR-RAG,显著提升P@10至0.86,并揭示多智能体编排的精度-忠实度悖论。

详情
Comments
24 pages, 8 figures, 30 tables. Preprint under review
AI中文摘要

当检索增强生成扩展到大规模、异构的文档集合时,其性能会下降,因为密集相似性失去了区分能力,top-k检索越来越多地返回语义相似但上下文不正确的块。我们将这种失败模式称为向量搜索稀释。即使使用混合密集+稀疏检索,我们在部署的怀俄明州交通部语料库中直接观察到了这一点:当文档从54篇扩展到1128篇(88907个块)时,准确率从75%下降到40%以下。为了解决这种稀释问题,我们提出了MASDR-RAG(用于RAG的多智能体领域限定检索),并在200个专家验证的查询上进行了评估,涉及五个LLM骨干、六个语料库和两个索引栈。我们的结果表明,使用组织元数据进行领域限定是关键修复,显著将P@10从0.77提高到0.86(p < 0.05)。此外,我们对多智能体编排的研究揭示,高度配置依赖会导致我们所谓的精度-忠实度悖论。基于这些不同的结果,我们的实用建议很简单:先限定领域,然后执行一次合成调用,将完整的多智能体编排保留给真正多领域的语料库,并配合原生工具调用骨干。代码和数据将在接收后公开。

英文摘要

Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p < 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

2606.11349 2026-06-11 cs.AI cs.HC 新提交

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

知道何时提问:分层语言代理的自门控澄清机制

Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh Woo

发表机构 * Amazon Web Services(亚马逊云科技)

AI总结 提出ACTION-RATING框架,将澄清请求纳入代理的动作空间,与导航共享序数尺度,在分层推理中实现自门控澄清,通过强制性和机会性两种信息寻求模式提升决策准确性。

详情
AI中文摘要

在分层推理中,失败通常源于中间决策点,代理在没有意识到缺乏关键信息的情况下错误地选择了分支。我们不将澄清视为外部不确定性触发,而是提出ACTION-RATING,一种将澄清置于代理动作空间内、与导航共享序数尺度的公式,使得在每个决策点提问与行动直接竞争,并在中间状态可观察求助行为。从代理自身的评分中涌现出两种结构上不同的信息寻求模式:强制性(无可行分支)和机会性(尽管有领先候选但仍有残余不确定性)。在协调关税表分类(30,000节点分类树,三个基准,跨4个家族的9个LLM)上,我们观察到从强制性澄清到机会性澄清的机制转变,信息寻求有效性(ISE,一个局部诊断指标,定义为帮助交互后正确下一步导航步骤的比例,非最终任务指标)从50%上升到74%。三个诊断对比未能复现此结构。可分离性测试表明,当答案质量下降(准确率下降18.8%)时,信息寻求模式(模式分裂、ISE排名)保持不变,支持代理寻求帮助的位置与其所获帮助质量之间的经验分离。在受控答案通道下,10位数字准确率提升达+16.2%;我们将其解读为更好定位所能释放的上限,而非部署估计。

英文摘要

In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

2606.11341 2026-06-11 cs.LG cs.RO 新提交

Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints

能量守恒神经管道:通过物理守恒约束减弱模块化神经网络中的误差传播

David Young, Swan Yi Htet

发表机构 * ORION Robotics

AI总结 提出在模块间强制能量守恒(特征向量L2范数不变)作为硬约束,实验证明该方法在多种噪声下显著优于基线,并具有深度不变性和理论保证。

详情
Comments
22 pages, 2 figures, 7 tables, 25 references
AI中文摘要

模块化神经网络管道存在误差累积问题:任何模块边界的噪声都会传播并可能在后续模块中放大。我们引入能量守恒作为模块间信息流的硬物理约束。激活能量(特征向量的平方L2范数)被强制在每个模块边界精确保持不变。与软能量惩罚不同,守恒是不可违反的定律:网络可以在神经元之间重新分配能量,但不能创造或毁灭能量。在CIFAR-10上的四个实验表明:(1)在噪声sigma=0.2时,守恒方法保留了77.4%的干净准确率,而基线为35.1%,能量惩罚模型为30.9%(p<0.001,5个种子);(2)管道变得深度不变,在深度2至5且每个边界都有噪声时保留了93.3%的准确率;(3)该优势泛化到系统性偏差(+45.1%)、高斯噪声(+40.4%)和对抗噪声(+4.8%),而对dropout有原则性的无影响(-0.3%);(4)在ResNet-18上,守恒优势与内在归一化呈反比:在sigma=0.2时,有BatchNorm时+0.3个百分点,无BatchNorm时+26.2个百分点,在sigma=0.5时达到+58.0个百分点。实验5在真实模块化机器人管道(MuJoCo物理,Franka Panda)上验证了该算子。在独立机器上的三次独立运行(每个单元90次试验)中,守恒在单目深度类噪声上提供了平均+18.9个百分点的优势。一个形式化界限证明了守恒噪声能量严格小于输入噪声能量。

英文摘要

Modular neural network pipelines suffer from error compounding: noise at any module boundary propagates and potentially amplifies through subsequent modules. We introduce energy conservation as a hard physical constraint on inter-module information flow. Activation energy (the squared L2 norm of feature vectors) is enforced to be exactly preserved at every module boundary. Unlike soft energy penalties, conservation is an inviolable law: the network may redistribute energy across neurons but cannot create or destroy it. Four experiments on CIFAR-10 demonstrate: (1) conservation retains 77.4% of clean accuracy at noise sigma=0.2, versus 35.1% for baselines and 30.9% for energy-penalized models (p<0.001, 5 seeds); (2) pipelines become depth-invariant, retaining 93.3% at depths 2 through 5 with noise at every boundary; (3) the advantage generalizes to systematic bias (+45.1%), Gaussian (+40.4%), and adversarial noise (+4.8%), with a principled non-effect on dropout (-0.3%); (4) on ResNet-18, the conservation advantage scales inversely with intrinsic normalization: +0.3 pp with BatchNorm, +26.2 pp without at sigma=0.2, reaching +58.0 pp at sigma=0.5. Experiment 5 validates the operator on a real modular robotic pipeline (MuJoCo physics, Franka Panda). Across three independent runs on separate machines (90 trials per cell), conservation provides +18.9 pp average advantage on monocular-depth-style noise. A formal bound proves conserved noise energy is strictly less than input noise energy.

2606.11337 2026-06-11 cs.AI cs.CL cs.CY 新提交

Can AI Agents Synthesize Scientific Conclusions?

AI代理能否综合科学结论?

Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

发表机构 * Princeton University(普林斯顿大学) Universidade Federal de Minas Gerais(米纳斯吉拉斯联邦大学) Stony Brook University(石溪大学) Hackensack Meridian School of Medicine(哈肯萨克子午线医学院)

AI总结 本文提出SciConBench基准和SciConHarness评估框架,通过分解原子事实并计算精确率和召回率,发现前沿AI代理在科学结论综合中事实F1仅0.337,且无约束评估存在数据泄露,消费者代理常生成不完整或矛盾的结论。

详情
Comments
79 pages, 34 figures, 17 tables. Under Submission
AI中文摘要

科学AI代理越来越多地检索证据、跨来源推理并综合用于重要决策的结论。然而,它们在健康等高风险领域中的能力仍不明确。我们引入了SciConBench,一个大规模实时基准,包含9.11K个问题以及来自系统综述的专家撰写的结论,用于评估开放域科学结论综合。该基准采用专家验证的自动评估流程,将结论分解为原子事实,并通过事实精确率和召回率衡量正确性和全面性。为减轻数据泄露,我们进一步引入了SciConHarness,一个洁净室评估框架,为代理配备受控的网页交互以确保有效测量。评估8个前沿模型和深度研究代理,我们发现事实质量仍然较低:在洁净室设置下,最佳代理仅达到0.337的事实F1。与无约束评估相比,我们的洁净室设置持续降低性能,表明数据泄露夸大了模型真实综合能力的估计。最后,我们审计了面向消费者的代理(如Google AI Overview、OpenEvidence),发现它们经常生成不完整甚至矛盾的结论,即使真实答案可用。总体而言,我们的结果表明,科学结论的可靠综合仍然是一个开放挑战,而洁净室评估对于评估开放域AI代理至关重要。

英文摘要

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

2606.11326 2026-06-11 cs.CV 新提交

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

DarkVGGT: 利用热几何在黑暗中透视,无需日光代价

Minseong Kweon, Wenyuan Zhao, Nuo Chen, Lulin Liu, Huiwen Han, Zihao Zhu, Srinivas Shakkottai, Chao Tian, Zhiwen Fan

发表机构 * University of Minnesota(明尼苏达大学) Texas A&M University(德克萨斯农工大学) Stanford University(斯坦福大学)

AI总结 提出DarkVGGT,一种RGB-T前馈几何框架,通过物理感知热建模实现低光照场景下的鲁棒3D估计,引入热分解和几何共享路由模块,在退化RGB条件下保持精度。

详情
Comments
Project Page: this https URL
AI中文摘要

最近的前馈3D重建方法在从图像流高效端到端场景几何估计中展现出强大性能和灵活性。然而,它们对可见光外观的依赖使其在黑暗和低可见度环境中脆弱,此时RGB线索严重退化,几何证据变得模糊。为应对这一挑战,我们提出DarkVGGT,一种RGB-T前馈几何框架,使用物理感知热建模实现低光照场景下的鲁棒3D估计。DarkVGGT引入两个互补模块。首先,物理启发的热分解提取发射主导、几何一致的热线索,同时隔离可能引入几何模糊的稀疏反射残差。其次,几何共享热路由从热特定模式中分离模态不变的几何结构,选择性地将可靠性感知的结构引导注入RGB流。这些组件共同使得在退化RGB条件下实现准确的热信息几何估计,同时在光照良好环境中基本保持性能。在低可见度RGB-T基准上的实验表明,与现有前馈几何基线相比,在深度和相机姿态估计上均有一致改进。

英文摘要

Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.

2606.11324 2026-06-11 cs.RO cs.AI cs.LG 新提交

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Embodied-R1.5:通过具身基础模型演化物理智能

Yifu Yuan, Yaoting Huang, Xianze Yao, Yutong Li, Shuoheng Zhang, Linqi Han, Pengyi Li, Jiangeng Sun, Wenting Jia, Zhao Zhang, Yuhao Liu, Ruihao Liao, Yucheng Hu, Qiyu Wu, Yuxiao Li, Zibin Dong, Fei Ni, Yan Zheng, Shuyang Gu, Yi Ma, Hongyao Tang, Han Hu, Jianye Hao

发表机构 * Tianjin University(天津大学) Tencent Hunyuan(腾讯混元)

AI总结 提出统一具身基础模型Embodied-R1.5,通过自动化数据管道和多任务平衡强化学习,在8B参数下实现24项基准中16项最优,并支持微调为VLA模型。

详情
Comments
Embodied R1.5 technical report. Project page: this https URL
AI中文摘要

我们介绍了Embodied-R1.5,一个统一的具身基础模型(EFM),它在单一架构中集成了全面的具身推理能力,涵盖具身认知、任务规划、修正和指向,旨在实现通用物理智能。利用三个自动化数据构建管道显著扩展关键能力的数据覆盖,我们构建了超过150亿token的大规模数据系统,并设计了多任务平衡的RL配方以缓解异构任务冲突。我们进一步引入了规划器-基础模型-修正器(PGC)闭环框架,使单一模型能够自主执行并在长时任务中进行自我修正。仅凭8B参数,Embodied-R1.5在24个具身VLM基准中的16个上达到了最先进水平,超越了Gemini-Robotics-ER-1.5和GPT-5.4等领先模型。得益于内化的具身能力,Embodied-R1.5只需少量数据即可微调为VLA,在4个流行的操作基准套件上优于$\pi_{0.5}$等领先VLA模型。我们进一步进行了广泛的零样本真实机器人实验,验证了在指令跟随、可供性基础、铰接物体操作和长时复杂任务中的性能,展示了向物理世界的强泛化能力。我们开源了模型权重、数据集、训练代码以及EmbodiedEvalKit(一个专为具身任务定制的评估框架),以促进EFM的未来研究。

英文摘要

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $\pi_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

2606.11320 2026-06-11 cs.CV 新提交

Semantic Segmentation of Node and Edge Diagrams for Assistive Technology

面向辅助技术的节点和边图语义分割

Michael Cormier, Yichun Zhao, Laura Paul, Cameron Swift, Duc Tri Dang, Miguel Nacenta

发表机构 * Natural Sciences and Engineering Research Council of Canada(加拿大自然科学与工程研究理事会)

AI总结 提出紧凑深度学习模型对节点-边图进行语义分割,在合成数据集上达到93%以上像素精度,以辅助非视觉访问。

详情
Comments
8 pages, 6 figures, 1 table. In Proceedings of the 23rd Conference on Robots and Vision (2026)
AI中文摘要

在本文中,我们提出了一组新颖的用于节点-链接图语义分割的相关模型。这些图经常用于表示数学图、概念之间的关系和流程图。此类图难以非视觉方式访问;尽管已经为节点-链接图设计了一些辅助界面,但它们依赖于图的可机读表示,而此类图通常以位图图像形式提供。我们的紧凑深度学习模型在大型合成节点-链接图数据集上表现出优异的定量和定性性能,达到超过93%的逐像素准确率。

英文摘要

In this paper, we present a novel set of related models for semantic segmentation of node-link diagrams. These diagrams are frequently used to represent mathematical graphs, relationships between concepts, and flowcharts. Such diagrams are difficult to access non-visually; while some assistive interfaces have been designed for node-link diagrams, they rely upon a machine-readable representation of the diagram, whereas such diagrams will generally be made available as bitmap images. Our compact deep learning models show excellent quantitative and qualitative performance on a large synthetic dataset of node-link diagrams, reaching per-pixel accuracy over 93\%.

2606.11319 2026-06-11 cs.LG cond-mat.dis-nn 新提交

Learning from almost nothing: How neural networks survive heavy input corruption

从几乎一无所有中学习:神经网络如何在严重输入损坏中生存

Justin Tahmassebpur, Asadullah Bhuiyan, Hyejin Kim, Omri Lesser

发表机构 * Cornell University(康奈尔大学)

AI总结 研究神经网络在输入严重损坏(>90%)时仍保持高精度的鲁棒性,通过平均场方法推导出网络实现最近类均值原型规则,解释学习成功的机制。

详情
Comments
26 pages, 10 figures
AI中文摘要

从不完美数据中学习是机器学习的核心主题,将鲁棒性的实际问题与可学习性的基本问题联系起来。本文研究属性噪声:在保持标签完整的情况下从损坏输入中学习,这一设置受到的关注远少于标签噪声。我们考虑两种损坏模型:加性噪声和替换噪声。通过在损坏分类数据集上使用多层感知器(MLP)进行实验,我们发现神经网络保持鲁棒性,即使输入损坏超过90%——远超人类识别能力——仍能维持远高于随机水平的准确率。为了理解这种鲁棒性,我们使用平均场启发的方法分析严重损坏机制下的无限宽网络,并推导出分类结果的前导决策规则:网络实现一个原型规则,即最近类均值,将每个测试点分配给其训练集平均值最接近的类别。这个前导决策规则在广泛的MLP架构中具有普适性,适用于任何深度以及多种激活函数和噪声分布。相同的质心机制与实验中有限宽网络的行为高度吻合,并提供了一个可解释且易于分析的说明,解释了为什么即使单个训练样本几乎不携带任何信号,学习也能成功。

英文摘要

Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label-noise counterpart. We consider two types of corruption models: additive noise and replacement noise. Through experiments with multi-layer perceptrons (MLPs) on corrupted classification datasets, we find that neural networks remain robust, maintaining well-above-chance accuracy even when inputs are >90% corrupted -- far beyond human recognition. To understand this robustness, we analyze infinite-width networks in the heavy-corruption regime using a mean-field-inspired approach and derive a leading-order decision rule for the classification outcome: the network implements a prototype rule, the nearest-class-mean, assigning each test point to the class whose training-set average it most closely resembles. This leading-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions. The same centroid mechanism closely matches finite-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal.

2606.11314 2026-06-11 cs.CV cs.GR 新提交

TRON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions

TRON:追踪光线以编排用于3D高斯重建的神经渲染器

Or Perel, Hassan Abu Alhaija, Zian Wang, Jacob Munkberg, Matan Atzmon, Sanja Fidler, Masha Shugrina

发表机构 * NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出TRON框架,结合3D高斯光线追踪与神经渲染,实现真实世界3D场景在新光照、动态物体运动、物体插入和材质编辑下的逼真可控渲染,通过内在分解先验和光线追踪辐射引导,弥合物理渲染与神经渲染的差距。

详情
Comments
Project page: this https URL
AI中文摘要

我们介绍了TRON,一种渲染框架,它将3D高斯光线追踪与神经渲染相结合,使得在新型光照、动态物体运动、物体插入和材质编辑下,对真实世界3D场景进行逼真且可控的渲染成为可能。先前仅依赖高斯表示的物理渲染(PBR)的方法,由于重建几何、材质估计和光传输估计的不完善,难以实现逼真的重光照。同时,神经渲染方法通常缺乏显式场景表示,限制了它们支持细粒度交互编辑的能力。TRON桥接了这两种范式。我们使用来自学习逆渲染模型的内在分解先验来正则化高斯场的材质属性,并重新利用光线追踪器提供辐射度量指导而非最终像素。通过将此输出视为结构化的3D支架,我们赋予轻量级神经渲染器能力,以弥合着色模型约束估计与逼真输出之间的领域差距。我们的关键见解是,显式3D知识与稳健材质先验的结合提供了速度和可控性,而神经渲染则实现了逼真图像的合成。为了支持真实世界场景,我们采用多阶段策略训练神经渲染器,包括大规模预训练和在从3D重建中构建的210万渲染合成及真实世界帧的新数据集上进行针对性微调。TRON在逼真度上优于基于高斯的重光照方法,在可编辑性和速度上优于先前的神经渲染器。据我们所知,TRON是首个能够在捕获的3D环境中实现实用交互式应用的方法,在动态几何、光照和材质条件下提供逼真的外观。

英文摘要

We introduce TRON, a rendering framework that combines 3D Gaussian ray tracing with neural rendering to enable realistic and controllable rendering of real-world 3D scenes under novel lighting, dynamic object motion, object insertion, and material editing. Prior approaches that rely solely on physically based rendering (PBR) of Gaussian representations struggle to achieve realistic relighting due to imperfections in reconstructed geometry, material estimates, and light transport estimation. At the same time, neural rendering methods often lack an explicit scene representation, limiting their ability to support interactive editing with fine-grained manipulation. TRON bridges these two paradigms. We use intrinsic decomposition priors from a learned inverse rendering model to regularize the material properties of a Gaussian field, and repurpose a ray tracer to provide radiometric guidance rather than final pixels. By treating this output as a structured 3D scaffold, we empower a lightweight neural renderer to bridge the domain gap between shading-model constrained estimates and photorealistic output. Our key insight is that the combination of explicit 3D knowledge with robust material priors provides speed and controllability, while neural rendering enables the synthesis of photorealistic images. To support real-world scenarios, we train our neural renderer with a multi-stage strategy consisting of large-scale pretraining and targeted fine-tuning on a newly constructed dataset of 2.1M rendered synthetic and real-world frames from 3D reconstructions. TRON outperforms Gaussian-based relighting methods in realism, and prior neural renderers in editability and speed. To the best of our knowledge, TRON is the first method to enable practical interactive applications in captured 3D environments, offering realistic appearance under dynamic geometric, lighting and material conditions.

2606.11289 2026-06-11 cs.CV 新提交

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

i1: 一种简单且完全开放的强文本到图像模型配方

Boya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文通过300多次控制实验系统研究文本到图像扩散模型的设计选择,提出i1模型,仅用公开数据集训练3B参数模型,在五个基准上平均超越现有最佳完全开放模型29.5个百分点。

详情
Comments
Project page at this https URL
AI中文摘要

扩散模型持续推动文本到图像生成的进展。然而,将最近的进展归因于特定的建模和数据选择是困难的:最先进的开放权重模型提供的消融研究有限,并且不公开其训练数据和完整的训练细节。研究社区需要完全开放(权重、数据和代码)的模型作为进一步研究的基础;然而,现有的完全开放模型在性能上仍显著落后于领先模型。在本项目中,我们通过300多次控制实验(总计超过70万TPU v6e小时)系统研究了文本到图像扩散训练和推理中的建模与数据设计选择。我们的实验突出了几个经验发现(例如,等权重是混合策划数据集的强默认设置)和简单的设计决策(例如,更大的文本编码器适配器以最小的参数增加提升性能),用于训练强模型。在这些见解的指导下,我们训练了i1,一个仅使用公开可用数据集的3B参数文本到图像扩散模型。i1在五个代表性基准(GenEval、DPG、PRISM、CVTG-2K和LongText)上与领先模型竞争,并且平均超越现有最佳完全开放模型29.5个百分点。我们提供i1检查点、训练和推理代码以及数据处理流程。总之,我们的发现和i1配方为未来文本到图像扩散模型的开放研究建立了实践基础。我们的代码可从此https URL获取。

英文摘要

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at this https URL.

2606.11286 2026-06-11 cs.LG cs.AI 新提交

FreeBridge: Variational Schrödinger Bridges for Cellular Transition Dynamics

FreeBridge: 用于细胞转变动力学的变分薛定谔桥

Xurui Wang, Qin Ren, Jun Ma, Haibin Ling, Chenyu You

发表机构 * Stony Brook University(石溪大学) University of Toronto(多伦多大学) University Health Network(大学健康网络)

AI总结 针对高内涵成像中细胞扰动建模的端点监督问题,提出FreeBridge方法,通过变分薛定谔桥在固定细胞流形上学习随机传输,并利用经验潜在支持正则化约束中间路径,在保持端点保真度的同时减少中间支持违规。

详情
Comments
Accepted to MICCAI 2026 (early accept). Project page: this https URL
AI中文摘要

高内涵成像实验量化细胞对化学和遗传扰动的反应,但由于细胞在采集时被化学固定,单个细胞的连续轨迹无法观测。因此,扰动建模简化为推断仅在对照和处理群体之间观察到的随机传输,这些群体作为单独的边际分布。虽然最近的生成模型实现了强端点对齐,但边界一致性并不决定中间演化:多个随机过程可能连接相同的边际分布,同时穿过观察到的单细胞形态不支持的区域。我们引入了 \textbf{FreeBridge},一种在仅端点监督下进行单细胞转变建模的薛定谔桥公式。FreeBridge 将原子状态定义为实例分割的单细胞表示,建立固定的细胞流形,并通过经验潜在支持正则化学习在此几何结构内约束的随机传输。在 BBBC021、RxRx1 和 JUMP 数据集上,FreeBridge 在统一评估协议下保持竞争性或改进的端点保真度和作用机制保留;在 BBBC021 上,它进一步减少了中间支持违规。这些发现强调了几何基础对于生物学可解释的扰动动力学的重要性。项目页面:此 https URL。

英文摘要

High-content imaging assays quantify cellular responses to chemical and genetic perturbations, yet continuous trajectories of individual cells are unobservable because cells are chemically fixed at acquisition. Perturbation modeling therefore reduces to inferring stochastic transport between control and treated populations observed only as separate marginals. While recent generative models achieve strong end-point alignment, boundary consistency does not determine intermediate evolution: multiple stochastic processes may connect identical marginals while traversing regions unsupported by observed single-cell morphologies. We introduce \textbf{FreeBridge}, a Schrödinger Bridge formulation for single-cell transition modeling under endpoint-only supervision. FreeBridge defines atomic states as instance-segmented single-cell representations, establishing a fixed cellular manifold, and learns stochastic transport constrained within this geometry via empirical latent support regularization. Across BBBC021, RxRx1, and JUMP, FreeBridge maintains competitive or improved endpoint fidelity and mechanism-of-action retention under a unified evaluation protocol; on BBBC021, it further reduces intermediate support violations. These findings highlight the importance of geometric grounding for biologically interpretable perturbation dynamics. Project page: this https URL.

2606.11278 2026-06-11 cs.RO 新提交

Model-based Optimization of Anguilliform Swimming Gaits for Soft Robotic Applications

基于模型的鳗鲡式游泳步态优化及其在软体机器人中的应用

Brian Van Stratum, James Gallentine, Caleb Rucker, Eric Barth, Jonathan E. Clark, Kourosh Shoele

发表机构 * FAMU/FSU College of Engineering(佛罗里达农工大学/佛罗里达州立大学工程学院) Vanderbilt University(范德堡大学) The University of Tennessee, Knoxville(田纳西大学诺克斯维尔分校)

AI总结 本文提出软体七鳃鳗启发双环境机器人(SLIDER),通过结合Lighthill理论、非线性结构模型和遗传算法,优化游泳控制模式与尾鳍设计,实现系留游泳速度21.7±0.4 cm/s,并探索多模态机器人优化。

详情
AI中文摘要

在本文中,我们介绍了软体七鳃鳗启发双环境机器人(SLIDER)以及用于设计该机器人的适当建模和优化流程。我们使用Lighthill的大振幅细长体理论来表示主要的流体环境作用——惯性效应、涡流力和粘性耗散。对于结构设计参数,如内部压力、尾部尺寸和身体刚度,我们开发并验证了一个快速的几何和材料非线性模型。流固耦合方程采用高效的二阶box方法隐式求解。采用气动歧管机器人系统在静水槽环境中驱动SLIDER,实现计算与实验结果的交叉比较。我们发现低频游泳主要受阻力环境影响,而高频游泳主要受惯性流体力的影响。利用我们的高效模型和遗传算法,我们共同优化了游泳控制模式和尾鳍设计(受限于SLIDER的攀爬形态),实现了21.7±0.4 cm/s(0.59 Bl/s)的系留游泳速度。此外,我们研究了执行游泳和攀爬任务的多模态机器人的优化程序。

英文摘要

In this paper, we introduce the Soft Lamprey-Inspired Dual Environment Robot (SLIDER) and a proper modeling and optimization procedure employed to design the robot. We represent the primary fluid environment actions - inertial effects, vortex forces, and viscous dissipation - using Lighthill's theory for large-amplitude elongated bodies. For structural design parameters such as internal pressure, tail size, and body stiffness, a fast, geometrically and materially nonlinear model is developed and validated. The fluid-structure interaction equations are solved implicitly with an efficient second-order box method. A pneumatic manifold robotic system is employed to actuate SLIDER in a quiescent water tank environment, allowing cross-comparison of computational and experimental results. We find that low-frequency swimming is dominated by resistant environmental forces, whereas higher-frequency swimming is primarily affected by inertial fluid forces. Using our efficient model alongside a genetic algorithm, we co-optimize a swimming control pattern and caudal fin design (subject to SLIDER's climbing morphology) to achieve a tethered swimming speed of 21.7 +/- 0.4 cm/s (0.59 Bl/s). Furthermore, we investigate the optimization procedure for a multimodal robot performing both swimming and climbing tasks.

2606.11277 2026-06-11 cs.LG physics.comp-ph 新提交

Least-Action-Guided Diffusion for Physical Extrapolation

最小作用量引导扩散用于物理外推

Zhongxin Yang, Yuanwei Bin, Xiang I.A. Yang, Shiyi Chen

发表机构 * College of Engineering, Peking University(北京大学工学院) Ningbo Institute for Digital Twin, Eastern Institute of Technology(东方理工宁波数字孪生研究院) Eastern Institute for Advanced Study, Eastern Institute of Technology(东方理工高等研究院) Shenzhen Tenfong Technology Co., Ltd.(深圳腾方科技有限公司) Mechanical Engineering, The Pennsylvania State University(宾夕法尼亚州立大学机械工程系)

AI总结 提出最小作用量引导扩散(LAPG)框架,通过将最小作用量原理转化为可微的推理时校正机制,在时间、参数和几何外推中保持物理一致性,优于训练时物理信息基线。

详情
AI中文摘要

可靠的外推仍然是计算物理学中生成模型的核心挑战,因为模型在有限的时间、参数或几何范围内训练,可能会在训练分布之外产生物理上不一致的预测。我们引入了最小作用量引导扩散(LAPG),这是一个在推理过程中促进物理一致性而非仅依赖训练时施加约束的框架。该方法结合了条件得分扩散模型与作用量导出的物理引导得分。在第一阶段,学习的得分模型生成一个分布内的提议;在第二阶段,基于作用量的变分先验将该提议向目标分布外条件细化。这一公式将最小作用量原理转化为可微的推理时校正机制,并提供了对通常需要经验损失平衡的点态残差惩罚的替代方案。我们在代表性的常微分和偏微分方程系统上评估了LAPG,包括自由落体、保守和耗散弹簧-质量动力学、相互作用点涡以及参数化翼型上的势流。在时间、参数和几何外推测试中,与训练时物理信息基线相比,LAPG减少了相位漂移,保持了耗散衰减,捕捉了涡旋运动,并改善了翼型流动的升力响应。

英文摘要

Reliable extrapolation remains a central challenge for generative models in computational physics, because models trained over finite ranges of time, parameters, or geometries may produce physically inconsistent predictions outside the training distribution. We introduce a least-action-principle-guided diffusion, LAPG, a framework that promotes physical consistency during inference rather than relying solely on constraints imposed during training. The method combines a conditional score-based diffusion model with an action-derived physical guidance score. In the first stage, the learned score model generates an in-distribution proposal; in the second, an action-based variational prior refines this proposal toward the target out-of-distribution condition. This formulation turns the principle of least action into a differentiable inference-time correction mechanism and provides an alternative to pointwise residual penalties that often require empirical loss balancing. We evaluate LAPG on representative ordinary- and partial-differential-equation systems, including free fall, conservative and dissipative spring-mass dynamics, interacting point vortices, and potential flow over parameterized airfoils. In temporal, parameter, and geometric extrapolation tests, LAPG reduces phase drift, preserves dissipative decay, captures vortex motion, and improves the lift response of airfoil flows compared with training-time physics-informed baselines.

2606.11275 2026-06-11 cs.LG cs.AI 新提交

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoVE: 旋转值嵌入注意力实现相对位置相关的值路径

Alejandro García-Castellanos, Maurice Weiler, Erik J Bekkers

发表机构 * AMLab University of Amsterdam(阿姆斯特丹大学AMLab) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出RoVE方法,通过同时旋转键和值使值对位置敏感,将RoPE注意力转化为注意力卷积,在少样本学习、分布外困惑度和长上下文检索上优于RoPE。

详情
AI中文摘要

旋转位置嵌入(RoPE)使注意力分数具有位置相对性,但值路径对位置不敏感:值令牌发送的消息与其到查询的距离无关。我们提出RoVE,一种无需参数修改的方法,通过同时旋转键和值使值对位置敏感,并证明它将RoPE注意力转化为注意力卷积。这一新视角统一了计算机视觉、机器人技术和现代LLM架构中同一操作的几种独立表述。训练124M和354M参数的GPT-2模型在少样本上下文学习、分布外困惑度和长上下文检索上一致优于RoPE,在需要长距离聚合的任务上改进最为明显。

英文摘要

Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a parameter-free modification that makes values position-sensitive by rotating them simultaneously with keys, and show that it turns RoPE attention into attentive convolution. This new perspective unifies several independent formulations of the same operation across computer vision, robotics, and modern LLM architectures. Trained 124M and 354M GPT-2 models show consistent empirical gains over RoPE on few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, with the clearest improvements on tasks that require long-range aggregation.

2606.11272 2026-06-11 cs.LG cs.AI 新提交

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

联邦持续学习:分布式和非平稳数据上的终身与隐私保护学习综述

Masoume Gholizade, Fabrizio Ruffini, Pietro Ducange, Francesco Marcelloni

发表机构 * University of Pisa(比萨大学) University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学)

AI总结 本文系统综述联邦持续学习(FCL),定义问题、分析经典联邦学习在非平稳数据下的局限,提出多维分类法,并讨论应用、评估指标及开放挑战。

详情
Comments
77 pages, 8 figures
AI中文摘要

联邦学习(FL)能够在分布式客户端之间实现协作和隐私保护的模型训练,但大多数现有的FL系统隐含地假设数据是平稳的。在现实场景中——如医疗、工业物联网(IIOT)、网络安全和智慧城市——数据流本质上是非平稳的,导致经典FL方法遭受性能下降、不稳定和灾难性遗忘。持续学习(CL)解决了在演化数据分布下的学习问题,但主要在集中式环境中研究,忽视了联邦系统的关键约束,包括隐私、有限通信和客户端异质性。联邦持续学习(FCL)出现在FL和CL的交汇处,旨在支持分布式和非平稳数据上的终身、自适应和隐私感知学习。本综述提供了FCL的全面和系统概述。我们首先给出FCL问题的正式定义并阐明其独特特征。然后分析经典FL在非平稳条件下的局限性,强调CL原理如何支持长期适应。为了组织快速增长的文献,我们提出了FCL方法的多维分类法。此外,我们回顾了代表性的应用领域和数据模态,总结了常用的评估指标,并讨论了评估长期性能和遗忘的实验视角。最后,我们强调了关键开放挑战,包括处理时间漂移下的极端异质性、设计可扩展且隐私保护的记忆机制,以及建立标准化基准。本综述旨在为推进FCL走向鲁棒和可部署的现实世界系统提供参考和路线图。

英文摘要

Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

2606.11269 2026-06-11 cs.CV cs.HC 新提交

Traits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment

特质更深:面向人格评估的特质特异性非对称融合

Jia Li, Qian Chen, Wei Wang, Xinyu Li, Zhenzhen Hu, Dongsheng Shao, Richang Hong, Meng Wang

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室) Jianghuai Advanced Technology Center(江淮前沿技术中心) Anhui Provincial Industry Innovation Center of Humanoid Robots(安徽省人形机器人产业创新中心) Anhui Provincial Key Laboratory of Humanoid Robots(安徽省人形机器人重点实验室)

AI总结 提出Traits Run Deeper框架,通过多模态基础表示、特质特异性非对称融合和分布校准回归模块,解决人格评估中模态偏好差异和标签偏差问题,在AVI Challenge 2026上MSE降低约25%。

详情
AI中文摘要

人格评估旨在从语言、声音和面部线索等动态行为中推断稳定的人格特质。由于不同的人格维度通过不同的行为视角展现,建模特质特异性证据具有挑战性。然而,现有大多数方法对所有维度采用统一的多模态融合策略,假设模态贡献相同。这忽略了特质特异性的模态偏好,并引入了跨模态干扰。为解决这一问题,我们提出了一种新颖的人格评估框架,称为Traits Run Deeper,由三个组件组成。具体而言,多模态基础表示(MFR)模块构建面向人格的多模态输入,并利用心理学启发的语义模板作为锚点,使基础模型能够捕获特质相关信息。基于MFR,特质特异性模态融合(TSMF)模块作为一种非对称融合机制,允许每个维度从模态特定建模到互补融合中,选择性地利用不同的模态路径。因此,TSMF捕获了异质的模态偏好,同时减少了跨模态污染。此外,分布校准人格回归(DCPR)模块通过目标分布校准减轻了标签不平衡和中心趋势偏差,提高了鲁棒性和稳定性。在AVI Challenge 2026验证集上的实验结果表明了所提出框架的有效性,与基线相比,均方误差(MSE)降低了约25%。在官方测试集上观察到一致的改进,我们的方法取得了最佳性能,并在人格评估赛道中排名第一。源代码将在此https URL提供。

英文摘要

Personality assessment aims to infer stable personality traits from dynamic behaviors across language, voice, and facial cues. Since different personality dimensions are revealed through distinct behavioral perspectives, modeling trait-specific evidence is challenging. However, most existing approaches adopt a uniform multimodal fusion strategy across all dimensions, assuming identical modality contributions. This overlooks trait-specific modality preferences and introduces cross-modal interference. To address this issue, we propose a novel personality assessment framework called Traits Run Deeper, which consists of three components. Specifically, the Multimodal Foundation Representation (MFR) module constructs personality-oriented multimodal inputs and leverages psychology-informed semantic templates as anchors, enabling foundation models to capture trait-relevant information. Building upon MFR, the Trait-Specific Modality Fusion (TSMF) module acts as an asymmetric fusion mechanism, allowing each dimension to selectively exploit different modality pathways from modality-specific modeling to complementary fusion. Thus, TSMF captures heterogeneous modality preferences while reducing cross-modal contamination. Furthermore, the Distribution-Calibrated Personality Regression (DCPR) module mitigates label imbalance and central tendency bias through target distribution calibration, improving robustness and stability. Experimental results on the AVI Challenge 2026 validation set demonstrate the effectiveness of the proposed framework, reducing mean squared error (MSE) by approximately 25% compared with the baseline. Consistent improvements are observed on the official test set, where our method achieves the best performance and ranks first in the Personality Assessment Track. The source code will be made available at this https URL.

2606.11268 2026-06-11 cs.LG 新提交

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

LakeFM:基于不规则多变量多深度时间序列数据的水生生态系统基础模型

Abhilash Neog, Sepideh Fatemi, Medha Sawhney, Kazi Sajeed Mehrab, Aanish Pradhan, Bennett J. McAfee, Emma Marchisin, Arka Daw, Robert Ladwig, Cayelan C. Carey, Paul Hanson, Anuj Karpatne

发表机构 * Virginia Tech(弗吉尼亚理工大学) Grand Valley State University(大峡谷州立大学) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Amazon AGI(亚马逊AGI) Aarhus University(奥胡斯大学)

AI总结 针对湖泊时间序列数据不规则采样和跨湖泊泛化难题,提出预训练基础模型LakeFM,在模拟和观测数据上学习表征,实现优于现有模型的预测性能。

详情
Comments
KDD 2026
AI中文摘要

理解和预测湖泊动态对于监测湖泊和水库的水质及生态系统健康至关重要。尽管机器学习方法最近已被应用于生态时间序列数据,但现有工作假设时间和深度上的规则采样,并且难以在具有异质变量、深度和观测模式的湖泊之间泛化。为了解决这些局限性,我们引入了\textsc{LakeFM},一个用于水生系统的基础模型,在包含模拟和观测湖泊的大规模生态数据集上预训练。通过广泛的实证评估,我们表明\textsc{LakeFM}学习了跨越更广泛湖泊层面特征的有意义表征,并在与现有时间序列基础模型和非基础模型相比时,实现了具有竞争力或通常更优的预测性能,同时产生与真实湖泊动态一致的物理上合理的预测。

英文摘要

Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textsc{LakeFM}, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textsc{LakeFM} learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.

2606.11267 2026-06-11 cs.LG cs.CR 新提交

A prior-free blind detection of information leakage from model predictions

基于模型预测的信息泄露的无先验盲检测

Laurence A. Jacobs

发表机构 * Center for Molecular Cardiology, University of Zurich(苏黎世大学分子心脏病学中心) Center for Complexity Sciences, National University of Mexico(墨西哥国立自治大学复杂性科学中心)

AI总结 针对机器学习模型输出中信息泄露的检测问题,提出决策理论框架,证明校准泄露与诚实模型不可区分,但近确定性子组可被无先验检测,并在UK Biobank上验证。

详情
AI中文摘要

数据泄露——模型被基线不可用的信息污染——是基于机器学习的科学中主要的可重复性失败,然而检测工具需要训练代码、外部数据或领域专业知识。没有一种工具能作用于审计员最常持有的工件:模型的输出。我们询问仅从预测和结果中能判断出关于泄露的什么信息。我们给出了一个决策理论框架,其中泄露诊断是预测风险/结果规律的泛函,由与适当评分规则和决策曲线分析相关的阈值加权参数化。我们证明了一个尖锐的不可能性:重新校准的泄露匹配诚实模型的校准和区分度,通过预测的\emph{任何}函数与诚实性能不可区分,因此广泛类别仅能通过外部提供的可实现区分度上限来检测。然后我们证明了泄露无法隐藏什么:近确定性子组——近标签泄露的特征——产生一个持续的单位纯度头部,任何非确定性结果的合法预测器都无法制造,从而产生一个无先验测试。这些结果将泄露组织成三分法——未校准、广泛校准和确定性——每个都有匹配的检测器和失败模式。我们在UK Biobank上使用时窗共病泄露进行验证,已知分级严重性,测量该终点上的检测下限$\Delta\cstar \approx 0.007$,低于此的残余泄露从输出中无法检测,且太小无法改变结论。数值下限是队列和终点特定的;结构教训是通用的:仅输出检测在残余泄露与诚实的更强预测器无法区分时失败。该测试在商品硬件上不到一秒内返回对预测向量的判定。

英文摘要

Data leakage -- contamination of a model with information unavailable at baseline -- is the dominant reproducibility failure in machine-learning-based science, yet detection tools require training code, external data, or domain expertise. None operates on the artifact an auditor most often holds: the model's output. We ask what can be decided about leakage from predictions and outcomes alone. We give a decision-theoretic framework in which leakage diagnostics are functionals of the predicted-risk/outcome law, parameterized by a threshold-weighting linked to proper scoring rules and decision-curve analysis. We prove a sharp impossibility: a recalibrated leak matching an honest model's calibration and discrimination is indistinguishable from honest performance by \emph{any} function of the predictions, so the broad class is detectable only against an externally supplied ceiling on achievable discrimination. We then prove what leakage cannot hide: a near-deterministic subgroup -- the signature of a near-label leak -- produces a sustained unit-purity head that no legitimate predictor of a non-deterministic outcome can manufacture, yielding a prior-free test. These results organize leakage into a trichotomy -- miscalibrated, broad-calibrated, and deterministic -- each with a matched detector and failure mode. We validate on UK Biobank using time-windowed comorbidity leakage with known, graded severity, measuring a detection floor of $\Delta\cstar \approx 0.007$ on this endpoint, below which residual leakage is undetectable from output and too small to alter conclusions. The numerical floor is cohort- and endpoint-specific; the structural lesson is general: output-only detection fails where residual leakage is indistinguishable from an honestly stronger predictor. The test returns a verdict on a prediction vector in under a second on commodity hardware.

2606.11266 2026-06-11 cs.LG 新提交

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

碰撞前的预见:利用冻结视觉-语言模型的预期性安全强化学习

Samuel Tetteh, Cody Fleming

发表机构 * Iowa State University(爱荷华州立大学)

AI总结 提出VLM-Safe-RL框架,通过冻结视觉-语言模型生成预期性成本项,改进CMDP拉格朗日更新,在高速碰撞场景下实现安全与回报的平衡。

详情
Comments
44pages, 26 figures
AI中文摘要

约束强化学习算法优化的成本信号几乎总是反应性的:模拟器仅在碰撞开始后发出非零成本,而PPO-Lagrangian的拉格朗日乘子仅在超出回合预算后增长。在比赛速度下,碰撞是瞬时且不可逆的,任何等待成本累积的安全机制在结构上都为时已晚。我们提出VLM-Safe-RL,一个将冻结的视觉-语言模型作为预期性成本项集成到CMDP拉格朗日更新中的框架。该框架包含四个贡献:(i) 解耦双路径CLIP,独立的奖励/成本路径,尊重CMDP的分解;(ii) VLM-Lagrange,一种增强的乘子更新,将每步VLM成本作为预期性项纳入;(iii) 置信门控,基于CLIP间隔的逻辑噪声模型导出的贝叶斯最优权重;(iv) VLMPPOLag,组合算法。在Safety-Gymnasium FormulaOne L2上,我们的主要评估($n=5$个种子,$10^{6}$步,预算$d_{\text{lim}}=25$)中,VLMPPOLag$+$Conf是默认预算比较中唯一同时保持实质性回报($J_r\approx40$)并在大多数种子上将成本控制在预算内的配置;五个约束感知基线(PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND)均至少未能满足一项要求。该机制泛化到保留的MetaDrive Medium(灾难率$41\%\to26\%$,95%自助法置信区间$[-26,-5]$个百分点),并显示出向Bullet Safety-Gym的方向一致迁移;我们诚实地报告了其不适用的情况(MetaDrive Easy/Hard, Qwen2-VL骨干),并将Hard失败归因于拉格朗日调节病理而非VLM信号本身。据我们所知,这是首个在CMDP拉格朗日更新中使用冻结VLM信号作为预期性成本项的工作。

英文摘要

The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP's factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ($n{=}5$ seeds, $10^{6}$ steps, budget $d_{\text{lim}}{=}25$) VLMPPOLag$+$Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ($J_r{\approx}40$) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate $41\%{\to}26\%$, 95\% bootstrap CI $[-26,-5]$\,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.

2606.11262 2026-06-11 cs.LG cs.AI 新提交

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

PermDoRA -- 理解语言模型中的适配器干扰:参数空间几何的局限性

Gowtham Sivaramakrishnan, Sarvesha Kumar Kombaiah Seetha, Kishan Gupta Balaji, Santhosh Baradwaj Vaduvur Ranganathan

发表机构 * Independent Researcher(独立研究员)

AI总结 研究适配器组合中的干扰是否源于线性参数更新重叠,通过DoRA-RBAC框架和几何感知合并策略实验,发现参数空间几何不是干扰主因,而是共享非线性表示中的交互。

详情
Comments
18 Pages, COLM 2026
AI中文摘要

大型语言模型(LLMs)中的访问控制需要模块化机制,以在不重新训练或跨领域干扰的情况下实现特定领域行为。一个常见的假设是,适配器组合过程中的干扰源于线性参数更新的重叠,这表明强制正交性或方向独立性应能提高多领域性能。我们使用DoRA-RBAC(一种基于权重分解低秩适配的分层适配器组合框架)来测试这一假设。我们比较了传统的欧几里得合并与一种几何感知的黎曼启发式合并策略,该策略通过在LLaMA-3.1-8B和Mistral-7B上的多个QA基准(GPQA、PubMedQA、SimpleQA、WMDP)上进行归一化方向平均来近似弗雷歇均值。我们的结果表明,虽然单领域性能与LoRA相当,但几何感知合并相比标准平均在多领域组合中并未提供一致的优势。进一步分析揭示,适配器更新的角度对齐和正交性是组合性能的弱预测因子。这些发现表明,适配器干扰并非主要由参数空间几何决定,而是与共享非线性表示中的交互一致。

英文摘要

Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference. A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance. We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation. We compare conventional Euclidean merging with a geometry-aware Riemannian-inspired merging strategy that approximates the Frechet mean via normalized directional averaging across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP) on LLaMA-3.1-8B and Mistral-7B. Our results show that while single-domain performance matches LoRA, geometry-aware merging provides no consistent advantage over standard averaging in multi-domain this http URL analysis further reveals that angular alignment and orthogonality of adapter updates are weak predictors of composition performance. These findings suggest that adapter interference is not governed primarily by parameter-space geometry, but is instead consistent with interactions in shared nonlinear representations.

2606.11260 2026-06-11 cs.SD cs.AI 新提交

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

RAIL: 基于CHC框架重新思考大型音频语言模型中的听觉智能

Hongyu Jin, Siyi Wang, Yang Xiao, Jiaheng Dong, Shihong Tan, Kaiyuan peng, Georgiana Juravle, Shanquan Chen, Gongping Huang, Hong Jia, Eun-Jung Holden, James Bailey, Ting Dang

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院) Faculty of Psychology and Educational Sciences, Alexandru Ioan Cuza University of Iași(亚历山德鲁伊万库扎大学心理学与教育科学学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院) School of Public Health, The University of Hong Kong(香港大学公共卫生学院) School of Computer Science, The University of Auckland(奥克兰大学计算机科学学院) Department of Data Science and Artificial Intelligence, Monash University(莫纳什大学数据科学与人工智能系)

AI总结 提出RAIL基准,基于CHC认知框架将听觉智能分解为五种核心能力,构建结构化评估任务,系统评测大型音频语言模型的认知行为。

详情
AI中文摘要

人类通过紧密集成的认知能力(如音频感知、音频推理和记忆)处理丰富的听觉环境。尽管大型音频语言模型(LALMs)在语音理解和多模态音频推理方面取得了近期进展,但当前的评估范式仍然主要围绕任务或模态,关注最终性能而忽视了潜在的听觉认知行为。这揭示了人类听觉认知理解与LALMs评估之间的根本差距,特别是缺乏将认知原则操作化到任务级指标之外以系统捕捉模型行为的框架。在这项工作中,我们引入了RAIL,一种基于Cattell-Horn-Carroll(CHC)认知框架的以人为中心的评估范式。RAIL将听觉认知形式化为五种核心能力,并将其发展为结构化评估任务,探究模型如何处理、保留和整合听觉信息。我们进一步构建了一个认知基础的基准,包含原则性数据收集和人类对齐的评估协议。评估26个最先进的LALMs,我们发现当前模型在认知能力上表现出高度不平衡的性能。RAIL建立了一种新的评估范式,从以任务为中心的基准测试转向基于认知的听觉智能评估。

英文摘要

Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.

2606.11257 2026-06-11 cs.CL cs.LG cs.PF 新提交

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

移动NPU上的能效型设备端RAG:Snapdragon X Elite系统设计与基准测试

Zhiyuan Cheng, Longying Lai

发表机构 * Qualcomm(高通) Snapdragon X Elite(骁龙X Elite) Dell XPS 13 laptop(戴尔XPS 13笔记本电脑) Qualcomm Hexagon NPU(高通Hexagon NPU) Adreno X1-85

AI总结 本文首次在Snapdragon X Elite的Hexagon NPU上实现端到端RAG流水线,通过对比CPU和GPU,NPU在嵌入吞吐量、系统能耗和查询延迟上分别提升9.1倍、降低12.3倍和4.0倍,且答案质量相当。

详情
Comments
9 pages, 2 figures, 6 tables
AI中文摘要

检索增强生成(RAG)流水线计算密集,结合了嵌入、检索、重排序和大语言模型(LLM)生成。完全在设备端运行有利于隐私、延迟和离线使用,但CPU推理的能耗成本是一个主要障碍。我们提出了据我们所知第一个在Snapdragon X Elite的Qualcomm Hexagon NPU上运行所有神经阶段(嵌入、重排序和LLM生成)的端到端RAG流水线。在Dell XPS 13笔记本电脑上进行性能分析,我们比较了NPU加速的RAG与CPU和OpenCL/Adreno GPU基线在索引和查询工作负载上的表现。在索引方面,NPU实现了9.1倍的嵌入吞吐量提升和12.3倍的系统能耗降低。在120查询的Wikipedia段落基准测试中,与CPU基线相比,NPU实现了18.1倍的LLM预填充加速、4.0倍的端到端查询延迟降低和4.0倍的系统能耗降低;集成GPU上的相同工作负载比CPU慢1.7倍,且能耗比NPU高6.5倍。GPT-4.1 LLM作为评判者的评估发现,NPU的答案质量与CPU和GPU相当,在评估者噪声范围内(1-10分制下平均9.32 vs. 8.95 vs. 9.03),86.7%的查询在所有三个后端上得分相同。因此,在Snapdragon X Elite / Hexagon类笔记本电脑SoC上,NPU实现了实用、能效高的设备端RAG,且无质量退化——这是一条通往绿色边缘智能的可持续路径,我们预计随着软件栈的成熟,该方法将推广到类似的移动NPU(Apple Neural Engine、Intel NPU、MediaTek APU)。

英文摘要

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

2606.11251 2026-06-11 cs.LG 新提交

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

机械场网络:多变量系统的结构化神经动力学

Xingji Cui

发表机构 * Xi’an Jiaotong University(西安交通大学)

AI总结 提出MF-Net,一种将多变量系统表示为共享场状态并通过可学习关系律更新状态的递归模型,在保持可解释结构的同时实现竞争性预测。

详情
AI中文摘要

许多多变量动力系统仅通过轨迹观测,其联合动力学机制是隐藏的。现有方法可以施加可解释的动力学或学习灵活的状态转移,但得到的交互结构通常要么预先指定,要么隐含在学习动力学中。我们引入MF-Net,一种递归动力学模型,将所有变量表示在共享场状态中,并通过学习的关系律更新该状态。每个变量携带一个场分量,这些分量通过可学习的机械转移共同演化。这里,机械指的是转移的关系-运动组织,其中学习的关系塑造状态依赖的流、场响应和推动场状态前进的运动趋势。得到的结构是展开本身的一部分:学习的关系影响场的运动方式,相同的内部量支持预测和结构读出。在已知定律的交互系统、混沌基准、真实神经记录和生态时间序列上,MF-Net在保持可检查的结构读出的同时,实现了有竞争力的短中期预测。在40维Lorenz-96测试平台上,MF-Net的八步$R^2$达到$0.798\pm0.018$;在五个随机种子下,其学习的关系矩阵以$19.80\pm1.00$的局部/非局部强度比和$1.000\pm0.000$的Precision@$K$恢复了局部耦合支持。MF-Net提供了一个结构可读的动力学建模框架,其中学习的关系通过前向演化训练,并在真实数据上,在适当的观测限制下被解释为功能预测耦合。

英文摘要

Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz--96 testbed, MF-Net achieves an eight-step $R^2$ of $0.798\pm0.018$; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of $19.80\pm1.00$ and Precision@$K$ of $1.000\pm0.000$. MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

2606.11249 2026-06-11 cs.RO cs.LG cs.MA 新提交

MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

MASK: 面向风险敏感的6G机器人学的多智能体语义K调度

Ahmet Gunhan Aydin, Elif Tugce Ceran

发表机构 * Middle East Technical University(中东技术大学) Aselsan Inc.(阿塞尔桑公司)

AI总结 针对6G机器人协同感知中频谱资源受限的问题,提出多智能体语义K调度(MASK)架构,通过仲裁辅助语义信息门控(A-SIG)机制仅调度语义重要性最高的K个智能体,结合自监督全局编码器和分布策略,在严格带宽限制下实现鲁棒的风险感知协调,性能接近无通信约束基线。

详情
AI中文摘要

实现6G连接机器人学的愿景需要协调高性能协作控制与物理无线信道的刚性频谱限制。在现实的协作感知场景中,频谱资源被量化为有限的物理资源块或正交子载波,使得所有智能体同时传输不可行。为了解决这一问题,我们提出了多智能体语义K调度(MASK),一种控制架构,旨在在严格的瞬时带宽限制下维持鲁棒的风险感知协调。我们引入了仲裁辅助语义信息门控(A-SIG),一种轻量级协调机制,通过基于本地计算的语义重要性分数仅调度前K个智能体来强制执行硬接入约束。通过将这些优先观测聚合为紧凑的潜在状态,自监督全局编码器使得分布策略能够在数据稀疏的情况下减轻尾部风险。我们在多个基准上评估了MASK,证明即使信道接入限制为群体大小的一小部分,其性能也能匹配无通信约束的基线。此外,该框架对数据包擦除具有固有的弹性,验证了语义调度作为资源受限的6G系统的关键使能技术。

英文摘要

Realizing the vision of 6G connected robotics requires reconciling high-performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi-Agent Semantic K-Scheduling (MASK), a control architecture designed to sustain robust, risk-aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter-Assisted Semantic Information Gating (A-SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top-K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self-supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication-unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource-constrained 6G systems.

2606.11243 2026-06-11 cs.LG cs.CL 新提交

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo: 具有功能引导的分层流匹配用于从头蛋白质生成

Chuanzhen Wang, Meade Cleti, Pete Jano

发表机构 * Arizona State University(亚利桑那州立大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tongji University(同济大学)

AI总结 提出ProHiFlo,一种分层流匹配框架,通过粗到细生成、功能引导和自适应SE(3)等变架构,实现高效、准确的从头蛋白质生成,在酶活性位点支架任务中成功率58.9%。

详情
Comments
23 pages
AI中文摘要

从头蛋白质生成在治疗设计、酶工程和合成生物学中具有变革潜力。尽管基于扩散和流匹配的方法已取得进展,但它们通常在单一分辨率下操作,且缺乏整合功能约束的机制。我们提出ProHiFlo,一种具有三项创新的分层流匹配框架:(1) 粗到细生成,先建模主链几何再细化到全原子坐标,在保持精度的同时降低计算成本;(2) 功能引导,利用预训练预测器引导生成朝向所需性质,无需重新训练;(3) 自适应SE(3)等变架构,用于高效多尺度处理。在无条件生成、基序支架和功能设计上的实验表明,在需要少4倍采样步数的情况下实现了最先进的性能。在酶活性位点支架任务中,ProHiFlo达到58.9%的成功率,而RFDiffusion为41.2%。

英文摘要

De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-ofthe-art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.