arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2183
2606.11969 2026-06-11 cs.CV 新提交

SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

SpecLoR: 面向运动连贯文本到视频生成的频谱前瞻矫正

Xu Zhang, Yu Lu, Ruijie Quan, Zhaozheng Chen, Bohan Wang, Yi Yang

发表机构 * ReLER, College of Artificial Intelligence, Zhejiang University(浙江大学人工智能学院ReLER实验室) Huawei Central Research Institute(华为中央研究院)

AI总结 提出SpecLoR,一种即插即用的推理方法,通过前瞻预测和频域矫正减少文本到视频生成中的时空不一致性,在Wan2.2上显著提升运动连贯性且仅增加4次NFE。

详情
AI中文摘要

流匹配通过潜在ODE采样实现了鲁棒的文本到视频生成。然而,速度逼近和数值离散误差不可避免地累积,导致采样轨迹漂移。因此,生成的视频常常遭受严重的时空不一致性。尽管如此,直接矫正这些漂移的噪声潜在变量具有挑战性:(i) 时间步相关的噪声掩盖了可靠的结构线索;(ii) 空间干预可能破坏复杂的局部几何结构,同时带来高昂的计算成本。为了解决这个问题,我们提出了频谱前瞻矫正(SpecLoR),一种即插即用的推理方法,通过前瞻预测绕过噪声,并通过将矫正转移到频域来规避时空纠缠,在频域中自然视频的通用统计先验易于获取。首先,在早期采样阶段,SpecLoR前瞻估计干净潜在变量 $z_{t,0}$ 并计算其3D时空频谱。接着,SpecLoR矫正幅度谱以匹配先验,保持相位不变。最后,将矫正后的状态重新加噪以恢复ODE积分。在Wan2.2上的实验表明,SpecLoR在多个基准上显著减少了物理伪影并增强了运动连贯性,且计算开销极小(仅增加4次NFE)。

英文摘要

Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).

2606.11968 2026-06-11 cs.LG stat.ML 新提交

Efficient Multinomial Logistic Bandit via Frequent Directions

基于频繁方向的高效多项式逻辑斯蒂老虎机

Linzhe He, Yu-Jie Zhang, Sifan Yang, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院)

AI总结 针对多项式逻辑斯蒂老虎机的高维计算瓶颈,提出集成频繁方向矩阵素描的EOFD-MLogB算法,将每轮复杂度降至O(Kd(m+K)^2)时间和O(Kd(m+K))空间,并证明其遗憾界接近原算法。

详情
AI中文摘要

本文研究多项式逻辑斯蒂老虎机(MLogB)的高效在线算法,其中$K+1$个结果的反馈分布遵循$d$维动作向量的多项式逻辑斯蒂模型。代表性的UCB型算法OFUL-MLogB实现了$\tilde{\mathcal{O}}(Kd\sqrt{T})$的遗憾界,但由于参数估计和乐观奖励构造,每轮仍需$\mathcal{O}(K^3d^3)$时间和$\mathcal{O}(K^2d^2)$空间,在高维场景下不可行。为解决此限制,我们提出EOFD-MLogB,将频繁方向矩阵素描集成到OFUL-MLogB中。通过维护累积Hessian的低秩SVD素描,参数估计中的约束在线牛顿更新和奖励奖励中的$Kd \times K$谱范数计算分别简化为单维求根任务和$K \times K$特征值计算。这导致每轮主要时间复杂度为$\mathcal{O}(Kd(m+K)^2)$,空间复杂度为$\mathcal{O}(Kd(m+K))$,其中$m \ll d$为素描大小。我们进一步证明了$\tilde{\mathcal{O}}(\Delta_T(Kd\ln\Delta_T+m)\sqrt{T})$的遗憾界,其中素描误差因子$\Delta_T$由Hessian的$m$截断谱尾控制。因此,当Hessian近似低秩时,遗憾接近OFUL-MLogB。实验验证了计算效率和竞争性能。

英文摘要

This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over $K+1$ outcomes follows a multinomial logistic model of $d$-dimensional action vectors. A representative UCB-type algorithm, OFUL-MLogB, achieves a regret bound of $\tilde{\mathcal{O}}(Kd\sqrt{T})$, but still requires $\mathcal{O}(K^3d^3)$ time and $\mathcal{O}(K^2d^2)$ space per round due to parameter estimation and optimistic reward construction, which is prohibitive in high-dimensional settings. To address this limitation, we propose EOFD-MLogB, which integrates frequent directions matrix sketching into OFUL-MLogB. By maintaining a low-rank SVD sketch of the accumulated Hessian, constrained online Newton updates in parameter estimation and $Kd \times K$ spectral-norm computations in the reward bonus are reduced to one-dimensional root-finding tasks and $K \times K$ eigenvalue computations, respectively. This yields dominant per-round time complexity $\mathcal{O}(Kd(m+K)^2)$ and space complexity $\mathcal{O}(Kd(m+K))$, where $m \ll d$ is the sketch size. We further prove a regret bound of $\tilde{\mathcal{O}}(\Delta_T(Kd\ln\Delta_T+m)\sqrt{T})$, where the sketching error factor $\Delta_T$ is controlled by the $m$-truncated spectral tail of the Hessian. Thus, when the Hessian is approximately low-rank, the regret is close to that of OFUL-MLogB. Experiments validate the computational efficiency and competitive performance.

2606.11966 2026-06-11 cs.CV 新提交

Feature extraction for plant growth estimation

用于植物生长估计的特征提取

Simbarashe Aldrin Ngorima, Albert Helberg, Marelie H. Davel

发表机构 * Faculty of Engineering, North-West University(西北大学工程学院) Centre for Artificial Intelligence Research(人工智能研究中心) National Institute for Theoretical and Computational Sciences(国家理论与计算科学研究所)

AI总结 针对精准农业中实时估计植物生长阶段的需求,提出两种特征提取方法(Gabor滤波器与形态学操作、预训练CNN与迁移学习),在公开数据集上测试,CNN方法在速度和精度上均优于手工特征,最佳系统(VGG-19特征+RBF SVM)达到98.4%准确率,每图处理0.08秒。

详情
Comments
13 pages
AI中文摘要

精准农业需要实时估计植物生长阶段。当植物生长阶段已知时,可以减少栽培中资源(如养分和水)的浪费,因为只需供应所需的资源。然而,不同生长阶段的植物具有相似的形态特征,这可能使自主生长阶段估计变得困难。本文提出了两种用于生长阶段估计的特征提取方法:一种使用Gabor滤波器组和形态学操作,另一种使用预训练卷积神经网络(CNN)和迁移学习。我们在公开的植物生长阶段数据集(“bccr-segset”)上测试了这些方法,该数据集包含两种在室内条件下生长和捕获的物种:油菜和小萝卜。使用支持向量机和提升树作为分类器,比较了两种提出的特征提取方法。我们发现两种方法都适用于实时应用,并且CNN特征在速度和准确性方面均优于手工特征。最佳系统(VGG-19特征,使用径向基函数支持向量机分类)对两个物种均获得了98.4%的准确率,处理一张图像仅需0.08秒。

英文摘要

Precision agriculture requires the estimation of plant growth stages in real-time. When the plant growth stage is known, the wastage of resources in cultivation, such as nutrients and water, is reduced as only the required resources need to be supplied. Plants at different growth stages, however, have similar morphological features, which can make autonomous growth stage estimation difficult. This paper presents two feature extraction methods for growth stage estimation: one that uses a bank of Gabor filters and morphological operations, and the other that uses pre-trained convolutional neural networks (CNNs) and transfer learning. We test these methods on a publicly available plant growth stage dataset (``bccr-segset``) for two species, canola and radish, grown and captured under indoor conditions. The two proposed feature extraction methods are compared, using support vector machines and boosted trees as classifiers. We find that both methods are suitable for real-time applications, and that CNN features outperform the hand-crafted features, both with regard to speed and accuracy. The best system (VGG-19 features, classified with a radial basis function support vector machine) obtained an accuracy of 98.4% for both species, processing an image in 0.08 seconds.

2606.11963 2026-06-11 cs.LG physics.comp-ph 新提交

HAMNO: A Hierarchical Adaptive Multi-scale Neural Operator with Physics-Informed Learning for Dynamical Systems

HAMNO: 一种用于动力系统的分层自适应多尺度神经算子与物理信息学习

Mostafa Bamdad, Mohammad Sadegh Eshaghi, Timon Rabczuk

发表机构 * Bauhaus-Universität Weimar(魏玛包豪斯大学) Leibniz University Hannover(莱布尼茨汉诺威大学)

AI总结 提出HAMNO神经算子架构,通过自适应门控机制平衡局部与全局信息,结合物理信息扩展PI-HAMNO,在非周期Allen-Cahn等方程上提升长期预测精度与物理一致性。

详情
AI中文摘要

神经算子为直接在函数空间学习偏微分方程解映射提供了强大框架。然而,许多现有架构仍难以表示涉及多尺度结构、长程相互作用和稳定长时间演化的非线性时变系统。本文引入分层自适应多尺度神经算子(HAMNO),一种结合局部卷积表示、全局谱算子和分层编码器-解码器处理的神经算子架构。HAMNO的核心是一个数据相关的门控机制,可在每个空间位置自适应平衡局部和全局信息,使模型能够解析细尺度特征同时保持长程依赖。我们进一步基于多目标损失策略开发了物理信息扩展PI-HAMNO,该策略将数据拟合与强形式和弱形式物理约束相结合。强形式项惩罚物理坐标中域积分平方PDE残差,而弱形式项通过将控制残差乘以有限元测试函数并使用基于质心的四面体求积法评估所得单元积分来构建。该框架在定义于立方域上的非周期Allen-Cahn(AC)、Cahn-Hilliard(CH)和Swift-Hohenberg(SH)方程上进行了评估。在长时程展开、数据有限训练、分布外初始条件偏移和随机种子变化下,HAMNO提高了相对于标准神经算子基线的预测精度,而PI-HAMNO进一步增强了稳定性、物理一致性和数据效率。实现代码公开于https://github.com/HAMNO/HAMNO。

英文摘要

Neural operators provide a powerful framework for learning solution mappings of partial differential equations directly in function space. However, many existing architectures still struggle to represent nonlinear time-dependent systems that involve multi-scale structures, long-range interactions, and stable long-time evolution. In this work, we introduce the Hierarchical Adaptive Multi-scale Neural Operator (HAMNO), a neural-operator architecture that combines local convolutional representations, global spectral operators, and hierarchical encoder-decoder processing. The central component of HAMNO is a data-dependent gating mechanism that adaptively balances local and global information at each spatial location, allowing the model to resolve fine-scale features while preserving long-range dependencies. We further develop a physics-informed extension, PI-HAMNO, based on a multi-objective loss strategy that combines data fitting with strong- and weak-form physics constraints. The strong-form term penalizes the domain-integrated squared PDE residual in physical coordinates, while the weak-form term is constructed by multiplying the governing residual by finite-element test functions and evaluating the resulting element integrals using centroid-based tetrahedral quadrature. The framework is evaluated on non-periodic Allen-Cahn (AC), Cahn-Hilliard (CH), and Swift-Hohenberg (SH) equations defined on cubic domains. Across long-horizon rollout, data-limited training, out-of-distribution initial-condition shifts, and random-seed variations, HAMNO improves predictive accuracy over standard neural-operator baselines, while PI-HAMNO further enhances stability, physical consistency, and data efficiency. The implementation is publicly available at this https URL.

2606.11961 2026-06-11 cs.LG cs.AI 新提交

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

类别先验锁定:为何上下文学习在结构化数据上失败

Antonio Pelusi, Stefano Braghin, Alberto Trombetta

发表机构 * University of Insubria(因苏布里亚大学) IBM Research Ireland(IBM 爱尔兰研究院)

AI总结 研究大语言模型在结构化数据生成中上下文学习的局限性,发现其无法更新预训练中的类别先验分布,导致罕见类完全无法生成;参数高效微调可解决但带来记忆化风险。

详情
Comments
9 pages, 5 figures. Empirical study of in-context learning and LoRA fine-tuning for synthetic tabular data generation, introducing the phenomenon of categorical prior lock-in. Under review
AI中文摘要

大型语言模型(LLM)越来越多地被用作结构化数据的条件生成器,依赖上下文学习(ICL)来适应新分布而无需更新参数。我们以高基数表格数据作为受控测试案例,研究分布不匹配下ICL在结构化生成中的局限性,并识别出一种结构性失败模式,我们称之为“类别先验锁定”:ICL无法更新模型从预训练中继承的令牌分布先验。在两个70亿参数开源模型中,ICL随着示例增加提高了数值保真度,但在类别分布上表现出明显的天花板效应,完全无法复现罕见类。参数高效微调(LoRA)克服了这些限制,但引入了可测量的记忆化风险,并在某些情况下破坏了结构化输出生成的稳定性,凸显了适应性与隐私之间的基本权衡。

英文摘要

Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

2606.11953 2026-06-11 cs.CL 新提交

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

解码多模态线索:揭示仇恨视频背后的隐含意义

Junyu Lu, Deyi Ji, Liqun Liu, Xiaokun Zhang, Youlin Wu, Roy Ka-Wei Lee, Peng Shu, Huan Yu, Jie Jiang, Bo Xu, Liang Yang, Hongfei Lin

发表机构 * Dalian University of Technology(大连理工大学) Tencent(腾讯) City University of Hong Kong(香港城市大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出IARE框架,通过信息增强和推理优化实现可解释的仇恨视频检测,在Ex-HateMM和Ex-ImpliHateVid数据集上达到最优性能。

详情
AI中文摘要

仇恨视频在在线平台上日益普遍,凸显了有效检测的迫切需求。然而,现有研究主要关注二元分类,未能提供揭示这些判断背后隐含意义的上下文理由,严重削弱了模型的可解释性。为填补这一空白,我们旨在实现可解释的仇恨视频检测,使模型能够提供整合相关证据和逻辑推理的上下文理由,同时做出决策。这种方法可以全面增强对视频内容的理解以及决策过程的可解释性。我们首先引入了两个用于可解释仇恨视频检测的数据集Ex-HateMM和Ex-ImpliHateVid。每个数据集提供了多模态有害元素的细粒度标注以及上下文理由。然后,我们提出了一个用于可解释检测的信息增强与推理优化(IARE)框架。该框架采用信息增强阶段,利用多模态思维链整合有害元素,从而丰富理由证据。此外,IARE包含一个推理优化阶段,其中直接偏好优化引导模型走向正确的推理路径并远离错误的路径,从而提高其理由的逻辑连贯性。我们在两个数据集上进行了大量实验,将多个基线与我们提出的IARE框架进行比较。结果表明,IARE在生成准确理由的同时实现了最先进的性能。

英文摘要

Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal the implicit meanings behind these judgments, significantly undermining model explainability. To fill this gap, we aim to achieve explainable hateful video detection, enabling models to provide contextual rationales that integrate relevant evidence and logical reasoning alongside decisions. This approach can comprehensively enhance the understanding of video content and the explainability of the decision-making process. We first introduce two datasets, Ex-HateMM and Ex-ImpliHateVid, for explainable hateful video detection. Each dataset provides fine-grained annotations of multimodal harmful elements, along with contextual rationales. We then propose an Information Augmentation and Reasoning Enhancement (IARE) framework designed for explainable detection. The framework employs an information augmentation phase that leverages the multimodal chain-of-thought to integrate harmful elements, thereby enriching rationale evidence. Additionally, IARE incorporates a reasoning enhancement phase, in which Direct Preference Optimization guides the model toward correct reasoning paths and away from incorrect ones, thereby improving the logical coherence of its justifications. We conduct extensive experiments on the two datasets, comparing multiple baselines with our proposed IARE framework. The results demonstrate that IARE achieves state-of-the-art performance while also generating accurate rationales.

2606.11952 2026-06-11 cs.RO 新提交

Deformable In-Hand Slip-Aware Tactile Sensor with Integrated Velocity, Force/Torque, and Pressure Map Sensing

可变形手内滑移感知触觉传感器,集成速度、力/力矩和压力图传感

Gabriel Arslan Waltersson, Yiannis Karayiannidis

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) Lund University(隆德大学)

AI总结 提出一种新型触觉传感器,通过可变形接触垫集成速度、力/力矩和压力图传感,实现手内操作的滑移感知控制,并支持快速低成本制造。

详情
AI中文摘要

本文介绍了一种用于手内操作的新型触觉传感器,具有滑移感知控制功能,将速度、力/力矩和压力图传感集成到一个带有可变形接触垫的单一设备中。据我们所知,这是首个将这些传感模态结合在单一柔性结构中的传感器。该传感器具有可变形接触表面,能够稳健地跟踪各种材料上的平面和曲面。通过一系列全面的实验评估了其性能,突出了其能力和局限性。该传感器设计用于快速低成本制造,结合了标准PCB制造和快速原型制作技术。

英文摘要

This paper introduces a novel tactile sensor for in-hand manipulation with slip-aware control that integrates velocity, force/torque, and pressure map sensing into a single device with a deformable contact pad. To the best of our knowledge, this is the first sensor to combine these sensing modalities within a single compliant structure. The sensor features a deformable contact surface and can robustly track both flat and curved surfaces across a wide range of materials. Its performance is evaluated through a comprehensive set of experiments that highlight both its capabilities and limitations. The sensor is designed for rapid and low-cost fabrication using a combination of standard PCB manufacturing and rapid prototyping techniques.

2606.11945 2026-06-11 cs.CL cs.IR 新提交

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

uva-irlab-conv 在 SemEval-2026 任务 8:基于学习型稀疏检索和列表式重排序的多轮 RAG

Simon Lupart, Kidist Amde Mekonnen, Zahra Abbasiantaeb, Mohammad Aliannejadi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 提出结合学习型稀疏检索与基于 LLM 的重排序和生成的多轮检索增强生成流水线,用于跨四个领域的对话系统,有效处理不可回答查询。

详情
Comments
SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables
AI中文摘要

本报告描述了我们在 SemEval-2026 任务 8(多轮检索与问答)中的参与情况。该任务评估跨四个领域(金融、云文档、政府、维基百科)的对话系统,并包括不可回答的查询,即可用集合中没有足够证据来生成完整回答。我们提出了一种多轮检索增强生成流水线,将学习型稀疏检索与基于 LLM 的重排序和生成相结合。使用稀疏检索作为主要检索方法,我们利用了其跨领域的强泛化能力。此外,我们利用 LLM 的长上下文能力进行对话查询重写、逐点和列表式重排序以及生成最终回答,每一步都基于完整的对话历史。这种多步骤设计使得在整个检索和生成过程中有效整合对话上下文,提高了跨领域的鲁棒性。

英文摘要

This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

2606.11931 2026-06-11 cs.CL 新提交

Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

低资源语言孟加拉语中书面答案的语义评分:使用微调轻量级语言模型

Meherun Farzana, Aniket Joarder, Mahmudul Hasan, Md. Mosaddek Khan

发表机构 * Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系)

AI总结 针对低资源语言孟加拉语,提出一种基于微调轻量级语言模型的双语评估系统,通过语义正确性而非词汇重叠进行自动评分,在合成和人工评估中均取得最优性能。

详情
Comments
10 pages, 5 figures, 2 tables. Preprint
AI中文摘要

孟加拉语是世界上使用最广泛的语言之一,但在教育NLP研究中仍服务不足。在许多偏远和农村地区,合格学科教师资源有限,书面答案因此主要依靠人工评分,限制了及时和一致的反馈。自动评估具有挑战性,因为语义正确的回答在表面形式上可能有很大差异。我们提出一个为低资源教育环境设计的双语(孟加拉语-英语)评估系统,优先考虑语义正确性而非词汇重叠。我们的方法微调一个轻量级语言模型,使用问题、参考答案和学生答案对每个回答进行评分,产生一个数值分数和简洁、基于上下文的反馈,适合课堂部署。我们还构建了一个合成双语数据集,以实现受控训练和评估。在统一协议下评估的专有和开源LLM中,我们的QLoRA微调Qwen3-8B在合成评估中产生最具抗泄漏性的反馈(RoRa = 0.819),并在专门的人工研究中与人类评分的一致性最强(rho = 0.936, MAE = 0.725),证实了持续改进。

英文摘要

Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.

2606.11926 2026-06-11 cs.CL cs.AI 新提交

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

通过假设树精炼迈向通用自主研究

Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Microsoft Research(微软研究院)

AI总结 提出Arbor框架,通过假设树精炼(HTR)实现长期自主研究循环,在六项真实任务中平均相对保留增益超过Codex和Claude Code的2.5倍。

详情
AI中文摘要

科学进步依赖于探索、实验和抽象的重复循环。研究人员测试候选方向,解释证据,并将所得经验用于后续尝试。我们研究AI代理如何自主地长期运行这一循环。我们提出了Arbor,一个用于自主研究的通用框架,它结合了长期存在的协调器、短期执行器和假设树精炼(HTR),后者是一个持久树,跨时间连接假设、工件、证据和提炼的见解。协调器管理树上的全局研究策略,而执行器在隔离的工作树中实现和测试单个假设。当结果返回时,Arbor更新树,传播可重用的经验,优化搜索前沿,并接受验证过的改进。这种设计将自主研究从一系列局部尝试转变为累积过程,其中策略、执行和证据跨时间传递。我们在自主优化(AO)下评估Arbor,这是一种操作设置,代理通过迭代实验改进初始研究工件,无需逐步人工监督。在模型训练、工具工程和数据合成等六项真实研究任务中,Arbor在所有六项任务上取得了最佳保留结果,在相同任务接口和资源预算下,平均相对保留增益是Codex和Claude Code的2.5倍以上。在MLE-Bench Lite上,Arbor使用GPT-5.5达到86.36%的任何奖牌,这是我们比较中的最强结果。

英文摘要

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

2606.11925 2026-06-11 cs.CV cs.LG 新提交

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

通过LLM引导的视频拼接进行手语翻译的语料增强

Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey

发表机构 * Peter Pazmany Catholic University, Faculty of Information Technology and Bionics(彼得·帕兹马尼天主教大学信息科技与仿生学院) DeepSign Technologies Ltd.(DeepSign科技有限公司)

AI总结 提出一种无需额外标注或生成模型的手语翻译语料增强方法,利用CTC强制对齐提取手语片段,通过LLM生成句子并拼接视频,在GFSLT-VLP基线上提升BLEU-4达2.92,并发现合成数据对视觉-语言预训练有害但可提升下游任务。

详情
AI中文摘要

手语翻译(SLT)将手语视频转换为口语文本,对于改善无障碍交流以及促进手语与非手语社区之间的沟通具有重要前景。虽然大规模弱对齐数据集实现了规模化预训练,且无词汇表方法减少了对专家标注的依赖,但用于微调的高质量平行手语视频-文本对仍然稀缺,限制了长尾词汇和未见结构的泛化。我们提出一种语料增强方法,无需额外人工标注、外部手语视频语料库或生成式视频模型,仅依赖现有的词汇表标注训练语料和用于句子生成的LLM:通过CTC强制对齐从训练视频中提取每个手语词汇的片段,由语料锚定的LLM生成新的词汇-句子对,通过随机句子采样和片段分配组装合成序列。得到的合成RGB视频-文本对在下游训练阶段与架构无关,可直接被基于RGB的SLT模型使用,或通过从视频提取输入的流水线转换为姿态或特征表示。Sincan等人在严格相同条件下重新评估了五种近期无词汇表方法;在GFSLT-VLP基线上验证的最大增益仅为0.98 BLEU-4。我们的增强方法在同一框架内应用,无需改变架构或训练协议,实现了+2.92 BLEU-4。我们进一步发现,合成数据虽然改善了视觉-语言预训练的目标,但对其有害;并且基于L2准则优化片段过渡以实现视觉平滑适得其反;我们提出,突兀的边界可能作为一种隐式正则化形式。代码可在https://this https URL获取。

英文摘要

Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at this https URL.

2606.11922 2026-06-11 cs.SD cs.AI 新提交

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

Lung-SRAD: 基于谱感知正则化音频DASS与双轴补丁混合对比学习的呼吸音分类

Hemansh Shridhar, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS(RSC实验室,MODULABS) Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究所)

AI总结 针对呼吸音分类中AST模型对局部异常模式不敏感的问题,提出基于状态空间模型的谱感知层正则化和双轴补丁混合对比学习,在ICBHI基准上达到64.48%分数,比AST基线提升5%。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

最近的呼吸音分类(RSC)研究主要依赖于CLS令牌驱动的自注意力架构,如音频频谱图变换器(AST)。虽然它在建模全局上下文方面有效,但最近的分析表明存在低通滤波行为,可能会降低对局部异常模式的敏感性。在这项工作中,我们研究了状态空间模型(SSM)作为RSC的替代骨干网络。使用蒸馏音频状态空间模型,我们通过频谱响应曲线分析中间表示,并观察到对中到高空间频率分量的更强保留。基于这些观察,我们引入了使用高斯卷积应用于选定层的谱感知层正则化。我们进一步提出了针对基于SSM的音频模型定制的双轴补丁混合对比学习,以实现稳健的表示学习。在ICBHI基准上的实验表明,我们的方法达到了64.48%的分数,比AST基线高出5%。代码可在以下网址获取:https://this https URL。

英文摘要

Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at this https URL.

2606.11915 2026-06-11 cs.SD cs.AI 新提交

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

呼吸音分类的质量自适应角度边界学习

Yoon Tae Kim, Heejoon Koo, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS, Republic of Korea(RSC实验室,MODULABS,韩国) Department of Electronic Engineering, Wonkwang University, Republic of Korea(韩国圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University, Republic of Korea(韩国圆光大学人工智能融合研究所)

AI总结 提出质量自适应角度边界学习框架QLung,通过频谱熵和均方根能量推导无参考音频质量边界,自适应缩放角度边界,改善特征泛化,在ICBHI和SPRSound数据集上分别提升2.46%和达到最优分布外性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出了一种质量自适应角度边界学习框架,通过增强类内紧凑性和类间可分离性来改进特征泛化。我们的框架名为QLung,引入了基于频谱熵和均方根能量的无参考音频质量边界,根据录音质量自适应缩放角度边界。为此,我们提出了一种对数缩放的角度边界,在严重类别不平衡下稳定训练。我们还使用了一个角度分类器,对特征和类别权重进行归一化,确保在单位超球面上一致地应用边界惩罚。我们的方法在ICBHI数据集上比交叉熵基线提高了2.46%的分布内性能,最重要的是,在SPRSound数据集上,与先前最先进的方法相比,实现了最强的分布外性能。代码可在以下网址获取:https://this URL。

英文摘要

We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at this https URL.

2606.11910 2026-06-11 cs.CL 新提交

An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination

一种本体引导的多锚点图检索框架用于交通事故法律责任判定

Xu Li, Shuqi Tian, Xun Han, Kuncheng Zhao, Xinyi Li

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出OMAGR框架,通过本体引导将查询分解为锚点并执行并行图检索,解决多维度检索瓶颈,在TrafficLaw-QA数据集上提升上下文精度和忠实度。

详情
Comments
Submitted to ICONIP. 15 pages, 3 figures
AI中文摘要

交通事故法律责任判定对于分配法律处罚至关重要,需要同时识别跨多个法律维度的相互依赖的法定条款。然而,现有的检索增强生成方法存在多维度检索瓶颈:单轴架构将复杂的法律查询压缩为单一通路,导致相互依赖的法定维度被忽视。为了解决这个问题,我们提出了OMAGR,一个本体引导的框架,将查询分解为与本体对齐的锚点,并在每个维度上执行并行图检索,确保在融合前各维度独立检索。为了评估所提出的方法,我们创建了TrafficLaw-QA数据集,这是一个经过专家验证的基准数据集,包含200个问题和527条法律条款。结果表明,TrafficOmni-RAG在上下文精度和忠实度指标上优于基线。研究结果表明,并行多锚点检索有效解决了多维度检索瓶颈,为交通事故法律责任判定研究提供了有前景的方向。

英文摘要

Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented generation methods suffer from a multi-dimensional retrieval bottleneck: single axis architectures compress complex legal queries into a single pathway, causing interdependent statutory dimensions to be overlooked. To address this, we propose OMAGR, an ontology-guided framework that decomposes queries into ontology-aligned anchors and executes parallel graph retrieval across each dimension, ensuring independent retrieval across dimensions before fusion. To evaluate the proposed method, we created the TrafficLaw-QA dataset, an expert-validated benchmark dataset containing 200 questions and 527 legal provisions. Results show that TrafficOmni-RAG outperforms baselines on Context Precision and Faithfulness metrics. The findings demonstrate that parallel multi-anchor retrieval effectively resolves the multi-dimensional retrieval bottleneck, offering a promising direction for traffic law liability determination research.

2606.11909 2026-06-11 cs.AI 新提交

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Embodied-BenchClaw:用于具身空间智能基准构建的自主多智能体系统

Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma

发表机构 * QiYuan Lab(启元实验室) School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院)

AI总结 提出Embodied-BenchClaw,一个通过五阶段流水线和三个智能体协调的自主系统,自动构建可验证、可执行、可维护且诊断有用的具身空间智能基准,减少人工工作量。

详情
AI中文摘要

基准测试对于评估具身空间智能至关重要,但其构建劳动密集、难以重用且维护困难。现有的具身基准通常是静态的,随着模型改进可能迅速饱和,限制其区分新能力的能力。我们提出Embodied-BenchClaw,一个用于构建具身空间智能基准的自主智能体系统。给定用户指定的评估意图,Embodied-BenchClaw通过五个阶段流水线自动生成完整且可持续更新的基准包:意图蓝图、数据收集、结构化与清洗、基准合成、评估报告。该流水线由三个智能体协调:规划、构建和评估。为提高可重用性和可靠性,Embodied-BenchClaw引入了可扩展的技能库和过程质量控制,使基准构建可组合、可验证和可修复。我们实例化了多个基准,涵盖室内空间推理、室外空间推理、机器人操作、四足机器人导航、无人机/空中视图理解以及静态基准增强。这些基准跨越不同的具身载体、数据源和空间能力。通过人工评估、基于评判者的评估、一致性检查、成本分析和消融实验,结果表明Embodied-BenchClaw能够以较少的人工努力构建可验证、可执行、可维护且诊断有用的具身空间基准。

英文摘要

Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

2606.11906 2026-06-11 cs.CL 新提交

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

语言何时重要?多语言指令揭示视觉-语言-动作模型中的逐步语言敏感性

Xuan Dong, Zhe Han, Tianhao Niu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本研究通过将LIBERO基准翻译成十种语言,首次系统评估了VLA模型的多语言鲁棒性,发现非英语指令下成功率下降30-50%,并基于步骤级语言敏感性提出推理时对齐干预,显著提升性能。

详情
Comments
Accepted to ACL 2026 Main Conference
AI中文摘要

视觉-语言-动作(VLA)模型在语言条件机器人操作中表现出强大性能,但其对语言变化的鲁棒性仍知之甚少。在这项工作中,我们通过将LIBERO基准翻译成十种语言,首次对VLA模型进行了系统的多语言评估,揭示了在非英语指令下性能严重下降,成功率下降30-50%。通过对任务执行的细粒度分析,我们发现语言影响在步骤间高度不均匀:某些步骤表现出强烈的语言依赖性并主导整体任务失败,而其他步骤则基本与语言无关。基于这一见解,我们提出了一种逐步推理时干预方法,根据步骤语言敏感性对齐表示,显著提高了语言变化下的性能。我们的结果表明,VLA模型中的语言鲁棒性本质上是一个逐步控制问题,突出了时间结构化分析对于可靠具身智能体的重要性。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.

2606.11901 2026-06-11 cs.RO cs.AI 新提交

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

DuoBench: 一个可复现的双手操作基准,涵盖仿真与现实世界

Tobias Jülg, Seongjin Bien, Simon Hilber, Yannik Blei, Pierre Krack, Maximilian Li, Sven Parusel, Rudolf Lioutikov, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg(纽伦堡工业大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Franka Robotics Technical University of Munich(慕尼黑工业大学)

AI总结 提出DuoBench,一个基于FR3 Duo平台的双手操作基准框架,包含11个任务和阶段式评估方案,用于诊断当前策略在双手协调、仿真到现实迁移等方面的失败模式。

详情
AI中文摘要

双手机器人系统极大地扩展了操作能力,但协调两只手臂引入了额外的控制复杂性和故障模式,现有基准未能很好地捕捉这些。我们介绍了DuoBench,一个针对FR3 Duo平台上的双手操作策略的可扩展基准框架。DuoBench包含跨越四个协调类别的十一个任务,在仿真中实现,并通过可复现的任务配方和3D打印资产部分地在现实世界中复现。此外,我们提出了一种基于阶段的评估方案,支持超出二元成功之外的细粒度语义故障分析,并为所有基准任务提供人类遥操作数据集。我们在仿真和真实硬件上对几种双臂模仿学习和视觉-语言-动作策略进行了基准测试。我们的结果表明,当前策略在双手操作中仍然面临挑战,特别是在早期交互阶段、并行手臂执行以及仿真与现实环境之间的迁移方面。DuoBench为诊断这些故障模式和研究未来的双臂策略学习方法提供了一个可复现的测试平台。代码、数据集和视频可在该https URL获取。

英文摘要

Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at this https URL

2606.11898 2026-06-11 cs.CL cs.LG 新提交

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Meiyi Qiang, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.

2606.11897 2026-06-11 cs.CL 新提交

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Notes2Skills: 从实验室笔记本到具有确定性意识的科学智能体技能

Shi Liu, Jiayao Chen, Chengwei Qin, Yanqing Hu, Jufan Zhang, Linyi Yang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University College Dublin(都柏林大学学院)

AI总结 提出Notes2Skills框架,将实验室笔记转化为保留作者确定性的可验证科学智能体技能,解决不确定判断与确认结论混淆问题。

详情
Comments
28 pages, preprint
AI中文摘要

科学发现工作流程通常包含并严重依赖实验室笔记,研究人员在其中记录观察结果、解释不确定的结果并规划后续实验。这些信息丰富的实验室笔记保留了不断演变的科学推理和作者的不确定性,而不是出版物中展示的经过修饰的最终结果,为人工智能在更全面和更深层次上参与科学探索提供了宝贵机会。然而,大多数先前关于科学文本的工作集中在论文、协议或结构化数据库上,使得非正式的实验室笔记作为科学AI智能体的输入未被充分探索。这一差距很重要,因为实验室笔记通常在同一段落中混合了经过验证的观察结果、初步判断和可能的实验下一步。如果这些信号被混淆,AI智能体可能会将不确定的科学判断误认为是已确认的结论或可执行的行动。为此,我们提出了Notes2Skills,一个两阶段框架,用于将实验室笔记本转化为可验证的科学AI智能体技能,同时保留作者的不确定性。在七个条件和三个湿实验环节中,Notes2Skills是唯一既不会将不确定的笔记误认为是明确的指令,也不会丢弃明确指令的配置。我们表明,确定性保留是实验室笔记本与可靠智能体技能之间缺失的一环,为更安全的AI共同科学家系统开辟了一条道路。

英文摘要

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

2606.11894 2026-06-11 cs.CV 新提交

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 提出Wild3R,一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法,通过引入包含多样光照和瞬态物体的WildCity数据集,学习跨视角外观一致性并移除瞬态内容,性能优于现有前馈方法,与基于逐场景优化的方法相当。

详情
AI中文摘要

前馈式3D高斯泼溅(3DGS)消除了传统3DGS所需的耗时逐场景优化。然而,现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中,我们提出了Wild3R,一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据,而这些是学习鲁棒场景表示所必需的。为解决这一问题,我们引入了WildCity数据集,该数据集包含200个场景、170种光照条件和瞬态物体,总计337,500张图像。通过利用该数据集,我们的模型在参考视图条件下学习跨视角的外观一致性,同时移除瞬态内容。大量实验表明,我们的方法优于现有的前馈方法,并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 新提交

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐:基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室、智能科学与技术学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Microsoft Research Asia(微软亚洲研究院)

AI总结 研究通过fMRI信号增强大型语言模型推理能力,提出脑引导框架,在10个模型上实现最高13%的准确率提升。

详情
AI中文摘要

大型语言模型(LLMs)与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的,一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐,以及这些信号是否能够改进它们。在此,我们聚焦于演绎推理,表明LLM内部表征不仅与任务fMRI活动部分对齐,而且可以直接通过这些信号增强。使用神经预测性度量,我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分,而在特定推理类型内的预测性较低,表明对齐和分歧并存。基于此,我们提出一个脑引导框架:我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征,在推理时进行干预,在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理,在10个LLM(1.5B-72B)上产生与仅语言监督正交的增益,具有跨推理类型的迁移,以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导,建立了一条由脑信号驱动的路径,通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

2606.11886 2026-06-11 cs.SD cs.OS 新提交

Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation

实时语言模型阻塞:现场音乐伴奏生成的案例研究

Bowen Zheng, Andrew H. Yang, Jiaqi Ruan, Jia He, Xinyue Li, Yuan-Hsin Chen, Ziyu Wang, Xiaosong Ma

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出StreamMUSE系统,在客户端-服务器架构中实现帧同步流式推理,通过现场音乐伴奏任务验证了不同延迟环境下实时同步的有效性。

详情
Comments
Accepted to RTAS 2026. 14 pages, 5 figures, 3 tables
AI中文摘要

语言模型(LMs)已成为现代生成建模中最突出的范式之一。虽然提高速度是实时部署的主要焦点,但仅靠速度是不够的。许多实际应用,如同步翻译和语音合成,还需要生成内容与外部信号在生成内容和时序上精确对齐。我们将此问题称为\textit{帧同步流式推理}。为了解决这个问题,我们提出了StreamMUSE,一个在客户端-服务器架构中响应外部信号流执行LM生成的推理系统。客户端基于最新输入持续发送高频推理请求,并接收与外部时钟同步的输出,而服务器执行模型推理。我们通过现场音乐伴奏任务演示了该框架,展示了在不同往返延迟的部署环境中如何实现实时同步。我们进一步建模了系统超参数与往返延迟之间的关系,并评估了不同环境如何影响实现实时性能的最佳配置。实验结果表明,系统实时性能与音乐质量之间存在一致对应关系,证明了所提出框架的有效性。该项目是开源的。相关代码和最新更新可在此https URL获取。

英文摘要

Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synthesis, also require precise alignment between generation and external signals, both in terms of generation content and timing. We refer to this problem as \textit{frame-synchronous streaming inference}. To address it, we present StreamMUSE, an inference system that performs LM generation in response to an external signal stream within a client-server architecture. The client continuously sends high-frequency inference requests based on the most recent inputs and receives outputs synchronized to the external clock, while the server executes model inference. We demonstrate the framework through a live music accompaniment task, showing how real-time synchronization can be achieved across different deployment environments with varying round-trip latencies. We further model the relationship between system hyperparameters and round-trip latency, and evaluate how different environments affect optimal configurations to achieve real-time performance. Experimental results show a consistent correspondence between system real-time performance and music quality, demonstrating the effectiveness of the proposed framework. The project is open source. Relevant code and the latest updates are available at this https URL.

2606.11884 2026-06-11 cs.CV cs.CR 新提交

Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality

使用开放人脸图像质量度量对身份证进行图像质量评估

Gregor Grote, Juan E. Tapia, Christian Rathgeb

发表机构 * da/sec - Biometrics and Internet Security Research Group, Hochschule Darmstadt(达姆施塔特应用科学大学生物识别与互联网安全研究组)

AI总结 本文通过将OFIQ标准中的捕获相关质量度量应用于身份证图像,提出一种预处理流程,并分析这些度量与三种呈现攻击检测算法性能的相关性,表明基于某些OFIQ度量的质量评估可显著提升PAD性能。

详情
Comments
Presented on IWBF 2026 (14th International Workshop on Biometrics and Forensics)
AI中文摘要

本文通过将开放人脸图像质量(OFIQ)标准中的捕获相关质量度量应用于身份证图像,解决了远程验证系统中身份证图像质量评估的挑战。我们的预处理流程包括角点检测、透视归一化和全面的前景掩码,以确保准确且无偏的质量度量计算。我们通过分析这些度量与三种呈现攻击检测(PAD)算法在四个不同身份证数据集上的性能相关性来评估其有效性,其中两个数据集包含真实(即原始)图像,两个包含打印的模拟身份证。我们的结果表明,基于某些OFIQ度量的质量评估可以显著提升PAD性能。

英文摘要

This paper addresses the challenge of assessing image quality in ID cards in remote verification systems by applying capture-related quality measures from the Open Face Image Quality (OFIQ) standard to ID card images. Our preprocessing pipeline includes corner detection, perspective normalization, and comprehensive foreground masking to ensure accurate and unbiased quality measure computation. We evaluate the effectiveness of these measures by analyzing their correlation with the performance of three presentation attack detection (PAD) algorithms across four diverse ID card datasets, where two datasets contain bona fide, i.e. pristine, images and two contain printed mock ID cards. Our results suggest that quality assessment based on some OFIQ measures can significantly improve PAD performance.

2606.11880 2026-06-11 cs.CV 新提交

SG2Loc: Sequential Visual Localization on 3D Scene Graphs

SG2Loc: 基于3D场景图的顺序视觉定位

Nicole Damblon, Olga Vysotska, Federico Tombari, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌) TU Munich(慕尼黑工业大学) Microsoft(微软)

AI总结 提出一种轻量级顺序视觉定位方法,利用紧凑的3D场景图表示环境,通过粒子滤波和语义匹配实现高效定位,显著降低存储需求。

详情
Comments
The code will be available at this https URL
AI中文摘要

复杂室内环境中的视觉定位仍然是机器人和AR应用的关键挑战。顺序定位,即随时间细化位姿估计,对自主智能体至关重要。然而,传统方法通常需要存储大量图像数据库或点云,导致显著开销。本文提出一种新颖的轻量级顺序视觉定位方法,使用3D场景图。我们的方法用紧凑的场景图表示环境,其中节点表示对象(带有粗略网格),边编码空间关系。在定位阶段,对于每张图像,我们提取逐块语义特征,预测对象身份。定位在粒子滤波框架内进行。每个粒子代表一个相机位姿,将场景图中的粗略对象网格投影到图像中,根据可见性为块分配对象身份。输入图像中逐块特征与场景图对象特征的相似度决定粒子的权重。后续图像顺序融合,细化位姿估计。通过利用紧凑的场景图和高效的语义匹配,我们的方法在保持真实世界数据集性能的同时显著减少存储。代码将在该网址提供。

英文摘要

Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at this https URL.

2606.11875 2026-06-11 cs.CL cs.SD 新提交

I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System

我理解你的感受:通过对话系统中的多语言情感验证增强深层情感支持

Zi Haur Pang, Yahui Fu, Koji Inoue, Tatsuya Kawahara

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究科)

AI总结 提出情感验证在对话系统中的应用,构建多语言语料库M-EDESConv和测试集M-TESC,设计多语言情感感知门控单元MEGUMI进行时机检测,并评估当前LLM在情感验证响应生成中的表现。

详情
Comments
This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2026 (SIGDIAL 2026)
AI中文摘要

情感验证——明确承认用户的感受是合理的——已被证明具有治疗价值,但很少受到计算方面的关注。对话系统中的情感验证可以分解为:(i) 验证响应识别,(ii) 验证时机检测,以及 (iii) 验证响应生成。为了支持所有三个子任务的研究,我们发布了 M-EDESConv,一个通过混合手动和自动标注创建的 12 万条英日多语言语料库,以及 M-TESC,一个多语言口语对话测试集。对于时机检测,我们提出了 MEGUMI,一种多语言情感感知门控单元用于相互融合,它通过跨模态注意力和门控融合将冻结的 XLM-RoBERTa 语义与特定语言的情感编码器融合。MEGUMI 在 M-EDESConv 和 M-TESC 数据集上均表现出优越的性能,无论是客观还是主观评价。最后,我们的 EmoValidBench 基准测试(使用 GPT-4.1 Nano 和 Llama-3.1 8B)表明,当前的 LLM 能够生成上下文相似且多样化的验证响应,但情感理解仍然是一个需要改进的主要领域。项目页面:this https URL

英文摘要

Emotional validation - explicitly acknowledging that a user's feelings make sense - has proven therapeutic value but has received little computational attention. Emotional validation in dialogue systems can be decomposed into (i) validating response identification, (ii) validation timing detection, and (iii) validating response generation. To support research on all three subtasks, we release M-EDESConv, a 120k English-Japanese multilingual corpus created through hybrid manual and automatic annotation, and M-TESC, a multilingual spoken-dialogue test set. For timing detection, we propose MEGUMI, a Multilingual Emotion-aware Gated Unit for Mutual Integration, that fuses frozen XLM-RoBERTa semantics with language-specific emotion encoders via cross-modal attention and gated fusion. MEGUMI shows superior performance on both the M-EDESConv and M-TESC datasets, both objectively and subjectively. Finally, our EmoValidBench benchmarks of GPT-4.1 Nano and Llama-3.1 8B indicate that current LLMs generate contextually similar and diverse validating responses, but emotional understanding remains a major area for improvement. Project page: this https URL

2606.11874 2026-06-11 cs.AI 新提交

AutoMine Solution for AV2 2026 Scenario Mining Challenge

AutoMine 解决方案:面向 AV2 2026 场景挖掘挑战

Songliang Cao, Jiele Zhao, Yuru Wang, Hao Li, Daqi Liu, Zehan Zhang, Fangzhen Li, Yu Wang, Yue Zhang, Bing Wang, Guang Chen, Hao Lu, Hangjun Ye

发表机构 * Xiaomi EV(小米汽车) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出基于 LLM 和 VLM 的自优化场景挖掘方法 AutoMine,通过语义保持提示增强、鲁棒轨迹原子函数与 VLM 函数结合以及执行反馈优化,在 CVPR 2026 挑战赛中取得领先性能。

详情
Comments
CVPR 2026 Scenario Mining Challenge (Temporal Track Winners)
AI中文摘要

随着自动驾驶系统的发展,从大规模驾驶日志中挖掘高价值、安全关键且与规划相关的场景已成为数据驱动评估的关键。本文提出 AutoMine,一种基于 LLM 和 VLM 的鲁棒自优化场景挖掘方法。AutoMine 使用语义保持提示增强来降低 LLM 提示敏感性,结合鲁棒轨迹原子函数与基于 VLM 的函数以处理感知噪声和开放世界视觉线索,并通过真实日志的执行反馈来优化生成的代码。在 CVPR 2026 的 Argoverse 2 场景挖掘竞赛中,AutoMine 取得了 36.38 的 HOTA-Temporal 分数和 77.21 的 Timestamp BA 分数。

英文摘要

With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-world visual cues, and refines generated code through execution feedback from real logs. In the Argoverse 2 Scenario Mining Competition at CVPR 2026, AutoMine achieves a HOTA-Temporal score of 36.38 and a Timestamp BA score of 77.21.

2606.11868 2026-06-11 cs.LG q-bio.QM 新提交

MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

MemNovo: 回顾谱图以实现质谱中平衡的从头肽段测序

Dongxin Lyu, Jingbo Zhou, Hongxin Xiang, Yuqiang Li, Jun Xia

发表机构 * Westlake University(西湖大学) Hunan University(湖南大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) HKUST-GZ & HKUST(香港科技大学(广州)与香港科技大学)

AI总结 针对现有Transformer模型在从头肽段测序中过度依赖生成序列先验而忽视谱图证据的问题,提出训练无关的即插即用机制MemNovo,通过建立持久谱记忆库和超保守残差连接在解码阶段注入谱特征,显著提升氨基酸和肽段精度。

详情
Comments
Code: this https URL
AI中文摘要

从串联质谱中进行从头肽段测序是蛋白质组学的关键,能够在不依赖参考数据库的情况下识别新型肽段。尽管基于Transformer的编码器-解码器模型已取得显著性能,但我们发现其推理动态中存在关键病理现象。通过全面的特征缩放实验,我们证明现有的自回归肽段解码器倾向于过度依赖生成序列的先验,同时逐渐未能充分利用输入质谱中的细粒度物理证据。这一现象导致次优结果,生成的肽段序列在生物学上合理但不符合输入谱图。为解决此问题,我们提出MemNovo,一种无需训练且即插即用的机制,在推理时重新平衡肽段和谱图的贡献。MemNovo通过建立持久的谱记忆库,并通过超保守残差连接将检索到的特征直接注入最终解码阶段,从而缓解信息瓶颈。理论分析证实,该机制恢复了解码器状态与原始谱图之间的互信息。在Nine Species基准上使用两个代表性基线模型Casanovo和InstaNovo进行的大量实验表明,MemNovo持续提高了氨基酸精度和肽段精度,对于Casanovo,肽段精度相对提升高达39.1%,对于InstaNovo提升高达3.9%,且计算开销可忽略不计。

英文摘要

De novo peptide sequencing from tandem mass spectrometry is pivotal in proteomics, enabling identification of novel peptides without reference databases. While recent Transformer-based encoder-decoder models have achieved remarkable performance, we uncover a critical pathology in their inference dynamics. Through comprehensive feature scaling experiments, we demonstrate that existing auto-regressive peptide decoders tend to over-rely on generated-sequence priors while progressively under-utilizing fine-grained physical evidence from the input mass spectrum. This phenomenon leads to suboptimal results, where generated peptide sequences are biologically plausible yet not faithful to the input spectrum. To rectify this, we propose MemNovo, a training-free and plug-and-play mechanism that re-balances peptide and spectral contributions at inference time. MemNovo alleviates the information bottleneck by establishing a persistent spectral memory bank and injecting retrieved features directly into the final decoding stage via an ultra-conservative residual connection. Theoretical analysis confirms that this mechanism restores the mutual information between the decoder state and the raw spectrum. Extensive experiments on the Nine Species benchmark with two representative baselines, Casanovo and InstaNovo, demonstrate that MemNovo consistently improves both amino acid precision and peptide precision, achieving up to 39.1% relative improvement in peptide precision for Casanovo and up to 3.9% for InstaNovo, with negligible computational overhead.

2606.11854 2026-06-11 cs.LG cs.AI cs.CL 新提交

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

使用ART微调多模态大语言模型:基于艺术的强化训练

Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski

发表机构 * University of Stavanger(斯塔万格大学) NORCE Research(NORCE研究机构)

AI总结 提出ART方法,通过优化原始视觉输入将信息注入冻结的多模态大语言模型,实现软提示微调,无需修改计算图,在数学和工具使用基准上达到与LoRA相当的精度。

详情
AI中文摘要

大语言模型有两种主要的参数高效微调技术。低秩适应在LLM层之间引入额外权重,而软提示则向LLM输入引入额外的微调特定原始token。然而,两者都需要修改预编译、预优化LLM的计算图。因此,两者在vLLM等高吞吐引擎中均未得到完全支持。我们提出使用ART(基于艺术的强化训练)进行微调。该方法通过仅优化冻结的多模态大语言模型的原始视觉输入来注入信息,从而在预编译计算图上实现软token方法。它依赖于将梯度反向传播到普通像素阵列,因此支持任何微调目标。此外,优化的视觉输入可以风格化为与任务相关的计算艺术品。该方法在流行的开源Qwen架构的不同规模以及多个文本基准上的有效性得到确认。具体而言,ART在数学和结构化工具使用基准上达到了与LoRA竞争的精度。

英文摘要

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

2606.11846 2026-06-11 cs.CV 新提交

SheafStain: Sheaf-Theoretic Schrödinger Bridge for Spatially and Biologically Coherent Virtual Staining

SheafStain:用于空间和生物学一致虚拟染色的层论薛定谔桥

Hyeongyeol Lim, Hongjun Yoon, Eunjin Jang, Daeky Jeong, Won June Cho, Hwamin Lee

发表机构 * Department of Medical Informatics, College of Medicine, Korea University(高丽大学医学院医学信息学系) DEEPNOID Inc.(DEEPNOID公司)

AI总结 针对虚拟染色中补丁推理导致的空间不连续和上下文污染问题,提出SheafStain方法,将视觉基础模型特征重新解释为层状截面,结合薛定谔桥框架实现空间和生物学一致的虚拟染色,在HER2等指标上优于六种现有方法。

详情
Comments
32 pages
AI中文摘要

当前的虚拟染色方法为癌症诊断和预后中的生物标志物量化提供了节省时间和成本的潜力。然而,对于千兆像素全切片图像(WSI)的补丁推理无法保持空间连续性,产生伪影,导致与真实图像出现灾难性不匹配。尽管病理视觉基础模型(VFM)提供了丰富的表示,但其自注意力机制导致不同的全局上下文为同一物理区域产生不一致的嵌入。我们将这种“上下文污染”形式化并验证为一个层论问题,其中这些嵌入形成一个违反粘合公理的预层。为了解决这个问题,我们提出了SheafStain,一种新方法,将VFM特征重新解释为层状截面,用于空间和生物学一致的虚拟染色。具体来说,SheafStain将类别和补丁令牌集成到薛定谔桥框架中作为层状截面。类别令牌锚定生物学一致性,而补丁令牌形成逐位置的空间图。在苏木精和伊红(H&E)与免疫组化(IHC)上共同预训练的主干网络产生非退化的跨染色茎,因此单个VFM特征空间同时监督输入条件和输出染色对齐。与先前在孤立$256 \ imes 256$补丁上评估并对$1024 \ imes 1024$真实图像进行随机裁剪或调整大小的工作不同,我们在$256 \ imes 256$上进行翻译,并在拼接后的$1024 \ imes 1024$输出上评估HER2、ER、PR和Ki-67。SheafStain在减轻补丁边界拼接伪影的同时,展示了优于六种先前方法的结果。代码即将发布。

英文摘要

Current virtual staining approaches offer the potential for time- and cost-efficient biomarker quantification in cancer diagnostics and prognostics. However, patch-wise inference for gigapixel whole slide images (WSIs) fails to maintain spatial continuity, yielding artifacts that cause catastrophic mismatches with ground-truth images. Although pathology Vision Foundation Models (VFMs) offer rich representations, their self-attention causes varying global contexts to produce inconsistent embeddings for the same physical region. We formalize and validate this ``context contamination'' as a sheaf-theoretic problem where these embeddings form a presheaf that violates the gluing axiom. To address this, we propose SheafStain, a new approach that reinterprets VFM features as sheaf-like sections for spatially and biologically coherent virtual staining. Specifically, SheafStain integrates class and patch tokens into a Schrödinger Bridge framework as sheaf-like sections. While the class token anchors biological consistency, patch tokens form a per-position spatial map. A backbone co-pretrained on Hematoxylin \& Eosin (H\&E) and Immunohistochemistry (IHC) yields non-degenerate cross-stain stalks, so a single VFM feature space supervises both input conditioning and output stain alignment. Departing from prior work that evaluates on isolated $256 \times 256$ patches and either random-crops or resizes the $1024 \times 1024$ ground truth, we translate at $256 \times 256$ and evaluate on the stitched $1024 \times 1024$ outputs across HER2, ER, PR, and Ki-67. SheafStain demonstrates promising results against six prior methods while mitigating patch-boundary stitching artifacts. Code will soon be released.

2606.11844 2026-06-11 cs.LG 新提交

TaskFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

TaskFusion: 异构表格数据的持续异常检测

Dayananda Herurkar, Federico Raue, Joachim Folz, Jörn Hees, Andreas Dengel

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) RPTU Kaiserslautern-Landau(凯泽斯劳滕-兰道大学) Hochschule Bonn-Rhein-Sieg (H-BRS)(波恩-莱茵-锡格应用技术大学)

AI总结 提出TaskFusion方法,通过AGF模型、任务融合增强和异常暴露技术,解决异构表格数据在持续学习中的特征空间变化、分布偏移和类别不平衡问题,在21个数据集上显著提升持续异常检测性能。

详情
Comments
22 Pages
AI中文摘要

表格数据中的持续异常检测具有挑战性且尚未充分探索,尤其是在异构特征模式、分布偏移和严重类别不平衡的情况下。在许多实际应用中,数据来自不同领域并按顺序到达,这使得传统的持续学习方法因依赖固定输入空间而失效。我们提出了一种持续学习方法,能够克服这些挑战并持续从不同任务中学习。我们的方法包含三个主要部分:AGF模型、TaskFusion增强和异常暴露。AGF模型将任务特定特征映射到共享空间,然后对齐分布以减少表示漂移,并在对齐空间中学习异常决策边界。为了提高稳定性,我们引入了TaskFusion增强,结合任务内的边界感知插值来细化模型异常边界,以及跨任务混合以在数据集间传递异常结构。为了处理类别不平衡和内存限制,我们采用表格数据集蒸馏来存储紧凑的合成回放样本,这些样本与增强数据一起在异常暴露目标中用于鲁棒的异常检测。我们在多个领域的21个异构数据集上评估了该方法。结果表明,与顺序微调和其他持续学习基线相比,我们的方法显著提高了持续异常检测性能,同时减少了灾难性遗忘并在异构数据集上保持稳定的检测。

英文摘要

Continual anomaly detection in tabular data is challenging and remains largely underexplored, particularly in settings with heterogeneous feature schemas, distribution shifts, and severe class imbalance. In many real-world applications, data arrive sequentially from diverse domains, rendering conventional continual learning methods ineffective due to their reliance on a fixed input space. We propose a continual learning (CL) method, which can overcome these challenges and continually learn from different tasks. Our method consists of three main parts: our AGF model, Taskfusion augmentation, and outlier exposure. The AGF-model maps task-specific features into a shared space, then aligns distributions to reduce representation drift, and learns anomaly decision boundaries in the aligned space. To improve stability, we introduce Taskfusion augmentation, combining boundary-aware interpolation within tasks to refine the model anomaly boundaries and cross-task mixing to transfer anomaly structure across datasets. To handle class imbalance and memory constraints, we employ tabular dataset distillation to store compact synthetic replay samples, which are jointly used with augmented data in an outlier exposure objective for robust anomaly detection. We evaluate the approach on 21 heterogeneous datasets across multiple domains. Results show that our approach substantially improves continual anomaly detection performance over sequential fine-tuning and other CL baselines while reducing catastrophic forgetting and maintaining stable detection across heterogeneous datasets.