arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2117
2606.12925 2026-06-12 cs.CV cs.LG 新提交

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

基于贝叶斯条件先验的多标签测试时自适应

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Qing Gu

AI总结 提出贝叶斯条件先验估计(BCP),一种无梯度的测试时自适应方法,通过在线估计锚定条件先验注入标签依赖性,提升冻结视觉语言模型在多标签识别中的分布偏移鲁棒性。

详情
Comments
accepted by ICML2026
AI中文摘要

多标签识别中,冻结的视觉语言模型(VLM)在分布偏移下表现脆弱:标准零样本推理独立评分每个标签,忽略共现结构,产生不连贯的标签集,其中主导概念抑制较弱但兼容的标签。我们引入贝叶斯条件先验(BCP)估计,一种无梯度的测试时自适应方法,在不调整主干网络的情况下注入标签依赖性。BCP将零样本logits视为在固定图像-文本似然下的边缘后验代理,并将偏移引起的误差主要归因于不匹配的标签先验。对于每个测试图像,它选择一个高置信度的锚定标签,并应用锚定条件的贝叶斯精炼。该更新在logit空间中是闭式的,并具有点互信息(PMI)解释,明确促进兼容标签并抑制不兼容标签。BCP通过从无标签测试流中在线估计锚定条件先验(使用轻量级二阶共现统计)来运行,无需目标标注,且仅增加单个前向传递之外的微不足道的开销。在标准多标签基准和多个CLIP主干网络上,BCP持续优于强TTA基线,例如将RN50的平均mAP从57.31提升至69.22,ViT-B/16从62.61提升至71.79。

英文摘要

Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

2606.12917 2026-06-12 cs.LG 新提交

Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

计算在 TabPFN 中的位置:注意力头功能的因果定位

Atharva Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

AI总结 通过激活修补、消融和注意力熵分析,发现 TabPFN 2.5 中一个注意力头在峰值层的因果必要性比其他头高2-5倍,且其主导层随任务复杂度变化,其余头呈现对称的后期层轮廓。

详情
Comments
Accepted to Workshop FMSD @ ICML 2026
AI中文摘要

我们首次对表格基础模型进行了因果机制分析,研究了 TabPFN 2.5 的逐特征注意力头如何跨层分布计算。使用两个合成回归数据集上的激活修补、消融和注意力熵,我们发现明确的时间特化:一个头的因果必要性在峰值层比其他头高2到5倍,其主导层随不同复杂度的任务而变化,而其余头表现出对称的后期层轮廓。注意力熵和修补为优势头的计算活跃层提供了收敛证据。我们还通过对比激活引导研究了推理时间的可操控性,发现它无法跨样本迁移。我们将这一结果归因于 TabPFN 的上下文学习机制,该机制通过上下文相关的注意力编码任务结构,而不是语言模型中使引导可行的稳定参数方向。

英文摘要

We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5's feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head's causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN's in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

2606.12913 2026-06-12 cs.LG cs.CV 新提交

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择:用于无损训练加速的统一数据集剪枝框架

Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

AI总结 提出基于图的统一数据集剪枝框架,将数据集建模为加权图,通过最大权重团问题选择样本,并设计贪心算法,在多种剪枝比例下优于现有方法,实现ImageNet-1k上40%以上训练加速且不损失精度。

详情
Comments
ICML 2026
AI中文摘要

现代训练数据集的快速增长显著增加了计算成本,促使数据集剪枝(DP)方法仅保留信息量丰富的样本子集以减少训练成本。现有的剪枝标准通常依赖于评估样本独立性的内在信号或通过成对关系促进多样性的外在信号。虽然在其特定领域有效,但每种方法仅捕捉样本效用的一方面,且在不同剪枝比例或数据分布下缺乏鲁棒性。在这项工作中,我们提出了一个统一的基于图的DP框架。通过将数据集建模为加权图,其中节点权重编码内在价值,边权重编码外在价值,DP可以转化为最大权重团问题(MWCP)。尽管MWCP是NP难的,但其结构允许基于样本边际增益的原则性贪心解法。在几个温和条件下,我们进一步证明该统一目标具有形式化的近似保证,适用于广泛的度量族,并提供了实用设计指南。大量实验表明,我们的方法优于现有DP方法,同时显著降低训练成本,在ImageNet-1k上使用ResNet-50时,训练时间减少超过40%且不损失精度。

英文摘要

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

2606.12911 2026-06-12 cs.CL 新提交

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

PiDA: 基于语音信息的数据增强用于鲁棒的越南语语音翻译

Giang Son Nguyen, Tung X. Nguyen, Hieu Minh Truong, Nhu Vo, Wray Buntine, Dung D. Le

AI总结 针对级联语音翻译中ASR错误传播问题,提出基于语音信息的数据增强方法PiDA,通过语音词嵌入生成相似音替换,在FLEURS越南语-英语上提升错误ASR输出翻译质量(BLEU+2.04)。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

级联语音翻译(ST)系统在自动语音识别(ASR)输出错误转录时会出现错误传播。我们首次对越南语ST的ASR错误进行系统分类,根据语音原因对替换错误进行分类,并使用线性混合效应模型量化其对下游神经机器翻译(NMT)性能的影响。我们确认大多数ASR替换错误源于语音混淆而非随机噪声,并且这些语音错误显著降低了ST质量。受此发现启发,我们提出了基于语音信息的数据增强(PiDA),该方法通过使用语音词嵌入替换为语音相似的替代词来生成类似ASR的损坏。在FLEURS越南语-英语的PiDA增强版本上进行微调,提高了错误ASR输出的翻译质量(比标准微调最多提高+2.04 BLEU),同时也略微提升了干净文本的性能。

英文摘要

Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

2606.12910 2026-06-12 cs.RO cs.AI cs.CV cs.SY eess.SY 新提交

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标:通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

AI总结 提出GRASP框架,利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测实现零样本桌面操作,无需任务特定训练。

详情
Comments
Project website: https://allisonandreyev.github.io/grasp.github.io/
AI中文摘要

为了将机器人有效集成到家庭或工业环境中,机器必须实时适应自然语言提示。尽管视觉-语言模型(VLM)已在机器人任务与运动规划(TAMP)中实现零样本泛化,但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP(基础推理与符号规划)框架,作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同,GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念,并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率,无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 新提交

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

零源大语言模型幻觉检测:类人类标准探测

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

AI总结 提出HCPD范式,通过类人类标准探测机制模拟人类评估者的多面推理,结合奖励对齐和多样本聚合,实现零源条件下的有效可解释幻觉检测。

详情
Comments
Accepted at ICML 2026
AI中文摘要

大型语言模型(LLM)常因生成事实错误或不忠实的内容而产生幻觉,对其安全使用构成重大风险。在零源约束下,即无法获取模型内部信息或外部参考,检测必须仅依赖于文本查询-答案对,检测此类幻觉尤为困难。本文提出用于幻觉检测的类人类标准探测(HCPD)范式,该范式模拟人类评估者的多面推理。其核心是类人类标准探测(HCP)机制,其中LLM代理自适应地将其判断分解为一组可解释的加权标准,并将特定标准得分聚合为最终的真实性度量。为实现这种自适应能力,我们引入了一种基于奖励的对齐方案,仅使用来自语义一致性的弱监督。在推理时,我们采用多样本聚合策略,确保决策稳健的同时保持完全可解释性。我们进一步提供了支持我们方法可靠性的理论分析。大量实验表明,HCPD始终优于最先进的基线,为零源幻觉检测提供了一种有效且可解释的解决方案。代码可从此https URL获取。

英文摘要

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

2606.12896 2026-06-12 cs.LG cs.AI cs.CR 新提交

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard:面向强化学习智能体的测试时和步级对抗防御

Junfeng Guo Heng Huang

AI总结 提出PolicyGuard,一种基于高斯过程后验方差的测试时步级后门防御方法,通过自适应伪轨迹计算单步不确定性,在七种RL游戏中达到平均AUROC 0.856和0.859。

详情
AI中文摘要

尽管强化学习(RL)的实际应用日益普及,但RL系统的安全性值得更多关注和探索。特别是,最近的研究揭示了RL智能体容易受到后门攻击,即受害智能体在标准条件下表现正常,但在特定触发器被激活时执行恶意动作。现有的RL后门防御要么需要访问智能体的内部参数,要么仅在模型或轨迹级别操作,或者仅限于特定攻击类型。为了确保RL智能体的安全性,我们提出了\texttt{PolicyGuard},一种\textit{测试时步级}后门防御方法,它利用高斯过程(GP)后验方差并自适应伪轨迹以实现单个时间步的不确定性计算。此外,我们还提供了理论基础来解释GP后验方差的有效性。在七个RL游戏上的大量实验表明,PolicyGuard在大多数情况下实现了最先进的检测性能,对于基于扰动的攻击平均AUROC为0.856,对于对抗智能体攻击平均AUROC为0.859。

英文摘要

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

2606.12895 2026-06-12 cs.LG 新提交

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

LongSpike:用于高效长序列学习的分数阶脉冲状态空间模型

Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha

AI总结 提出LongSpike框架,将分数阶状态空间模型(f-SSM)引入脉冲神经网络,通过长记忆核实现高效长序列学习,在多个基准上超越现有SNN。

详情
AI中文摘要

脉冲神经网络(SNN)因其生物合理性和处理序列数据时的能量效率而备受推崇。然而,主流的SNN架构通常依赖一阶常微分方程(ODE)来控制神经元状态转换。这种一阶假设引入了“无记忆”瓶颈,限制了模型捕捉长序列任务中固有的复杂长程依赖关系的能力。在这项工作中,我们提出了LongSpike,一种新颖的SNN框架,它将控制理论中的分数阶状态空间建模(f-SSM)集成到脉冲域中。通过将传统的整数阶SSM扩展到分数阶微积分领域,LongSpike实现了具有长记忆核的神经元动力学的层次化集成。为了缓解分数算子通常带来的计算开销和并行化挑战,我们利用了一种支持高效并行训练的状态空间公式。在具有挑战性的基准测试(包括Long Range Arena(LRA)、大规模WikiText-103和Speech Commands)上的实证评估表明,LongSpike在保持稀疏突触计算的同时,在准确性上优于最先进的SNN。代码可在以下网址获取:https://this URL。

英文摘要

Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at https://github.com/xinruihe389-commits/LongSpike.

2606.12882 2026-06-12 cs.AI 新提交

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

HarnessBridge: 用于LLM智能体框架的可学习双向控制器

Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang

AI总结 提出HarnessBridge,一种轻量级可学习框架控制器,通过双向投影参数化智能体-环境接口,减少令牌使用和轨迹长度,并泛化到更大模型。

详情
AI中文摘要

大型语言模型越来越多地被部署为用于长周期任务的智能体,但其性能不仅受模型能力和环境设计的影响,还受调节智能体-环境交互的框架的影响。现有的框架大多是手动设计的,随着轨迹变长和交互变得更加复杂,它们难以扩展。在这项工作中,我们探究框架是否可以通过一个可学习的即插即用模块生成,该模块可以以端到端的方式进行训练。我们引入了HarnessBridge,一种轻量级可学习框架控制器,它将智能体-环境接口参数化为双向投影。HarnessBridge学习两个双向投影:观测投影,将原始轨迹提炼为紧凑的、与决策相关的状态;以及动作投影,将提议的动作转换为可执行的转换或基于轨迹的拒绝。我们在框架监督数据集上通过统一指令调优训练HarnessBridge。在Terminal-Bench~2.0和SWE-bench Verified上,HarnessBridge匹配或超越了强大的专用框架,同时大幅减少了令牌使用和轨迹长度,并从较小的生成器泛化到较大的商业模型。

英文摘要

Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent--environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug-in module that can be trained in an end-to-end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent--environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision-relevant states, and action projection, which converts proposed actions into executable transitions or trajectory-grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal-Bench~2.0 and SWE-bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

2606.12881 2026-06-12 cs.CL cs.LG 新提交

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

面向聊天机器人微调的直接偏好优化:一项实证研究

Yvonne Qiu, Dezhi Yu, ShuoJia Fu

AI总结 本文实证研究直接偏好优化(DPO)在聊天机器人微调中的应用,表明其简化训练流程、提升计算效率且性能有竞争力,但存在训练不稳定性。

详情
Comments
7 pages, 3 figures, 1 table
AI中文摘要

我们提出了一种使用直接偏好优化(DPO)微调大型语言模型的方法,这是一种强化学习技术。我们的实验结果表明,DPO简化了训练流程,提高了计算效率,并实现了有竞争力的性能。使用BLEU、ROUGE和余弦相似度指标的评估表明,模型有效学习并收敛,尽管需要进一步研究以解决观察到的训练不稳定性。

英文摘要

We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

2606.12867 2026-06-12 cs.LG 新提交

SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

SMGFM: 面向多模态属性图的谱多模态图预训练

Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

AI总结 提出SMGFM框架,利用图频谱分解区分结构诱导语义与模态特有语义,通过频带路由实现跨模态融合,在图级和模态级任务上取得最优性能。

详情
AI中文摘要

多模态属性图(MAGs)将图拓扑结构与来自文本、图像等模态的节点语义相结合。传统的图学习通过耦合拓扑与节点特征来上下文化节点语义。然而,这种耦合设计在MAGs中变得棘手,因为结构诱导和模态固有的语义可能对下游任务产生不同贡献。结构诱导语义通过平滑拓扑变化促进关系一致性,而模态固有语义通常编码局部、细粒度的区分,不应被统一平滑或对齐。因此,关键挑战在于跨模态融合前识别语义角色。为此,我们利用图频率变化作为先验,其中低频分量捕获拓扑一致语义,高频分量保留模态特定语义。基于这一直觉,我们提出SMGFM,一种谱多模态图预训练框架,将每个模态特定的节点信号分解为图频带,并在跨模态交互前分配频带级语义角色。具体地,SMGFM使用可扩展的切比雪夫滤波器构建频率解析的模态令牌,通过拓扑条件路由估计其耦合可靠性,并在融合前进行频带-模态交互。其频率路由目标在平滑共识路由的同时保留模态特定路由,减轻空间域纠缠和统一跨模态对齐。在MAG数据集上的大量实验表明,SMGFM在图级和模态级任务上均达到最先进性能。

英文摘要

Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

2606.12863 2026-06-12 cs.LG 新提交

Multimodal Graph Negative Learning

多模态图负学习

Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

AI总结 提出GraphMNL框架,通过负学习解决多模态属性图中节点级分支语义不平衡问题,避免主导分支偏差传播,在Grocery和Reddit M数据集上取得最优性能。

详情
AI中文摘要

多模态属性图(MAGs)将图拓扑与异构模态属性(如文本和图像)集成,从而能够对复杂关系系统进行更丰富的建模。然而,这种表达能力也使得MAGs上的学习依赖于多个语义源,包括结构拓扑、文本和视觉属性,每个都可以被视为节点表示的一个分支。当这些分支在语义信息量和可靠性上因节点而异时,就会出现节点级分支语义不平衡:一个分支为某个节点提供判别性语义,但由于模态质量或结构上下文的偏差,可能会误导另一个节点。现有方法通常通过跨分支一致性或对齐来缓解这种异质性,隐含地将主导预测视为可靠监督。当主导分支有偏差时,强制模仿可能会将其偏差传播到其他分支,并抑制对分类有用的原始语义。我们提出GraphMNL,一种图感知的多模态负学习框架,通过使用负学习作为跨分支指导来解决这个问题。该模型不强制劣质分支模仿教师预测,而是教导它们节点不太可能属于哪些类别。GraphMNL构建分支库,通过图感知可靠性仲裁识别主导和劣质分支,门控不稳定传输,并对非目标类别应用目标保持负学习。这种设计将目标监督与分支指导解耦,使得监督损失学习正确类别,而当分支一致性不可靠时,负学习抑制不太可能的备选类别。通过全面的实验评估,GraphMNL在Grocery数据集上达到72.47%的准确率,在Reddit M数据集上达到76.60的F1分数,取得了最佳性能。

英文摘要

Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

2606.12843 2026-06-12 cs.LG cs.CE 新提交

Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

可解释因子分解用于大规模金融市场决策智能:来自中国A股市场的证据

Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng

AI总结 提出可解释机器学习流程,将截面股票收益预测分解为可审计因子贡献,使用XGBoost和TreeSHAP在中国A股市场验证,发现行为信号贡献58.2%预测归因。

详情
AI中文摘要

我们提出一个可解释的机器学习流程,将截面股票收益预测分解为可审计的因子贡献。我们应用带有TreeSHAP归因的XGBoost模型,对2009年至2019年的3632只中国A股进行压力测试。使用60个月滚动窗口,在55个月的样本外数据上,XGBoost获得平均AUC为0.547,且前五分之一与后五分之一的多空价差为+2.38%/月(Newey-West t = 5.94;年化夏普比率2.23)。在调整Carhart四因子模型后,该alpha持续存在(+2.31%/月;t = 7.48)。SHAP分解表明,在55个行业组中,行为信号(换手率和动量)平均占预测归因的58.2%,而估值比率仅占10.7%。消融分析用于交叉验证这一排名,并提供证据表明SHAP和消融以突出特征可替代性结构的方式产生分歧,而这种结构在单独使用任一方法时几乎不可见。

英文摘要

We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

2606.12840 2026-06-12 cs.LG 新提交

CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees

CLARITree: 基于Cholesky和前瞻加速的可解释分段线性树回归

Yixiao Wang, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

AI总结 提出一种结合前瞻搜索和秩一Cholesky更新的算法,用于构建近最优稀疏分段线性回归树,在计算效率、预测精度和稀疏性之间取得良好平衡。

详情
Comments
Accepted at ICML 2026
AI中文摘要

回归树是机器学习中最具可解释性且表达能力最强的模型之一。历史上,贪心归纳一直是构建高性能回归树的主要方法。尽管存在基于动态规划和分支定界的最优方法,但对于一般的线性回归树,这些方法在计算上不可行,尽管它们通常比贪心方法取得更好的性能。最近的研究表明,专门的前瞻策略可以显著提高运行时间,同时保持接近最优的性能,主要是在分类设置中。在这项工作中,我们开发了一种新颖的算法,用于近最优、稀疏、分段线性回归树,该算法将前瞻式搜索策略与Gram矩阵的高效秩一Cholesky更新相结合。我们从理论和实验上证明,我们的方法在计算效率、预测精度和稀疏性之间实现了有利的权衡,并且比当前最先进的方法具有更好的可扩展性。

英文摘要

Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing well-performing regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

2606.12828 2026-06-12 cs.AI 新提交

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

人工智能研究中的主题相变:大规模证据与新兴主题的早期预警信号

Rasul Khanbayov, Hasan Kurban

AI总结 通过分析2017-2025年五大AI会议论文,发现AI主题通过“相变”方式突然爆发,并基于早期预警信号识别未来需关注的主题。

详情
AI中文摘要

人工智能的研究主题是逐渐增长,还是通过突然的、可检测的跳跃式发展?通过分析2017年至2025年期间五个顶级AI会议(ACL、CVPR、ICLR、ICML、NeurIPS)的80,814篇主会论文,我们发现主要AI主题通过主题相变推进:在多年间保持边缘地位,然后在一到三年内跨会议激增。到2025年,大型语言模型成为跨会议的主导主题,扩散模型以类似的突发性崛起,语言模型方法通过视觉语言模型进入计算机视觉领域,而强化学习则平滑累积,这区分了真正的相变与普通增长。这一结构是我们的主要贡献:对AI研究如何重组的大规模、跨会议特征描述。然后我们探究相变是否在达到顶峰前留下可检测的足迹。我们定义了一个早期预警信号,即基于2017-2021年数据冻结的四项出版动力学标准,并在2023-2025年的相变上进行样本外评估,在13.5%的基准率下获得了27%的精确率和63%的召回率。应用于2025年数据时,该信号将推理与测试时计算、智能体AI、多模态LLM、检索增强生成和世界模型标记为2026-2028年需监测的主题。源代码也在GitHub上公开,网址为https://this https URL。

英文摘要

Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at https://github.com/KurbanIntelligenceLab/ai-phase-transitions.

2606.12809 2026-06-12 cs.AI cs.LG 新提交

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

AI总结 提出MLUBench基准,评估多模态大模型在连续遗忘请求下的性能,发现现有方法存在累积退化,并揭示多模态对齐保持的挑战,提出LUMoE方法缓解退化。

详情
Comments
36 pages, accepted to the ICML 2026
AI中文摘要

多模态大语言模型(MLLMs)在海量多模态数据上训练,使得数据遗忘变得越来越重要,因为数据所有者可能要求移除特定内容。实际上,这些请求通常随时间顺序到达,引发了MLLM终身遗忘这一具有挑战性的问题。然而,现有大多数基准在规模和范围上有限,未能捕捉MLLM终身遗忘的复杂性。为填补这一空白,我们引入了MLUBench,一个大规模、全面的基准,包含9个类别下的127个实体,用于终身遗忘请求。我们使用MLUBench进行了大量实验,揭示出现有遗忘方法遭受严重且累积的退化。更重要的是,我们进一步识别出该问题的独特挑战:与单模态模型不同,MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战,我们提出了LUMoE,一种有效方法。实验表明,LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in https://github.com/lihe-maxsize/Lifelong_Unlearning_main.

2606.12808 2026-06-12 cs.LG cs.AI 新提交

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal

AI总结 提出SymQNet,一种摊销强化学习方法,通过离线学习后验条件获取策略,在线快速前向传播,显著降低自适应哈密顿量学习的获取延迟。

详情
AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中,选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算,这一步可能需要几秒钟。在数百次试验中,这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet,一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略,然后在线使用快速策略前向传播,同时保留贝叶斯后验反馈。在横向场伊辛基准测试中,相对于有界Fisher信息搜索和有界两步贝叶斯主动学习(BALD),SymQNet显著降低了获取延迟。在五量子比特时,相对于这些在线基线,它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$;在十二量子比特时,SymQNet的完整模拟步骤需要$1.02$秒,而有界两步BALD需要$13.27$秒。总体而言,我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

2606.12765 2026-06-12 cs.CL cs.DC 新提交

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel:逆向工程 Apple M4 Max GPU 上的 Metal 4.1 张量计算路径

Ramchand Kumaresan

AI总结 通过微基准测试逆向工程 Apple M4 Max 的 Metal 4.1 张量计算路径,揭示 fp8 matmul2d 为模拟而非硬件加速,并重建了 8x8 张量片段布局。

详情
AI中文摘要

Apple 的 Metal 4.1 暴露了一条张量计算路径:基于 cooperative_tensor 片段的 Metal Performance Primitives (MPP) matmul2d 操作,其接口有文档记录,但硬件行为被故意隐藏。规范说明了支持哪些数据类型行,但从未说明它们是否经过硬件加速、操作在物理上何处执行、其累加器宽度是多少,或者如何在线程间划分矩阵片段。我们提出了 Rigel,这是对单个 Apple M4 Max(前神经加速器一代)上该路径的经验性表征。使用校验和门控、来源追踪的微基准测试工具,Rigel 恢复了 v4.1 规范隐藏或矛盾的十一个事实。主要发现:Metal 4.1 fp8 (E4M3) matmul2d 是模拟的,而非加速的:尽管读取的操作数字节数减半,但其吞吐量仅为 fp16 的 0.94 倍,因此在 M4 上它是一个内存占用特性,而非性能特性。我们进一步通过三信号三角测量(吞吐量上限、与 simdgroup_matrix 的比较以及每路功率归因)表明,matmul2d 完全在 GPU 着色器核心上执行,没有专用的矩阵数据路径,也没有证据表明路由到 Apple 神经引擎;它使用 >=fp32 累加;并且我们重建了 Apple 在任何地方都没有记录的 opaque 8x8 cooperative_tensor 片段布局。基于该表征,一个手动融合的 GEMM + bias + GELU 内核在缓存驻留状态下比分解路径快 6.5-12.9%。所有发现均可从 MIT 许可的代码和逐单元 CSV 中重现。

英文摘要

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

2606.12754 2026-06-12 cs.CL cs.AI 新提交

LLMs Can Better Capture Human Judgments--With the Right Prompts

LLMs 能更好地捕捉人类判断——使用合适的提示

Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

AI总结 通过简单提示策略,LLMs 能恢复人类反应的完整分布,并减少对措辞变化的敏感性,提升 AI-人类对齐。

详情
AI中文摘要

大型语言模型(LLMs)在捕捉人类判断方面是否表现不佳?两个常被提及的限制是:LLMs 无法捕捉反应的全分布,以及它们的判断在措辞变化上不稳定。我们展示了缓解这些限制的简单提示策略。在两个数据集上——一个代表美国的 144 个道德情景集,以及国际社会调查项目“家庭与性别角色变化”模块涵盖 32 个国家的 38 个道德信念——我们展示了简单的启发式技术如何帮助改善 AI-人类对齐。首先,提示模型报告标准差和反应比例,比常见策略更好地恢复了人类反应的完整范围。其次,确保情景对人类参与者清晰——如人类困惑评分所反映——提升了模型对齐度,且 LLMs 可以跟踪人类困惑评分。同时,我们发现 LLMs 对自身误差的估计校准不佳,尽管它们能相对较好地预测人类变异性。这些结果表明,向 LLMs 提出更好的问题可以得到更好的答案。

英文摘要

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

2606.12748 2026-06-12 cs.CL 新提交

Agent-based models for the evolution of morphological alternation patterns

基于智能体的形态交替模式演化模型

Aravinth Kulanthaivelu, Richard Sproat

AI总结 通过多智能体模拟,研究形态交替(如go/went)的涌现机制,发现无标度社交网络和随机采纳策略能产生更真实的形态模式。

详情
Comments
51 + 37 pages. 31 Figures
AI中文摘要

为什么英语中“go”的过去式是看似无关的“went”?这种交替在语言中很常见。它们既无助于交流也不利于学习,却能持续存在数百年或数千年。我们提出了一个多智能体模拟,用于研究形态词干和屈折交替的涌现。交替形式源于语音变化,或者像“go/went”一样,来自与部分人群相关的词汇替代。当一个智能体“听到”另一个智能体对某个词形位(例如go的过去式)使用新形式时,它们会以一定概率采纳该形式,并可能将其使用扩展到共享相同原始形式的其他词形位。因此,替代形式可以在人群中传播,并固化为词干或屈折标记的交替形式。与许多先前的计算研究不同,我们的系统允许自然主义的词汇形式、现实的语音规则、包含数百或数千条目的词典,以及数十或数百个智能体的人群。它支持多种网络拓扑、扩散模式和智能体采纳策略。这类模拟的一个问题是评估:与真实语言相比,产生的形态有多真实?我们引入了AI历史语言学家,这是一个新颖的大型语言模型驱动系统,模拟两位历史语言学家之间的辩论。我们用它来比较一组真实语言的形态、伪装形态和实验演化形态。结果表明,有利于产生更合理形态的因素包括无标度社交网络和随机伯努利形式采纳。我们还提出了三个案例研究,模拟了有记载的历史变化,使我们能够测试如果历史不同会发生什么。所有代码和数据均已发布。

英文摘要

Why is the past of English "go" the apparently unrelated "went"? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi-agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with "go/went", from lexical alternatives associated with a subset of the population. When an agent 'hears' another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model-driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale-free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released.

2606.12718 2026-06-12 cs.LG eess.SP 新提交

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

面向开放集射频指纹识别的分布外检测器

Sudeepta Mondal, Ganesh Sundaramoorthi

AI总结 针对开放集射频指纹识别中未知发射机与时间漂移引起的分布偏移问题,引入基于信息论的OOD检测统一框架,并采用无需OOD调优数据的方法,在POWDER数据集上验证其性能接近有真实OOD数据的基线。

详情
AI中文摘要

射频指纹识别系统必须在开放世界环境中运行,其中来自未知发射机的信号和时间漂移会在测试时引入分布偏移。分布外检测为该问题提供了自然框架,但其在射频指纹识别中的应用仍然有限。其采用的一个关键障碍是大多数OOD检测器需要辅助OOD数据进行参数调优,而在射频环境中收集代表性OOD数据不切实际,这一假设难以满足。在这项工作中,我们将机器学习文献中一组有前景的OOD检测方法引入开放集RFF领域。我们基于信息论(通信系统的自然框架)在一个统一的数学框架中呈现这些方法。我们的框架允许对方法进行系统分析并开发新方法。我们进一步展示了最近关于无需给定OOD调优数据即可调优OOD检测器的工作在开放集RFF中的适用性。我们在POWDER射频指纹数据集上进行评估,表明无需任何给定OOD数据调优的检测器性能与能够访问真实OOD调优数据的基线相当,并且大大优于无法访问真实OOD调优数据的基线方法,展示了RFF问题的实际可行性。

英文摘要

Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

2606.12702 2026-06-12 cs.AI 新提交

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

以部署为中心的评估:预测临床大语言模型系统中的查询级拒绝风险

Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah

AI总结 针对临床大语言模型系统,提出基于部署上下文(如提供者类型、科室名称)的预响应分类器,预测用户拒绝风险,AUROC达0.719,并展示其在触发护栏和弃权中的效用。

详情
AI中文摘要

大语言模型(LLMs)正越来越多地集成到临床系统中,因此评估这些系统的实际效用至关重要。然而,静态基准倾向于衡量正确性而非用户接受度,跨查询聚合性能,并需要密集标注的数据集——这导致评估临床系统时存在重大盲点。在这项工作中,我们对嵌入某学术医疗中心电子健康记录中的LLM系统进行了以部署为中心的评估,其中用户反馈稀疏但密切反映了部署条件。具体而言,我们训练了一个预响应分类器,该分类器基于查询内容和生成前可用的部署特定上下文,估计未来交互导致用户拒绝LLM响应的风险。我们对模型进行了4.5个月用户反馈的前瞻性分析,发现我们的预测模型达到了0.719的AUROC。此外,我们估计了此类预测在两个下游用例(触发护栏和弃权)中的益处。我们的关键概念洞察是,利用部署特定上下文(即提供者类型、科室名称、用于响应的语言模型),而不仅仅是查询内容,可以提高预测用户是否会拒绝系统输出的能力。总之,我们的实证案例研究证明了使用部署特定上下文预测用户拒绝的可行性,为定向护栏打开了大门。

英文摘要

Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

2606.12674 2026-06-12 cs.AI 新提交

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux: 紧凑型智能体的可执行工具工作流的推理时演化

Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao

AI总结 提出Evoflux,一种推理时演化搜索方法,通过结构化编辑和执行反馈修复紧凑语言模型的工具工作流,将执行可行性从3%提升至17-24%,优于SFT和ReAct。

详情
Comments
Code is available at https://github.com/IBM/Evoflux
AI中文摘要

紧凑型语言模型(LMs)降低了工具智能体的成本、延迟和部署风险。然而,MCP风格的工具使用不仅仅需要孤立的函数调用:智能体必须从实时目录中发现工具、满足模式、跨中间输出保留依赖关系,并在执行证据中基于最终响应。小型规划器通常生成看似合理的工作流图,但在工具解析、参数验证、依赖跟踪或执行中失败。我们认为,小语料蒸馏难以处理这种失败模式。几百个教师轨迹可以教授工作流格式,但很少涵盖修复失败计划所需的恢复行为。我们引入了Evoflux,一种推理时演化搜索方法,将紧凑工具使用视为可执行工具工作流的修复。它通过结构化编辑、执行反馈、自适应强度、元引导重设计和多样性剪枝来演化类型化工作流图。在涵盖实时MCP服务器和250个工具的保留MCP-Bench任务上,Evoflux将小型规划器的执行可行性从约3%提高到17-24%。相比之下,在相同搜索挖掘数据上的SFT和SFT+DPO匹配、表现不佳或崩溃至零样本性能以下;ReAct达到更高峰值,但方差和令牌成本更高。这些结果表明,在稀缺的教师轨迹预算下,基于执行的搜索更可靠。

英文摘要

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

2606.12671 2026-06-12 cs.CV 新提交

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

SalArt-VQA: 诊断VLM是否理解生成图像中的显著伪影

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy

AI总结 提出SalArt-VQA基准,通过950张图像和3681道多选题,从检测、定位、空间基础、缺陷识别四方面评估VLM对生成图像伪影的理解,揭示高检测准确率下隐藏的失败模式。

详情
Comments
23 pages, 7 figures, 7 tables. Dataset: https://huggingface.co/datasets/salartvqa/SalArt-VQA
AI中文摘要

视觉语言模型(VLM)越来越多地被用于检测AI生成图像是否包含可见伪影,然而它们分析此类伪影的能力仍然知之甚少。正确的图像级决策仍可能隐藏重要失败:模型可能正确标记伪影,但依赖于错误的视觉线索、选择错误的区域,或描述图像中不存在的缺陷。为了直接评估这些行为,我们引入了SalArt-VQA,一个用于细粒度理解AI生成图像中显著伪影的诊断基准。SalArt-VQA包含950张图像和3,681道人工编写的多项选择题,涵盖伪影图像、匹配的真实参考图像和配对的生成参考图像。四种对齐的问题类型评估存在检测、语义定位、空间基础和证据基础的缺陷识别,而参考分割测试了当注释缺陷不存在时的校准和弃权能力。在20个VLM上,SalArt-VQA揭示了图像级检测准确率所隐藏的失败:最强的模型在伪影图像上达到99.37%的检测召回率,但仅在53.26%的图像上正确回答了所有四个伪影侧问题。比较伪影图像与无伪影参考揭示了灵敏度-校准权衡:敏感模型经常做出无根据的伪影声明,而保守模型主要通过遗漏真实伪影来避免误报。这些结果表明,高伪影检测准确率本身并不意味着有基础的伪影理解。SalArt-VQA暴露了这些隐藏的失败模式,并提供了对VLM伪影声明是否得到局部视觉证据支持的细粒度评估。

英文摘要

Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

2606.12658 2026-06-12 cs.LG q-bio.QM stat.ML 新提交

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

基于物理信息的神经网络用于化疗药代动力学:基准测试临床估计器并揭示参数可辨识性

Riya Bisht, Dhruv Agarwal

AI总结 本研究将物理信息神经网络(PINN)应用于化疗药代动力学,在双室线性模型上匹配临床标准方法,在Michaelis-Menten扩展模型中揭示参数不可辨识性,并通过稀疏组织观测部分恢复可辨识性。

详情
AI中文摘要

物理信息神经网络(PINN)是生物学中部分观测问题的一个有吸引力的工具,其中控制动力学已知但某些隔室无法测量。化疗药代动力学(PK)是一个清晰的实例:血浆中的药物浓度常规测量,但组织中的浓度——决定肿瘤杀伤和脱靶毒性——无法测量。我们在两个PK问题上将PINN与标准临床基线(非线性最小二乘解析双指数血浆解,以下简称NLS)和物理无关的神经基线(仅数据的MLP)进行基准测试。在线性双室问题上,NLS接近最优;PINN在匹配其性能(小常数因子内)的同时,在单次训练过程中产生组织曲线,而仅数据的MLP在组织上失败约10倍。在Michaelis-Menten扩展(可饱和消除)上,双指数闭式不再存在,因此NLS被错误指定并静默返回无意义的速率常数。PINN反而揭示了一个更深层的事实:Michaelis-Menten双室模型仅从血浆数据不可辨识,PINN通过收敛到k12 -> 0的盆地诚实地报告这一点。添加两个稀疏组织观测在很大程度上解决了可辨识性:在五个随机种子上,PINN恢复k21在真实值的1%以内,Vmax和Km在一个标准差范围内,而k12向正确方向移动(0.02 -> 0.82)但仍低于真实值约2个标准差——这是闭式NLS估计器根本无法尝试的恢复,因为其双指数假设仅描述血浆。我们的主张不是PINN击败NLS。而是PINN提供了一种统一的方案,该方案在教科书问题上与教科书估计器匹配,揭示了教科书估计器隐藏的结构可辨识性,并在单一损失中吸收异构测量。

英文摘要

Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

2606.12639 2026-06-12 cs.LG q-bio.QM 新提交

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者:评估选择翻转未见化学空间中药物反应预测的模型排名

Dhruv Agarwal, Riya Bisht

AI总结 本研究通过VCPI竞赛数据,发现药物反应预测模型排名随评估指标反转:简单基线在代理指标下胜出,但真实指标下深度模型显著优于线性指纹基线,首次在真实药物化学数据上验证了度量校准效应。

详情
AI中文摘要

预测细胞转录组对其从未见过的药物的反应是计算细胞生物学中的一个核心难题:最近的基准测试表明,一旦测试化合物按化学结构留出,复杂模型往往无法击败简单基线。我们研究了一个细胞系和检测方法,即通过DRUG-seq分析的THP-1细胞,由VCPI预测竞赛的活性化合物加权MSE(wMSE)评分。我们提出了一种分阶段方法:该领域一直无法击败的简单基线(未处理对照和平均训练化合物响应);非参数检索(对留出化合物的最近训练化合物进行Tanimoto加权平均);以及一个融合阶段,将冻结的化学嵌入与检索支持特征相结合,以预测相对于均值的残差,并包含不确定性头和基因程序。在发布的VCPI THP-1 drug-seq数据(14,026个训练化合物)上,采用Bemis-Murcko骨架划分,模型排名根据度量标准反转。在逆方差每基因代理度量下,基于Morgan指纹的正则化线性回归似乎胜过了深度模型、检索和ChemBERTa——这是教科书式的“简单基线获胜”结果。但在竞赛的真实活性集度量(每(基因,化合物)的Mejia权重,经官方评分器验证;均值基线0.535 vs 组织者的0.507参考)下,情况反转:深度模型获胜,我们的融合解码器显著优于线性指纹基线(-0.012 wMSE,配对bootstrap p < 10^-4),而代理度量的胜者成为最差的化学感知预测器。选择度量即选择胜者——据我们所知,这是首次在真实留出药物化学数据上证明度量校准效应,该效应此前主要在遗传扰动中建立。我们发布了一个可复现的流水线,连接到官方评分器,可在真实的1064 x 12,995网格上生成有效提交。

英文摘要

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

2606.12633 2026-06-12 cs.CV cs.LG 新提交

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA:面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

AI总结 提出ECA方法,通过混合查询模块、Fisher动态扩展和字典重放,实现无需旧数据的持续对齐,缓解灾难性遗忘,提升开放图像到文本生成的增量学习性能。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

开放图像到文本生成(OpenITG)的增量学习(IL)使模型能够持续为新的图像生成准确、上下文相关的文本,同时保留先前获得的知识。与先前研究不同,本文处理了一个更实际的场景,其中视觉数据的主要类别随时间推移而演变。在此背景下,我们引入了持续对齐的新概念,它逐步调整预训练VLM中的对齐模块,以保持高质量的跨模态表示。基于这一思想,我们提出了高效持续对齐(ECA),一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:混合查询(MoQ)模块,用于适应任务特定的查询令牌;Fisher动态扩展(FeDEx),基于Fisher信息矩阵(FIM)度量动态扩展模型结构;以及带有字典重放(DR)的嵌入字典,以保留过去的知识。为了评估ECA的性能,我们构建了四个新的IL OpenITG基准,更好地反映了现实场景。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

2606.12609 2026-06-12 cs.LG q-bio.QM 新提交

Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

AI总结 研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示,发现嵌入空间中存在主导的“天然性”轴,该轴按模型困惑度排序序列,且缩放效果因病毒家族而异,但嵌入仍保留病毒特异性信号。

详情
Comments
Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at https://github.com/MisteFr/viral-proteins-plms
AI中文摘要

蛋白质语言模型在高度不平衡的数据集上训练,引发了一个问题:它们如何表示代表性不足的生物序列?以病毒蛋白作为跨ESM模型家族的案例研究,我们在嵌入空间中识别出一个主导的天然性轴,该轴与掩码重建困惑度对齐,将序列从建模良好的细胞蛋白通过病毒蛋白排序到打乱和随机序列。缩放效果在不同病毒家族间不均匀地压缩该轴。尽管如此,蛋白质语言模型嵌入保留了病毒特异性信号:病毒蛋白在零样本困惑度和浅层序列特征之上仍然是线性可分的。这些结果共同表明,pLM表示由天然性的一般概念结构化,同时保留了特定于不同生物群体的信息。

英文摘要

Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine:从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

AI总结 提出EgoEngine框架,通过视觉和动作桥接,将自我中心人类视频转化为高保真机器人数据,首次实现零样本灵巧策略学习。

详情
AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源,但直接用于机器人学习需要弥合两个差距:人类与机器人观测之间的视觉差距,以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine,一个可扩展的框架,用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频,EgoEngine生成:(i) 高保真机器人观测视频,用机器人替换人类,同时保留场景上下文和时间对齐,以及(ii) 在可行性约束下,与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明,EgoEngine能够将人类视频可扩展地转化为机器人数据,并且据我们所知,首次展示了无需真实机器人演示,从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站:此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

2606.12599 2026-06-12 cs.CL 新提交

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

通过波斯谚语条件故事生成实现LLM中的约束语义解压缩

Zahra Habibzadeh, Paria Khoshtab, Amir Mesbah, Yadollah Yaghoobzadeh

AI总结 提出约束语义解压缩任务,通过波斯谚语条件故事生成测试大语言模型的抽象到实现能力,构建PAND数据集,发现解压缩差距,并表明显式推理和迭代细化可部分缓解。

详情
AI中文摘要

将一个密集、抽象的谚语转化为引人入胜且道德忠实的故事需要深厚的文化理解和稳健的语义基础。我们将此问题定义为约束语义解压缩任务,并研究谚语条件故事生成作为大语言模型中抽象到实现的测试平台。聚焦波斯语,我们引入了谚语对齐叙事数据集(PAND),将谚语与人类编写的故事和显式含义配对。通过结合人类校准的LLM-as-a-Judge与结构度量的混合评估框架,我们分析了多种提示机制下的模型行为。我们的发现揭示了一个持续存在的解压缩差距:当前的LLM通常实现强大的表面流畅性,但未能忠实地实例化谚语中编码的潜在道德和因果结构。我们进一步表明,显式推理和迭代细化可以部分缓解这些失败,这表明许多解压缩错误源于将抽象含义转化为叙事形式的困难,而非完全缺乏相关知识。我们提出的任务自然扩展到其他形式的压缩文化知识。

英文摘要

Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emph{decompression gap}: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.