arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30200 2026-05-29 cs.AI

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

双刃剑还是利器?设计与评估面向K-12写作规模化的三元LLM-教师协作

Canran Wang, Yuwen Yang, Zhen Wang, Ming Ma, Ding Yu, Chentai Wang, Keman Huang, Xiaoyong Du

AI总结 本文通过开发一个三元协作系统,结合系统功能语言学与建议轨迹追踪管道,基于包含57,954篇作文的大规模实证数据,验证了LLM作为生成引擎、教师作为教学把关者的分工策略能有效提升写作质量,并发现语言扩展存在边际效用递减的天花板效应。

详情
AI中文摘要

集成大型语言模型(LLM)的双刃剑效应需要一个有效的LLM、教师和学生之间的三元协作机制,尤其是对于K-12教育。通过开发一个支持K-12写作学习的三元协作系统,一个基于系统功能语言学和建议轨迹追踪管道的多维评估框架,本文贡献了一个大规模实证数据集,包含来自120所学校10,195名学生在两年内提交的57,954篇作文。我们的发现证实了该系统通过战略分工提高写作质量的功效:LLM作为生成引擎以缓解教师倦怠,教师作为教学把关者和桥梁以保证反馈质量。虽然LLM和教师对技能提升都至关重要,但我们发现了一个天花板效应,即过度的语言扩展产生递减的边际效用。这些表明随着学生熟练度的提高,需要动态自适应的LLM-教师协作。

英文摘要

The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.

2605.30198 2026-05-29 cs.LG

Active Continual Learning with Metaplastic Binary Bayesian Neural Networks

具有可塑性二值贝叶斯神经网络的主动持续学习

Kellian Cottart, Théo Ballet, Djohan Bonnet, Damien Querlioz

AI总结 针对边缘系统持续学习中的后验饱和与可塑性冻结问题,提出基于有界记忆变分目标的BiMU方法,通过不确定性依赖步长和先验松弛维持非退化后验,实现无缓冲主动查询,在Permuted-MNIST和OpenLORIS-Object上显著减少标签与更新次数。

详情
Comments
Accepted at ICML 2026
AI中文摘要

始终在线的边缘系统必须在严格的计算预算下随着条件变化持续学习,并检测不可靠的预测。贝叶斯二值神经网络在此场景中具有吸引力,但均值场伯努利后验可能在长非平稳流上饱和,消除认知不确定性并冻结可塑性。我们提出BiMU,它源于一个有界记忆变分目标,平衡了稳定性、可塑性和遗忘。BiMU结合了数据项与受控松弛向先验,以及不确定性依赖的步长,防止饱和并维持信息性不确定性。这种非退化后验通过蒙特卡洛分歧实现完全在线、无缓冲的主动查询,在类别不平衡下减少标签查询和反向传播更新。BiMU在1000任务Permuted-MNIST上维持学习和强OOD检测,在OpenLORIS-Object上在类别不平衡和特征压缩下,以匹配的准确率实现高达32倍的标签/更新节省。

英文摘要

Always-on edge systems must keep learning as conditions change under tight compute budgets and must detect unreliable predictions. Bayesian binary neural networks are attractive in this setting, but mean-field Bernoulli posteriors can saturate on long non-stationary streams, wiping out epistemic uncertainty and freezing plasticity. We propose BiMU, derived from a bounded-memory variational objective that balances stability, plasticity, and forgetting. BiMU combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size that prevents saturation and sustains informative uncertainty. This non-degenerate posterior enables fully online, buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. BiMU sustains learning and strong OOD detection on 1000-tasks Permuted-MNIST, and on OpenLORIS-Object achieves up to 32$\times$ label/update savings at matched accuracy under class imbalance and feature compression.

2605.30195 2026-05-29 cond-mat.mtrl-sci cs.AI cs.LG

What drives performance in molecular MPNNs? An operator-level factorial benchmark

分子MPNN性能驱动因素:算子级因子基准测试

Panyu Jiao, Shuizhou Chen, Yiheng Shen, Yuyang Wang, Runhai Ouyang, Wei Xie

AI总结 通过分解分子MPNN为消息种子初始化、节点-边融合和节点更新三类算子,在84种配置下对MoleculeNet数据集进行基准测试,发现消息构建而非更新复杂度主导性能,并提出了设计启发式方法。

详情
AI中文摘要

消息传递神经网络(MPNN)广泛用于分子性质预测,但其作为整体架构部署使得难以识别特定消息传递算子如何影响性能。我们提出了一个算子级因子基准测试,将二维分子MPNN分解为消息种子初始化、节点-边融合和节点更新算子三个家族。在共享实验设置和统计分析协议下,对十个MoleculeNet数据集上的84种配置进行了基准测试。在这个受控设计中,性能变化主要与消息构建相关,而非更新复杂度。消息种子初始化在回归和分类任务中均显示出显著的家族级效应;节点-边融合在回归任务中显示出显著的家族级效应,且基于拼接的混合具有描述性优势;更新家族在任一任务家族中均未显示出统计上支持的效应。对Quinethazone分子的表征探测进一步表明,与Hadamard门控相比,基于拼接的混合能更好地区分化学上不同的杂原子并抵抗过度平滑。分别针对分类和回归任务选择的代表性配置相对于已建立的分子图神经网络(GNN)基线恢复了竞争性性能,在十个基准数据集中有八个数值上排名最佳。这些实证结果通过对代表性节点-边融合和更新算子的简洁机理分析进行了解释。我们的发现通过将模型设计从搜索整体架构转变为针对化学信息在消息传递管道中进入位置和方式的定向评估,为分子MPNN提供了实证设计启发式方法。

英文摘要

Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures makes it difficult to identify how specific message-passing operators affect performance. We present an operator-level factorial benchmark that decomposes 2D molecular MPNNs into the three families of message-seed initialization, node-edge fusion, and node update operators. The resulting 84 configurations are benchmarked on ten MoleculeNet datasets under a shared experimental setup and statistical analysis protocol. Across this controlled design, performance variation is associated primarily with message construction rather than update complexity. Message-seed initialization shows significant family-level effects for both regression and classification, node-edge fusion shows a significant family-level effect for regression with descriptive advantages for concatenation-based mixing, and the update family shows no statistically supported effect for either endpoint family. A representation probe into the Quinethazone molecule further demonstrates that concatenation-based mixing can better differentiate chemically distinct heteroatoms and withstand oversmoothing than Hadamard gating. Representative configurations selected separately for classification and regression recover competitive performance relative to established molecular graph neural network (GNN) baselines, ranking numerically best on eight of ten benchmark datasets. These empirical results are interpreted through concise mechanistic analyses of representative node-edge fusion and update operators. Our findings provide empirical design heuristics for molecular MPNNs by turning model design from a search over monolithic architectures into a targeted assessment of where and how chemical information enters the message-passing pipeline.

2605.30189 2026-05-29 cs.CR cs.AI cs.CL cs.LG

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

LoRA适配器后门中的令牌级泛化:攻击表征与行为检测

Travis Lelle

AI总结 本文通过数据投毒在LoRA适配器中植入后门,发现后门在令牌特征层面泛化而非结构模式层面,并提出了基于行为统计和权重统计的两种检测方法。

详情
Comments
45 pages, 27 tables. Code and evaluation data: https://github.com/Travis-ML/lora-backdoors. Trained adapter weights available on request
AI中文摘要

我们表明,LoRA适配器(微调LLM的主要分发格式)可以通过训练数据投毒可靠地植入后门,同时保持基线任务性能。在Qwen 2.5 1.5B提示注入分类器上,一小部分中毒样本即可驱动一个保持干净精度的后门达到饱和。由此产生的后门在令牌特征层面而非结构模式层面泛化:在一个RFC引用上训练的模型会在任何RFC引用上激活,但不会迁移到结构相同的ISO、OWASP、CWE或NIST引用上。这种不对称性有利于攻击者,因为防御者无法通用地探测“结构化引用”。 我们跨基础模型规模与系列、LoRA秩和触发字符串表征了该攻击,并针对多种子适配器队列评估了两种互补的检测路径。一个由两个探测电池统计量(outlier_gap和mean_attack_rate)构建的行为检测器,在探测电池与触发器的令牌邻域重叠时完美区分中毒适配器和干净适配器,在不重叠时以零假阳性实现高召回率。一个权重级统计量——维度归一化Frobenius范数的跨模块标准差——也能在不运行模型的情况下完美区分队列。两者结合对探测组成具有鲁棒性。因果修补将后门定位到中后层的MLP块,其中down_proj是最强的单投影原因。 跨规模、系列和秩的重复实验表明,行为检测器无需重新调整即可迁移,而权重级检测器则需针对基础模型进行校准。攻击随秩单调扩展,且选择的触发锚点令牌既依赖于触发也依赖于基础模型。行为检测是适配器供应链扫描中操作上可移植的结果。

英文摘要

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

2605.30187 2026-05-29 cs.AI cs.CY

Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

模块化教育大语言模型代理以促进负责任的学习辅助

Julius Gabelmann, Felix Jahn, Kevin Baum, Sophie van Rossum, Emely Wuenscher, Timo P. Gros, Verena Wolf

AI总结 提出一种模块化代理架构的AI聊天机器人,通过分阶段指导练习解决,融入针对性教学建议,实现更可控、透明和可监督的学习过程,促进教育中负责任的AI使用。

详情
Comments
12 pages, 2 figures (+ 2 in appendix), accepted at AISoLA 2025 (Track: Responsible and Trusted AI: An Interdisciplinary Perspective)
AI中文摘要

AI聊天机器人在教育中的广泛采用将彻底改变学习方式,使负责任部署成为关键问题。虽然大型语言模型(LLM)可能能够访问讨论教育科学见解的来源,但它们并不特别倾向于遵循教学概念,可能对学习过程产生负面影响,如丧失迁移能力、批判性思维或创造力。在本文中,我们介绍了一种辅助学生解决练习的代理型AI聊天机器人架构,专门设计用于促进教育中更负责任的AI使用。我们的概念开发基于对负责任的基于LLM的教育系统若干期望的识别,论证了整体式开箱即用解决方案固有的结构缺陷,并建议模块化代理架构。我们提出了针对练习解决不同阶段的特定模块,使得能够融入有针对性的教学建议,以更可控、透明和可监督的方式引导学生完成学习过程。

英文摘要

The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. While large language models (LLMs) might have access to sources discussing insights from educational sciences, they are not particularly inclined to adhere to pedagogical concepts, risking negative effects on the learning process, such as a loss of transfer capabilities, critical thinking, or creativity. In this paper, we introduce an agentic AI chatbot architecture assisting students with exercise solving, specifically designed to contribute to more responsible AI use in education. We base our conceptual development on the identification of several desiderata for responsible LLM-based educational systems, argue for the structural shortcomings inherent in monolithic, out-of-the-box solutions, and instead suggest modularizing the agentic architecture. We propose specific modules for different stages of exercise solving, enabling incorporation of targeted pedagogical advice, guiding students through the learning process in a more controllable, transparent, and overseeable manner.

2605.30179 2026-05-29 cs.LG cs.AI

iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

iLoRA: 用于微生物组诊断的具有潜在交互图的贝叶斯低秩适应

Yang Song, Yixuan Zhang, Lingfa Meng, Tongyuan Hu, Haizhou Shi, Hao Wang, Samir Bhatt, Hengguan Huang

AI总结 提出iLoRA,一种贝叶斯图条件LoRA框架,通过推断输入中的潜在交互图生成输入条件LoRA更新,联合学习预测和潜在交互结构,在微生物组诊断中优于现有方法。

详情
Comments
Accepted at ICML 2026
AI中文摘要

参数高效适应使得大型语言模型在领域预测中变得实用,但标准LoRA仍然依赖于静态低秩更新,并且没有揭示通常驱动科学标签的潜在交互。我们引入了iLoRA。据我们所知,这是第一个贝叶斯图条件LoRA框架。它从输入中推断潜在交互图,并使用它生成输入条件LoRA更新。因此,iLoRA联合学习预测和潜在交互结构,而不是训练预测器然后仅事后应用交互分析。我们将这一思想实例化用于微生物组诊断,其中疾病状态可能依赖于物种水平丰度和微生物-微生物串扰,并在两个互补设置中评估:与人工注释图进行交互式问答,测试潜在结构恢复;以及多队列IBD诊断,测试生物医学效用。在这两种设置中,iLoRA优于强LoRA和贝叶斯适应基线,恢复与人工注释和队列水平微生物组关联一致的图,并提供具有适度图分支开销的校准不确定性。

英文摘要

Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-conditioned LoRA updates. As a result, iLoRA learns prediction and latent interaction structure jointly, rather than training a predictor and applying interaction analysis only post hoc. We instantiate this idea for microbiome diagnosis, where disease state can depend on both species-level abundance and microbe-microbe cross-talk, and evaluate it in two complementary settings: interactive QA with human-annotated graphs, which tests latent structure recovery, and multi-cohort IBD diagnosis, which tests biomedical utility. Across both settings, iLoRA improves over strong LoRA and Bayesian adaptation baselines, recovers graphs aligned with human annotations and cohort-level microbiome associations, and provides calibrated uncertainty with moderate graph-branch overhead.

2605.30175 2026-05-29 astro-ph.HE cs.LG stat.ML

A new completely parameter-free clustering algorithm for unsupervised classification of BATSE gamma-ray bursts

一种用于BATSE伽马射线暴无监督分类的全新无参数聚类算法

Soumita Modak

AI总结 提出一种完全无参数的聚类算法,对BATSE伽马射线暴样本进行分类,支持双群(短暴与长暴)的合并-坍缩星理论。

详情
AI中文摘要

聚类分析是一种广泛应用的机器学习技术,用于理解伽马射线暴(GRB)群体中存在的模式,以探索其物理来源。目前,尽管采用了最先进的聚类程序进行了多次尝试,但对应可区分群组的聚类数量仍存在争议。这一关键未知参数需要通过直接或间接方式(以其他调优参数的形式)评估,以便通过实施合适的聚类算法在GRB中产生聚类。虽然大多数应用的算法得出了两个物理上可解释的群组(分别以短暴和长暴为主的合并与坍缩星),但其他统计方法违反了这种二元划分。然而,任何额外聚类的物理建立尚未得到确认。因此,我们提出一种新算法,来自一种称为“完全无参数”的不同聚类流派,它以迄今未尝试过的方式对GRB进行分类。该算法从BATSE样本中指示出两个主要群组,即短持续时间和长持续时间爆发,与合并-坍缩星理论兼容。

英文摘要

Cluster analysis is a widely applied machine learning technique to understand the existing patterns in the population of gamma-ray bursts (GRBs), in order to explore their physical sources. In the present scenario, the number of clusters corresponding to differentiable groups is still under conflict, in spite of numerous attempts with the state-of-the-art clustering procedures. This crucial unknown parameter needs to be evaluated, either directly or indirectly in terms of other tuning parameters, to produce the clusters in GRBs through implementation of an appropriate clustering algorithm. While most of the applied algorithms reached two physically explained groups of merger and collapsar predominated by the short and long bursts respectively, other statistical approaches violated this binary partition. However, physical establishment of any additional cluster(s) is not yet confirmed. Therefore, we propose a new algorithm, from a different stream of clustering referred to as `completely parameter-free', which carries out the classification of GRBs in a manner that has not been tried so far. It indicates two main groups, of short and long duration bursts from the BATSE sample, compatible with the merger-collapsar theory.

2605.30174 2026-05-29 cs.CV

LiveSVG: Zero-Shot SVG Animation via Video Generation

LiveSVG:通过视频生成的零样本SVG动画

Matan Levy, Ran Margolin, Bar Cavia, Dvir Samuel, Yael Pritch, Shmuel Peleg, Alex Rav Acha, Ariel Shamir, Dani Lischinski

AI总结 提出LiveSVG方法,利用视频扩散模型直接拟合目标视频实现零样本SVG动画,无需骨架或类别先验,通过双层运动表示和球体填充重着色策略解决复杂运动与颜色歧义问题。

详情
Comments
Project Page: https://levymsn.github.io/LiveSVG
AI中文摘要

我们介绍了LiveSVG,一种利用视频扩散模型生成可缩放矢量图形(SVG)动画的零样本方法。当前的SVG动画方法在处理复杂运动时存在困难:基于LLM的代码合成难以表达精细的非刚性贝塞尔变形,而分数蒸馏采样(SDS)提供有噪声的梯度,并且通常需要类别特定的先验(如骨架)。相比之下,LiveSVG将矢量几何直接拟合到显式生成的目标视频上。给定输入SVG图像和运动提示,我们使用冻结的图像到视频模型生成可预览的目标视频,然后通过可微分渲染将原始SVG拟合到该视频。我们的拟合阶段无需骨架,利用双层运动表示:每个组的单应性矩阵用于粗略关节运动,每个路径的贝塞尔控制点偏移用于局部变形。为了解决逐像素拟合过程中颜色引起的对应歧义,我们引入了一种新颖的球体填充重着色策略。我们还提出了ChallengeSVG,一个包含复杂多对象场景的基准测试,揭示了先前工作的局限性。评估表明,LiveSVG在AniClipart和ChallengeSVG上均显著优于现有方法,确立了直接参考视频拟合作为实现提示对齐和完全可编辑矢量动画的实用、稳健途径。

英文摘要

We introduce LiveSVG, a zero-shot approach for generating Scalable Vector Graphics (SVG) animations using video diffusion models. Current SVG animation methods struggle with complex motions: LLM-based code synthesis fails to express fine, non-rigid Bézier deformations, while Score Distillation Sampling (SDS) provides noisy gradients and often requires category-specific priors like skeletons. In contrast, LiveSVG fits vector geometry directly to an explicitly generated target video. Given an input SVG image and a motion prompt, we generate a previewable target video using a frozen image-to-video model, then fit the original SVG to this video via differentiable rendering. Our fitting stage is skeleton-free, utilizing a dual-level motion representation that combines per-group homographies for coarse articulation with per-path Bézier control-point offsets for local deformations. To resolve color-induced correspondence ambiguities during pixel-wise fitting, we introduce a novel sphere-packing recolorization strategy. We also present ChallengeSVG, a benchmark of complex, multi-object scenes that exposes the limitations of prior work. Evaluations demonstrate that LiveSVG significantly outperforms existing methods on both AniClipart and ChallengeSVG, establishing direct reference-video fitting as a practical, robust route to prompt-aligned and fully editable vector animation.

2605.30170 2026-05-29 cs.MM cs.CV cs.LG

Unveiling the Visual Counting Bottleneck in Vision-Language Models

揭示视觉语言模型中的视觉计数瓶颈

Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan

AI总结 通过分解视觉计数为三个认知阶段,发现视觉语言模型在符号映射阶段失败,提出断裂数量假说:模型学习到分离的模态特定统计流形,无法实现跨模态对齐。

详情
Comments
ICML 2026
AI中文摘要

尽管大型视觉语言模型(VLM)在插值任务上表现出色,但在系统泛化方面,尤其是视觉计数任务中,会遭遇灾难性失败。本文通过将视觉计数分解为三个认知阶段:视觉个体化、数量感知和符号映射,来研究这一外推瓶颈。利用合成围棋棋盘和线性探针,我们证明视觉骨干网络在进入外推区域后仍能保持稳健、线性可分离的数量表示,排除了感知失败的可能性。此外,模型保留了潜在的数量感知能力,能够成功对无法枚举的数量进行比较推理。我们将崩溃定位在符号映射阶段,即模型无法将有效的视觉数量投影到符号标记上。我们的发现支持断裂数量假说:VLM未能获得通用数字空间,而是学习了不相交的、模态特定的统计流形,这阻止了对未见数量的跨模态对齐。在最新基础模型上的验证结果表明,弥合这一差距需要引入强制统一表示的归纳先验,因为仅靠数据扩展是不够的。

英文摘要

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

2605.30168 2026-05-29 cs.CV

OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics

OmniCD:多模态语义引导的遥感图像变化检测基础框架

Chenhao Sun

AI总结 提出OmniCD框架,通过多模态语义引导(图像和文本提示)统一遥感变化检测任务,结合层次化场景检索和风格解耦机制,并构建大规模数据集RSITCD,在多个基准上取得最优性能。

详情
AI中文摘要

遥感中的变化检测(CD)对于城市监测和灾害评估等应用至关重要,但传统方法难以在不同场景下泛化。我们提出OmniCD,一个通过多模态语义引导统一并增强遥感CD的基础框架。OmniCD将图像和文本提示(如文本描述、语义地图和地理空间元数据)整合到统一架构中,支持从二元CD到零样本语义变化理解的任务。该框架集成了层次化场景检索模块和变化检测模块,并通过风格解耦机制增强跨域鲁棒性。我们进一步引入RSITCD,一个包含30万+标注图像-文本对的大规模多模态数据集。大量实验表明,OmniCD在多个基准上达到最先进性能,展现出强大的适应性,为遥感中的通用CD系统奠定了坚实基础。

英文摘要

Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.

2605.30167 2026-05-29 stat.ML cs.CV cs.LG stat.AP

Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks

视觉空间学习:使用卷积神经网络的单场空间插值

Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva

AI总结 提出基于卷积神经网络(CNN)的架构,直接从单次部分观测场学习空间插值,无需外部数据或先验场,作为克里金法的替代方案。

详情
Comments
53 pages, 10 figures
AI中文摘要

从稀疏观测中预测完整的空间相关场是空间统计和环境建模中的一个基本挑战。经典的插值方法如克里金法依赖于高斯过程假设和变异函数分析,这可能会限制其在非平稳环境中的有效性,并且需要大量的领域专业知识。在这项工作中,我们利用基于卷积神经网络(CNN)的架构进行空间插值,该架构在单个部分观测场上进行训练和应用,无需访问外部数据或先验场。模型直接在观测位置进行监督,并学习在用户定义的网格上预测未观测点的值。与克里金法不同,我们的方法不需要显式的协方差建模或变异函数估计,并且可以以数据驱动的方式灵活捕捉局部空间模式。这项工作展示了CNN在稀疏监督下进行单实例空间插值的潜力,为经典地统计方法提供了实用的替代方案,并将CNN的应用扩展到新的问题领域。

英文摘要

Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.

2605.30162 2026-05-29 cs.AI cs.CR cs.LG

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

BioRefusalAudit: 使用通用和领域微调稀疏自编码器审计生物安全拒绝深度

Caleb DeLeeuw

AI总结 本文提出BioRefusalAudit方法,通过行为测试和内部稀疏自编码器特征分析,评估语言模型在生物安全场景下的拒绝一致性,发现模型存在结构脆弱性、过度拒绝和架构差异。

详情
Comments
21 pages, 2 figures, 3 tables. Apart Research AIxBio Sprint hackathon paper, April 2026 (Track 3: AI Biosecurity Tools). Code, eval set, and SAEs: github.com/SolshineCode/Deleeuw-AI-x-Bio-hackathon. Reviewer feedback: apartresearch.com/project/biorefusalaudit-auditing-biosecurity-refusal-depth-using-general-and-domainfinetuned-sparse-autoencoders-1fyk
AI中文摘要

语言模型的生物安全评估通常询问模型是否产生危险输出。本文提出一个补充性问题:当模型拒绝时,该拒绝在结构上是否稳健,还是在提示框架、格式或输出长度的适度变化下消失?在五种架构中,没有模型能清晰区分良性查询和危险查询。Gemma 2 2B-IT 在75个提示中从未真正拒绝,对每个接近危险的查询都含糊其辞。Gemma 4 E2B-IT 在聊天模板格式下拒绝了65/75个提示,无格式时拒绝了0/75。两个Gemma模型在80词限制下都降至0%拒绝率。Qwen 2.5 1.5B 和 Phi-3-mini 过度拒绝,将83-87%的良性生物学标记为危险。Llama 3.2 1B 显示出唯一有意义的层级梯度(61点跨度)。为了探究过度拒绝的驱动因素,我们测试了一组附表I但无生物毒性的化合物(特别是裸盖菇素栽培,具有FDA突破性疗法资格)。一些模型对这些化合物的拒绝率超过了真正危险的生物学,表明拒绝追踪法律和文化显著性而非CBRN危险。为了测量内部层面,我们引入了一个分歧分数D,比较模型的表面响应标签与其内部稀疏自编码器(SAE)特征激活。在Gemma 2 2B-IT(Gemma Scope 1)和Gemma 4 E2B-IT(作者训练的bio SAE)上计算了完整的D。发布了两个微调的Gemma 2领域SAE。在Gemma 4上,服从和拒绝响应之间差距为0.647点,零重叠(n=75),尽管这是初步的,目录狭窄,样本内校准,且仅覆盖Gemma家族的SAE。在一个黑客马拉松周末使用消费级硬件(GTX 1650 Ti Max-Q,以及用于SAE训练的Colab T4)构建,这一初步证据表明,激活级审计可能揭示行为评估无法发现的失败模式,且各架构间存在显著差异。

英文摘要

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

2605.30161 2026-05-29 cs.CV

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

为什么远处看起来在上方:探究视觉-语言模型中的空间表征

Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park

AI总结 通过最小对比对分析,发现视觉-语言模型存在垂直-距离纠缠(将图像垂直位置与距离混淆),这种透视偏差导致性能差距,并随数据规模扩大而加剧,而具有良好分离空间轴的模型更鲁棒。

详情
AI中文摘要

视觉-语言模型(VLM)在空间推理基准上取得了强劲性能,但仍不清楚这是否反映了结构化的3D理解,还是依赖于自然图像中的统计捷径。我们引入了一个表征级分析框架,构建最小对比对来测量VLM嵌入中空间轴的组织和分离程度。跨多个模型族的分析揭示了一致的垂直-距离纠缠:模型将图像垂直位置与距离混淆,反映了自然照片的透视偏差。这种偏差导致透视一致与反启发式示例之间存在显著的准确率差距,并且随着数据规模的扩大而加剧,即使整体基准准确率有所提高。我们进一步表明,具有相似基准分数的模型可能表现出不同的内部表征,并且这些差异可预测跨不同空间推理基准的准确率和鲁棒性。为了将这种偏差与评估集偏斜隔离,我们引入了SpatialTunnel,这是一个合成基准,通过去除自然图像中常见的相关性来暴露空间捷径偏差。实验证实,纠缠是模型固有的,并且具有良好分离空间轴的模型表现出更强的鲁棒性,这表明结构良好的空间表征可在不同基准上带来更可靠的空间推理。代码和基准可在项目页面获取:https://cheolhong0916.github.io/whyfarlooksup.github.io/。

英文摘要

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

2605.30160 2026-05-29 cs.LG cs.AI

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

混沌动力系统中的分布强化学习

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

AI总结 针对混沌动力系统中强化学习面临的高方差和梯度病态问题,提出分布强化学习通过1-Wasserstein度量下的分布贝尔曼目标实现更稳定的优化。

详情
AI中文摘要

混沌动力系统对强化学习(RL)提出了根本性挑战:对初始条件的指数敏感性导致高方差的引导目标和病态的梯度更新。混沌动力学出现在科学和工程领域的各个方面,从流体流动和气候系统到多智能体系统,在这些领域中,可靠的学习是非常可取的。标准RL方法通过标量值函数优化期望回报,隐式地对发散轨迹进行平均,并将轨迹层面的不稳定性与学习目标纠缠在一起。我们证明,在温和的统计稳定性假设下,当在$1$-Wasserstein度量下测量时,回报分布比单个轨迹更规则地演化,从而产生更平滑的分布贝尔曼目标。通过将优化与该度量层面结构对齐,分布RL提供了更好的条件学习。我们为混沌系统中分布方法的优势以及混沌下RL目标的几何结构提供了原则性的解释。

英文摘要

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

2605.30159 2026-05-29 cs.AI

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

元认知记忆策略优化用于长视野LLM智能体

Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

AI总结 针对长视野任务中记忆策略训练缺乏中间监督的问题,提出基于信念熵的元认知记忆策略优化(MMPO),通过自监督代理惩罚高认知不确定性摘要,提升长期推理性能。

详情
AI中文摘要

记忆增强的LLM智能体通过递归地将交互轨迹总结为紧凑记忆来处理复杂的长期任务。然而,现有方法通常使用基于结果的强化学习来训练这些记忆策略,未能定位中间记忆质量下降的位置。随着交互的展开,模糊的递归总结逐渐丢弃任务相关信息并引入语义噪声。这加剧了信念偏差,模糊了智能体对潜在任务状态的估计,最终导致长期推理偏离轨道。因此,我们认为记忆优化不仅应关注轨迹层面的成功,还应关注中间总结所诱导的信念清晰度。为此,我们引入了信念熵,这是一种自监督代理,用于探测模型在当前记忆下对潜在任务状态的不确定性程度。基于这一代理,我们提出了元认知记忆策略优化(MMPO)。MMPO不依赖稀疏的基于结果的信号,而是通过显式惩罚诱导高认知不确定性的总结,提供细粒度的、记忆特定的监督。实验表明,MMPO在各种长期任务上持续优于现有方法,即使在扩展到175万token的上下文时仍保持97.1%的性能。

英文摘要

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

2605.30154 2026-05-29 cs.LG

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

RL2ML: 从强化学习到最大似然的有限rollout替代目标

Yifu Zheng

AI总结 本文提出RL2ML系列有限rollout替代目标,具有闭式无偏梯度估计,连接标准强化学习、类最大似然训练及超越最大似然目标,并揭示群体级更新尺度相变,将剩余自由度转化为一维优化问题。

详情
AI中文摘要

基于正确性的可验证奖励强化学习(RLVR)通过采样输出的二元反馈训练语言模型,但期望优化的目标与有限rollout组引起的随机更新几何常被混淆。本文开发了RL2ML,一系列具有闭式、精确无偏梯度估计的有限rollout替代目标。该系列在固定rollout预算下连续连接标准强化学习、类最大似然训练及超越最大似然目标,同时保持估计器-目标对齐。我们引入群体级更新尺度来表征rollout组在观察到经验成功计数后如何重新加权,揭示了仅通过总体级目标符号隐藏的亚临界-超临界更新尺度相变。基于这一区分,校准的度量增益分析和精确方差分解表明,最佳替代目标的选择既不由接近最大似然决定,也不仅由总体级权重决定,而是取决于评估度量、局部敏感性和估计器方差。因此,替代目标系列中的剩余自由度可以表述为一维优化问题,而非视为无约束超参数。

英文摘要

Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper develops RL2ML, a family of finite-rollout surrogate objectives with a closed-form, exactly unbiased gradient estimator. The family continuously connects standard reinforcement learning, maximum-likelihood-like training, and beyond-maximum-likelihood objectives while preserving estimator-objective alignment under a fixed rollout budget. We introduce the group-level update scale to characterize how a rollout group is reweighted after its empirical success count is observed, revealing a subcritical-supercritical update-scale transition that is hidden by population-level objective notation alone. Building on this distinction, calibrated metric-gain analysis and exact variance decomposition show that the best choice of surrogate objective is determined neither by proximity to maximum likelihood nor by the population-level weight alone. Instead, it depends jointly on the evaluation metric, local sensitivity, and estimator variance. The remaining degree of freedom in the surrogate objective family can therefore be formulated as a one-dimensional optimization problem rather than treated as an unconstrained hyperparameter.

2605.30153 2026-05-29 stat.ML cs.IT cs.LG math.IT math.ST stat.TH

Diffusion Models Are Statistically Optimal for Learning Low-Dimensional Multi-Modal Distributions

扩散模型在学习低维多模态分布时具有统计最优性

Jingda Wu, Changxiao Cai

AI总结 本文证明扩散模型在学习支撑在低维子空间并集上的分布时,样本复杂度仅依赖于内在维度,达到近最优的1-Wasserstein误差率,无需光滑性或有界密度假设。

详情
Comments
accepted to ICML 2026
AI中文摘要

基于分数的扩散模型在学习高维分布,特别是那些具有低维和多模态结构的分布方面,已经展现出显著的实证成功。然而,对其统计效率的理论理解仍然有限。现有理论通常依赖于强正则性假设,例如一致有界密度或全局光滑的分数函数,这些假设无法捕捉此类内在结构。在这项工作中,我们研究了扩散模型在学习支撑在低维子空间并集上的分布时的样本复杂度。假设每个子空间内的数据分布是次高斯的,我们证明扩散模型最多需要$\widetilde{O}(\varepsilon^{-k \vee 2})$个样本即可在1-Wasserstein距离上达到$\varepsilon$误差,其中$k$是内在维度。这一近最优的收敛速率仅依赖于内在维度,并显著改进了先前遭受维度灾难的理论保证。值得注意的是,我们的分析适用于广泛的分布,无需施加光滑性、有界密度或对数凹性假设。总体而言,我们的结果表明,扩散模型能够统计适应内在低维结构,同时自然容纳多模态数据,为其在复杂高维学习任务中的成功提供了严格的理论依据。

英文摘要

Score-based diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models for learning distributions supported on a union of low-dimensional subspaces. Assuming that the data distribution within each subspace is subgaussian, we show that diffusion models require at most $\widetilde{O}(\varepsilon^{-k \vee 2})$ samples to achieve $\varepsilon$ error in 1-Wasserstein distance, where $k$ is the intrinsic dimension. This near-optimal convergence rate depends only on the intrinsic dimension and significantly improves upon prior theoretical guarantees that suffer from the curse of dimensionality. Notably, our analysis applies to a broad collection of distributions without imposing smoothness, bounded-density, or log-concavity assumptions. Overall, our results show that diffusion models can statistically adapt to intrinsic low-dimensional structure while naturally accommodating multi-modal data, offering a rigorous theoretical justification for their success in complex high-dimensional learning tasks.

2605.30152 2026-05-29 cs.CL cs.AI cs.HC

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

主动型智能体真的需要LLM来决定何时唤醒和锚定什么吗?

Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao

AI总结 提出用时间图学习(TGL)模型替代LLM作为主动智能体的触发器,通过图更新而非文本处理用户活动,实现高效、低延迟的触发决策。

详情
Comments
31 pages, 5 figures, 7 tables
AI中文摘要

主动型智能体将用户活动读取为文本,并在每个事件上调用LLM来决定是否行动。但用户活动本质上不是文本:它是操作系统以图形式维护的结构化事件流(actor, verb, object, timestamp)元组。将结构渲染为文本并要求LLM恢复它是系统本不必进行的往返。我们将始终在线的信号视为图更新而非文本,并使用小型时间图学习(TGL)模型作为编码器:一次前向传播产生每个事件的触发概率和每个实体的路由分数,只有下游智能体(将小型结构化交接转化为流畅的用户面向句子)是LLM调用,仅在触发时调用。TGL在14个基线上平均提升F1 +16.7(最高+46.0);在触发架构比较中,一个TGL检查点给出了最强的触发AUC和最稳定的部署阈值。它在GPU服务器上每个事件运行11.13毫秒,在消费级笔记本电脑上运行13.99毫秒,比每种测试场景中的每个单次前向LLM作为触发配置快约4-7倍和12-83倍,其BF16驻留内存占用约220 MiB,可部署在设备上,与其消费的隐私敏感活动流一起运行。

英文摘要

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.

2605.30151 2026-05-29 cs.AI

Temporal Stability and Few-Shot Prompting in Math Task Assessment

数学任务评估中的时间稳定性和少样本提示

Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

AI总结 本研究通过纵向实验评估AI工具在数学任务认知需求分类中的时间稳定性和少样本提示效果,发现提示工程比模型版本更新更能提升性能。

详情
Comments
23 pages, 1 figure
AI中文摘要

随着AI工具日益融入教育环境,其随时间稳定性以及对提示工程技术的响应性成为问题。本纵向研究聚焦于不同AI工具使用任务分析指南(TAG; Stein & Smith, 1998)对数学任务认知需求进行分类的能力。具体而言,考察了这种分类能力是否随(1)模型版本更新和(2)使用示例任务的少样本提示而改变。我们测试了一个通用AI工具(Gemini)和一个教育专用AI工具(Coteach)。选择这些特定工具是因为它们在相关公开基准和先前任务特定测试中表现相对较高。模型在基线时进行测试,在模型版本更新后重新测试,然后再次使用少样本提示(每个认知需求类别两个示例任务)进行测试。结果显示,仅更新模型版本产生了混合效应:Gemini的准确率稳定在58%,而Coteach的准确率从75%下降到50%。然而,少样本提示提高了两个模型的性能:Gemini提高到67%,Coteach恢复到75%的准确率。这些发现表明,提示工程技术可以产生比被动模型改进更大且更可靠的效果,并且版本更新并不总是能提高在专门教育任务上的性能。该研究对教育工作者和研究人员在教育环境中如何选择、评估和实施AI工具具有重要意义。

英文摘要

As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\%, while Coteach's accuracy decreased from 75\% to 50\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\% and Coteach recovered to 75\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.

2605.30150 2026-05-29 cs.AI

Anchorless Diversification for Parallel LLM Ideation

无锚点多样化并行LLM创意生成

Fares Nabil Ibrahim, Nafis Saami Azad, Raiyan Abdul Baten

AI总结 研究无锚点方法(如语义方向分层)在并行LLM创意生成中实现候选池多样化,无需依赖种子想法,在多样性、质量和计算效率上优于有锚点基线。

详情
AI中文摘要

大型语言模型越来越多地用于生成创意任务的候选想法池,其中广泛探索是有价值的。在此场景下,并行推理在拓宽池的同时保持质量和成本效率时具有吸引力。我们研究推理时控制以实现候选池多样化,探究无锚点方法是否能与依赖观察到的种子想法的方法相抗衡。在三个创意任务族中,我们在中性和群体参照发散指令下,比较了独立生成和语义方向分层与自我、同伴和代表性锚点基线。群体参照发散是一个强大的低成本基线,在保持质量代理的同时增加了语义多样性。语义方向分层更强:一次规划调用即可组织跨广泛语义方向的生成,产生最佳的多样性-质量-计算前沿。锚点再生在最终池多样性上可能很强,但其优势在完整流水线令牌核算下缩小。这些结果为开放式LLM创意生建立实用的无锚点基线。

英文摘要

LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.

2605.30148 2026-05-29 cs.LG cs.AI

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM微调中的遗忘:进化策略方法

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

AI总结 本文发现进化策略微调中的先前任务遗忘实为性能漂移且可恢复,并引入锚定权重衰减(AWD)正则化技术有效稳定先前任务性能,表明遗忘可避免,使ES成为LLM持续学习的可行方法。

详情
AI中文摘要

进化策略(ES)最近作为强化学习(RL)在大语言模型(LLM)微调中的竞争性替代方案出现,通过简单性、可扩展性和仅推理训练提供优势。然而,近期研究表明,在新任务上进行ES微调可能导致对先前任务的遗忘。首先,本文表明先前任务遗忘(1)更好地被描述为性能漂移而非不可逆遗忘,在ES训练过程中先前任务性能通常会恢复;(2)并非ES特有的失败模式,使用RL方法微调时也可能出现。其次,本文分析了这种漂移何时以及为何出现,强调了其对ES训练动态的依赖性,特别是权重空间中弱约束方向上的随机游走行为。第三,基于这些见解,本文引入了锚定权重衰减(AWD)作为一种参数空间正则化技术,将优化约束向初始模型参数。AWD在保持目标任务性能的同时有效稳定了先前任务性能,以更低的计算成本实现了与大型ES种群规模相当的优势。因此,与先前观点相反,本文表明ES下的先前任务遗忘在很大程度上是可以避免的,使ES成为LLM持续学习中一种有前景的方法。

英文摘要

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

2605.30144 2026-05-29 cs.AI cs.MA

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

AgentSchool:基于LLM的多智能体教育模拟系统

Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

AI总结 提出AgentSchool,一种LLM驱动的多智能体模拟器,通过可成长的学生智能体(带知识图谱、思维工作流和错误概念)与自适应教师智能体(基于最近发展区)模拟学习过程,支持多尺度模拟,实验验证了其生成差异化掌握轨迹和符合课堂社会理论的行为模式。

详情
Comments
39 pages, 10 figures
AI中文摘要

尽管LLM已迅速部署到课堂中,验证教育AI仍然具有独特的棘手性:干预措施作用于发展中的学习者,其认知和社会轨迹被不可逆地塑造,而现实世界试验缓慢、受伦理约束且受制度限制。基于LLM的教育模拟器已成为潜在的补救措施,但许多模拟器仍将学习简化为角色扮演,并且当仅优化以再现现有课堂时,可能会结构性惩罚教学改革所需的制度创新。在这项工作中,我们介绍了AgentSchool,一种LLM驱动的多智能体模拟器,将学习建模为状态转换而非提示行为。AgentSchool将可成长的学生智能体(配备加权学科知识图谱、思维工作流池和显式错误概念)与自适应教师智能体(在最近发展区内规划、搭建支架和反思)相结合,嵌入可配置的场景生成器(将教学置于正式和非正式学习领域)和多尺度模拟器(解耦交互规模、时间粒度和模拟持续时间)。实验表明,结构化学生智能体比基线模拟器产生更差异化的掌握和错误概念轨迹,而教师智能体比较显示出与基于ZPD的适应一致的骨干依赖模式。此外,AgentSchool生成与课堂社会理论一致的外围参与、小团体形成、攻击者诱导的凝聚力和意见领袖出现的合理轨迹。除了作为教育研究工具的作用外,AgentSchool还将教育构建为在组织压力下进行长时记忆、多智能体协调和未来制度推理的社会意义测试平台。

英文摘要

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

2605.30140 2026-05-29 cs.CV

AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection

AnomalyAgent: 用于零样本/少样本异常检测的无训练智能体模型

Yi Zhang, Jiawen Zhu, Lele Fu, Guansong Pang

AI总结 提出一种基于多模态大语言模型的无训练智能体框架AnomalyAgent,通过定制工具集和记忆模块实现零样本/少样本异常检测,在逻辑/上下文异常等复杂场景中优于现有方法。

详情
AI中文摘要

受益于视觉语言模型(如CLIP)的泛化能力,许多零样本/少样本异常检测方法已在各种数据集上取得了令人印象深刻的检测性能。然而,它们需要在大规模辅助数据集上进行大量训练以适应异常检测,并且其推理主要依赖于基于视觉-文本嵌入相似度的异常分数,缺乏检测需要深度上下文理解的复杂异常的推理能力。为了解决这一局限性,我们提出了 extbf{AnomalyAgent},一种新颖的无训练智能体框架,利用多模态大语言模型的先进推理和泛化能力进行异常检测。关键要素包括: extbf{1)}一个全面的以异常为中心的工具集,能够在零样本设置下实现自适应MLLM驱动的智能体异常推理; extbf{2)}一个定制的记忆模块,通过少样本上下文参考示例来支撑异常推理。我们将评估从广泛使用的基准测试中检测简单异常(例如,裂纹和凹痕等表面缺陷以及明显病变)扩展到更多样化的异常类型,例如物流和制造环境中的逻辑/上下文异常。大量实验结果表明,我们的AnomalyAgent与无训练的基于VLM的异常检测和通用智能体方法相比,实现了显著更好的性能,突显了其在零样本和少样本异常检测设置中的优越泛化能力。代码实现可在此地址找到。

英文摘要

Benefiting from generalizability of vision-language models (VLMs) such as CLIP, many zero-/few-shot anomaly detection (AD) approaches have achieved impressive detection performance across various datasets. Nevertheless, they require substantial training on large auxiliary datasets to adapt VLMs to anomaly detection, and their inference largely relies on visual-text embedding similarity-based anomaly scores, lacking reasoning abilities to detect complex anomalies that require in-depth contextual understanding. To address this limitation, we propose \textbf{AnomalyAgent}, a novel training-free, agentic framework that leverages the advanced reasoning and generalization capabilities of multimodal large language models (MLLMs) for anomaly detection. The key ingredients include \textbf{1)} a comprehensive anomaly-centric toolset that enables adaptive MLLM-driven, agentic anomaly reasoning in zero-shot settings, and \textbf{2)} a customized memory module that grounds anomaly reasoning with few-shot, in-context reference examples. We extend evaluation beyond the detection of simple anomalies (e.g., surface defects like cracks and dents and clear lesions) in widely used benchmarks to more diverse types of anomalies such as logical/contextual anomalies in logistics and manufacturing settings. Extensive experiment results demonstrate that our AnomalyAgent achieves substantially better performance compared to training-free VLM-based AD and generic agentic methods, highlighting its superior generalization capability in both zero-shot and few-shot anomaly detection settings. The code implementation can be find at this address.

2605.30136 2026-05-29 cs.AI

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

通过上下文相关性的注意力引导增强多智能体通信

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

AI总结 针对LLM多智能体系统中长对话历史导致信息稀释的问题,提出无训练的上下文管理方法Agent-Radar,利用时空衰减机制动态引导注意力,在五个基准上取得最高7.64个绝对点的提升。

详情
AI中文摘要

基于LLM的多智能体系统通过协作推理在复杂任务上表现出色。然而,这些系统在交互过程中会迅速积累极长的对话历史。随着对话变长,相关信息被无关上下文稀释,导致性能下降。在这项工作中,我们提出了Agent-Radar,一种无需训练的上下文管理方法,通过新颖的时空衰减机制动态引导每个智能体的注意力到相关上下文。实验表明,Agent-Radar在五个不同基准上优于最先进的方法,最高提升7.64个绝对点。此外,分析显示Agent-Radar在智能体数量和交互轮次增加时仍然有效且鲁棒。最后,消融研究表明Agent-Radar的核心组件对性能至关重要,且在不同设置下具有泛化性。

英文摘要

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

2605.30135 2026-05-29 cs.LG cs.AI

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

DAMEL: 双轴多专家学习用于类别不平衡学习

Hyuck Lee, Taemin Park, Heeyoung Kim

AI总结 提出双轴多专家学习算法DAMEL,通过表示轴和时间轴上的多专家集成,同时降低预测偏差和方差,有效解决类别不平衡学习问题。

详情
AI中文摘要

针对来自具有长尾分布的真实世界数据的类别不平衡学习所带来的挑战,已有多种算法被提出。这些算法通过重平衡技术减少了预测偏差,但通常以增加预测方差为代价。一些多专家学习算法旨在解决这一方差问题,但涉及复杂的过程。我们提出了一种新的多专家学习算法,称为双轴多专家学习(DAMEL),该算法通过沿表示轴和时间轴使用多个专家来同时降低预测的偏差和方差。沿表示轴,DAMEL拼接多个专家的表示,并同时使用拼接后的表示训练一个辅助的平衡分类器。沿时间轴,DAMEL聚合跨训练时期的网络权重,并在测试时使用这些聚合权重。实验结果表明,DAMEL同时降低了预测的偏差和方差,突显了其在类别不平衡学习中的有效性。

英文摘要

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

2605.30133 2026-05-29 cs.CL

CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution

CorPipe at CRAC 2026: 多语言共指消解中的空节点与跨语言迁移

Milan Straka

AI总结 本文提出CorPipe 26系统,通过单一模型联合预测空节点、提及和共指链接,在CRAC 2026多语言共指消解共享任务中超越所有其他系统,并在LLM赛道和不受限赛道分别领先2.8和9.5个百分点。

详情
Comments
Accepted to CODI-CRAC 2026
AI中文摘要

我们介绍CorPipe 26,这是我们在CRAC 2026多语言共指消解共享任务中的获胜提交。该共享任务的第五版主要关注生成式LLM与专用系统的比较;此外,还引入了5个更多数据集和2种新语言。CorPipe 26是CorPipe 25的改进版本,具有一种新变体,可在单个模型中同时预测空节点、提及和共指链接。我们的系统在LLM赛道中优于所有其他提交2.8个百分点,在不受限赛道中优于所有提交9.5个百分点。此外,我们进行了一系列消融实验,涉及不同模型大小、空节点预测方法以及跨语言零样本评估。源代码和训练好的模型可在https://github.com/ufal/crac2026-corpipe公开获取。

英文摘要

We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026-corpipe.

2605.30132 2026-05-29 cs.LG stat.ML

Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation

学习外推到新任务:一种关系型任务外推方法

Adam Ousherovitch, Yixin Wang

AI总结 提出关系型任务外推器(RTE),通过将目标任务分解为锚定任务和变换关系并学习关系算子,实现向未见任务的系统性外推,在函数预测和序列预测中显著优于现有方法。

详情
Comments
ICML 2026
AI中文摘要

现代学习系统擅长内插,但难以泛化到训练分布支持范围之外的未见任务。即使在简单设置中(如处理超出训练范围的任务参数),这种失败也会发生,并且尽管基础模型取得了进展,问题依然存在。为此,我们开发了关系型任务外推器(RTE),一种旨在实现向新任务系统性外推的算法。关键观察是外推本质上是关系型的:外推到未见任务需要学习任务如何相互转换。如果模型在训练期间学习了任务A和B之间的变换,它可以在测试时应用相同的变换来关联已知任务和未见任务。RTE通过将每个目标任务分解为一个已知的锚定任务和一个连接锚定与目标的变换来实现这一思想。然后它学习一个关系算子,将锚定-变换对映射到目标任务的预测。我们在函数预测的多个任务外推场景中实例化RTE,例如目标任务使用超出范围的参数(参数外推)、具有更大的组合深度(长度外推)和/或以未见方式重新组合函数原语(组合外推)。我们进一步将RTE扩展到序列预测,将其集成到基础模型的微调算法中。在实证研究中,我们发现RTE在向新颖、未见任务的外推上显著优于现有方法。

英文摘要

Modern learning systems excel at interpolation but struggle to generalize to unseen tasks outside the training distribution's support. This failure occurs even in simple settings, such as handling task parameters beyond the training range, and persists despite advances in foundation models. To this end, we develop the Relational Task Extrapolator (RTE), an algorithm designed to enable systematic extrapolation to novel tasks. The key observation is that extrapolation is inherently relational: extrapolating to unseen tasks requires learning how tasks transform into one another. If a model learns the transformation between tasks A and B during training, it can apply that same transformation to relate known tasks to unseen ones at test time. RTE operationalizes this idea by decomposing each target task into a known anchor task and a transformation linking the anchor and target. It then learns a relational operator, mapping an anchor-transformation pair to predictions for the target task. We instantiate RTE across multiple task extrapolation regimes in function prediction, e.g. where target tasks use out-of-range parameters (parameter extrapolation), have greater compositional depth (length extrapolation), and/or recombine function primitives in unseen ways (compositional extrapolation). We further extend RTE to sequence prediction, integrating it into fine-tuning algorithms for foundation models. Across empirical studies, we find that RTE substantially outperforms existing approaches on extrapolation to novel, unseen tasks.

2605.30131 2026-05-29 cs.CL cs.CV

CCS: Clinical Consensus Selection for Radiology Report Generation

CCS:放射学报告生成的临床共识选择

Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

AI总结 提出CCS框架,通过采样多个候选报告并选择临床共识最高的一个,以改进放射学报告生成在推理时的质量。

详情
Comments
17 pages, 6 figures
AI中文摘要

放射学报告生成(RRG)通常被表述为单路径生成任务,其中多模态大语言模型(MLLM)产生一个解码报告作为最终输出。虽然最近的进展主要通过扩展训练数据、模型容量和检索机制来推动,但在推理时提高报告质量仍未被充分探索。在这项工作中,我们观察到固定的放射学MLLM在其候选池中通常生成比默认解码选择的报告临床更强的报告,这表明推理时的决策仍然是一个被忽视的瓶颈。为了解决这个问题,我们提出了临床共识选择(CCS),一个解码器无关的推理时选择框架,它采样多个候选报告,并选择在展开池中具有最高临床共识的报告。CCS将基于文本的效用与由图像-报告训练的多模态嵌入器计算的放射学适应效用统一起来,该嵌入器测量超越表面文本相似性的候选一致性。在三个数据集和多个放射学MLLM上,CCS始终优于单路径解码和通用Best-of-N基线,特别是在临床指标上取得了明显提升。进一步分析表明,基于图像的效用形成了与文本共识不同的选择轴,并且在推理时改进RRG仍有很大的提升空间。

英文摘要

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

AI总结 提出PARCEL视觉分词架构,通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突,在27个基准上提升性能-效率帕累托前沿。

详情
Comments
33 pages, 4 figures
AI中文摘要

大型视觉-语言模型(LVLMs)将视觉输入映射为密集的令牌序列,导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而,现有方法在激进压缩下表现不佳。空间压缩(如嵌套池化)表现为不完美的低通滤波器,并引起频谱混叠,掩盖了细粒度细节。查询压缩(如嵌套查询重采样)用非局部摘要替代显式的网格对齐令牌,显著降低了空间定位能力。为解决这一表示冲突,我们引入了PARCEL(基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解),一种视觉分词架构,动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点,并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征,而非冗余的空间映射。在27个基准上的广泛评估表明,PARCEL改进了性能-效率帕累托前沿,在各种视觉令牌预算下持续优于现有的嵌套基线,同时保留了“一次训练,随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

2605.30117 2026-05-29 cs.AI

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

VLA-Trace: 通过表示与行为追踪诊断视觉-语言-动作模型

Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju

AI总结 提出VLA-Trace诊断框架,通过表示演化、因果控制归因和行为表现分析,揭示VLA模型在多模态知识向具身控制转化中的机制,发现不同模型在微调适应、多模态路由和语义遵循上的差异与局限。

详情
AI中文摘要

理解视觉-语言-动作(VLA)模型如何将多模态知识转化为具身控制仍然是一个开放的挑战。我们提出了VLA-Trace,一个渐进式诊断框架,通过从表示动态到因果控制归因再到行为表现的统一证据链来分析VLA模型。它具体结合了跨模态和以检查点漂移为中心的核对齐(CKA)来追踪表示演化,注意力阻断干预来识别模态特定的控制通路,以及 rollout 级别的行为探针来检查基础能力、捷径依赖和语义遵循。在 $π_{0.5}$ 和 OpenVLA 上的实验揭示了三个关键发现。第一,两个模型在 VLA 微调期间表现出不同的模态特定适应动态。第二,它们在动作解码期间依赖于不同的多模态路由策略和层间依赖关系。第三,尽管 VLA 策略在视觉引导的轨迹生成方面表现出色,但在细粒度语义遵循方面仍然有限。这些发现指出了表示保持适应、因果 VLA 回路和组合语义控制的未来方向。

英文摘要

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $π_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.