arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.13455 2026-05-27 cs.CV

Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

使用联合泊松反卷积和微分同胚配准的贝叶斯体内突触追踪

Shashwat Kumar, Dominic M. Padova, Binish Narang, Gabrielle I. Coste, Austin R. Graves, Richard L. Huganir, Adam S. Charles, Michael I. Miller, Anuj Srivastava

AI总结提出一种基于模板的贝叶斯框架，通过联合泊松反卷积和微分同胚配准，同时实现突触检测、去噪、荧光强度推断、组织运动校正和置信区间估计，用于低信噪比体内显微镜数据中的突触追踪。

详情

AI中文摘要

突触是密集排列的亚微米结构，在学习和记忆形成过程中动态重组。纵向体内成像荧光标记的突触受体为研究大规模突触动力学以及这些过程在神经疾病中如何被破坏提供了有希望的机会。然而，使用双光子显微镜的体内成像采用低激光功率，因此受到低信噪比和高散粒噪声、天与天之间的非线性组织运动、突触荧光的非平稳波动以及显微镜点扩散函数引起的显著模糊的影响。这些因素共同使得检测和追踪突触变得具有挑战性，尤其是在突触密度高的区域。本文提出了一种新颖的基于模板的框架，将突触建模为在非线性组织变形下移动的可变亮度点源。采用统一的贝叶斯方法，我们通过推导一个后验分布来将该模型应用于显微镜数据，该后验分布包含用于域扭曲的微分同胚映射、用于成像过程的高斯点扩散函数以及用于原始光子计数的泊松观测模型。贝叶斯解决方案同时：(1) 构建突触位置的概率模板，(2) 对图像数据进行去噪和反卷积，(3) 推断荧光强度，(4) 执行微分同胚图像配准以校正组织运动，以及(5) 为这些参数估计提供置信区域。我们在一个2D+t模拟数据集和一个在小鼠两周内成像的荧光突触的3D+t纵向体内显微镜数据集上展示了该框架。

英文摘要

Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textit{in vivo} imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textit{in vivo} microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.

URL PDF HTML ☆

赞 0 踩 0

2604.22546 2026-05-27 cs.CV

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

ReLIC-SGG: 开放词汇场景图生成的关系格补全

Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

AI总结针对开放词汇场景图生成中标注不完整导致大量有效关系被误判为负例的问题，提出ReLIC-SGG框架，通过构建语义关系格建模谓词间的相似、蕴含和矛盾关系，将未标注关系视为潜在变量而非确定负例，结合视觉-语言兼容性、图上下文和语义一致性推断缺失正关系，并采用正-无标记图学习减少假负例监督，格引导解码生成紧凑且语义一致的场景图。

Comments Some errors in the experimental sections

详情

AI中文摘要

开放词汇场景图生成（SGG）旨在用超越固定谓词集的灵活关系短语描述视觉场景。现有方法通常将标注的三元组视为正例，所有未标注的对象-对关系视为负例。然而，场景图标注本质上是不完整的：许多有效关系缺失，且同一交互可以以不同粒度描述，例如 extit{on}、 extit{standing on}、 extit{resting on} 和 extit{supported by}。由于开放词汇SGG的关系空间更大，这一问题变得更加严重。我们提出 extbf{ReLIC-SGG}，一种关系不完整性感知框架，将未标注关系视为潜在变量而非确定负例。ReLIC-SGG构建语义关系格来建模开放词汇谓词间的相似性、蕴含和矛盾关系，并利用它从视觉-语言兼容性、图上下文和语义一致性中推断缺失的正关系。正-无标记图学习目标进一步减少假负例监督，而格引导解码生成紧凑且语义一致的场景图。在常规、开放词汇和全景SGG基准上的实验表明，ReLIC-SGG改进了稀有和未见谓词的识别，并更好地恢复了缺失关系。

英文摘要

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

URL PDF HTML ☆

赞 0 踩 0

2509.09544 2026-05-27 cs.CL

MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025)

MetaGraph：金融NLP中GenAI的大规模元分析（2022-2025）

Paolo Pedinotti, Peter Baumann, Nathan Jessurun, Leslie Barrett, Enrico Santus

AI总结提出MetaGraph方法，利用本体引导的LLM从科学语料中提取类型化知识图谱，对681篇GenAI在金融领域的论文进行结构化趋势分析，揭示了三个阶段：早期LLM驱动的任务和数据集扩展、对局限性和风险的日益关注、以及向模块化系统导向方法的转变。

Comments 8 pages, appendices, GEM, ACL

2605.12271 2026-05-27 cs.CV

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

超越文本提示：视觉到视觉生成作为统一范式

Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, Raymond H. Chan

AI总结提出视觉到视觉（V2V）生成范式及无需训练的V2V-Zero框架，通过利用视觉页面隐藏状态替代文本条件，在多个任务上达到或接近优化后的文本到图像性能。

Comments Project Page: https://yaofang-liu.github.io/V2V_Web

详情

AI中文摘要

人类通常通过视觉制品（如排版表、草图、参考图像和标注场景）来指定和创作。然而，现代视觉生成器仍然要求用户将这种意图序列化为文本，这一瓶颈压缩了空间结构、精确外观和字形形状等信号。我们提出 extbf{\emph{视觉到视觉}（V2V）}生成，其中用户使用视觉规范页面（而非文本提示）来条件化生成模型。该页面不是编辑目标，而是指定所需输出的视觉文档。我们引入 extbf{V2V-Zero}，一个无需训练的框架，通过用从视觉页面提取的最终层隐藏状态替换纯文本条件，在现有的视觉语言模型（VLM）条件化生成器中暴露此接口，利用了冻结的VLM已将文本和图像映射到生成器条件空间的事实。在GenEval上，V2V-Zero使用冻结的Qwen-Image骨干网络达到0.85，接近其优化后的文本到图像性能而无需微调。为评估更广泛的V2V空间，我们引入 extbf{Simple-V2V Bench}，涵盖七个视觉条件化任务和七个模型，包括GPT Image 2、Nano Banana 2、Seedream 5.0 Lite、开源权重基线和视频扩展。V2V-Zero得分为32.7/100，优于评估的开源图像基线，并揭示了清晰的能力层次：属性绑定强，内容生成不可靠，结构控制即使对商业系统也困难。HunyuanVideo-1.5扩展得分为20.2/100，表明该接口可迁移到图像之外。机制分析显示默认推理路径主要通过视觉路由，95.0%的条件化token注意力集中在视觉页面隐藏状态上。

英文摘要

Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.

URL PDF HTML ☆

赞 0 踩 0

2605.11867 2026-05-27 cs.CV

When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI

当大脑存在分歧：生物模糊性是结构MRI合成淀粉样蛋白PET挑战的基础

Louise E. G. Baron, Ross Callaghan, David M. Cash, Philip S. J. Weston, Hojjat Azadbakht, Hui Zhang

AI总结通过控制实验证明，结构MRI到淀粉样蛋白PET合成性能受限的根本原因是生物模糊性（MRI与PET测量时间解耦的病理过程），而非模型架构能力，并表明引入血浆生物标志物等多模态信息可解决该问题。

Comments MICCAI 2026 accepted paper (no rebuttal)

详情

AI中文摘要

结构MRI到淀粉样蛋白PET合成已被提出作为阿尔茨海默病（AD）中淀粉样蛋白评估的非侵入性替代方法。然而，相同模型的报告性能在不同研究中差异很大，且日益复杂的架构并未带来一致的提升。这种不一致性被认为是由基本的生物模糊性引起的：MRI捕捉神经退行性变，而PET测量淀粉样蛋白病理——这两个过程在AD中常常在时间上解耦。因此，相似的MRI模式可能对应不同的淀粉样蛋白状态，产生模糊的一对多映射。因此，MRI到淀粉样蛋白PET合成可能本质上是病态的；然而，这一想法尚未得到科学验证。本工作的目的是通过两个控制实验来检验这一假设。我们首先通过根据淀粉样蛋白和神经退行性变状态对配对的MRI-PET数据进行分层来控制训练分布。在控制设计下使用两种标准合成模型，我们表明生物学上明确的映射可以单独学习，但当引入数据模糊性时性能崩溃。这表明数据分布中的模糊性（而非架构容量）限制了性能。其次，我们表明引入血浆生物标志物形式的正交生物学信息可以解决这种模糊性。当整合多模态输入时，性能提高且稳定性恢复。总之，这些发现表明MRI到淀粉样蛋白PET合成中有限且不一致的性能是由内在的生物模糊性解释的，稳定、有意义的进展需要多模态整合而非架构复杂性。

英文摘要

Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer's disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.06152 2026-05-27 cs.LG cs.CL math.OC stat.ML

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Grokking 还是 Glitching？低精度如何驱动 Slingshot 损失尖峰

Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

AI总结本文证明深度神经网络训练中的 Slingshot 损失尖峰现象是由浮点精度限制导致的数值特征膨胀（NFI）机制引起的，并解释了参数范数快速增长和梯度消失等现象。

Comments 28 pages, 13 figures; ICML 2026 Workshop on High-dimensional Learning Dynamics (Spotlight)

详情

AI中文摘要

深度神经网络在无正则化的长期训练中会出现周期性的损失尖峰，这种现象被称为“Slingshot 机制”。现有工作通常将其归因于内在的优化动力学，但其触发机制仍不清楚。本文证明这种现象是浮点算术精度限制的结果。当训练进入高置信度阶段时，正确类别的 logit 与其他 logit 之间的差异可能超过吸收误差阈值。然后在反向传播中，正确类别的梯度被精确舍入为零，而错误类别的梯度保持非零。这打破了跨类别的梯度零和约束，并在分类器层的参数更新中引入了系统性漂移。我们证明这种漂移与特征形成正反馈循环，导致全局分类器均值和全局特征均值呈指数增长。我们将这种机制称为数值特征膨胀（NFI）。该机制解释了 Slingshot 尖峰前的快速范数增长、随后梯度的重新出现以及由此产生的损失尖峰。我们进一步表明，NFI 并不等同于观察到的损失尖峰：在更实际的任务中，部分吸收可能不会产生可见的尖峰，但它仍然可以打破零和约束并驱动参数范数的快速增长。我们的结果将 Slingshot 重新解释为有限精度训练的一种数值动力学，并为训练后期异常参数增长和 logit 发散提供了可检验的解释。

英文摘要

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

URL PDF HTML ☆

赞 0 踩 0

2604.22274 2026-05-27 cs.CV

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

CAGE-SGG：用于开放词汇场景图生成的反事实主动图证据

Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

AI总结提出基于反事实关系验证的开放词汇场景图生成框架，通过分解谓词为软证据基并使用反事实验证器确保关系有视觉证据支持，从而提升可靠性、可解释性和泛化能力。

Comments This manuscript has been withdrawn by the authors because we found a methodological flaw in the formulation and evaluation of the proposed approach. The issue affects the reliability of the experimental results and the conclusions drawn from them. Therefore, the authors consider the current version unsuitable for citation or further use

详情

AI中文摘要

开放词汇场景图生成（SGG）旨在用超出固定谓词词汇表的灵活且细粒度的关系短语描述视觉场景。虽然最近的视觉语言模型极大地扩展了SGG的语义覆盖范围，但它们也引入了一个关键的可信性问题：预测的关系可能由语言先验或对象共现驱动，而非基于视觉证据。在本文中，我们提出了一种基于反事实关系验证的证据充分的开放词汇SGG框架。我们的方法不是直接接受合理的关系提议，而是验证每个候选关系是否得到关系特定的视觉、几何和上下文证据的支持。具体来说，我们首先使用视觉语言提议器生成开放词汇关系候选，然后将谓词短语分解为软证据基，如支撑、接触、包含、深度和状态。关系条件证据编码器提取谓词相关线索，而反事实验证器测试当必要证据被移除时关系分数是否下降，并在无关扰动下保持稳定。我们进一步引入矛盾感知谓词学习和图级偏好优化，以改进细粒度区分和全局图一致性。在常规、开放词汇和全景SGG基准上的实验表明，我们的方法一致地改进了标准召回率指标、未见谓词泛化和反事实基础质量。这些结果表明，从关系生成转向关系验证可产生更可靠、可解释且基于证据的场景图。

英文摘要

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

URL PDF HTML ☆

赞 0 踩 0

2511.15572 2026-05-27 cs.CV

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

从逐图像低秩到编码不匹配：重新思考视觉Transformer中的特征蒸馏

Huiyuan Tian, Bonan Xu, Shijian Li

AI总结本文通过发现编码不匹配现象，提出Lift或WideLast两种简单修复方法，显著提升视觉Transformer特征蒸馏在压缩场景下的性能。

Comments 22 pages, 22 figures. Accepted at the ICML 2026

详情

AI中文摘要

特征图知识蒸馏（KD）在规模相当的视觉Transformer（ViT）之间能很好地传递内部表示，但在压缩场景下常常失败。我们重新审视这一失败并揭示了一个悖论。逐样本SVD表明每个图像高度可压缩，这似乎暗示一个带有线性投影器的窄学生网络“原则上”应该匹配教师网络。然而，数据集层面的视图与这一直觉相矛盾：PCA表明教师网络是低秩子空间的并集，且不同输入间存在显著的子空间旋转。我们进一步引入token级别的频谱能量模式（SEP），发现一个架构无关的编码定律：即使token存在于低秩子空间中，它们也会在通道模式上广泛分布能量，造成带宽不匹配。我们将这一组合现象称为编码不匹配。我们提出两种最小修复方法：Lift或WideLast。（i）Lift在推理时保留一个轻量级的提升投影器以提供更宽的通道，或（ii）WideLast仅加宽学生网络的最后一个块，实现输入依赖的扩展。在ImageNet-1K上，这些修复方法复兴了ViT压缩的特征KD，将从CaiT-S24蒸馏的DeiT-Tiny的top-1准确率从74.86%提升至77.53%/78.23%，并且也增强了未经蒸馏训练的学生网络。我们的分析阐明了特征图KD何时以及为何失败，以及如何修复。代码和原始数据见https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch。

英文摘要

Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student's last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch.

URL PDF HTML ☆

赞 0 踩 0

2509.26469 2026-05-27 cs.LG

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

DiVeQ: 使用重参数化技巧的可微分向量量化

Mohammad Hassan Vali, Tom Bäckström, Arno Solin

AI总结提出DiVeQ方法，通过重参数化技巧将量化视为添加模拟量化失真的误差向量，实现前向传播硬量化而梯度可流动，并引入空间填充变体SF-DiVeQ减少量化误差并充分利用码本，在VQ-VAE、VQGAN和DAC任务中提升重建质量和样本质量。

2601.16578 2026-05-27 cs.RO cs.SY eess.SY

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Cyber-Physical Mobility Lab中的零样本多智能体强化学习基准测试

Julius Beerwerth, Jianye Xu, Simon Schäfer, Fynn Belderink, Bassam Alrifaee

AI总结本文基于Cyber-Physical Mobility Lab构建了一个可复现的基准测试平台，用于评估联网自动驾驶汽车多智能体强化学习策略的仿真到现实迁移，并揭示了性能下降的两个互补来源。

2605.09156 2026-05-27 cs.CL cs.AI

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中？探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

AI总结本文提出一个可解释的深度学习框架，通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分（阳性、阴性、中性）到二分（阳性、阴性）的演变，并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情

AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组，在大多数罗曼语中从三分结构（阳性、阴性、中性）变为二分结构（阳性、阴性）。在这项工作中，我们引入了一个可解释的深度学习框架，在词法和上下文层面研究这一现象。首先，我们表明传统的分词策略对于这种低资源历史设置不够稳健，而我们提出的分词器在这些基线上提高了性能。在词法层面，我们评估了形态特征对性别预测的贡献。在上下文层面，我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

URL PDF HTML ☆

赞 0 踩 0

2605.08455 2026-05-27 cs.LG cs.PL cs.SE

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

CUDABeaver：基于LLM的自动化CUDA调试基准测试

Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding

AI总结提出CUDABEAVER基准，通过协议条件指标pass@k(M,C,A)评估LLM修复CUDA代码的能力，揭示性能损失容忍度对成功率的影响。

Comments 25 pages, 5 figures

详情

AI中文摘要

调试CUDA程序长期以来一直具有挑战性，因为故障通常源于硬件行为、编译器决策、内存层次结构和异步执行之间微妙的交互。更重要的是，随着GPU在科学计算、机器学习、图形和系统工作负载中的快速扩展，CUDA调试变得比以往任何时候都更具挑战性。当前对基于LLM的CUDA编程的评估大多忽略了这一场景：模型可以通过退化性修复通过正确性测试，将CUDA代码简化为更安全但更慢的程序，从而放弃原始优化结构。我们引入了CUDABEAVER，一个从基于LLM的CUDA生成过程中产生的真实失败工作空间中进行CUDA调试的基准。每个任务提供损坏的候选代码、原生构建/测试命令、原始错误证据以及一个可编辑文件。CUDABEAVER评估修复程序是否真正修复了失败的CUDA代码，还是仅仅找到了一个更慢的通过测试的替代方案，并按故障类别、调试轨迹、停滞模式和性能保持情况报告结果。我们进一步提出了pass@k(M,C,A)，一种协议条件的CUDA调试指标，通过明确修复程序M、语料库C和协议轴A。使用该指标在213个任务和七个前沿LLM上，我们表明协议感知评估提供了更真实的CUDA调试能力视图：当性能损失容忍度高时，修复程序看起来更强，但即使是一个微小的更严格的性能要求也能显著降低测量成功率，分数变化高达40个百分点。

英文摘要

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

URL PDF HTML ☆

赞 0 踩 0

2605.04635 2026-05-27 cs.CV

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

UniPCB: 一种用于PCB缺陷检测的生成辅助检测框架

Huan Zhang, Lianghong Tan, Yichu Xu, Zishan Su, Jiangzhong Cao, Huanqi Wu, Linwei Zhu, Xu Zhang

AI总结提出UniPCB框架，通过多模态条件生成器合成缺陷样本以增强数据，并设计倒残差移位注意力与跨级互补融合模块提升检测性能，在DsPCBSD+上实现98.0% mAP@0.5。

详情

AI中文摘要

在工业物联网（IIoT）中，实现智能、实时的印刷电路板（PCB）缺陷检测对于确保产品可靠性至关重要。然而，现有的基于IIoT的视觉检测系统面临两个相互叠加的挑战：稀缺且不平衡的缺陷样本限制了模型训练，以及在复杂电路背景下特征表示不足。现有的生成方法依赖具有粗略结构控制的单模态条件，而检测方法则改进架构但未解决数据瓶颈。为了共同解决这两个挑战，我们提出了一种生成辅助的PCB缺陷检测框架，该框架在IIoT支持的流水线中集成了受控缺陷合成与任务特定缺陷检测。在生成侧，多模态条件生成器并行提取互补的边缘、深度和文本条件。然后，ScaleEncoder将这些条件嵌入到扩散U-Net的四个分辨率中，条件调制在每个尺度上应用FiLM风格的空间自适应调制，实现结构对齐和缺陷感知的样本合成，以增强稀缺的IIoT数据集。在检测侧，倒残差移位注意力将自注意力与移位卷积相结合，以共同捕获全局上下文和局部纹理，跨级互补融合块生成像素级门控用于选择性跨级特征融合。合成的样本直接丰富检测训练集，使得生成的改进与检测的改进相互叠加。在DsPCBSD+上的大量实验表明，UniPCB在缺陷检测上达到mAP@0.5为98.0%、mAP@0.5:0.95为61.8%，超越了所有对比方法，同时生成分支的FID为129.61、SSIM为0.619，优于现有的条件生成方法。

英文摘要

In the Industrial Internet of Things (IIoT), enabling intelligent, real-time Printed Circuit Board (PCB) defect inspection is critical for ensuring product reliability. However, existing IIoT-based visual inspection systems face two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection within an IIoT-enabled pipeline. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis to augment the scarce IIoT dataset. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR：用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

AI总结提出PHALAR对比框架，利用学习谱池化和复值头实现音高和相位等变，在茎检索任务中参数减少50%、训练加速7倍，准确率相对提升约70%，并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

AI总结本文发现语言模型内部存在对应工具选择的线性方向，通过干预该方向可切换工具调用，并能提前检测潜在错误，在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情

AI中文摘要

当工具调用代理选错工具时，失败在执行之前是不可见的：邮件被发送，会议被错过。随着代理承担重要行动，一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它；本文表明我们可以做到。在模型内部，工具的选择由激活空间中的单个方向承载，每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1（270M 到 27B）的 12 个指令微调模型和 6 个基础模型上，这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率，在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式，因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误：模型在两个工具之间不确定的查询失败率比确定的高 21 倍（Gemma 3 27B）。这不仅仅是主题注入：相同幅度的随机向量给出 0% 的切换率，而在单个领域（共享一个主题的 14 个航空工具）内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具：从模型内部状态读取所选工具（余弦读出）在 BFCL 上恢复 61-82% 的准确率，而基础生成仅为 2-10%，这表明预训练形成了表示，而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置；在多轮代理循环中，相同的干预不太稳定（匹配基线的增益或损失高达 30 个百分点，没有一致的方向）。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

URL PDF HTML ☆

赞 0 踩 0

2605.07632 2026-05-27 cs.CL cs.AI cs.LG

Post-training makes large language models less human-like

后训练使大型语言模型更不像人类

Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska Brändle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, Noémi Éltető, Michael Franke, Thomas L. Griffiths, Fritz Günther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield, Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang, Jian-Qiao Zhu, Eric Schulz

AI总结通过引入Psych-201数据集，发现后训练（将基础模型转化为有用助手的过程）一致地降低了模型与人类行为的对齐度，且这种错位在新模型世代中加剧，而人物诱导技术无法改善个体层面的预测。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用作人类参与者的替代品，但目前尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题，我们引入了Psych-201，这是一个新颖的数据集，使我们能够大规模测量行为对齐。我们发现，后训练——将基础模型转化为有用助手的阶段——在模型家族、规模和目标上一致地降低了与人类行为的对齐度。此外，这种错位在新模型世代中扩大，即使基础模型继续改进。最后，我们发现人物诱导——一种通过将模型条件化为参与者特定信息来引发类人行为的流行技术——并不能改善个体层面的预测。综合来看，我们的结果表明，当前用于将LLMs转化为有用助手的那些过程也使得它们成为人类行为的不太准确的模型。

英文摘要

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.07521 2026-05-27 cs.AI

From Feasible to Practical: Pareto-Optimal Synthesis Planning

从可行到实用：帕累托最优合成规划

Friedrich Hastedt, Dongda Zhang, Antonio del Rio Chanona

AI总结针对现有合成规划方法忽略多目标权衡的问题，提出MORetro*算法，通过多目标A*搜索生成帕累托前沿，在成本、可持续性、毒性等指标间实现最优权衡。

Comments Published in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

当前的计算机辅助合成规划（CASP）方法通常将逆合成视为一旦找到单一可行路线即解决，主要关注收敛性或最短路径指标。这种观点与现实实践不符，因为化学家必须平衡成本、可持续性、毒性和总产率等相互竞争的目标。为解决这一问题，我们将合成规划建模为多目标搜索问题，并引入MORetro*算法，该算法生成合成路线的帕累托前沿，以明确捕捉用户定义标准之间的权衡。MORetro*使用加权标量化和基于贝叶斯优化的采样，有效导航组合搜索空间并优先考虑有前景的权衡。基于多目标A*搜索，我们提供了最优性保证，表明对于固定的单步模型，MORetro*在可采纳性条件下恢复真实的帕累托前沿。在多个逆合成基准测试中，MORetro*生成了多样化、高质量的帕累托前沿，发现了单目标方法忽略的解决方案，并使CASP输出更符合工业决策。

英文摘要

Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro*, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs among user-defined criteria. MORetro* uses weighted scalarization and BO-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A*-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro* recovers the true Pareto front under admissibility. Across multiple retrosynthesis benchmarks, MORetro* produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.

URL PDF HTML ☆

赞 0 踩 0

2603.12647 2026-05-27 cs.CV cs.AI

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

LR-SGS：用于自动驾驶场景重建的鲁棒激光雷达反射率引导显著高斯泼溅

ZY Chen, F Zhu, H Zhu, DY Kong, XK Kuang, YJ Zhang, CM Jiang

AI总结提出一种结合激光雷达反射率与RGB的显著高斯表示方法，通过结构感知初始化、反射率校准和联合对齐，实现高效鲁棒的自动驾驶场景重建。

Comments 8 pages, 7 figures

详情

AI中文摘要

最近的3D高斯泼溅（3DGS）方法已证明了自动驾驶场景重建和新视角合成的可行性。然而，现有方法大多仅依赖相机，或仅将激光雷达用于高斯初始化或深度监督，而点云中包含的丰富场景信息（如反射率）以及激光雷达与RGB之间的互补性尚未被充分利用，导致在具有高自运动和复杂光照等挑战性自动驾驶场景中性能下降。为解决这些问题，我们提出了一种鲁棒且高效的激光雷达反射率引导显著高斯泼溅方法（LR-SGS），用于自动驾驶场景。该方法引入了一种结构感知的显著高斯表示，该表示从激光雷达提取的几何和反射率特征点初始化，并通过显著变换和改进的密度控制来捕捉边缘和平面结构。此外，我们将激光雷达强度校准为反射率，并将其作为光照不变的材料通道附加到每个高斯上，与RGB联合对齐以强制边界一致性。在Waymo Open数据集上的大量实验表明，LR-SGS以更少的高斯和更短的训练时间实现了优越的重建性能。特别是在复杂光照场景下，我们的方法在PSNR上超过OmniRe 1.18 dB。

英文摘要

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

URL PDF HTML ☆

赞 0 踩 0

2605.07053 2026-05-27 cs.CL cs.AI

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

GSM-SEM: 生成语义变体增强的基准与框架

Jyotika Singh, Fang Tu, Aziza Mirsaidova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Karan Dua, Yassine Benajiba, Weiyi Sun, Tao Sheng, Graham Horwood, Sujith Ravi, Dan Roth

AI总结提出GSM-SEM框架，通过修改实体、属性和关系生成语义多样的数学问题变体，降低模型对固定测试集的记忆偏差，并在多个基准上验证性能下降。

详情

AI中文摘要

像GSM8K这样的基准测试是数学推理的流行度量，但由于对固定测试集的记忆，排行榜上的提升可能夸大真实能力。大多数鲁棒性变体应用表面级别的扰动（释义、重命名、数字交换、干扰项），这些扰动在很大程度上保留了底层事实，而静态发布本身可能随着时间的推移成为记忆目标。我们引入了GSM-SEM，一个可重用且随机的框架，用于生成语义多样化的基准变体，其语义方差显著高于先前方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述，经常改变底层事实，并要求模型在新条件下重新计算解决方案，同时约束生成以保留原始计算/答案和近似问题难度。GSM-SEM在每次运行时生成新的变体，无需重新标注，减少了对静态公共基准评估的依赖，从而降低了记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有的变体系列（GSM-Symbolic和GSM-Plus），生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA LLM，我们观察到一致的性能下降，当语义扰动与符号/plus变体结合时下降更大（在GSM-SEM的最大严格配置中平均下降率为28%）。我们公开发布这三个SEM变体作为完全人工验证的数据集。最后，为了展示在GSM风格数学问题之外的适用性，我们将GSM-SEM应用于其他基准，包括BigBenchHard、LogicBench和NLR-BIRD。

英文摘要

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

URL PDF HTML ☆

赞 0 踩 0

2604.08059 2026-05-27 cs.RO cs.AI

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

受治理的能力演化：基于AI组件的系统的生命周期兼容性检查与回滚——以具身智能体为例

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

AI总结针对基于AI组件的系统，提出一种受治理的能力演化框架，通过四类兼容性检查和七阶段升级管线实现安全部署，在具身智能体实验中实现零不安全激活。

Comments 42 pages, 7 figures, 12 tables

详情

AI中文摘要

由版本化AI组件构建的软件系统越来越需要生命周期治理：当能力模块演化到新版本时，宿主系统必须决定新版本是否可以安全激活、应在何种部署条件下运行、如何监控以及何时回滚。现有的软件部署模式（金丝雀发布、蓝绿部署、特性标志和MLOps管线）解决了这一循环的部分问题，但它们是针对无状态Web服务而非驱动现场AI组件的带状态、策略约束运行时设计的。我们将受治理的能力演化形式化为基于AI组件的系统的一等软件生命周期问题，并提出一个分阶段升级框架，其中每个新能力版本被视为受治理的部署候选，而非立即可执行的替换。该框架引入了四类升级兼容性检查（接口、策略、行为、恢复），并将其组织成七阶段管线（候选验证、沙箱评估、影子部署、门控激活、在线监控、回滚、审计）。我们在带有ROS 2中间件的PyBullet操作测试平台上实现了参考原型，并在15个随机种子的6轮能力升级中进行了评估。朴素升级实现了72.9%的任务成功率，但到最后一轮不安全激活率升至60%；受治理升级保持了可比的成功率（67.4%），同时在所有轮次中保持零不安全激活（Wilcoxon p=0.003）。影子部署揭示了40%的升级回归问题，这些问题是单独沙箱评估无法发现的，并且在79.8%的激活后漂移场景中回滚成功。

英文摘要

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whether the new version may be activated safely, under what deployment conditions it should run, how it must be monitored, and when it should be rolled back. Existing software-deployment patterns (canary release, blue-green, feature flags, and MLOps pipelines) address parts of this loop but were designed for stateless web services rather than for stateful, policy-constrained runtimes that drive AI components in the field. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks (interface, policy, behavioral, recovery) and organizes them into a seven-stage pipeline (candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, rollback, audit). We implement a reference prototype on a PyBullet manipulation testbed with ROS 2 middleware and evaluate it over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of upgrade regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.06213 2026-05-27 cs.AI

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

超越固定基准和最坏情况攻击：语言模型的动态边界评估

Haoxiang Wang, Da Yu, Huishuai Zhang

AI总结提出动态边界评估（DBE）方法，通过定位模型在随机采样解码下通过概率接近0.5的边界项，构建统一难度尺度的评估协议，以解决固定基准的饱和问题。

Comments This submission is being withdrawn because it was submitted without the knowledge and authorization of all co-authors. The authors need to resolve this authorship/authorization issue before any public posting

详情

AI中文摘要

当前评估大型语言模型（LLM）依赖于固定基准，这些基准对所有模型应用相同的测试项，产生天花板和地板效应，掩盖了能力差距。我们认为最具信息量的评估信号位于边界，即在随机采样解码下每个提示的通过概率接近0.5，并提出了动态边界评估（DBE），它主动定位每个模型的边界，并将其置于全局可比的难度尺度上。DBE提供三个产物：(i) 一个校准的题库，涵盖安全性、能力和真实性，其每项难度标签在9个参考LLM上得到验证；(ii) 技能引导的边界搜索（SGBS），一种仅通过API级查询访问即可为目标LLM找到边界项的搜索算法；(iii) 一个评估协议，将新的LLM置于统一的能力尺度上，并在目标超出题库覆盖范围时自适应地扩展评估集。我们在四个类别上实例化DBE，涵盖安全性（有害请求拒绝和过度拒绝）、能力（受限指令遵循）和真实性（多轮谄媚抵抗）。由此产生的评估覆盖更广泛的模型谱系而不饱和，同时与现有数据集兼容。

英文摘要

Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.

URL PDF HTML ☆

赞 0 踩 0

2511.22882 2026-05-27 cs.LG math.PR

Normalizing Flows on Quotient Manifolds via Boundary Quotients

通过边界商在商流形上的归一化流

William Ghanem, Benjamin Cai

AI总结提出边界商框架，用于在作为更简单域边界商的流形上学习密度，并构造离散群作用下的商流形上的归一化流，在亏格g曲面和透镜空间上验证了有效性。

2509.26619 2026-05-27 cs.CL cs.AI

Searching the Internet for Challenging Benchmarks at Scale

在互联网上大规模搜索具有挑战性的基准测试

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

AI总结提出一种自动框架，将互联网建模为多臂老虎机问题，通过epsilon-greedy策略高效搜索最具挑战性的主题，以构建无需人工筛选的基准测试。

详情

AI中文摘要

许多静态基准测试开始饱和：随着模型快速改进，它们在固定测试集上获得近乎完美的分数，几乎没有剩余空间来暴露模型的真正弱点——即使是专家策划的挑战集在爬山后也会迅速饱和。我们提出一个完全自动化的框架，在互联网上大规模搜索以构建具有挑战性的基准测试，无需人工筛选。关键洞察是将互联网建模为一个广阔的主题空间，并将搜索形式化为多臂老虎机问题，其中每个主题的难度仅通过昂贵的采样和评估查询来揭示。我们的epsilon-greedy策略在仅探索6%的搜索空间的情况下识别出最具挑战性的主题——相比穷举评估成本降低了100倍。我们在机器翻译和知识问答上进行了验证，确认发现的难度在独立指标（GEMBA-SQA和MetricX）、语言和模型上都是稳健的。

英文摘要

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.

URL PDF HTML ☆

赞 0 踩 0

2605.01489 2026-05-27 cs.AI cs.CL

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: 面向前沿科学推理的深度研究智能体规模化

Tianshi Zheng, Rui Wang, Xiyun Li, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Wei Fan, Yangqiu Song, Tianqing Fang

AI总结提出SciResearcher框架，通过合成基于学术证据的概念与计算任务并训练智能体，在HLE-Bio/Chem-Gold等基准上达到最优性能。

Comments 23 pages, 6 figures, 15 tables

详情

AI中文摘要

前沿科学推理正迅速成为推动AI智能体在自动化科学发现中的关键基础。深度研究智能体为此挑战提供了有前景的方法。这些模型通过后训练处理信息寻求任务（通常通过知识图谱构建或迭代网页浏览来策划）来发展强大的问题解决能力。然而，这些策略在前沿科学中面临固有局限性，因为领域特定知识分散在稀疏且异构的学术来源中，而问题解决需要远超事实回忆的复杂计算和推理。为弥合这一差距，我们引入了SciResearcher，一个用于前沿科学数据构建的全自动智能体框架。SciResearcher综合了基于学术证据的多样化概念和计算任务，同时激发信息获取、工具集成推理和长程能力。利用策划的数据进行监督微调和智能体强化学习，我们开发了SciResearcher-8B，一个在HLE-Bio/Chem-Gold基准上达到19.46%的智能体基础模型，在其参数规模上建立了新的最先进水平，并超越了多个更大的专有智能体。它在SuperGPQA-Hard-Biology和TRQA-Literature基准上进一步取得了13-15%的绝对提升。总体而言，SciResearcher为前沿科学推理的自动数据构建引入了一种新范式，并为未来的科学智能体提供了一条可扩展的路径。

英文摘要

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

URL PDF HTML ☆

赞 0 踩 0

2605.01188 2026-05-27 cs.CL

Compute Optimal Tokenization

计算最优分词

Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer

AI总结通过训练988个BLT模型，研究压缩率（每token平均字节数）对缩放趋势的影响，发现计算最优配置下模型参数量与数据字节数成比例，且最优压缩率低于BPE并随计算量下降。

详情

AI中文摘要

缩放定律能够优化数据量和语言模型大小的选择，但数据单元——token——对此关系的影响尚未充分探索。本文系统研究了由压缩率（即每token平均文本字节数）控制的token信息粒度如何影响缩放趋势。我们训练了988个潜在分词模型（BLT），参数规模从50M到7B，这些模型可以设置所需的压缩率。这种灵活性使我们能够研究压缩率远超过使用流行BPE分词器得到的每token 4.57字节的作用。实验表明，在计算最优配置中，模型参数量与以字节为单位的数据大小成比例，而不是通常认为的以token为单位（Kaplan et al., 2020; Hoffmann et al., 2022）。此外，我们发现最优压缩率不同于BPE得到的压缩率，并且随计算量增加而降低。这些发现普遍适用于潜在分词和子词分词，以及英语以外的其他语言，指导语言模型开发者选择分词方案以实现最大计算效率。

英文摘要

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

URL PDF HTML ☆

赞 0 踩 0

2603.23985 2026-05-27 cs.LG

Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

精简你的大语言模型：通过融合任务特定重要性分数的维度级全局剪枝

Jimyung Hong, Jaehyung Kim

AI总结提出一种无需训练的维度级结构化剪枝方法DIET，通过跨任务激活幅度多数投票构建全局掩码，在保持任务感知能力的同时避免高昂训练成本，在Gemma-2模型上显著提升剪枝后准确率。

Comments 14 pages, 10 figures. Code available at https://github.com/Jimmy145123/DIET

详情

AI中文摘要

大型语言模型（LLMs）展现了卓越的能力，但其庞大的规模给实际部署带来了重大挑战。结构化剪枝通过移除整个维度或层提供了一种有前景的解决方案，然而现有方法面临关键权衡：任务无关方法无法适应任务特定需求，而任务感知方法需要昂贵的训练来学习任务适应性。我们提出DIET（通过融合任务重要性分数进行维度级全局剪枝），一种无需训练的结构化剪枝方法，结合了维度级粒度与任务感知选择。DIET仅使用每个任务100个样本跨任务分析激活幅度，然后应用多数投票构建单个全局掩码。DIET不需要预计算或训练的高成本。在Gemma-2 2B和9B模型上的七个零样本基准测试实验证明了DIET的有效性；例如，在Gemma-2 2B上20%稀疏度下，与先前最先进的结构化剪枝方法相比，DIET实现了近10%的平均准确率提升。这一优势在不同稀疏度和模型规模下持续存在，使DIET成为结构化LLM剪枝的实用且稳健的选择。

英文摘要

Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.

URL PDF HTML ☆

赞 0 踩 0

2601.21972 2026-05-27 cs.AI cs.DC cs.MA

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

基于多智能体Actor-Critic的分散式LLM协作学习

Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato

AI总结针对分散式LLM协作优化，提出两种多智能体Actor-Critic方法（CoLLM-CC和CoLLM-DC），实验表明在长时域或稀疏奖励任务中集中式Critic方法优于蒙特卡洛方法和分散式Critic方法。

详情

AI中文摘要

近期工作探索了通过多智能体强化学习（MARL）优化LLM协作。然而，大多数MARL微调方法依赖于预定义的执行协议，通常需要集中式执行。分散式LLM协作在实践中更具吸引力，因为智能体可以并行运行推理并灵活部署。此外，当前方法使用蒙特卡洛方法进行微调，这存在高方差问题，因此需要更多样本才能有效训练。Actor-Critic方法在MARL中常用于处理这些问题；因此，我们开发了多智能体Actor-Critic（MAAC）方法来优化分散式LLM协作。本文分析了这些MAAC方法何时以及为何有益。我们提出了两种MAAC方法：带有集中式Critic的CoLLM-CC和带有分散式Critic的CoLLM-DC。我们在写作、编码和游戏领域的实验表明，在短时域和密集奖励设置中，蒙特卡洛方法和CoLLM-DC可以达到与CoLLM-CC相当的性能。然而，在长时域或稀疏奖励任务中，它们均不如CoLLM-CC，其中蒙特卡洛方法需要更多样本，而CoLLM-DC难以收敛。

英文摘要

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

URL PDF HTML ☆

赞 0 踩 0

2605.00412 2026-05-27 cs.AI cs.RO

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

物理原生世界模型：生成式世界建模的哈密顿视角

Sen Cui, Jingheng Ma

AI总结提出哈密顿世界模型，通过结构化潜相空间和哈密顿动力学演化实现物理可靠、动作可控且长期稳定的未来预测，用于具身决策。

详情

AI中文摘要

世界模型最近重新成为具身智能、机器人、自动驾驶和基于模型的强化学习的核心范式。然而，当前的世界模型研究通常由三条部分分离的路线主导：强调视觉未来合成的2D视频生成模型、强调空间重建的3D场景中心模型，以及强调抽象预测表示的JEPA类潜变量模型。每条路线都取得了重要进展，但它们仍然难以提供物理可靠、动作可控且长期稳定的预测以支持具身决策。在本文中，我们认为世界模型的瓶颈不再仅仅是它们能否生成逼真的未来，而是这些未来是否物理上有意义且对动作有用。我们提出哈密顿世界模型作为世界建模的一个物理基础视角。关键思想是将观测编码到结构化的潜相空间中，通过带有控制、耗散和残差项的哈密顿动力学演化潜状态，将预测轨迹解码为未来观测，并利用生成的轨迹进行规划。我们讨论了哈密顿结构如何提高可解释性、数据效率和长期稳定性，同时也指出了在涉及摩擦、接触、非保守力和可变形物体的真实机器人场景中的实际挑战。

英文摘要

World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video-generative models that emphasize visual future synthesis, 3D scene-centric models that emphasize spatial reconstruction, and JEPA-like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action-controllable, and long-horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emph{Hamiltonian World Models} as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian-inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long-horizon stability, while also noting practical challenges in real-world robotic scenes involving friction, contact, non-conservative forces, and deformable objects.

URL PDF HTML ☆

赞 0 踩 0

2604.27604 2026-05-27 cs.CV cs.CE

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

解码科学实验图像：用于感知、理解和推理的SPUR基准

Junpeng Ding, Zichen Tang, Haihong E, Mengyuan Ji, Yang Liu, Haolin Tian, Haiyang Sun, Pengqi Sun, Yang Xu, Yichen Liu, Haocheng Gao, Zijie Xi, Ruomeng Jiang, Peizhi Zhao, Rongjin Li, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Jintong Chen, Siying Lin

AI总结提出SPUR基准，通过4264个问答对评估多模态大模型在科学实验图像上的细粒度感知、跨面板关系理解和专家级推理能力，揭示当前模型与专家水平的差距。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

我们引入了SPUR，一个全面的科学实验图像感知、理解和推理基准，包含来自1084张专家精选图像的4264个问答对。SPUR具有三个关键创新：（1）面板级细粒度感知：评估多模态大语言模型（MLLMs）在六个细粒度面板类型上的三个维度（数值、形态和信息定位）的视觉感知能力；（2）跨面板关系理解：利用平均每样本14.3个面板的复杂图像评估MLLMs解读复杂跨面板关系的能力；（3）专家级推理：跨五个实验范式评估定性和定量推理，以确定模型是否能像人类专家一样从证据中推断结论。对20个MLLMs和四种多模态思维链（MCoT）方法的全面评估表明，当前模型远未达到科学图像解释的专家级要求，凸显了人工智能科学（AI4S）研究的关键瓶颈。

英文摘要

We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.

URL PDF HTML ☆

赞 0 踩 0

2604.27292 2026-05-27 cs.AI

The Two Boundaries: Why Behavioral AI Governance Fails Structurally

两个边界：为什么行为性AI治理在结构上失败

Alan L. McCann

AI总结本文提出形式化框架，利用Rice定理证明行为性AI治理存在结构性的不可判定间隙，并定义共延治理作为可测试标准，通过分离计算与效应实现结构治理。

Comments 17 pages, 2 figures. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers;updated license

详情

AI中文摘要

每个产生效应的系统都有两个边界：它能做什么（表达能力）和治理覆盖什么（治理）。在几乎所有已部署的AI系统中，这些边界是独立定义的，从而产生三个区域：受治理的能力（唯一有用的区域）、未受治理的能力（风险）以及针对不存在的能力的治理策略（作秀）。三个区域中有两个是失败模式。我们关注效应的治理：AI系统在世界中执行的动作（API调用、数据库写入、工具调用）。这不同于模型输出的治理（内容质量、偏见、公平性），后者在不同层面运作并需要不同的机制。我们提出了一个形式化框架来分析这种结构性差距。Rice定理（1953）证明，对于任何试图行为性地治理效应的图灵完备架构，该差距在一般情况下是不可判定的：没有算法可以决定任意程序的非平凡语义属性，包括属性“该程序的效应符合治理策略”。我们定义了共延治理：一种系统属性，其中表达能力边界等于治理边界。我们证明共延治理需要架构决策（将计算与效应分离），而不是事后添加的治理层。我们表明，在这种分离下的结构治理包含了独立的治理基础设施：治理检查成为执行流水线的一部分，而不是与之并行的第二系统。我们提出共延治理作为任何AI治理系统的可测试标准：要么两个边界可证明相同，要么风险和作秀在结构上不可避免。证明在Coq中机械化（454个定理，36个模块，0个待证）。

英文摘要

Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly all deployed AI systems, these boundaries are defined independently, creating three regions: governed capabilities (the only useful region), ungoverned capabilities (risk), and governance policies that address non-existent capabilities (theater). Two of the three regions are failure modes. We focus on the governance of effects: actions that AI systems perform in the world (API calls, database writes, tool invocations). This is distinct from the governance of model outputs (content quality, bias, fairness), which operates at a different level and requires different mechanisms. We present a formal framework for analyzing this structural gap. Rice's theorem (1953) proves the gap is undecidable in the general case for any Turing-complete architecture that attempts to govern effects behaviorally: no algorithm can decide non-trivial semantic properties of arbitrary programs, including the property "this program's effects comply with the governance policy." We define coterminous governance: a system property where the expressivenessboundary equals the governance boundary. We show that coterminous governance requires an architectural decision (separatingcomputation from effect) rather than a governance layer added after the fact. We show that structural governance under this separation subsumes separate governance infrastructure: governance checks become part of the execution pipeline rather than a second system running alongside it. We propose coterminous governance as the testable criterion for any AI governance system: either the two boundaries are provably identical, or risk and theater are structurally inevitable. Proofs are mechanized in Coq (454 theorems, 36 modules, 0 admitted).

URL PDF HTML ☆

赞 0 踩 0