arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.29622 2026-05-29 cs.LG physics.chem-ph

MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties

MōLe-Λ: 学习耦合簇响应态以获取能量、梯度和性质

Andreas Burger, Luca Thiede, Abdulrahman Aldossary, Jorge A. Campos-Gonzalez-Angulo, Alex Zook, Jérôme Florian Gonthier, Alán Aspuru-Guzik

AI总结 提出MōLe-Λ模型,通过联合学习左右手振幅预测耦合簇响应态,高效计算能量、梯度及多类分子性质。

详情
Comments
ICML 2026 AI4Physics
AI中文摘要

耦合簇理论常被视为量子化学的金标准,但其高计算成本限制了准确能量、力和响应性质的常规获取。虽然右手$T$-振幅决定了相关波函数,但许多实际重要的可观测量还需要左手$Λ$-振幅。我们引入MōLe-$Λ$,它是分子轨道学习(MōLe)的扩展,通过从局域化的Hartree-Fock分子轨道联合学习右手振幅$(T_1,T_2)$和左手振幅$(Λ_1,Λ_2)$,预测完整的基态耦合簇单双激发(CCSD)响应态。在架构上,MōLe-$Λ$扩展了MōLe,增加了$Λ_1$和$Λ_2$读出模块,这些模块镜像了$T_1$和$T_2$头的对称性约束,同时保留了原始的等变轨道编码器、奇符号等变解码、局域性和大小广延性。所得模型能够提供准确的CC级能量和力,同时恢复偶极矩、四极矩、极化率、电子密度以及双电子可观测量如对密度。我们表明,MōLe-$Λ$进一步扩展了MōLe相对于完整CCSD的速度优势,同时大幅扩展了可访问的性质,为相关量子化学的波函数级替代模型提供了途径。

英文摘要

Coupled-cluster (CC) theory is often considered the gold standard of quantum chemistry, but its high computational cost limits routine access to accurate energies, forces and response properties. While the right-hand $T$-amplitudes determine the correlated wavefunction, many practically important observables additionally require the left-hand $Λ$-amplitudes. We introduce MōLe-$Λ$, an extension of Molecular Orbital Learning (MōLe) that predicts the full ground-state coupled-cluster singles and doubles (CCSD) response state by jointly learning right-hand amplitudes $(T_1,T_2)$ and left-hand amplitudes $(Λ_1,Λ_2)$ from localized Hartree--Fock molecular orbitals. Architecturally, MōLe-$Λ$ extends MōLe with $Λ_1$ and $Λ_2$ readouts that mirror the symmetry constraints of the $T_1$ and $T_2$ heads, while preserving the original equivariant orbital encoder, odd sign-equivariant decoding, locality and size-extensivity. The resulting model yields accurate CC-quality energies and forces, while simultaneously recovering dipoles, quadrupoles, polarizabilities, the electron density, and 2-electron observables such as the pair density. We show that MōLe-$Λ$ further extends the speed advantage of MōLe over full CCSD while substantially expanding the accessible properties, providing a route to wavefunction-level surrogate models for correlated quantum chemistry.

2605.29615 2026-05-29 cs.CV cs.CL

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?

DiffSpot:VLM能发现网页界面中的细微视觉差异吗?

Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

AI总结 提出DiffSpot基准,通过CSS属性突变生成可控图像对,评估视觉语言模型在网页界面中检测细微视觉差异的能力,发现最佳模型仅识别40.7%的真实变化。

详情
AI中文摘要

视觉语言模型(VLM)在高层次图像-文本对齐方面取得了显著进展,但其感知细微视觉差异的能力仍然有限。我们在渲染的网页界面中研究这一问题,其中局部视觉变化既是对细粒度感知的诊断测试,也是GUI代理和设计工具的实际需求。我们引入了 extbf{DiffSpot},一个用于网页界面开放式找不同的代码驱动基准。DiffSpot通过突变自包含HTML中目标元素的单个CSS属性,重新渲染页面,并记录变化的属性、元素和突变幅度,从而构建受控图像对。一个接地门控仅保留渲染像素差异局限于目标元素的图像对。该基准包含4,400对图像,包括3,900对有差异对(平衡分布在13个CSS属性操作符和三个难度级别上)以及500对无差异对用于幻觉控制。对13个前沿VLM进行零样本评估,我们发现即使最佳模型也只能识别$40.7\%$的真实变化,所有模型在困难级别的召回率低于$23\%$。DiffSpot进一步表明,难度强烈依赖于属性:在CSS操作符中,像素幅度和CLIP距离都不能可靠预测召回率。

英文摘要

Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.

2605.29613 2026-05-29 eess.AS cs.SD

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

基于扩散的ASR解码策略:基于置信度阈值的系统评估

Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha, Yong Man Ro

AI总结 本文系统评估了基于扩散语言模型的ASR中三种解码策略,提出使用基于负对数似然的不确定性度量来监控解码进度,发现基于阈值的策略在准确率和速度上均优于固定步数策略,其中静态阈值策略在匹配自回归解码准确率的同时具有更高效率。

详情
AI中文摘要

虽然基于LLM的自动语音识别(ASR)实现了高准确率,但其速度受限于顺序自回归解码。扩散语言模型(DLM)提供了一种并行替代方案,然而其解码策略在ASR场景中尚未得到充分探索。本文分析了三种用于DLM-based ASR的解码方案:固定步数、静态置信度阈值和动态置信度阈值。我们提出使用基于负对数似然的不确定性度量作为解码进度的代理来测量逐轮准确率。结果表明,基于阈值的策略在准确率和速度上均显著优于固定步数方案。我们将此归因于ASR独有的特性:大多数token在早期就达到高置信度,从而可以积极收集可靠token,仅将困难token留到后续轮次。值得注意的是,静态阈值策略在匹配自回归解码准确率的同时提供了更高的效率。

英文摘要

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

2605.29612 2026-05-29 cs.MA cs.CL

CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

CONCAT: 基于共识与置信驱动的即席团队协作以实现高效的基于LLM的多智能体系统

Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

AI总结 提出一种无需训练的共识与置信驱动即席团队协作框架CONCAT,通过聚类初始答案、选择高置信领导者并基于心智理论预测协作收益来动态组织多智能体交互,显著提升效率并降低延迟。

详情
AI中文摘要

尽管基于大型语言模型的多智能体系统在解决复杂任务和实现比单智能体系统更高的性能方面显示出能力,但由于智能体之间的密集通信,它们导致了巨大的计算开销。先前的研究致力于训练稀疏多智能体图或微调规划器以更好地编排工作流程。然而,这些额外的训练过程引入了计算成本,并将多智能体系统限制在特定领域,从而损害了其泛化能力。在本文中,我们提出了CONCAT,一种基于共识和置信驱动的即席团队协作的无训练多智能体协作框架,以高效组织智能体交互。具体来说,智能体根据其初始答案进行聚类,并根据智能体的置信度选择每个聚类的领导者。然后,基于心智理论设计启发式函数,根据领导者的答案和置信度预测每两个领导者之间的协作收益。最后,在根据预测收益驱逐一定比例的通信后,组织一个即席多智能体网络。在三个LLM和三个基准上的实验表明,CONCAT比LLM-Debate实现了高达2.02倍的效率(准确率/延迟比),并优于诸如AgentDropout等训练感知方法,同时在Qwen2.5-14B-Instruct上将平均延迟降低了50.1%,且无需任何任务特定训练。

英文摘要

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.

2605.29610 2026-05-29 cs.CV cs.AI cs.LG

Learning Context-Conditioned Predicate Semantics via Prototype Feedback

通过原型反馈学习上下文条件谓词语义

NamGyu Jung, Chang Choi

AI总结 提出AlignG方法,利用原型反馈从图像关系候选中推断上下文条件谓词语义并调整关系表示,在VG-150和GQA-200上分别提升SGDet的F@100指标1.4和2.7。

详情
Comments
Accepted at ICML 2026. Code: https://github.com/Namgyu97/AlignG-SGG.pytorch
AI中文摘要

在场景图生成中,一个核心挑战是建模多义谓词,其含义随上下文变化。先前的方法通过将谓词分解为多个静态原型或检索语义相似的示例来解决此问题。然而,这些策略保持谓词表示静态,无法重新组织语义以反映图像特定的证据,导致在模糊上下文中出现系统性混淆。我们提出AlignG,通过原型反馈学习上下文条件谓词语义。AlignG从每幅图像中的关系候选中推断上下文条件谓词语义,并将调整后的语义反馈回来以重新校准关系表示。学习目标将此适应锚定到全局语义中心,防止语义漂移,同时当场景提供一致的关系线索时仍允许选择性重组。在VG-150和GQA-200上的实验表明,在SGDet下,F@100指标分别提升了+1.4和+2.7,优于最先进的基线。我们进一步可视化每幅图像的原型相似性变化,并观察到一致的上下文相关重组,其中原型根据场景证据选择性地合并或分离谓词。代码可在https://github.com/Namgyu97/AlignG-SGG.pytorch获取。

英文摘要

In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image-specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers context-conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over state-of-the-art baselines, with F@100 improvements of +1.4 on VG-150 and +2.7 on GQA-200 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG-SGG.pytorch.

2605.29607 2026-05-29 cs.LG

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

掩码扩散语言模型的簇级注意力引导并行解码

Heqiang Qi, Wei Huang, Mingyuan Bai, Xiangming Meng

AI总结 提出CLAD方法,通过将相邻高置信度token聚合成簇,并利用自注意力图估计簇间依赖,实现掩码扩散语言模型的训练无关簇级并行解码,在保持任务精度的同时获得1.77-8.47倍加速。

详情
AI中文摘要

掩码扩散语言模型(MDLMs)通过在每个去噪步骤预测所有掩码位置来实现并行解码,然而现有的无训练采样器通常以token级粒度决定哪些位置被提交。我们重新审视这一粒度,并观察到可靠预测通常表现为连续的置信度跨度,这表明并行提交的单位可以大于单个token。我们首先将相邻的高置信度候选分组为置信度诱导簇(CICs),作为跨度级更新单元。然后,我们利用同一前向传递的自注意力图来估计簇间依赖关系,从而实现对相互兼容的CICs进行冲突感知选择以进行并行提交。这产生了CLAD(簇级注意力引导解码),一种用于MDLMs的无训练簇级解码器。在LLaDA和Dream模型系列上的四个推理和代码生成基准测试中,CLAD在大多数设置下实现了1.77倍至8.47倍的速度提升,同时保持广泛可比的任务精度。

英文摘要

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

2605.29606 2026-05-29 cs.AI cs.IR

HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

HiKEY: 面向开放域文档问答的分层多模态检索

Joongmin Shin, Gyuho Shim, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

AI总结 提出基于文档层次结构的分层多模态检索框架HiKEY,通过文档层次解析和粗到细的检索策略解决大规模工业语料中的路由失败和证据碎片化问题,在ODQA基准上检索召回率提升达12.9%,端到端QA性能提升达6.8%。

详情
Comments
Accepted to ACL2026 Main
AI中文摘要

基于检索增强生成(RAG)的文档级开放域问答(ODQA)在大规模工业语料库中面临两个关键瓶颈:定位正确文档时的路由失败以及整合分散信息时的证据碎片化。现有依赖平面文本块或页面级图像的方法本质上难以(i)在数千个候选中精确定位目标文档,以及(ii)在有限的token预算内有机连接多模态证据(如表格和图形)。为应对这些挑战,我们提出HiKEY,一种基于层次树的多模态检索框架,将文档层次结构提升为一等检索信号。不同于简单的分块,HiKEY通过文档层次解析(DHP)重建逻辑异构图,显式编码父子关系。采用层次化由粗到细的策略,该框架(1)通过全局路由利用层次索引快速剪枝搜索空间,以及(2)通过采用捕获最具区分性证据的多模态融合策略进行细粒度检索以对章节排序。最后,HiKEY通过混合结构-语义打包策略组装一个token高效的证据子图。在ODQA基准上的实验表明,HiKEY显著优于基于页面和基于块的基线,检索召回率提升高达12.9%,端到端QA性能提升高达6.8%。

英文摘要

Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existing approaches relying on flat text chunks or page-level images inherently struggle to (i) precisely pinpoint the target document among thousands of candidates and (ii) organically connect multimodal evidence, such as tables and figures, within a limited token budget. To address these challenges, we propose HiKEY, a hierarchical tree-based multimodal retrieval framework that elevates document hierarchy to a first-class retrieval signal. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph via Document Hierarchical Parsing (DHP), explicitly encoding parent-child relationships. Adopting a hierarchical coarse-to-fine strategy, the framework (1) performs global routing to rapidly prune the search space using hierarchical indexing, and (2) conducts fine-grained retrieval to rank sections by employing a multimodal fusion strategy that captures the most discriminative evidence. Finally, HiKEY assembles a token-efficient evidence subgraph via a hybrid structural-semantic packing strategy. Experiments on ODQA benchmarks demonstrate that HiKEY significantly outperforms page- and chunk-based baselines, improving retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8%.

2605.29605 2026-05-29 cs.RO

VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models

VLAConf: 视觉-语言-动作模型的校准任务成功置信度

Dehao Huang, Aoxiang Gu, Chengjie Zhang, Bolin Zou, Wenlong Dong, Zilang Cen, Yue Wang, Hong Zhang

AI总结 提出VLAConf,一种基于单类判别性置信度框架的方法,通过冻结预训练VLA内部表示和轻量级置信度头,在单次前向传播中直接估计逐步异常分数,实现高效且跨架构通用的任务成功置信度估计。

详情
Comments
11 pages, 7 figures
AI中文摘要

视觉-语言-动作(VLA)模型的置信度估计对于机器人在开放世界中执行操作任务至关重要,它为风险敏感决策和故障预测提供关键信号。现有的置信度估计方法通常依赖于基于集成的范式或动作令牌概率来预测任务成功的可能性。然而,它们在计算效率和跨架构泛化性方面仍面临挑战。这些方法通常需要重复采样,导致推理效率低下,并且仅限于具有离散动作输出的VLA模型,难以应用于连续动作空间。为解决此问题,我们提出VLAConf,一种单类判别性置信度框架。通过利用冻结的预训练VLA内部表示,VLAConf使用轻量级置信度头在单次前向传播中直接估计逐步异常分数,从而消除了详尽重采样的开销。我们还使用步骤条件建模来编码操作轨迹中的展开阶段信息。在LIBERO基准上的实验表明,VLAConf显著提高了为事后校准构建的置信度信号的质量,在推理效率上大幅优于现有基线。VLAConf的有效性在真实机器人实验中进一步得到验证。要访问源代码和补充视频,请访问https://sites.google.com/view/vlaconf。

英文摘要

Confidence estimation for Vision-Language-Action (VLA) models is essential for robots to perform manipulation tasks in the open world, providing crucial signals for risk-sensitive decision-making and failure anticipation. Existing confidence estimation methods typically rely on ensemble-based paradigms or action-token probabilities to predict the likelihood of task success. However, they still encounter challenges in computational efficiency and cross-architecture generalizability. These methods usually require repeated sampling, leading to inference inefficiency, and are restricted to VLA models with discrete action outputs, making them difficult to apply to continuous action spaces. To address this issue, we propose VLAConf, a one-class discriminative confidence framework. By leveraging frozen pretrained VLA internal representations, VLAConf directly estimates step-wise anomaly scores in a single forward pass using a lightweight confidence head, thereby eliminating the overhead of exhaustive resampling. We additionally use step-conditioned modeling to encode rollout-phase information along the manipulation trajectory. Experiments on the LIBERO benchmark demonstrate that VLAConf significantly improves the quality of the confidence signal constructed for post-hoc calibration, outperforming existing baselines by a large margin in inference efficiency. The effectiveness of VLAConf is further validated in real-robot experiments. To access the source code and supplementary videos, visit https://sites.google.com/view/vlaconf.

2605.29602 2026-05-29 cs.CV

CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning

CogniVerse: 用认知反思与几何推理革新多模态检索增强生成

Xiang Fang, Wanlong Fang, Changshuo Wang

AI总结 提出CogniVerse框架,通过认知反思模块、基于黎曼流形对齐的多模态检索模块和最优传输层次生成模块,解决多模态检索增强生成中的噪声检索、跨模态语义错位和生成不连贯问题。

详情
Comments
Accepted in CVPR 2026
AI中文摘要

多模态检索增强生成(MMRAG)已成为一种强大的范式,通过整合外部视觉、文本和结构知识,增强多模态大语言模型在知识密集型问答中的能力。然而,现有的MMRAG框架存在关键限制,包括噪声和无关检索、跨模态语义错位、缺乏自适应推理以及局部和全局上下文生成不连贯。我们提出了 extbf{CogniVerse},一种新颖的MMRAG框架,通过受认知启发的数学严谨方法解决这些挑战。借鉴人类推理,CogniVerse集成了三个协同组件:(1)认知反思模块,动态评估检索必要性并过滤相关多模态内容,减少噪声和计算开销;(2)多模态检索模块,使用信息几何在黎曼流形中对齐嵌入,并通过谱图理论优化知识图谱,确保精确且连贯的检索;(3)层次生成模块,采用基于最优传输的损失来平衡词元级准确性和全局语义连贯性。大量实验表明,CogniVerse在准确性和连贯性上显著优于最先进系统,同时降低了检索延迟。

英文摘要

Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.

2605.29601 2026-05-29 cs.CL cs.AI cs.LG

Training Deliberative Monitors for Black-Box Scheming Detection

训练审慎监控器用于黑箱策划检测

Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn

AI总结 提出一种基于行动轨迹的审慎监控方法,通过蒸馏前沿模型的推理过程训练开源模型,以低成本高精度检测智能体的策划与破坏行为。

详情
AI中文摘要

随着自主智能体在执行现实任务方面变得愈发强大,区分策划行为与良性任务追求可能成为AI控制的核心问题。现有监控器通常依赖思维链访问或内部激活,或使用提示的前沿模型,这些在部署中可能不可用、不可靠或成本高昂。在本工作中,我们研究仅基于行动的审慎监控器:较小的开源模型,经过训练可从智能体轨迹中检测策划与破坏行为,而无需访问被监控智能体的推理或模型内部。我们的方法受审慎对齐启发,使用策划规范从前沿教师模型中引出结构化推理,通过独立的评判器进行过滤,并通过监督微调和强化学习将最高质量的推理蒸馏到开源监控器中。我们在五个数据集上训练,并在六个分布外智能体失调基准上评估。我们表明,将我们的方法应用于Qwen3.5-27B,其性能优于所有低成本前沿模型作为提示监控器(Gemini 3.1 Flash-Lite、GPT-5.4 Nano和Claude Haiku 4.5)以及Gemini 2.5 Pro,同时实现了更低的边际推理成本(每1000次评估的token计费美元)。更强的提示前沿监控器(Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6和Claude Opus 4.6)实现了更高的性能,但边际推理成本大约高出16-34倍。我们训练的多个监控器在我们评估的监控器中位于经验成本-性能帕累托前沿,为提示前沿模型提供了实用的低成本、低误报率替代方案。

英文摘要

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

2605.29599 2026-05-29 cs.RO cs.CV

How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments

如何缓解越野环境语义分割中的分布偏移

Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon, Dong-Wook Kim, Seung-Woo Seo

AI总结 提出ST-Seg框架,通过风格扩展和纹理正则化缓解越野场景中源-目标域差异和传感器退化导致的分布偏移,提升语义分割鲁棒性。

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 10, issue. 5, pp. 4500-4507, 2025
Comments
8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
AI中文摘要

语义分割对于越野环境中的自主导航至关重要,能够精确分类周围环境以识别可通行区域。然而,越野条件固有的独特因素,如源-目标域差异和粗糙地形导致的传感器退化,可能引起分布偏移,使数据变化与训练条件不同。这常导致语义标签预测不准确,进而造成导航任务失败。为解决此问题,我们提出ST-Seg,一种通过风格扩展(SE)和纹理正则化(TR)扩展源分布的新框架。与先前在固定源分布内隐式应用泛化的方法不同,ST-Seg提供了一种直观的分布偏移处理方法。具体而言,SE通过生成多样化的逼真风格来拓宽域覆盖范围,增强源域有限的风格信息。TR通过深度纹理流形稳定受风格增强学习影响的局部纹理表示。在各种分布偏移的目标域上的实验证明了ST-Seg的有效性,相较于现有方法有显著改进。这些结果凸显了ST-Seg的鲁棒性,增强了越野导航中语义分割的实际应用性。

英文摘要

Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

2605.29592 2026-05-29 cs.CV

Non-Forgetting Knowledge Allocation with Bi-level Competition for Class-Incremental Learning

非遗忘知识分配与双层竞争用于类增量学习

Xiang Tan, Run He, Yawen Cui, Mengchen Zhao, Yan Wu, Tianyi Chen, Huiping Zhuang, Xiaonan Luo, Guanbin Li

AI总结 针对基于预训练模型的类增量学习中适配器知识分配不均和遗忘问题,提出非遗忘分配与双层竞争方法(NoFA-BC),通过递归最小二乘构建非遗忘分配器,并引入任务内赢家通吃和任务间最后淘汰机制优化适配器利用。

详情
AI中文摘要

基于预训练模型(PTM)的类增量学习(CIL)旨在顺序地将PTM适应到新类别而不遗忘旧知识。现有的基于适配器的方法主要通过不同的任务特定适配器训练模型,并在推理时为每个适配器呈现统一的知识分配。然而,这种分配机制忽略了任务差异的本质,导致适配器的利用次优。此外,在CIL约束下,分配器在任务演化时容易遗忘。为了解决这些问题,我们提出了一种具有双层竞争的非遗忘分配(NoFA-BC)。NoFA-BC通过将分配器训练转化为递归最小二乘问题来构建非遗忘分配器(NFA),并实现了与使用所有数据训练等效的分配器。基于NFA,提出了双层竞争(BLC),包括任务内级别的赢家通吃(WTA)机制和任务间级别的最后淘汰(LOF)消除,以提供更好的适配器知识分配。WTA提取任务内最显著的logit来表示适配器的贡献,LOF抑制不相关的适配器。通过BLC,每个适配器的参与比例可以根据每个输入进行调整。此外,还加入了稳定性增强(SE)过程,以进一步提高旧任务的性能。

英文摘要

Class-Incremental Learning (CIL) with pre-trained models (PTMs) aims to sequentially adapt PTMs to new categories without forgetting old knowledge. Built upon PTMs, existing adapter-based methods mainly train models via distinct task-specific adapters, and present a uniform knowledge allocation for each adapter during inference. However, this allocation mechanism ignores the nature of task discrepancy and leads to suboptimal utilization of adapters. Also, under CIL constraint, an allocator is prone to forgetting when tasks evolve. To address these issues, we propose a Non-Forgetting Allocation with Bi-Level Competition (NoFA-BC). NoFA-BC constructs a non-forgetting allocator (NFA) by transforming the allocator training into a recursive least-squares problem and achieves an allocator equivalent to that trained with all data. Based on the NFA, a Bi-Level Competition (BLC) including an intra-task level Winner-Takes-All (WTA) mechanism and inter-task Last-Ones-Fall (LOF) elimination is proposed to provide better allocation of adapter knowledge. WTA extracts the most significant logit within a task to represent the adapter's contribution and LOF suppresses the irrelevant adapters. With BLC, participation ratio of each adapter can be tailored for each input. Moreover, a Stability Enhancement (SE) process is incorporated to further improve the performance of old tasks.

2605.29591 2026-05-29 cs.AI

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

Mind-Omni:通过离散扩散实现脑-视觉-语言建模的统一多任务框架

Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He

AI总结 提出Mind-Omni框架,利用离散扩散范式统一七种编码与解码任务,通过脑分词器将连续脑信号转化为离散令牌,实现多模态交互,并构建脑问答指令调优数据集,在多项任务上达到或超越专用模型性能。

详情
AI中文摘要

建模外部刺激与内部神经表征之间的相互作用是脑机接口(BCI)领域的关键研究方向。以往工作的主要局限性在于普遍采用专门的单任务模型,这限制了通用性并忽略了任务间的协同效应。为解决这一问题,我们提出了Mind-Omni,这是第一个通过离散扩散范式统一七种不同编码和解码任务的通用框架。其核心是一种新颖的脑分词器(Brain Tokenizer),可将异质、连续的脑信号转化为标准化、离散的令牌。这使得在共享语义空间中,任意两个或多个模态之间能够进行直接的令牌级交互,实现相互理解和生成。为了解锁高级推理能力,我们进一步策划了一个专门的脑问答(BQA)指令调优数据集。我们的模型不仅在多任务统一框架中确立了新的最先进水平,还为多任务协同提供了有力证据。通过展示与更大规模专用模型相当甚至有时更优的性能,我们的工作为神经建模提供了强大的新范式,并为神经活动基础模型铺平了道路。代码已公开于https://github.com/ReedOnePeck/Mind-Omni。

英文摘要

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.

2605.29587 2026-05-29 q-bio.QM cs.LG

FPLIER: Federated Pathway-Level Information Extractor

FPLIER:联邦通路级信息提取器

Daniele Malpetti, Christian Berchtold, Francesco Gualdi, Marco Scutari, Laura Azzimonti, Francesca Mangili

AI总结 提出联邦学习框架FPLIER,通过安全聚合实现分布式基因表达数据上的通路级因子分解,并证明隐私风险由训练表达矩阵的秩决定。

详情
Comments
Accepted for publication at the ACM BCB '26 conference
AI中文摘要

在转录组学中,通路级信息提取器(PLIER)等基因集感知因子分解方法在大型异质性表达数据集上训练时效果最佳。然而,由于隐私和治理限制,许多临床相关队列无法合并为单个数据集。我们提出FPLIER,这是PLIER的联邦扩展,能够在多个数据持有者之间进行分布式训练,同时整合公开可用数据集。通过安全聚合,FPLIER产生的训练更新在代数上等价于集中式池化数据方法,同时保持表达数据的本地性。我们在两个模拟联盟(来自K-CLIER和MultiPLIER研究)的多个场景中评估FPLIER,并展示其稳定收敛。我们进一步对针对中间训练统计量和发布模型的成员推断攻击进行了系统分析。结果表明,隐私风险由训练表达矩阵的秩决定。整合公开数据或降低数据维度会增加该秩,使系统趋向满秩状态,在此状态下训练样本与非训练样本对攻击者而言难以区分,成员推断性能接近随机猜测。

英文摘要

In transcriptomics, gene-set-aware factorization methods such as the Pathway Level Information Extractor (PLIER) are most effective when trained on large, heterogeneous expression compendia. Yet, many clinically relevant cohorts cannot be pooled into a single dataset due to privacy and governance constraints. We present FPLIER, a federated extension of PLIER that enables distributed training across multiple data holders while incorporating publicly available datasets. Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. We evaluate FPLIER across multiple scenarios in two simulated consortia (from the K-CLIER and MultiPLIER studies) and demonstrate stable convergence. We further conduct a systematic analysis of membership inference attacks targeting both intermediate training statistics and the released model. Our results show that privacy risk is governed by the rank of the training expression matrix. Incorporating public data or reducing data dimensionality increases this rank, moving the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker, and membership-inference performance approaches random guessing.

2605.29586 2026-05-29 cs.AI

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

FinVerBench: 大型语言模型财务报表验证中的基准有效性与校准

Silu Panda

AI总结 提出FinVerBench基准,通过四类错误分类和SEC 10-K XBRL数据,评估LLM在财务报表数值一致性验证中的性能,发现校准和渲染选择显著影响结果,强调构造有效性而非最终排行榜。

详情
Comments
37 pages, 9 figures
AI中文摘要

我们介绍了FinVerBench,一个用于财务报表验证的基准和有效性研究:确定一组公司财务报表在模型所呈现的信息下是否数值一致。FinVerBench基于43家标普500公司的SEC 10-K XBRL文件构建,定义了一个包含算术、跨报表链接、同比和幅度扰动四类错误的分类法。我们尝试了十五种当代LLM评估,并报告了十四次完整运行;Gemini 2.5 Pro的一次运行因40/108次网关调用失败而被排除在主比较之外。所有二值指标排除了其扰动行项目未呈现的不确定正例,留下一个包含105个实例的可观察诊断子集(43个干净,62个注入错误)。在未舍入诊断子集上的原始引导清单提示下,十四次完整LLM运行中有九次在干净报表上产生95-100%的假阳性,而一次运行实现了0%的观察假阳性。基准渲染选择显著影响测量的召回率:在同一可观察子集的现实舍入变体上,校准模型的召回率为79.0%,观察FPR为0%,而在未舍入诊断变体上召回率为100.0%。这些结果支持构造有效性结论而非最终排行榜:财务报表验证不仅仅是算术检测,而是在不完全可观察性、提示诱导假设和现实数值渲染下的校准判断。FinVerBench和所有代码均已公开。

英文摘要

We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate financial statements is numerically consistent from the information shown to the model. FinVerBench is built from SEC 10-K XBRL filings for 43 S&P 500 companies and defines a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. We attempt fifteen contemporary LLM evaluations and report fourteen complete runs; a Gemini 2.5 Pro run is excluded from the main comparison because 40/108 gateway calls failed. All binary metrics exclude underdetermined positive instances whose perturbed line item is not rendered, leaving a 105-instance observable diagnostic subset (43 clean, 62 error-injected). Under the original guided-checklist prompt on the unrounded diagnostic subset, nine of fourteen complete LLM runs produce 95-100% false positives on clean statements, while one run achieves 0% observed false positives. Benchmark rendering choices materially affect measured recall: on a realistic rounded variant of the same observable subset, the calibrated model's recall is 79.0% with 0% observed FPR, compared with 100.0% recall on the unrounded diagnostic variant. These results support a construct-validity conclusion rather than a final leaderboard: financial statement verification is not merely arithmetic detection, but calibrated judgment under incomplete observability, prompt-induced assumptions, and realistic numerical rendering. FinVerBench and all code are publicly available.

2605.29585 2026-05-29 cs.CL

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

语言中的世界模型:审计视觉语言模型中的物理状态转换承诺

Emmanuelle Bourigault

AI总结 提出WMW框架,通过要求VLM输出结构化轨迹(初始状态、状态转换、结果状态和答案)并利用混合验证器检查模式有效性、状态基础、转换一致性和答案-轨迹兼容性,揭示仅评估最终答案所隐藏的物理推理失败。

详情
Comments
8 pages, 3 figures, 5 tables
AI中文摘要

视觉语言模型(VLM)越来越多地被用于回答关于物理场景的问题,然而大多数评估仅将性能简化为最终答案。这隐藏了模型是否感知到正确的物体、表示了正确的物理状态、预测了合理的转换,或者仅仅因为错误的原因选择了正确的选项。我们引入\wmw,一个用于审计VLM的\emph{语言表达的物理承诺}的评估框架。我们不只对$I,q\mapsto a$进行评分,而是要求模型生成一个类型化轨迹$I,q\mapsto(s_0,Δs,s_1,a)$:初始状态、状态转换、结果状态和答案。然后,一个混合验证器检查模式有效性、状态基础、转换一致性和答案-轨迹兼容性,产生类型化错误标签,如物体、关系、力、转换、时间、单位/尺度和忠实性错误。我们发布\tracebank,一个受控轨迹资源,包含\nSeed个经过模式验证和重新计算验证的合成场景,涵盖\nFamilies个物理家族,\nPairs个最小扰动对比偏好对,验证器代码,审计指南和模型输出。我们在受控和外部物理推理示例上评估\nModels个VLM。\wmw揭示了仅答案评估遗漏的失败:来自中等水平模型的35%的正确答案背后是物理上无效的轨迹。验证器引导的重新排序在不牺牲答案准确性的情况下恢复了高达7个百分点的轨迹有效性,而轨迹级别的偏好调整将隐藏的不一致性相对降低了41%。贡献不是另一个最终答案的物理基准,而是一个可重用的协议,用于衡量VLM所陈述的物理世界是否与其答案同时为真。

英文摘要

Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \wmw, an evaluation framework for auditing the \emph{language-expressed physical commitments} of VLMs. Instead of scoring only $I,q\mapsto a$, we ask models to produce a typed trace $I,q\mapsto(s_0,Δs,s_1,a)$: an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer-trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release \tracebank, a controlled trace resource with \nSeed schema- and recomputation-validated synthetic scenarios across \nFamilies physics families, \nPairs minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate \nModels VLMs on both controlled and external physical-reasoning examples. \wmw reveals failures that answer-only evaluation misses: 35\% of correct answers from mid-tier models are backed by physically invalid traces. Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace-level preference tuning reduces hidden inconsistency by 41\% relative. The contribution is not another final-answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.

2605.29583 2026-05-29 cs.CV

BitC-3DGS: High-Capacity 3D Gaussian Splatting Watermarking via Bit Compression

BitC-3DGS: 基于位压缩的高容量3D高斯泼溅水印技术

Yuquan Bi, Baosheng Yu, Yingke Lei, Jianwei Yang, Hongsong Wang, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok

AI总结 提出BitC-3DGS框架,通过位压缩令牌化、双分支架构和硬消息采样策略,突破CLIP文本编码器77位消息限制,实现高容量3DGS水印嵌入与恢复。

详情
AI中文摘要

高容量水印对于3D高斯泼溅(3DGS)资产嵌入丰富信息(例如所有权、来源和认证码)是必要的,从而在大规模3D资产管线中实现可靠的识别和完整性验证。现有的基于预训练文本编码器的位到令牌水印方法由于CLIP固定的77令牌上下文长度而仅限于77位消息,因为超出此限制的令牌不被学习的位置嵌入支持。为了解决这一限制,我们引入了BitC-3DGS,一种位压缩框架,每个令牌编码多个消息位。它采用位压缩令牌化方案,将同一块内的多个位编码为单个语义令牌。为了恢复压缩信息,它进一步引入了双分支架构用于联合块解压缩和位解码,以及硬消息采样策略以改善解码器训练期间的组合覆盖。在Blender和LLFF数据集上的大量实验证明了BitC-3DGS在高容量水印方面的有效性,实现了高消息恢复精度和渲染保真度。例如,它支持128位消息容量,恢复精度与最近最先进方法中64位消息相当。

英文摘要

High-capacity watermarking is necessary for 3D Gaussian Splatting (3DGS) assets to embed rich information (e.g., ownership, provenance, and authentication codes), enabling reliable identification and integrity verification in large-scale 3D asset pipelines. Existing bit-to-token watermarking methods based on a pre-trained text encoder are limited to 77-bit messages due to CLIP's fixed 77-token context length, as tokens beyond this limit are unsupported by learned positional embeddings. To address this limitation, we introduce BitC-3DGS, a bit-compression framework that encodes multiple message bits per token. It employs a bit-compressed tokenization scheme that encodes multiple bits within the same chunk into a single semantic token. To enable recovery of the compressed information, it further introduces a dual-branch architecture for joint chunk decompression and bit decoding, along with a hard-message sampling strategy to improve combinatorial coverage during decoder training. Extensive experiments on the Blender and LLFF datasets demonstrate the effectiveness of BitC-3DGS for high-capacity watermarking, achieving high message recovery accuracy and rendering fidelity. For example, it supports 128-bit message capacity with recovery accuracy comparable to that of 64-bit messages in recent state-of-the-art methods.

2605.28711 2026-05-29 cs.LG

Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models

基于扩散模型的零样本逆问题中的逐阶段失真-感知遍历

Jiawei Zhang, Ziyuan Liu, Leon Yan, Zhenyu Xiao, Yuantao Gu

AI总结 提出一种逐阶段框架MAP-RPS,通过MAP估计和重噪声后验采样实现单扩散模型下的失真-感知权衡遍历,并扩展至潜空间LMAP-RPS以提升适用性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

失真-感知(D-P)权衡是贝叶斯逆问题的一个基本现象,它刻画了失真性能与感知质量之间的内在矛盾。在推理时实现D-P权衡的灵活遍历对实际应用至关重要。尽管扩散模型在零样本逆问题求解中取得了近期成功,但在基于扩散的逆算法中实现D-P遍历的高效且原则性策略仍缺乏充分刻画。本文提出一种逐阶段框架,利用单个扩散模型在零样本逆问题中实现D-P遍历。我们提出的方法称为MAP-RPS,首先进行MAP估计阶段,近似MMSE解并提供低失真初始化,随后进行重噪声后验采样阶段,逐步提升感知质量。我们对两个阶段进行了理论分析,验证了所提设计的有效性和正确性。此外,我们将MAP-RPS扩展到潜空间,得到LMAP-RPS,通过利用大规模预训练潜扩散骨干网络,具有更广泛的适用性。大量实验表明,MAP-RPS和LMAP-RPS在各种任务上实现了更有效的D-P遍历,同时作为实际逆问题的高效求解器也表现出强劲性能。

英文摘要

The distortion-perception (D-P) tradeoff is a fundamental phenomenon of Bayesian inverse problems, which characterizes the inherent tension between distortion performance and perceptual quality. Enabling flexible traversal of the D-P tradeoff at inference time is crucial for practical applications. Despite the recent success of diffusion models in zero-shot inverse problem solving, efficient and principled strategies for D-P traversal in diffusion-based inverse algorithms remain inadequately characterized. In this paper, we propose a stage-wise framework for realizing D-P traversal using a single diffusion model in zero-shot inverse problems. Our proposed method, termed MAP-RPS, starts with an MAP estimation stage that approximates the MMSE solution and provides a low-distortion initialization, followed by a re-noised posterior sampling stage that progressively improves perceptual quality. We provide theoretical analyses for both stages, establishing the validity and effectiveness of the proposed design. Furthermore, we extend MAP-RPS to the latent space, yielding LMAP-RPS, which enjoys broader applicability by leveraging large-scale pre-trained latent diffusion backbones. Extensive experiments demonstrate that MAP-RPS and LMAP-RPS enable more effective D-P traversal on various tasks, while also exhibiting strong performance as efficient solvers for real-world inverse problems.

2605.28700 2026-05-29 cs.AI cs.CL

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

认真统计的重要性:对GSM-Symbolic的批判性再评估

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez

AI总结 通过广义线性混合模型和每问题随机效应重新评估20个开源模型,发现仅半数模型在原始提示格式下表现显著变化,并指出GSM-Symbolic数据集存在大整数分布偏移,控制该效应后剩余显著案例约减半,表明关于LLM推理的笼统结论在统计上不成熟且机制上具有误导性。

详情
Comments
38 pages, 11 figures. Submitted to ACL ARR / EMNLP 2026
AI中文摘要

GSM-Symbolic基准测试(Mirzadeh等人,2025)报告了25个大型语言模型(LLM)在GSM8K问题的模板生成变体上测试时出现一致的性能下降,并得出结论认为这些模型缺乏真正的推理能力。我们认为这一结论建立在不可靠的统计基础上。使用具有每问题随机效应的广义线性混合模型重新评估20个开源模型,我们发现只有一半的模型在原始提示格式下表现出统计上显著的性能变化。此外,我们识别出一个先前未被承认的因素:主要GSM-Symbolic数据集相对于GSM-Base,在问题文本中包含系统性地偏移的大整数分布(K-S统计量=0.12,p<0.001),这与原始作者的声明相矛盾。控制这一大数效应后,大约一半剩余案例的显著性得以解释。在具有统计上显著性能差异的模型中,我们识别出不同的、模型特定的失败模式——包括变量绑定的脆弱性、算术限制和双任务干扰——这强调了关于LLM推理的笼统结论在统计上既不成熟,在机制上也是误导性的。

英文摘要

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.

2605.28643 2026-05-29 cs.CL

GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study

GraphLit:面向文学研究的文本增强动态人物网络表示学习

Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara, Mirella Lapata

AI总结 提出动态异质人物网络(DHCN)和自监督框架GraphLit,通过掩码图自编码器学习融合文本上下文的文学表示,在12个角色相关任务上优于纯文本或纯图基线。

详情
AI中文摘要

将文学文本表示为图或图序列的方法主要侧重于表示角色互动,而往往忽略了另一个关键方面:角色互动的文本上下文。我们引入了动态异质人物网络(DHCN),将长篇小说组织成时间局部化的异构图,将角色与其文本上下文对齐。我们从Project Gutenberg中提取了约20,000个DHCN,并提出了GraphLit,这是一个自监督学习框架,通过掩码图自编码器目标学习丰富的文学表示。在广泛的12个角色相关任务中,GraphLit优于纯文本和纯图基线,特别是在需要上下文理解的任务上。最后,我们通过研究叙事非线性和动态社会特征之间的联系,展示了DHCN和GraphLit在文学分析中的适用性。

英文摘要

Methods to represent literary texts as graphs or sequences of graphs mainly focus on representing character interactions, and often overlook another crucial aspect: the textual context in which characters interact. We introduce Dynamic Heterogeneous Character Networks (DHCNs), which organize long novels into temporally localized heterogeneous graphs that align characters with their textual contexts. We extract around 20,000 DHCNs from Project Gutenberg, and propose GraphLit, a self-supervised learning framework that learns rich literary representations through a masked graph autoencoder objective. Across a wide-range of 12 character-related tasks, GraphLit improves over text-only and graph-only baselines, particularly on tasks requiring contextual understanding. Finally, we demonstrate the applicability of DHCNs and GraphLit for literary analysis by studying the link between narrative non-linearity and dynamic social features.

2605.28418 2026-05-29 cs.LG

Revisiting Metafeatures to Explain Model Differences on Tabular Data

重新审视元特征以解释表格数据上的模型差异

Markus Herre, Andrej Tschalzev, Sascha Marton, Christian Bartelt

AI总结 研究通过严格统计检验和留一法分析,发现数据集元特征无法稳健解释表格数据上不同模型族(如神经网络与树模型、非基础模型与基础模型)之间的性能差异。

详情
AI中文摘要

随着表格基础模型的兴起以及传统模型在许多任务上仍表现良好,为表格数据集选择合适模型仍然困难。我们研究数据集元特征是否能解释表格预测任务中模型族之间的性能差距。利用TabArena基准结果,我们分析数据集级别的性能差距,并将其与模型无关的数据集描述符相关联。经过严格统计检验并控制错误发现率后,我们发现:(1) 对于神经网络与树模型的差距,没有元特征能通过错误发现率控制;(2) 对于非基础模型与基础模型的差距,一个关联是稳健的,但在留一数据集预测测试中不能泛化;(3) 对于TabICLv2与TabPFN-2.6,一个稳健关联也改善了留出预测。此外,我们进行了留一数据集分析,发现元特征预测器未能比简单基线有实质性改进。总体而言,我们的结果显示了表格数据集的异质性,并且全局元特征方法不够稳健,无法对51个TabArena数据集提供解释。

英文摘要

With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps between model families on tabular prediction tasks. Using the TabArena benchmark results, we analyze dataset-level performance gaps and relate them to model-agnostic dataset descriptors. After strict statistical tests with false discovery control, we find that (1) for neural network vs. tree gaps, no meta-feature survives false discovery control, (2) for non-foundation vs. foundation model gaps, one association is robust but does not generalize when tested in leave-one-dataset-out prediction, and (3) for TabICLv2 vs. TabPFN-2.6, one robust association also improves held-out prediction. Furthermore, we conduct a leave-one-dataset-out analysis and find that meta-feature predictors fail to improve meaningfully over a simple baseline. Overall, our results show the heterogeneity of tabular datasets and that global meta-feature approaches are not robust enough to offer explanations on the 51 TabArena datasets.

2605.28368 2026-05-29 cs.LG cond-mat.mtrl-sci physics.app-ph

LEIA: Learned Environment for Interactive Architected Materials

LEIA: 用于交互式架构材料的学习环境

Haiqian Yang, Yuan Cao, Markus J. Buehler

AI总结 提出LEIA世界模型,通过逐步施加边界条件并实时观察变形和应力场,支持工程师交互式探索架构材料,并实现快速代理引导的候选生成与排序。

详情
Comments
22 pages, 10 figures
AI中文摘要

世界模型已经实现了游戏环境和机器人操作的交互式探索,但物理工程仍然超出其能力范围:真实材料表现出非线性本构定律、携带历史依赖的内部状态、经历惯性动力学,并且可能具有跨越多个长度尺度的层次结构。我们提出了LEIA(用于交互式架构材料的学习环境),这是一个世界模型,允许工程师逐步施加边界条件并实时观察由此产生的变形和应力场。LEIA处理大型三维非结构化网格,并对用户指定的加载生成自回归响应。我们引入了MicroPlate,这是一个架构板的基准测试,涵盖微观结构建模的两种模式:通过三维几何显式解析微观结构的架构晶格,以及通过内部自由度隐式建模微观结构变化的均质板。MicroPlate用于评估LEIA以及两种模式下的四种基线方法。最后,我们证明LEIA能够实现高效的候选生成和排序,用于快速代理引导的架构材料新设计搜索,并通过有限元地面实况验证了应力准确的候选排序。

英文摘要

World models have enabled interactive exploration of game environments and robotic manipulation, but physical engineering remains beyond their reach: real materials exhibit nonlinear constitutive laws, carry history-dependent internal state, undergo inertial dynamics, and may possess hierarchical structures spanning multiple length scales. We present LEIA (Learned Environment for Interactive Architected materials), a world model that lets engineers apply boundary conditions step by step and observe the resulting deformation and stress fields in real time. LEIA handles large three-dimensional unstructured meshes and generates autoregressive responses to user-specified loading. We introduce MicroPlate, a benchmark of architected plates spanning two regimes of microstructure modeling: architected lattices that resolve microstructure explicitly through three-dimensional geometry, and a homogeneous plate where microstructural change is modeled implicitly through internal degrees of freedom. MicroPlate is used to assess LEIA alongside four baseline methods across both regimes. Finally, we demonstrate that LEIA enables efficient candidate generation and ranking for fast surrogate-guided search for de novo designs of architected materials, with stress-accurate candidate ranking validated by finite element ground truth.

2605.28327 2026-05-29 stat.ML cs.LG q-fin.RM stat.AP

Insurance Pricing Optimization via Off-Policy Evaluation

通过离线策略评估进行保险定价优化

Sascha Günther, Dimitri Semenovich, Mario V. Wüthrich

AI总结 本文提出基于离线策略评估和随机控制的保险定价方法,利用核化逆倾向得分估计器降低方差,并通过数据共享Lasso和神经网络两种策略优化方法实现最优定价。

详情
AI中文摘要

传统保险定价依赖于基于风险的原则,确保精算公平和偿付能力,但未明确考虑投保人的价格敏感性。我们将保险定价表述为一个决策问题,并使用离线策略评估和随机控制的工具进行研究。我们提出了一种核化逆倾向得分估计器,该估计器利用动作空间中的局部结构,与经典逆倾向得分估计器相比实现了方差减少。基于这些价值估计,我们研究了策略优化,并提出了两种计算最优定价规则的实用方法:一种可解释的数据共享Lasso公式和一种基于神经网络的灵活策略参数化。通过使用受控的合成旅行保险环境,我们实证验证了理论结果,并表明神经网络在策略优化方面优于现有技术。

英文摘要

Traditional insurance pricing relies on risk-based principles that ensure actuarial fairness and solvency but do not explicitly account for policyholders' price sensitivity. We formulate insurance pricing as a decision-making problem and study it using tools from off-policy evaluation and stochastic control. We propose a kernelized inverse propensity score estimator that exploits local structure in the action space and yields variance reduction compared to the classical inverse propensity score estimator. Building on these value estimates, we investigate policy optimization and present two practical approaches for computing optimal pricing rules: an interpretable data-shared Lasso formulation and a flexible policy parameterization based on neural networks. Using a controlled synthetic travel insurance environment, we empirically confirm the theoretical results and show that neural networks outperform existing techniques for policy optimization.

2605.27809 2026-05-29 cs.LG cs.CR

Density-aware Sample-specific Attack

密度感知的样本特定攻击

Qiyuan Wang, Yao Li, Raymond K. W. Wong

AI总结 提出一种通过将触发样本引导至干净数据分布的低密度区域来优化后门攻击的双层优化方法,在微调和剪枝防御下均保持高攻击成功率。

详情
AI中文摘要

尽管后门攻击近期取得进展,现有方法仍易受到训练后防御(如微调或剪枝)的影响,这些防御会擦除后门。我们重新审视后门攻击的核心目标,并在受害者训练的贝叶斯最优模型下推导出刻画最优样本特定触发器构建的原则性准则。我们的分析表明,当触发样本被引导至干净数据分布的低密度区域时,攻击成功率和干净准确率保持同时达到最优,这种分布条件一次性控制中毒分布的所有矩,而非少量输入空间汇总统计量。我们引入一个双层优化框架,通过条件时间分数匹配估计密度比,并优化混合模型目标以将触发样本放置在这些稀疏区域。在MNIST、CIFAR-10、GTSRB和TinyImageNet上的广泛评估表明,我们的方法在防御前达到99%以上的攻击成功率,并且在微调防御下,防御后的ASR比最强基线高出50-85个百分点。针对神经元剪枝防御,该方法表现出完全免疫性,在所有剪枝阈值下均未识别出任何需要移除的神经元。这些结果暴露了当前防御范式的根本缺陷,并强调了需要超越干净分布支持域进行防御的必要性。

英文摘要

Despite recent progress in backdoor attacks, existing methods remain susceptible to post-training defenses that erase the backdoor through fine-tuning or pruning. We revisit the core objectives of backdoor attacks and derive principled criteria characterizing optimal sample-specific trigger construction under a Bayes-optimal model of the victim's training. Our analysis reveals that both attack success and clean-accuracy preservation are simultaneously optimized when triggered samples are steered into low-density regions of the clean data distribution, a distributional condition that controls all moments of the poisoned distribution at once rather than a handful of input-space summary statistics. We introduce a bilevel optimization framework that estimates density ratios via conditional time-score matching and optimizes a mixture-model objective to place triggered samples in these sparse regions. Extensive evaluations on MNIST, CIFAR-10, GTSRB, and TinyImageNet demonstrate that our method achieves above 99\% attack success rate before defense and retains 50--85 percentage points higher post-defense ASR than the strongest baselines under fine-tuning defenses. Against neuron-pruning defenses, the method exhibits complete immunity, with zero neurons identified for removal across all pruning thresholds. These results expose a fundamental gap in current defense paradigms and underscore the need for defenses that operate beyond the support of the clean distribution.

2605.27696 2026-05-29 cs.CV cs.LG

Structure over Pixels: Learning Variable-Length Visual Programs

结构优于像素:学习可变长度视觉程序

Piotr Wyrwiński, Kacper Dobek, Krzysztof Krawiec

AI总结 提出STROP离散视觉分词器架构,通过基于DINOv3特征的局部率失真监督学习可变长度视觉程序,以结构表示替代像素重建。

详情
AI中文摘要

离散视觉分词器将图像转换为有序的代码序列,为场景的结构描述提供了自然表示。然而,现有的自适应分词器要么需要事后搜索,要么在预训练速率的离散集合中进行选择,而不是学习与模型和场景耦合的连续每图像序列长度,并且它们通常针对像素重建进行训练,强调纹理而非结构。我们提出STROP,一种离散视觉分词器架构,形成结构场景表示并同时学习图像的视觉程序应该有多长。使用由冻结的DINOv3特征的局部率失真探针监督的四阶段课程,STROP优化了一个专门的长度头,在单次前向传递中估计活动前缀长度。通过绕过像素级重建梯度,码本完全由高层潜在表示的质量塑造。程序长度随场景复杂性增长,组合结构的迹象出现在下游密集预测迁移和对学习代码词汇的直接检查中。

英文摘要

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

2605.27580 2026-05-29 cs.AI q-bio.NC

You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

你掌控自己的状态:为什么人类结果可以通过因果状态干预来控制

Suraj Biswas, Saurav Gupta, Pritam Mukherjee

AI总结 本文提出人类行为的变异性源于动态潜在状态,并通过因果状态干预实现对结果的可控性,结合六类证据和超过20万用户的数据验证了该框架。

详情
Comments
20 pages, 12 figures, 37 references. Companion to a prior SSRN preprint on causal architecture for human modelling
AI中文摘要

行为科学和面向人类的人工智能的一个核心谜题是个体内变异性的持续存在。同一个体在相同的可观察输入下,在不同场合产生不同的结果,而不同个体产生不同的结果,且没有可观察的协变量能完全预测。我们认为,这种变异性属于个体的动态潜在状态,并且通过针对决策形成时刻的状态及其权重的干预,人类结果在精确且可操作的意义上是可控的。我们将状态定义为随时间索引的权重向量,其维度决定个体的生物学、生理学和神经心理学如何将下一个事件处理为决策和结果。状态、决策和结果之间的关系是因果性的而非相关性的。权重向量在亚日时间尺度上是动态的。结果可报告的意识通道是一个狭窄的注意瓶颈,其内容本身依赖于状态。综合这些主张,意味着给定事件的结果在干预时的状态轨迹条件下是可控的。我们通过六条已建立的证据链(因果推断、预测处理、稳态应变、注意瓶颈、时间生物学、计算精神病学)以及一个部署的行为平台(涵盖2023年至2026年研究期间超过20万同意用户,跨越四种职业角色)的24个月观察基础来推动该框架。我们推导出七个可检验的预测,列出了六个状态感知系统的操作要求,并讨论了对数字健康、教育、人工智能个性化和个人能动性的影响。

英文摘要

A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability. The same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts. We argue that this variability belongs in the dynamic latent state of the person, and that human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed. We define a state as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome. The relationship between state, decision, and outcome is causal rather than correlational. The weighting vector is dynamic at sub-daily timescales. The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent. Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention. We motivate the framework with six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) and a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026). We derive seven testable predictions, list six operational requirements for state-aware systems, and discuss implications for digital health, education, AI personalisation, and personal agency.

2605.27176 2026-05-29 cs.AI

The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

压缩知识图谱假说:哪些图事实对科学假设生成至关重要?

Shashwat Sourav, Viktoriia Baibakova, Sanjay Das, Ran Elgedawy, Maria Mahbub, Emily Herron, Tirthankar Ghosal

AI总结 研究通过扰动局部知识图谱(密度、本体丰富度、拓扑和控制结构),评估不同语言模型在电池材料假设生成中知识图谱的效用,提出冗余感知的压缩知识图谱假说:有用信号可从紧凑子图恢复。

详情
AI中文摘要

知识图谱(KGs)可以为语言模型提供结构化的科学背景,但目前尚不清楚哪些图事实实际上塑造了生成的假设。我们研究了Mistral-7B、Llama-3.1-70B和Gemini 2.5 Flash在电池材料上的KG引导假设生成。通过改变密度、本体丰富度、拓扑和控制结构来扰动局部KG,并使用提供的图和固定参考指标评估输出。跨模型而言,KG效用是选择性的且依赖于模型:图上下文改变了输出,但无KG输出也从模型先验中恢复了大量图内容。紧凑的top-k子图通常近似于完整KG的行为,包括当声称的结果三元组被排除时。同时,压缩并非唯一依赖于某种语义排序规则,随机和基于拓扑的子集也能恢复大部分信号。这些结果支持一种冗余感知的压缩KG假说:有用的KG信号通常可以从紧凑的、科学结构的子图中恢复,而不是需要完整的局部图。

英文摘要

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

2605.27078 2026-05-29 cs.LG cs.AI

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

两种学习速度:Grokking 和双下降的表征-读出分解

Chi-Ning Chou, Oscar Uzdelewicz, Neng-Chun Chiu, Yao-Yuan Yang, SueYeon Chung

AI总结 通过将学习动态分解为编码器中的表征学习和最终分类器中的读出校准两个竞争过程,解释了 grokking 和 epoch-wise 双下降现象,并提供了诊断虚假泛化的框架。

详情
AI中文摘要

训练损失和准确率是用于监控深度神经网络训练过程中泛化性能的标准信号。两个有据可查的现象使这一图景复杂化:在 grokking 中,训练损失迅速下降,而测试性能仅在长时间延迟后突然提升;在 epoch-wise 双下降中,训练损失单调下降,而测试损失或误差先升后降。现有解释通常针对特定任务,缺乏一个任务无关的分析框架来诊断和解释这些现象在现实任务和架构中的表现。我们通过分析学习动态背后的两个竞争过程来应对这一挑战:编码器中的表征学习和最终分类器中的读出校准。利用表征几何、神经正切核和线性探测等工具,我们表明这两个过程在整个训练过程中都是活跃的,它们相对速度的波动导致了看似异常的泛化动态。将表征-读出分解应用于各种任务和架构中的 grokking,我们发现读出在 grokking 发生前偏向训练集,而表征学习是渐进但并非缺失的,这与“从懒惰到丰富”的解释相反。该框架进一步提供了区分虚假泛化和真实泛化的诊断特征:在先前报告的 MNIST grokking 示例和 epoch-wise 双下降示例中,看似延迟或非单调的泛化是由非标准训练配方导致的表征退化和读出失调引起的。总之,这些结果确立了表征-读出分解作为一个自上而下的框架,用于理解学习动态并揭示可解释性研究的基础算法。

英文摘要

Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented phenomena complicate this picture: in grokking, train loss falls rapidly while test performance improves abruptly only after a long delay; in epoch-wise double descent, train loss decreases monotonically while test loss or error rises and falls. Existing accounts are often task-specific, and a task-agnostic analysis framework for diagnosing and explaining these phenomena across realistic tasks and architectures is missing. We address this challenge by analyzing two competing processes that underlie learning dynamics: representation learning in the encoder and readout calibration in the final classifier. Using tools from representational geometry, neural tangent kernels, and linear probing, we show that both processes are active throughout training, with the fluctuations of their relative speed giving rise to seemingly anomalous generalization dynamics. Applying the representation-readout decomposition to grokking across a wide range of tasks and architectures, we find that the readout is train-biased before grokking onset, and representation learning is gradual but not absent, contrary to the lazy-to-rich account. The framework further provides diagnostic signatures distinguishing spurious from genuine generalization: in a previously reported MNIST grokking example and an epoch-wise double descent example, apparent delayed or non-monotone generalization is shown to arise from representation degradation and readout misalignment induced by non-standard training recipes. Together, these results establish the representation-readout decomposition as a top-down framework for understanding learning dynamics and revealing underlying algorithms for interpretability research.

2605.26954 2026-05-29 cs.CL

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

AlbanianLLMSafety:面向阿尔巴尼亚语大语言模型的安全评估数据集

Wajdi Zaghouani, Kholoud K. Aldous, Isra Fejzullaj

AI总结 针对低资源语言阿尔巴尼亚语,构建了首个公开的安全评估数据集,包含11个安全类别的2951条提示,以填补安全评估基础设施的空白。

详情
Journal ref
In Proceedings of the SIGUL2026 Workshop co-located with LREC 2026, Palma de Mallorca, Spain, 2026
Comments
Accepted at SIGUL2026 Workshop co-located with LREC2026
AI中文摘要

大语言模型(LLM)的安全评估主要集中于高资源语言,而低资源语言则严重缺乏关注。我们提出了AlbanianLLMSafety,这是首个公开的阿尔巴尼亚语LLM安全评估数据集。阿尔巴尼亚语是一种语言独特的低资源语言,在阿尔巴尼亚、科索沃、北马其顿以及海外侨民中约有750万使用者。该数据集包含2951条提示,涵盖11个安全类别,包括自残、暴力、种族主义内容、儿童剥削和激进化等,平均每个类别268条提示。每条提示均提供阿尔巴尼亚语原文、英语参考译文以及详细的类别标签。该资源填补了低资源语言安全评估基础设施的重大空白,并为开发更安全、更具包容性的LLM提供了重要基准。数据集将根据请求提供,以支持阿尔巴尼亚语社区的安全评估、微调、红队测试和护栏开发。

英文摘要

Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, violence, racist content, child exploitation, and radicalization, with an average of 268 prompts per category. Each prompt is provided in Albanian with an English reference translation and a detailed category label. This resource addresses a significant gap in safety evaluation infrastruc-ture for low-resource languages and provides an essential benchmark for developing safer, more inclusive LLMs. The dataset will be provided upon request to support safety evaluation, fine-tuning, red-teaming, and guardrail development for Albanian-speaking communities.

2605.26947 2026-05-29 cs.CL

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

KZ-SafetyPrompts:用于大型语言模型的哈萨克语安全评估提示数据集

Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek, Olzhasbek Zhakenov, Adiya Akhmetzhanova

AI总结 针对哈萨克语在大型语言模型安全评估中资源不足的问题,构建了一个包含11个风险类别、5717条原生哈萨克语提示的数据集,并基于GPT-4o基线测试发现跨类别拒绝率差异显著,揭示了仅英语评估无法捕获的类别特定安全漏洞。

详情
Journal ref
In Proceedings of the SIGUL2026 Workshop co-located with LREC 2026, Palma de Mallorca, Spain, 2026
Comments
Accepted at the SIGUL2026 Workshop co-located with LREC2026
AI中文摘要

哈萨克语在评估大型语言模型安全行为的资源中代表性不足。我们提出了KZ-SafetyPrompts,这是一个哈萨克语提示数据集,用于涵盖常见风险领域的十一个类别的安全评估,例如自残、暴力、儿童剥削、色情内容、种族主义内容、激进化以及受管制商品或非法活动。该数据集包含5717条以哈萨克语(西里尔字母)原生编写的提示,按类别组织,并附有英文翻译以进行跨语言分析。提示类似于真实的用户查询,通常采用青少年或儿童风格,并以意图提示的形式表述,不包含程序性指令。我们记录了编写协议、标注程序(包括边界案例决策规则)和质量控制步骤(模式标准化、完整性检查和去重)。我们还将这些类别与广泛使用的安全分类法对齐,以支持与现有评估管道的集成。使用GPT-4o的基线结果显示总体拒绝率为28.2%,不同类别间从5.5%到53.8%不等,表明哈萨克语提示暴露了仅英语评估无法捕获的类别特定安全漏洞。

英文摘要

Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, child exploitation, sexual content, racist content, radicalization, and regulated goods or illegal activities. The dataset contains 5,717 prompts written natively in Kazakh (Cyrillic), organized by category, with English translations for cross-lingual analysis. Prompts resemble realistic user queries, often in a teen or child style, and are phrased as intent prompts without procedural instructions. We document the writing protocol, labeling procedures (including borderline-case decision rules), and quality-control steps (schema standardization, completeness checks, and deduplication). We also align the categories with widely used safety taxonomies to support integration with existing evaluation pipelines. Baseline results with GPT-4o show an overall refusal rate of 28.2%, varying from 5.5% to 53.8% across categories, indicating that Kazakh prompts expose category-specific safety gaps not captured by English-only evaluation.