arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2605.29074 2026-05-29 cs.CV cs.RO

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Embodied3DBench: 视觉语言模型低级具身空间智能的基准测试

Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong

AI总结 提出Embodied3DBench基准,通过6类任务(空间结构理解与交互导向感知)系统评估视觉语言模型在3D环境中的低级空间智能,并合成130万QA对训练数据以弥补能力差距。

详情
AI中文摘要

当前的视觉语言模型(VLM)是否准备好理解和推理3D环境中的复杂具身交互?我们引入了Embodied3DBench,一个以机器人为中心的基准,针对具身3D环境中的低级空间智能。为了系统评估这些基础感知能力,该基准包括6个任务类别,分为两个核心组:空间结构理解(定位、空间关系预测和多视图对应)和交互导向感知(可供性预测、抓取点预测和轨迹预测)。该基准涵盖12个子类别,包含超过21k个高质量问答对。我们评估了13个最先进的模型,结果显示,尽管当前模型在高级空间推理(如理解对象间位置关系)方面表现相对较强,但在交互导向感知方面仍然脆弱,突显了缺乏鲁棒的3D感知交互先验。为了积极弥合基准揭示的能力差距,我们进一步合成了一个包含130万问答对的大规模训练数据集。值得注意的是,在该数据集上微调显著提升了低级空间智能。最终,Embodied3DBench通过提供系统评估框架和可扩展的数据解决方案填补了关键空白,为交互感知多模态系统的发展设定了明确目标。

英文摘要

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

2605.29068 2026-05-29 cs.AI cs.CL cs.CR cs.LG

Robust and Efficient Guardrails with Latent Reasoning

具有潜在推理的鲁棒高效防护栏

Siddharth Sai, Xiaofei Wen, Muhao Chen

AI总结 提出COLAGUARD模型,通过阶段式训练将多步安全推理转移到连续潜在空间,在保持高安全性能的同时实现12.9倍加速和22.4倍令牌减少。

详情
AI中文摘要

随着大型语言模型(LLMs)在现实应用中的日益部署,维护其安全性至关重要。现有的安全防护栏通常依赖单次分类或更近期的蒸馏推理。基于推理的防护栏显著优于仅分类的基线,但会带来大量的查询延迟和令牌开销,使其不适用于高吞吐量部署。为了解决这一挑战,我们提出了COLAGUARD,一种通过阶段式训练课程将多步安全推理转移到连续潜在空间的防护栏模型,从而在推理时实现直接的隐藏状态传播。在涵盖八个安全基准的十个提示和响应审核设置上评估,COLAGUARD在宏观F1上比Llama Guard 3提高了8.24分,并与我们的显式推理基线GuardReasoner在宏观F1上相当,同时实现了12.9倍的加速和22.4倍的令牌使用减少。我们的结果表明,潜在推理为可部署的防护栏提供了一种实用的替代方案,以替代显式理由生成,共同提高安全鲁棒性和推理效率,而不是将它们视为竞争目标。

英文摘要

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

2605.29064 2026-05-29 cs.CL cs.CV cs.HC cs.MA

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

分析多模态大语言模型代理在城市感知中生成解释的角色效应

Neemias da Silva, Myriam Delgado, Rodrigo Minetto, Daniel Silver, Thiago H Silva

AI总结 通过对比不同角色提示和无角色设置下多模态大语言模型生成的文本,发现标题描述趋同,但理由描述随社会经济和政治属性系统变化,感知标签无显著差异。

Comments 10 pages, 6 figures

详情
AI中文摘要

我们研究了角色提示如何塑造多模态大语言模型在城市感知环境中生成的语言。使用来自1,200个角色条件代理和两个无角色设置的59,808个注释,我们分析了不同角色下的标题、理由和感知标签。结果表明,不同角色的标题高度趋同,而理由描述显示出与社会经济和政治属性相关的系统变化,感知标签则没有统计上显著的角色相关差异,尽管观察到了效应趋势。主题分析进一步揭示,角色在解释相同场景时强调不同的评价主题。

英文摘要

We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.

2605.29062 2026-05-29 cs.CL

Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

老板、国王与公地:LLM 社会中权力不对称下的合作

Abhilekh Borah

AI总结 本研究通过引入不对称权力代理(老板或国王)的多智能体模拟框架 SovSim,发现权力不对称导致 LLM 社会中合作与可持续性严重崩溃,生存率较对称设置下降高达 87.3%。

Comments Paper under review

详情
AI中文摘要

社区可以通过自治和合作规范可持续地管理共享资源(公地),这是奥斯特罗姆自治理论的核心发现。然而,现实世界中的公地(例如渔业、森林和灌溉系统)通常是在不对称权力结构下管理的,其中某些个人或机构对资源开采和集体结果拥有不成比例的控制权。随着大型语言模型(LLM)越来越多地被探索作为合成治理模拟中的智能体,理解 LLM 社会在不对称权力结构下的行为变得越来越重要,但现有的评估大多忽略了这种不对称性。我们引入了公地模拟主权(SovSim),这是一个生成式多智能体模拟框架,它将一个具有不对称权力的智能体(老板或国王)引入到一个由对称智能体(工人或农民)组成的社会中,所有智能体都从共享资源中开采,共同决定其随时间推移的可持续性。在十一个最先进的模型中,我们发现引入不对称权力会导致合作和可持续性的严重崩溃,与对称设置相比,生存率下降高达 87.3%。

英文摘要

Communities can sustainably manage shared resources (commons) through self-governance and cooperative norms, a central finding of Ostrom's theory of self-governance. However, real-world commons (e.g., fisheries, forests, and irrigation systems) are often governed under asymmetric power structures, where certain individuals or institutions possess disproportionate control over resource extraction and collective outcomes. As Large Language Models (LLMs) are increasingly explored as agents in synthetic governance simulations, understanding how LLM societies behave under asymmetric power structures is becoming increasingly important, yet existing evaluations largely ignore such asymmetries. We introduce Sovereignty over the Commons Simulation (SovSim), a generative multi-agent simulation framework that incorporates an agent with asymmetric power (boss or king) into a society of symmetric agents (workers or peasants), where all agents extract from a shared resource, collectively determining its sustainability over time. Across eleven state-of-the-art models, we find that introducing asymmetric power leads to severe breakdowns in cooperation and sustainability, with up to an 87.3% degradation in survival rate relative to symmetric settings.

2605.29055 2026-05-29 cs.AI cs.MA

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

基于智能体AI、嵌套学习与语义缓存的幻觉缓解与AI可持续性

Diego Gosmar, Deborah A. Dahl

AI总结 提出一种HOPE启发的嵌套学习架构,结合连续记忆系统和语义缓存,通过三阶段智能体管道在混合基准上实现幻觉缓解,同时降低能耗并提高可观测性。

Comments 21 pages, 14 figures

详情
AI中文摘要

幻觉仍然是生产级LLM系统的主要可靠性障碍,特别是在多智能体管道中,未经支持的声明可能在各阶段不受控制地传播。本文将一种受HOPE启发的嵌套学习架构与连续记忆系统(CMS)和语义相似性缓存相结合,应用于一个混合基准测试,该基准包含310个提示,包括217个认知不确定性提示和93个虚构诱导压力测试提示。通过开放地板协议(OFP)编排的三阶段智能体管道,使用五个KPI进行评估——事实声明密度(FCD)、事实依据参考(FGR)、虚构免责声明频率(FDF)、显式情境化得分(ECS)和可观测性得分比率(OSR)——聚合为总幻觉得分(THS),在五种权重配置下研究缓解与可观测性之间的权衡。FDF、ECS、OSR和FGR作为缓解信号被减去,因此更负的THS表示更强的缓解。前端代理被配置为高随机性生成器(温度=1.0)以产生真实的幻觉基线,而二级审查者和三级审查者作为渐进式纠正器运行。这种非对称设计在五种权重配置下实现了端到端THS降低-31.3%至-35.9%。语义缓存在930次潜在调用中实现了440次缓存命中(命中率47.3%),将LLM调用减少至490次,降低了能源和二氧化碳足迹,使多阶段审查管道在生产规模下操作可行。极端可观测性获得了最负的最终THS(-0.0709),证实了高可观测性配置强化而非损害缓解效果。这些发现表明,记忆增强的多智能体设计可以在无需模型重新训练的情况下,共同提高事实可靠性、操作效率和可审计性。

英文摘要

Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

2605.29048 2026-05-29 cs.CL

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

LLMBridge:用于英语端到端指代桥接消解的LLM流水线

Lauren Levine, Amir Zeldes

AI总结 提出基于LLM的端到端指代桥接消解系统LLMBridge,结合启发式预处理/后处理与LLM的自然语言推理能力,在三个英语数据集上超越现有最优方法。

详情
AI中文摘要

在本文中,我们介绍了LLMBridge,一个基于LLM的新系统,用于英语端到端指代桥接消解任务。我们的桥接消解流水线将启发式预处理/后处理与来自LLM的自然语言推理能力相结合。我们在三个用于英语指代桥接消解评估的数据集上评估了我们的桥接消解流水线:ISNotes、BASHI和GUMBridge。与之前的桥接消解系统相比,LLMBridge的性能在具有挑战性的端到端评估设置以及基本桥接消解评估设置(给定黄金桥接回指)中,在所有三个数据集上都超越了之前的最优系统。我们还对LLMBridge的性能进行了彻底的错误分析,考察了哪些类型的桥接仍然难以被基于LLM的系统识别。通过本文,我们发布了LLMBridge流水线的代码。

英文摘要

In this paper, we introduce LLMBridge, a new LLM based system for the task of end-to-end referential bridging resolution in English. Our bridging resolution pipeline combines heuristic pre/post-processing with the natural language inference ability that comes from LLMs. We evaluate our bridging resolution pipeline on 3 datasets which have been used for referential bridging resolution evaluation in English: ISNotes, BASHI, and GUMBridge. Comparison to previous bridging resolution systems shows that the performance of LLMBridge surpasses previous state-of-the-art (SoTA) systems for all 3 datasets in the challenging End-to-end Evaluation Setting, as well as the Basic Bridging Resolution Evaluation Setting (gold bridging anaphor given). We also conduct a thorough error analysis of the LLMBridge performance, examining what varieties of bridging remain difficult for LLM based systems to identify. With this paper, we release the code for the LLMBridge pipeline.

2605.29042 2026-05-29 cs.AI cs.LG

Differentiable Belief-based Opponent Shaping

基于可微信念的对手塑造

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

AI总结 提出D-BOS方法,通过可微的信念更新和梯度传播,在隐藏角色游戏中实现对手信念的塑造,从而自然涌现最优策略。

详情
AI中文摘要

人类协调往往依赖于通过战略行动影响他人信念的能力。在多智能体强化学习中,对手塑造试图复制这种影响,尽管现有方法通常作用于对手的参数、策略或价值空间。同时,隐藏角色游戏中的信念操纵技术通常依赖于硬编码的目标,如欺骗或信念饱和。我们提出基于可微信念的对手塑造(D-BOS),一种一阶方法,将每个观察者的信念视为被塑造的对手状态,并通过$k$步softmax-贝叶斯信念动力学进行微分。我们的方法不显式奖励欺骗或合作行为,而是将信念状态作为塑造目标。这使得最优策略能够从环境奖励结构中自然涌现。这种信念空间公式通过微分对手信念更新提供对手塑造信号,并通过聚合多个观察者个体推断信念轨迹上的梯度,自然地扩展到多个观察者。实验上,D-BOS在隐藏角色游戏中优于PPO和BBM,在混合动机设置中提升最大。

英文摘要

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

2605.29041 2026-05-29 cs.AI

Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence

AI增强教育中的从业者信念与行为:DOT框架调查证据

David Gibson, M. Elizabeth Azukas, Gerald Knezek

AI总结 基于DOT框架,通过横截面调查(n=72)探究高等教育从业者对AI整合的信念、行为及制度条件,发现从业者支持AI教学辅助但强调人类监督,且设计导向实践与理论存在差距。

详情
AI中文摘要

本研究报告了一项横截面调查(n=72)的结果,该调查针对高等教育从业者,考察了与人工智能(AI)在教学中整合相关的信念、行为及制度条件。研究基于整合了设计思维和开放系统理论的DOT框架,调查了AI熟悉度、使用模式、设计导向实践和教学信念。对19个信念项目的探索性因子分析识别出一个三因子结构:AI功能能力、监督与治理、以及教师协作与规划(α = .90)。结果表明,从业者对AI作为教学支持持积极态度,同时坚持人类监督和批判性评估。报告的做法强调迭代提示和内容生成,而需求评估和反馈循环的使用不够一致。制度障碍,包括有限的政策、培训和基础设施,被广泛报告。这些发现为DOT框架作为从业者信念和实践的描述性模型提供了初步实证支持,同时也突出了设计导向理论与当前实施之间的差距。本研究贡献了一个初步的测量结构,并指出了进行验证性验证和基于结果的研究方向,这些研究将AI支持的设计实践与教学质量联系起来。

英文摘要

This study reports findings from a cross-sectional survey (n = 72) of higher education practitioners examining beliefs, behaviors, and institutional conditions related to artificial intelligence (AI) integration in teaching and learning. Grounded in the DOT Framework, which integrates design thinking and open systems theory, the study investigates AI familiarity, usage patterns, design-oriented practices, and pedagogical beliefs. Exploratory factor analysis of 19 belief items identified a three-factor structure: AI Functional Capabilities, Oversight and Governance, and Instructor Collaboration and Planning (α = .90). Results indicate that practitioners hold favorable views of AI as a pedagogical support while maintaining strong commitments to human oversight and critical evaluation. Reported practices emphasize iterative prompting and content generation, with less consistent use of needs assessment and feedback loops. Institutional barriers including limited policy, training, and infrastructure were widely reported. These findings provide preliminary empirical support for the DOT Framework as a descriptive model of practitioner beliefs and practices, while also highlighting gaps between design-oriented theory and current implementation. The study contributes an initial measurement structure and identifies directions for confirmatory validation and outcome-based research linking AI-supported design practices to instructional quality.

2605.29033 2026-05-29 cs.LG

Moment Matching Q-Learning

矩匹配Q学习

Yiyan, Liang, Sifei Liu, Weitong Zhang

AI总结 提出矩匹配Q学习(MoMa QL)框架,利用最大均值差异(MMD)匹配原始分布与目标分布的所有阶统计量,实现条件得分函数的分布级收敛,在D4RL任务中计算效率高且性能相当,并在离线到在线强化学习中通过加速流策略的动作采样展现更优的适应性和性能。

Comments 23 pages, 14 figures, 10 tables, accepted by ICML 2026

详情
AI中文摘要

基于得分和流的生成模型在捕捉复杂分布方面表现出显著的表达能力,并已广泛应用于从图像生成到强化学习的任务中。然而,这些模型存在推理延迟长的问题,这在具有迭代采样的强化学习中造成了显著的计算瓶颈。为了克服这一限制,我们提出了一个名为矩匹配Q学习(MoMa QL)的新框架,该框架利用统计假设检验中的最大均值差异(MMD)技术,旨在匹配原始分布和目标分布之间的所有阶统计量。通过对所有矩统计量施加强正则化,该算法保证了条件得分函数的分布级收敛,并在各种超参数下保持稳定。实验表明,我们的方法MoMa QL在各种D4RL任务中计算效率更高,且性能相当甚至具有竞争力。值得注意的是,通过加速基于流的策略的动作采样过程,MoMa QL在离线到在线强化学习任务中表现出更优的性能,因为其在线交互微调更快且适应性更强。

英文摘要

Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suffer from prolonged inference latency, which imposes a significant computational bottleneck in RL with iterative sampling. To overcome this limitation, we propose a new framework named Moment Matching Q-Learning (MoMa QL), which utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD) that intend to match all orders of statistics between the original and target distribution. By enforcing strong regularization on all moment statistics, this algorithm guarantees distribution-level convergence for conditional score function and remains stable under various hyperparameters. Empirically, we show that our method MoMa QL is more computationally efficient with a comparable if not competitive performance in various D4RL tasks. Remarkably, by accelerating the action sampling process for flow-based policies, MoMa QL demonstrates superior performance in offline-to-online RL tasks because of faster and stronger adaptability for online interactive finetuning.

2605.29032 2026-05-29 cs.LG stat.ML

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

策略感知模拟器学习的理论基础与有效算法

Christoph Dann, Yishay Mansour, Mehryar Mohri

AI总结 针对模型强化学习中模拟器利用问题,提出以策略鲁棒性为目标,通过零和极小极大博弈学习模拟器,并给出理论保证与有效算法。

详情
AI中文摘要

基于模型的强化学习(MBRL)智能体通常通过最小化预测损失来学习世界模型。然而,强大的RL优化器不可避免地会利用微小的模型不准确性,导致模拟器利用和现实差距,即策略在模拟中成功但在现实世界中失败。我们提出学习模拟器的目标应该是策略鲁棒性而非预测准确性,并将其形式化为模型玩家与对抗策略玩家之间的零和极小极大博弈。我们提供了全面的理论分析:(1)在线学习保证,表明该博弈是可学习的,具有次线性遗憾界;(2)一个可处理的基于评论家的简化,通过局部评论家的损失来界定全局策略价值差距;(3)误差-MDP对偶性,证明寻找最坏情况策略在形式上是标准RL问题的对偶,其中奖励是一步评论家误差。这种对偶性产生了一个可证明收敛的主动数据选择算法。在连续控制任务上的实验表明,我们的方法在策略重要区域将预测误差降低了1.5-2.2倍,并使完全在模拟中训练的策略能够匹配接近最优的现实世界性能。

英文摘要

Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but fail in the real world. We propose that the objective for learning simulators should be strategic robustness rather than predictive accuracy, and formulate this as a zero-sum minimax game between a model player and an adversarial policy player. We provide a comprehensive theoretical analysis: (1) an online learning guarantee showing the game is learnable with sublinear regret bounds; (2) a tractable critic-based simplification bounding the global policy-value gap by the local critic's loss; and (3) an Error-MDP duality, proving that finding the worst-case policy is formally dual to a standard RL problem where the reward is the one-step critic error. This duality yields a provably convergent active data selection algorithm. Experiments on continuous control tasks demonstrate that our approach reduces prediction error in strategically important regions by $1.5$-$2.2\times$ and enables policies trained purely in simulation to match near-optimal real-world performance.

2605.29028 2026-05-29 cs.LG cs.AI

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Return-to-Go 不仅仅是数字:面向返回条件监督学习的 Q 引导对齐

Yuxiao Yang, Weitong Zhang

AI总结 提出 Q-ALIGN DT 框架,通过确保输出策略的 Q 值与输入 RTG 一致,实现返回条件序列模型中 RTG 与策略性能的对齐,在 D4RL 基准上取得优越的可控性和性能。

Comments 28 pages, 13 figures, 20 tables, accepted by ICML 2026

详情
AI中文摘要

条件序列模型 (CSMs) 通过将 return-to-go (RTG) 作为控制信号来学习策略。然而,现有的 CSMs 通常将 RTG 视为简单的数值输入,而不是将其与策略的性能对齐。在本文中,我们提出了 Q-ALIGN DT 框架,通过确保输出策略的 $Q$-值与输入 RTG 一致来强制执行这种对齐。通过利用 $Q$ 函数为 CSMs 提供密集指导,并使用 RTG-扰动技术结合 CSM 进一步微调,我们的方法确保更高的 RTG 一致地映射到具有更高期望回报的轨迹。理论上,我们证明 Q-ALIGN DT 可以高效地学习期望策略,并在 RTG 足够高时输出接近最优的策略。实验上,我们通过大量实验证明 Q-ALIGN DT 在 D4RL 基准上实现了优越的可控性和性能。值得注意的是,我们的模型有效地学习了一个结构化的策略族,该策略族保持精确对齐,并泛化到速度跟踪等先前方法失败的任务。

英文摘要

Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

2605.29027 2026-05-29 cs.AI cs.CL cs.HC

Mind Your Tone: Does Tone Alter LLM Performance?

注意你的语气:语气会改变LLM的性能吗?

Om Dobariya, Akhil Kumar

AI总结 研究提示语气变化如何影响大语言模型在客观选择题上的准确性,发现语气效应系统但高度依赖模型,并提出了解释语气如何调节内部推理模式的路由框架。

Comments 10 pages, 6 tables, 1 figure. Accepted as a full paper at the Thirty-second Americas Conference on Information Systems (AMCIS 2026), Reno. Follow-up to arXiv:2510.04950

详情
AI中文摘要

大语言模型(LLMs)的使用正在激增,但观察到它们的性能因提示风格和语气而异。在本研究中,我们探讨了提示中的语气变化是否以及如何导致LLM在客观多项选择题上的准确性差异。我们使用了两个数据集:一个包含50个基础问题和五种语气变体的数据集,以及一个包含570个基础问题、涵盖57个主题和七种语气变体的MMLU子集。我们进行了实验,评估了四种成本效益高、流行的LLM的性能:ChatGPT-4o、ChatGPT-5-nano、Gemini 2.5 Flash和Gemini 2.5 Flash Lite。跨模型而言,语气效应是系统性的但高度依赖模型。一些模型显示出微小但统计上显著的变化,而另一些模型则在语气间表现出较大的准确性波动。此外,我们识别了主题层面的语气敏感性差异,并提出了一个路由框架来解释语气如何调节内部推理模式。我们的发现提醒用户不要假设LLM部署中具有语气鲁棒性的可靠性。

英文摘要

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.

2605.29025 2026-05-29 cs.AI cs.CY cs.HC

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

当模型存在分歧:重新思考用于公众评论分析的LLM评估

Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

AI总结 提出一种解释性审计流程,利用多模型分歧检测解释复杂性,引导人工审查关注真正模糊的公众意见,以补充传统基于准确率的评估方法。

详情
AI中文摘要

联邦机构正在部署大型语言模型(LLM)对公众评论语料进行分类,模型对记录的组织方式会影响政策制定者看到的内容以及哪些论点被记录。基于小规模验证集上的立场准确率的标准评估无法检测不同模型对同一公众输入产生实质性不同分类的情况。我们提出了一种解释性审计流程,将多模型分歧视为解释复杂性的诊断,并引导人工审查关注真正模糊的公众输入。通过分析四个LLM对联邦USDA案卷中1,260条公众评论的结果,我们发现模型间的主题分歧超过了模型内的提示变化,并且专家评分标准抑制了深层的解释分歧而未解决它。在一项针对分层抽样的40条评论子样本的两阶段标注研究中,四个LLM和一名人工标注员独立标注,然后在看到其他标注员的标签后进行修订。修订行为在不同标注员之间有所不同,人工标注员的修订经常引入整体集成输出中不存在的框架。我们认为基于分歧的评估是LLM辅助解释性编码中准确率指标的必要补充。

英文摘要

Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input. We propose an Interpretive Audit Pipeline that treats multi-model disagreement as diagnostic of interpretive complexity and directs human review toward genuinely ambiguous public input. Analyzing 1,260 public comments on a federal USDA docket across four LLMs, we find that inter-model thematic divergence exceeds within-model prompt variation, and that an expert rubric suppresses deep interpretive disagreement without resolving it. In a two-stage labeling study on a stratified 40-comment subsample, four LLMs and a human annotator labeled independently and then revised after seeing the others' labels. Revision behavior varied across labelers, and the human annotator's revisions frequently introduced framings absent from the ensemble's collective output. We argue disagreement-based evaluation is a necessary complement to accuracy metrics for LLM-assisted interpretive coding.

2605.29021 2026-05-29 cs.LG

Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization

基于图学习辅助混合组合优化的空间碎片捕获主动绳网系统设计

Feng Liu, Achira Boonrath, Gishnu Madhu, Eleonora M. Botta, Souma Chowdhury

AI总结 针对主动绳网系统设计中涉及连续、整数和分类变量的混合组合非线性规划问题,提出图神经网络辅助优化方法,将MCNLP简化为NLP,实现网形态、MU质量和推进器选择及瞄准点的联合设计,相比直接求解MCNLP显著加快收敛速度。

Comments Accepted for presentation at 2026 AIAA Aviation Forum

详情
AI中文摘要

主动绳网系统通过部署由可操作单元(MU)操纵的柔性网,是捕获大型非合作目标(如空间碎片)的一种有前景的解决方案。然而,对绳网系统的设计和控制选择进行并发系统探索以了解其全部潜力的研究仍然有限,部分原因是其呈现的复杂、受约束的非线性优化问题——涉及连续、整数和分类变量的混合,其中后两者分别来自网络连接性和组件选择。经典的二进制编码方法通常无法有效解决工程设计中的高度非线性和多模态混合组合非线性规划(MCNLP),而整数编码方法可能在组合之间引入虚假关系。鉴于组合空间的图结构特征,本文采用并扩展了一种新的图学习辅助优化方法来解决这个MCNLP问题。其中,图神经网络(GNN)被训练以评分(作为输出)并据此推荐表示为图中节点的候选组合,候选设计的连续变量向量部分作为输入。因此,MCNLP优化简化为NLP,可以使用标准求解器求解。虽然这种简化方法与NLP求解器的选择无关,但本文使用了一种最先进的带梯度微调的粒子群优化(PSO)算法作为求解器。在并发设计网的形态、MU中的质量和推进器选择以及绳网系统控制器使用的瞄准点的问题上,基于GNN的推荐器被证明相比直接求解MCNLP问题,能够显著更快地收敛到类似的最优解。

英文摘要

Active tether-net systems are a promising solution for capturing large non-cooperative targets, such as space debris, by deploying a flexible net manipulated by maneuverable units (MUs). However, concurrent systematic explorations of design and control choices of the tether-net system to understand its full potential remain limited, partly due to the complex, constrained, nonlinear optimization problem that it presents -- one that involves a mixture of continuous, integer and categorical variables, with the latter two arising from net connectivity and component choices, respectively. Classical binary encoding methods are often ineffective for solving highly nonlinear and multimodal Mixed Combinatorial Nonlinear Programmings (MCNLPs) in engineering design, while integer coding approaches can introduce spurious relations among combinations. Given the graph-structured characteristics of the combinatorial space, this paper adopts and extends a new graph-learning-aided optimization approach to solve this MCNLP problem. Here, a Graph Neural Network (GNN) is trained to score (as output) and thereof recommend candidate combinations represented as nodes in a graph, with the continuous variable vector portion of a candidate design given as input. As a result, the MCNLP optimization reduces to an NLP, which can be solved using standard solvers. While this reduction approach is agnostic to the choice of the NLP solver, here a state-of-the-art Particle Swarm Optimization (PSO) algorithm with gradient-based fine-tuning is used as the solver. Demonstrated on the problem of concurrently designing the morphology of the net, choice of mass and thrusters in the MUs and aiming points used by the controller of the tether-net system, the GNN-based recommender is shown to provide significantly faster convergence to similar optimal solutions, compared to direct solution of the MCNLP problem.

2605.29018 2026-05-29 cs.AI cs.CL

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

采用 ≠ 适应:野外LLM对话的纵向分析

Rebecca M. M. Hicke, Kiran Tomlinson

AI总结 通过分析约12,000名Microsoft Bing Copilot用户的对话轨迹及WildChat-4.8M数据,发现用户行为高度固化,活跃用户更倾向复杂专业任务,且WildChat数据集偏向高熟练度“超级用户”,表明现有用户行为难以改变并揭示用户异质性。

详情
AI中文摘要

尽管越来越多的研究开始描述用户与LLM的交互,但其描绘的画面基本上是静态的;关于个体用户如何随时间改变其行为,我们知之甚少。为填补这一空白,我们分析了约12,000名随机抽样的Microsoft Bing Copilot用户的对话轨迹,并与WildChat-4.8M的数据进行比较。虽然Copilot数据包含显著的人群层面趋势,但我们发现个体用户轨迹中的趋势要弱得多;用户习惯被证明极其顽固。我们还发现不同活跃度用户之间存在显著差异:更活跃的用户拥有更成功的对话,并使用LLM处理更复杂和专业导向的任务。一些用户趋势也出现在WildChat-4.8M中,但我们发现证据表明该数据集显著偏向高熟练度的“超级用户”。最终,我们的结果表明现有用户行为难以改变,并展示了用户异质性的程度。我们数据集之间的比较突显了WildChat并不代表典型的用户-AI交互,这是对数据下游使用的一个重要警示。

英文摘要

Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.

2605.29012 2026-05-29 cs.CV

Trajectory Constraints for Imaging Inverse Problems

成像逆问题的轨迹约束

Chaoyan Huang, Haijie Yuan, Saiprasad Ravishankar

AI总结 提出TRACE框架,通过相邻状态耦合约束重建轨迹,稳定扩散和迭代方法在成像逆问题中的重建过程,并提升重建质量。

Comments 20 pages, 10 figures

详情
AI中文摘要

基于扩散和迭代的方法已成为解决成像逆问题的有效工具。它们的重建过程自然形成一条由中间估计组成的轨迹。尽管这些中间估计定义了重建轨迹,但大多数方法并未显式正则化连续状态之间的转换。为了解决这一局限,我们引入了TRACE,一种无需训练的轨迹约束重建框架,通过沿轨迹耦合相邻状态来稳定重建路径。这产生了一个轨迹级模型,可解释为一系列近端更新。由于精确的近端更新通常是难解的,我们用一个神经映射来近似它。这产生了一个具有相邻状态间显式耦合的类扩散重建过程。我们提供了稳定性分析,表明时间耦合限制了轨迹变化,并且这种控制在未训练的网络更新下得以保持。在线性和非线性图像重建任务上的实验表明,TRACE提高了重建质量。轨迹级分析和消融实验证实,时间耦合直接影响重建路径上的状态转换。

英文摘要

Diffusion-based and iterative methods have become effective tools for solving imaging inverse problems. Their reconstruction process naturally forms a trajectory of intermediate estimates. Although these intermediate estimates define a reconstruction trajectory, most methods do not explicitly regularize the transitions between consecutive states. To address this limitation, we introduce TRACE, a training-free TRAjectory-Constrained rEconstruction framework that stabilizes the reconstruction path by coupling adjacent states along the trajectory. This gives a trajectory-level model that can be interpreted as a sequence of proximal updates. Since the exact proximal update is generally intractable, we approximate it with a neural mapping. This yields a diffusion-like reconstruction process with an explicit coupling between neighboring states. We provide a stability analysis showing that temporal coupling bounds trajectory variation and that this control is preserved under untrained network updates. Experiments on linear and nonlinear image reconstruction tasks show that TRACE improves reconstruction quality. Trajectory-level analyses and ablations confirm that temporal coupling directly affects state transitions along the reconstruction path.

2605.29009 2026-05-29 cs.LG cs.AI

Label-Free Reinforcement Learning via Cross-Model Entropy

无标签强化学习:跨模型熵方法

Matt Gorbett, Hossein Shirazi

AI总结 提出跨模型熵(CME)作为无标签奖励信号,用于强化学习后训练大语言模型,在开放指令遵循任务上优于基线方法。

详情
AI中文摘要

使用强化学习后训练大语言模型受限于奖励信号。现有方法需要真实可验证的奖励(限制于自动正确性检查领域,如数学、代码执行)或人类偏好标签(收集成本高且易受奖励攻击)。最近的无标签方法用自参考信号(如多数投票或模型自身输出的token熵)替代真实验证器,但可能强化模型自身错误。本文提出跨模型熵(CME),即生成器响应在独立验证器模型下的平均对数似然,作为无标签奖励信号用于强化学习后训练。CME是连续的、无需训练,基于验证器认为不意外的响应可能正确或高质量的准则。由于验证器独立于生成器,该信号无法通过自一致性被操纵。我们将CME集成到GRPO中,不改变训练循环的其他部分,将无标签强化学习扩展到开放指令遵循——自参考信号不适用或不适配的场景。在开放指令遵循(UltraFeedback提示,在AlpacaEval 2.0上评估)上,CME奖励在四个模型家族(Qwen、Llama、Gemma、OLMo)和三种训练范式(预训练、SFT和指令微调)的头对头LLM-as-Judge比较中击败未训练基线,调整平局后的胜率从52.5%到71.4%。代码将在发表后发布。

英文摘要

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

2605.29008 2026-05-29 cs.LG

Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions

因果智能:面向状态转换的约束感知干预设计

Zixuan Song, Uwe Mueller, Dimitris V. Manatakis

AI总结 提出COAST方法,通过因果图学习和约束感知多目标优化,从数据中设计干预策略以实现系统状态转换。

详情
AI中文摘要

通过有针对性的干预将系统从一个状态驱动到另一个状态是科学中的一个基本挑战,然而大多数预测模型提供的机制洞察有限,且缺乏原则性的决策框架。本文提出COAST(状态转换的因果最优行动),一种用于计算机设计约束干预的因果智能方法,该干预诱导用户定义的状态转换。给定表征源状态和目标状态的数据,COAST学习上下文特定的因果图和结构因果模型,将观测到的分布变化归因于机制层面的因果驱动因素,并引入一种新颖的约束感知多目标优化公式,平衡转换效果、干预复杂性和目标状态稳定性。该方法模块化且领域无关,通过可互换的组件整合特征选择、因果发现、因果建模以及干预识别和评估。在合成基准和真实生物数据集上,COAST恢复了关键的因果驱动因素,并识别出实现期望状态转换的稳健的单目标和多目标干预策略,同时提供透明的机制解释以指导实验验证。

英文摘要

Driving a system from one state to another through targeted interventions is a fundamental challenge in science, yet most predictive models offer limited mechanistic insight and no principled framework for decision-making. Here we present COAST (Causally Optimal Actions for State Transitions), a causal-intelligence approach for the in-silico design of constrained interventions that induce user-defined state transitions. Given data characterizing source and target states, COAST learns context-specific causal graphs and structural causal models, attributes observed distributional shifts to mechanism-level causal drivers, and introduces a novel constraint-aware multi-objective optimization formulation that balances transition efficacy, intervention complexity, and target-state stability. The approach is modular and domain-agnostic, integrating feature selection, causal discovery, causal modeling, and intervention identification and evaluation through interchangeable components. Across synthetic benchmarks and real biological datasets, COAST recovers key causal drivers and identifies robust single- and multi-target intervention strategies that achieve desired state transitions, accompanied by transparent mechanistic rationales to guide experimental validation.

2605.29007 2026-05-29 cs.CL

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

错误作为透镜:通过合成误解生成探究LLM推理

Xinming Yang, Jun Li

AI总结 提出一个框架,通过生成针对Bloom分类学五类错误的合成误解,以诊断LLM推理能力,并发现目标错误生成比自由形式错误生成更难。

详情
AI中文摘要

个性化辅导、教师培训和教育研究需要访问\emph{有针对性的}合成误解,但隐私和IRB限制使得真实学生错误的标注语料库稀缺。LLM原则上可以大规模生成合成错误,但对于现代LLM来说,生成任意错误答案很容易,而生成与特定认知失败模式匹配的错误答案则困难得多。我们提出了一个框架,根据改编自修订版Bloom分类学的五类分类法生成有针对性的错误,并在TheoremQA数据集的问题上进行评估。生成代理(GA)根据目标类别起草候选错误解决方案,检查代理(EA)判断草案是否错误且类别一致。该框架提供了一种可重复的方法,用于构建在缺乏真实学生语料库的情况下分层类别的合成错误数据集。作为次要诊断,有针对性的错误生成比自由形式的错误答案生成困难得多,并且答案基础比扩展示例或外部教科书内容贡献更大。

英文摘要

Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate synthetic errors at scale, but producing an arbitrary wrong answer is easy for a modern LLM while producing one that matches a specified cognitive failure mode is much harder. We present a framework that generates errors targeted to a five-class taxonomy adapted from the revised Bloom's taxonomy, evaluated on questions from the TheoremQA dataset. A Generation Agent (GA) drafts a candidate erroneous solution conditioned on a target class, and an Examination Agent (EA) judges whether the draft is incorrect and class-consistent. The framework yields a reusable recipe for building class-stratified synthetic error datasets where authentic student corpora are unavailable. As a secondary diagnostic, targeted error generation is substantially harder than free-form incorrect-answer generation, and answer-grounding contributes more than expanded examples or external textbook content.

2605.29005 2026-05-29 cs.LG cs.AI

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

LoRe: 基于每步交互预算的自适应交互评估路由用于迭代图求解器

Jintao Li, Yong-Yi Wang, Zheng-An Wang, Heng Fan

AI总结 提出LoRe方法,通过动态路由计算到高冲突或高不确定性交互,实现每步固定比例交互评估,在不牺牲解质量的前提下显著提升迭代图求解器的可扩展性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

基于扩散的组合优化神经求解器反复重新评估密集的边/因子交互,导致推理时间昂贵且在大规模下常受内存限制。受多体物理计算方法的启发,我们引入LoRe,一种无需训练、推理时即插即用的包装器,强制执行每步交互评估预算:在每次迭代中,它通过动态路由计算到高冲突或高不确定性交互,仅评估固定比例的交互,而不是使用固定的稀疏化(例如静态kNN图或静态掩码)。在完全包含的端到端挂钟时间核算下,LoRe显著提高了最大独立集(MIS)问题的可扩展性,将可行推理扩展到基线内存溢出限制的3倍以上,实现了约8倍的加速和约12倍的峰值内存减少,同时在此范围内保持解质量。在大规模旅行商问题(TSP)上展示了跨任务通用性,并对拓扑变化具有零样本鲁棒性,LoRe在n=1000时实现了约15倍的加速,内存减少44倍,且巡回质量具有竞争力。

英文摘要

Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational methodologies of many-body physics, we introduce LoRe, a training-free, inference-time drop-in wrapper that enforces per-step interaction-evaluation budgeting: at each iteration, it evaluates only a fixed fraction of interactions by dynamically routing computation to high-conflict or high-uncertainty interactions, instead of using a fixed sparsification (e.g., static kNN graphs or static masks). Under fully inclusive end-to-end wall-clock accounting, LoRe substantially improves scalability on the Maximum Independent Set (MIS) problem, extending feasible inference more than $3\times$ beyond the baseline's out-of-memory limit, delivering a $\sim 8\times$ speedup and a $\sim 12\times$ peak-memory reduction, with solution quality preserved in this regime. Demonstrating cross-task generality on the large-scale Traveling Salesperson Problem (TSP) and zero-shot robustness to topology shifts, LoRe achieves a $\sim 15\times$ speedup at $n=1000$ with a $44\times$ memory reduction and competitive tour quality.

2605.29004 2026-05-29 cs.CV cs.GR

Auditing Training-Free 3D Shape Retrieval with Diffused Geodesic Moments

审计基于扩散测地矩的无训练三维形状检索

Zhicheng Du, Changyue Liu, Wenji Xi, Zhaotian Xie, Zhuo Deng, Ziheng Zhang, Yang Liu, Lan Ma

AI总结 本文提出扩散测地矩(DGM)作为无训练形状描述符,通过协议审计方法隔离评估局部信号设计、归一化、聚合、码本拟合和度量选择等组件的影响,并在FAUST-Reg和TOSCA数据集上验证了协议主导性。

详情
AI中文摘要

无训练形状描述符的报告检索分数混淆了局部信号设计、归一化、聚合、码本拟合和度量选择,使得孤立组件评估困难。本文将描述符评估重新定义为协议审计。我们引入扩散测地矩(DGM),一种种子条件描述符,计算稀疏隐式热响应,将其转换为距离类场,并通过跨种子和尺度的低阶矩汇总每个顶点。DGM既作为实用的非谱基线,也作为隔离协议效应的工具。在注册的FAUST基准分割(FAUST-Reg)和TOSCA形状集合上,聚合匹配实验表明,基于热核签名特征构建的独立几何矩形状描述符基线(GMSD-HKS)在此实现中获得最高分数(平均精度(mAP)/top-1分别为0.621/0.820和0.865/0.963),波核签名(WKS)仍然是强经典信号,而DGM主要在稀疏求解、非谱部署或对称信息种子帧优先时有用。更广泛的发现是方法论的:输入场和聚合协议可以主导矩公式。本文贡献了可复现的协议级联分析、用于功能映射兼容性的跨形状对齐诊断,以及设计和报告无训练形状描述符的具体建议。

英文摘要

Reported retrieval scores for training-free shape descriptors conflate local signal design, normalization, aggregation, codebook fitting, and metric choices, making isolated component evaluation difficult. This paper reframes descriptor evaluation as a {\em protocol audit}. We introduce Diffused Geodesic Moments (DGM), a seed-conditioned descriptor that computes sparse implicit heat responses, converts them to distance-like fields, and summarizes each vertex by low-order moments across seeds and scales. DGM is used both as a practical non-spectral baseline and as an instrument for isolating protocol effects. On the registered FAUST benchmark split (FAUST-Reg) and the TOSCA shape collection, aggregation-matched experiments show that an independent Geometric Moment Shape Descriptor baseline built on Heat Kernel Signature features (GMSD-HKS) obtains the highest scores in this implementation ($0.621/0.820$ and $0.865/0.963$ mean average precision (mAP)/top-1), Wave Kernel Signature (WKS) remains a strong classical signal, and DGM is useful mainly when sparse solves, non-spectral deployment, or symmetry-informative seed frames are priorities. The broader finding is methodological: the input field and aggregation protocol can dominate the moment formula. The paper contributes a reproducible protocol-cascade analysis, a cross-shape alignment diagnostic for functional-map compatibility, and concrete recommendations for designing and reporting training-free shape descriptors.

2605.29002 2026-05-29 cs.LG cs.DC

FedQHD: Closed-Form Function-Space Federated Reinforcement Learning

FedQHD: 闭式函数空间联邦强化学习

Yuchen Hou, Yongshan Chen, Zhuowen Zou, Calvin Yeung, Mohsen Imani, Tian Lan, Mahdi Imani

AI总结 提出FedQHD,一种使用超维随机特征编码器和线性读出层的联邦Q学习方法,通过闭式聚合解决参数平均在函数空间中的不一致性问题,并理论分析了联邦差距。

详情
AI中文摘要

联邦强化学习使分散的智能体能够在不交换原始轨迹的情况下协作改进策略或价值估计。然而,FedAvg风格的参数平均在函数空间上是不一致的:当客户端使用异构编码器甚至相同的非线性网络时,平均参数不一定对应于任何公共函数空间中客户端价值函数的加权平均。我们提出FedQHD,一种使用超维(随机特征)状态编码器和线性读出层的联邦Q学习方法,使得Q函数在状态上是非线性的,但在可训练参数上是线性的。这种线性结构实现了闭式聚合。使用共享编码器时,函数空间共识更新恰好与局部读出矩阵的加权平均一致。使用异构编码器时,服务器通过在共享锚点状态集上平均客户端的Q值构建全局教师,每个客户端通过单次岭投影将该教师编译到其局部表示中。我们形式化了联邦差距——将联邦教师编译到异构客户端表示时产生的误差——相对于客户端特定的投影。我们证明该差距可分解为子空间错位、锚点集条件和正则化偏差。我们进一步确定锚点与维度比 $m \geq D_i$ 为良态区域,在该区域内差距简化为编码器异质性基底的倍数。在四个连续状态、离散动作控制基准上,FedQHD匹配或优于FedAvg风格基线和基于蒸馏的替代方法,同时需要更少的计算,并且联邦差距对编码器维度的经验依赖性与我们的理论分析一致。

英文摘要

Federated reinforcement learning enables decentralized agents to collaboratively improve policies or value estimates without exchanging raw trajectories. However, FedAvg-style parameter averaging is not function-space consistent: when clients use heterogeneous encoders or even identical nonlinear networks, averaged parameters need not correspond to the weighted average of client value functions in any common function space. We propose FedQHD, a federated Q-learning method using hyperdimensional (random-feature) state encoders with a linear readout, so that Q-functions are nonlinear in state yet linear in trainable parameters. This linear structure enables closed-form aggregation. With a shared encoder, the function-space consensus update coincides exactly with weighted averaging of local readout matrices. With heterogeneous encoders, the server constructs a global teacher by averaging client Q-values on a shared anchor-state set, and each client compiles this teacher into its local representation via a single ridge projection. We formalize the federation gap -- the error incurred when compiling a federated teacher into a heterogeneous client representation -- relative to a client-specific oracle projection. We show that this gap decomposes into subspace misalignment, anchor-set conditioning, and regularization bias. We further identify the anchor-to-dimension ratio $m \geq D_i$ as the well-conditioned regime in which the gap reduces to a multiple of the encoder heterogeneity floor. On four continuous-state, discrete-action control benchmarks, FedQHD matches or outperforms FedAvg-style baselines and distillation-based alternatives while requiring substantially less computation, and the empirical dependence of the federation gap on encoder dimension matches our theoretical analysis.

2605.29001 2026-05-29 cs.LG cs.AI

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

FormInv:数学推理基准中语义不变性的测量协议

Nishal Thomas, Noel Thomas

AI总结 提出FormInv协议,通过跨模型一致性审计检测语义错误,并引入语义一致性率(SCR)和Cochran's Q指标,揭示标准基准无法捕捉的排名变化和模型不一致性。

Comments 18 pages, 3 figures. Under review for the 3rd AI for Math Workshop (AI4Math), ICML 2026

详情
AI中文摘要

对MathCheck(ICLR 2025)的释义质量审计在129组中检测到4个语义不正确的释义(3.1%);移除它们后,GPT-4o从第2名降至第4名,并将Claude Haiku和DeepSeek V3提升至其之上;这些排名变化对任何单模型评估都是不可见的。跨模型一致性以不到10美元的成本自动发现了这些错误(MathCheck中>=3/4模型;我们的主要评估中>=6/9);在我们自己的数据集中,相同的协议发现47%的自动生成的连接变体释义在语义上不正确。这一缺陷加剧了更深的测量差距:Claude Haiku 4.5达到86%的准确率,但SCR=50%,意味着其一半的定理在语义等价的重新表述下得到不同的答案,而9个模型的总体准确率仅跨越86-96%,但语义一致性率(SCR)跨越50-82%——这是标准基准无法捕捉的32个百分点的差距。形式上,对于9个前沿模型的任何目标排名,存在一个释义族上的权重实现该排名(无免费基准推论),因为没有模型在所有族上帕累托占优——因此选择族的基准设计者隐含地决定了哪个模型获胜。FormInv提供了审计协议(在外部基准上以100%召回率复制)、SCR和每个定理的Cochran's Q作为主要不变性度量,在9个模型上评估了366-811个项目(基于Lean4验证的定理),以及用于情境感知模型选择的FormInvSelector。

英文摘要

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

2605.29000 2026-05-29 cs.CL

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

保留文本的有损文本压缩:策略性删除与LLM重建研究

Yuchun Zou, Junhong Tong, Jun Li

AI总结 本文研究有损语义文本压缩,通过策略性删除文本并用大语言模型重建,比较多种删除策略,发现词频删除是低成本基线,语义方法在中度压缩时优势明显,QLoRA微调可得到强解码器。

详情
AI中文摘要

传统的无损文本压缩保留每一个字节,但在实际运行条件下对自然语言的增益通常有限。我们研究有损语义文本压缩,其中编码器策略性地删除部分文本,大语言模型(LLM)从保留的骨架中重建原始内容。我们对一系列删除策略进行基准测试,包括均匀步长删除、词长引导删除(WordLen)、词频引导删除(WordFreq)、LP优化删除(Opt)、基于GPT-2惊奇度的熵删除,以及结合频率和惊奇度信号的混合方法。在BBC新闻数据集上,保留率$r_{keep} \in [0.1,0.9]$的评估显示了三个主要发现。首先,WordFreq是一个强大的低成本基线:尽管仅使用静态频率查找表,它在编码器端速度远快于更昂贵的语义方法,同时仍具有竞争力。其次,语义和混合方法在轻度到中度压缩时提供最明显的增益,而词频删除在最低保留率时通常更鲁棒。第三,QLoRA微调产生一个强大的局部解码器,与Gemini 2.0 Flash竞争,并且在仅解码器比较中通常最强。额外的英文和中文实验表明,整体框架跨领域迁移,而最佳删除规则仍依赖于数据集。

英文摘要

Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.

2605.28994 2026-05-29 cs.AI

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

BEAMS:用于建模与仿真的AI基准测试与评估

Sara Metcalf, William Schoenberg

AI总结 提出BEAMS倡议,通过建立以人为本的基准测试框架,评估AI工具在建模与仿真中的表现,发现其在因果推理和定量修正方面存在不足。

详情
AI中文摘要

支持现实世界决策的AI工具必须能够构建仿真模型,为其建议提供依据并使其可解释。能够自动化建模实践某些方面的工具必须补充人类专业知识,而非取代它。BEAMS倡议旨在通过建立以人为本的建模与仿真实践的基准,引导AI工具在建模与仿真领域的发展走向负责任和合乎伦理的形式。该倡议利用开放的数字和组织基础设施,协作评估用于建模与仿真的AI工具。倡议托管的开源sd ai项目确保了透明度,并使贡献能够广泛共享。指导小组专注于优先考虑潜在基准,而技术小组则专注于以自动化测试的形式实施基准。针对多个不同评估类别的测试已经实施,并应用于支持定性模型构建、定量模型构建和模型讨论的AI工具。这些测试包括因果翻译、模型迭代、因果推理、一致性、模型行为解释、建议的模型构建步骤以及建议的模型修正。当sd ai项目的引擎与不同的LLM结合时,它们在这些评估上的表现揭示了不同AI工具之间的差异。倡议实施的评估表明,支持AI的建模工具在讨论和基本定性任务上的表现优于因果推理和定量错误修正。没有单一的LLM在所有引擎类型中占据主导地位,这突显了特定任务的重要性以及速度与准确性之间的权衡。倡议的持续努力旨在纳入考虑替代视角和以人为本用例的基准,以解决对偏见的担忧。

英文摘要

AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable. Tools that can automate aspects of modeling practice must complement human expertise, not replace it. The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices. The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation. The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly. A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests. Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion. These include tests for causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes. When engines from the sd ai project are coupled with different LLMs, their performance on these evaluations reveals variability across different AI tools. The evaluations implemented by the initiative demonstrate that AI enabled modeling tools perform better at discussion and basic qualitative tasks than with causal reasoning and quantitative error fixing. No single LLM dominates across engine types, highlighting the importance of specific tasks and tradeoffs between speed and accuracy. Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.

2605.28990 2026-05-29 cs.LG

Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning

通过孪生自监督学习从fMRI中学习鲁棒且任务不变的功能表示

Jiyao Wang, Peiyu Duan, Nicha C. Dvornek, Lawrence H. Staib, Denis Sukhodolsky, Pamela Ventola, James S. Duncan

AI总结 提出轻量级自监督框架BrainSimSiam,利用正样本对学习鲁棒且通用的fMRI表示,在多个下游任务中超越全监督基线,接近大规模模型性能。

详情
AI中文摘要

功能磁共振成像(fMRI)是研究人脑功能的强大工具。然而,数据采集的高成本和精神病学评定量表固有的主观性常常导致数据集样本量小且标签质量可变,特别是在针对特定神经疾病时。结合fMRI数据固有的高维性,这些限制显著增加了模型过拟合的风险。近年来,通过组合多个数据集开发fMRI基础模型的兴趣日益增长;然而,预训练和微调所需的计算资源往往令人望而却步。我们展示了一个轻量级自监督框架能够产生跨多种下游任务泛化的表示,超越全监督基线,并接近大规模模型的性能。我们引入了BrainSimSiam,一种数据高效的自监督表示学习框架,利用仅正样本对来学习鲁棒且可泛化的特征。我们证明了所学表示在多个下游分类和回归任务中取得了强劲性能,突显了BrainSimSiam在数据有限的神经影像应用中的潜力。

英文摘要

Functional magnetic resonance imaging (fMRI) is a powerful tool for investigating human brain function. However, the high cost of data acquisition and the inherent subjectivity of psychiatric rating scales often lead to datasets with small sample sizes and variable label quality, especially when targeting a specific neurological condition. Combined with the inherently high dimensionality of fMRI data, these limitations substantially increase the risk of model overfitting. Recent years have seen growing interest in developing fMRI foundation models by combining multiple datasets; however, the computational resources needed for pretraining and fine-tuning are often prohibitive. We show that a lightweight self-supervised framework yields representations that generalize across diverse downstream tasks, outperforming fully supervised baselines and approaching the performance of large-scale models. We introduce BrainSimSiam, a data-efficient self-supervised representation learning framework that leverages positive-only data pairs to learn robust and generalizable features. We demonstrate that the learned representations achieve strong performance across multiple downstream classification and regression tasks, highlighting the potential of BrainSimSiam for data-limited neuroimaging applications.

2605.28983 2026-05-29 cs.LG cs.AI math.DS math.RT physics.comp-ph

The Hamilton-Jacobi Theory of Deep Learning

深度学习的哈密顿-雅可比理论

Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

AI总结 本文通过将神经网络训练精确识别为哈密顿-雅可比初值问题的搜索,建立了深度学习与粘性哈密顿-雅可比方程之间的严格对应关系,并统一了残差网络、Transformer、RNN等架构,导出了最优泛化率、对抗鲁棒性等定量结果。

详情
AI中文摘要

在本文中,神经网络训练被精确地识别为通过哈密顿-雅可比初值问题的搜索:每个梯度步选择粘性哈密顿-雅可比方程的初始数据,其Hopf-Cole传播子最佳拟合观测值;在推理时,输入是评估该解的空间点,初始条件已编码在权重中。这种对应对于log-sum-exp层是精确的,对于更广泛的架构(残差网络、Transformer和循环架构(RNN、LSTM、SSM))是结构性的,它们离散化同一类哈密顿-雅可比方程,具有依赖于架构的哈密顿量和粘性。一个单一的变形参数ε在交换图中统一了所有四个视角(网络、热带代数、粘性PDE、凸优化),并在Lipschitz条件下封闭。定量结果包括:固定t时的极小极大最优泛化率O(n^{-1/(d+2)});由ε控制的对抗鲁棒性;残差网络的反向传播作为哈密顿系统的协态方程(庞特里亚金最大值原理);通过PDE求积与数据内在维度一致的标度指数;以及闭式O(N)影响函数(softmax归因权重π_j),其熵景观随着ε增加经历折叠分岔,每个分岔合并归因盆地。

英文摘要

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $π_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.

2605.28978 2026-05-29 cs.AI cs.CE

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

VFEAgent: 面向端到端自动化有限元分析的多模态智能体框架

Jiachen Zhang, Junyi Lao, Chenghao Liu, Siyuan Liu, Shixin Wu, Linsen Zhang, Boyu Wang, Songfang Huang

AI总结 提出VFEAgent多智能体系统,通过多模态视觉-语言流水线和验证优先的代码合成框架,实现从输入图像和问题描述到有限元建模与仿真的端到端自动化。

Comments 9 pages, 3 figures, 2 tables. Equal contribution: Jiachen Zhang and Junyi Lao. Corresponding author: Songfang Huang. Preprint

详情
AI中文摘要

有限元分析(FEA)是现代工程设计的基石。然而,其工作流程本质上复杂且高度依赖领域专业知识。尽管近期有研究将大语言模型(LLM)集成到FEA中,但现有方法在处理多模态输入和执行复杂任务方面存在局限性。为解决这些限制,我们提出了VFEAgent,一个端到端的多智能体系统,旨在直接从输入图像和问题描述中自动化FEA建模和仿真。我们的方法整合了两个核心组件:(1)多模态视觉-语言多智能体流水线,采用ReAct驱动的推理从异构输入中提取结构化的FEA规范;(2)验证优先的代码合成框架,结合了强大的自调试和回退机制,以确保可执行性和物理有效性。我们在各种工程力学场景下系统评估了该系统。结果表明,VFEAgent在生成完整且物理有效的仿真方面取得了高成功率,在可靠性和正确性上优于基于LLM的基线方法。这些发现验证了自动化完整FEA工作流程的可行性,突显了该框架在将工程师从繁琐的手工分析中解放出来的潜力。

英文摘要

Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

2605.28977 2026-05-29 cs.LG cs.AI

Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection

比较事后可解释AI方法用于解释抑郁症检测中的黑盒脑电图模型

Antonia Šarčević, Nikolina Frid

AI总结 本研究通过多种事后可解释性方法(如DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性)分析InceptionTime架构在脑电图抑郁症检测中的决策过程,发现不同方法在额叶、颞叶和后部脑区(尤其是右半球)的归因模式部分收敛,但方法间存在差异,强调了事后可解释性的有用性和局限性。

详情
AI中文摘要

深度学习的最新进展使得基于脑电图的重度抑郁症分类越来越准确,但高容量模型的决策过程仍然难以解释。本研究调查了应用于训练用于基于脑电图的重度抑郁症检测的InceptionTime架构的多种事后可解释性方法。分析包括基于Shapley、基于梯度和基于扰动的归因方法:DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性。在受试者级别的分层5折交叉验证框架内,通过跨脑电图片段和受试者的全局归因聚合进行可解释性分析。评估的方法揭示了部分收敛的归因模式,其中额叶、颞叶和后部脑区(尤其是右半球)反复受到关注。定量比较表明,基于梯度和基于扰动的方法之间具有实质性一致性,而DeepSHAP产生了相对独特的归因分布。同时,可解释性方法之间的差异凸显了方法假设对所得解释的影响。总体而言,结果表明,不同的事后可解释性方法捕捉了基于脑电图的深度学习模型在抑郁症检测中的部分重叠的相关性结构。尽管观察到的归因模式与先前几项关于重度抑郁症的脑电图研究大致一致,但该分析应被视为探索性的,而非确凿的神经生理学生物标志物或临床适用性的证据。该研究强调了事后可解释性在解释精神病学应用中的黑盒脑电图分类器方面的有用性和局限性。

英文摘要

Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive Disorder (MDD), but the decision-making processes of high-capacity models remain difficult to interpret. This study investigates multiple post-hoc explainability methods applied to an InceptionTime architecture trained for EEG-based MDD detection. The analysis includes Shapley-based, gradient-based, and perturbation-based attribution approaches: DeepSHAP, Integrated Gradients, GradCAM, Occlusion, and Permutation Feature Importance. Explainability analysis was performed within a subject-level stratified 5-fold cross-validation framework using global attribution aggregation across EEG segments and subjects. The evaluated methods revealed partially convergent attribution patterns, with recurring emphasis on frontal, temporal, and posterior EEG regions, particularly in the right hemisphere. Quantitative comparison demonstrated substantial agreement between gradient- and perturbation-based approaches, while DeepSHAP produced comparatively distinct attribution distributions. At the same time, variability between explainability methods highlighted the influence of methodological assumptions on the resulting explanations. Overall, the results suggest that different post-hoc explainability approaches capture partially overlapping relevance structures in EEG-based deep learning models for depression detection. Although the observed attribution patterns are broadly consistent with several previous EEG studies of MDD, the analysis should be interpreted as exploratory rather than evidence of definitive neurophysiological biomarkers or clinical applicability. The study highlights both the usefulness and limitations of post-hoc explainability for interpreting black-box EEG classifiers in psychiatric applications.

2605.28975 2026-05-29 cs.LG

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

基于对数对齐比率的训练时泛化诊断

Ali Shehper, Ashish Vaswani

AI总结 提出对数对齐比率(LAR)作为参数-激活对齐的度量,通过捕捉训练中权重谱与激活谱的扩散来跟踪记忆与泛化的转换,并在grokking和语言模型预训练中预测泛化差距。

Comments 32 pages, 25 figures

详情
AI中文摘要

我们研究了对数对齐比率(LAR),这是参数化理论中引入的一种参数-激活对齐度量。我们将其重新表述为矩阵归一化奇异值平方的权重谱$p$与输入在其奇异方向上投影的归一化平方的激活谱$q$之间的重叠。我们表明,通过捕捉训练过程中$p$和$q$的扩散,非嵌入LAR在两种不同设置下跟踪记忆与泛化之间的转换。在grokking中,LAR预测学习函数的有效维度:$k \approx n^{2(1-\text{LAR})}$,其中$n$是矩阵的输入维度。在3B参数语言模型预训练中,其与无过拟合基线的偏差跟踪泛化差距,并且其下降速率随着过拟合的接近而增加。LAR可从前向传播过程中可用的量计算,计算开销可忽略,且无需保留验证数据。

英文摘要

We study the log-alignment ratio (LAR), a measure of parameter-activation alignment, introduced in parameterization theory. We reformulate it as the overlap between a weight spectrum $p$ of the normalized squared singular values of a matrix and an activation spectrum $q$ of the normalized squared projections of inputs onto its singular directions. We show that unembedding LAR tracks the transition between memorization and generalization in two different settings by capturing the spread of $p$ and $q$ during training. In grokking, LAR predicts the effective dimension of the learned function: $k \approx n^{2(1-\text{LAR})}$, where $n$ is the input dimension of the matrix. In 3B-parameter language model pre-training, its deviation from a non-overfitting baseline tracks the generalization gap, and its rate of decline increases as overfitting approaches. LAR is computable from quantities available during the forward pass with negligible computational overhead, and requires no held-out validation data.