arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.23267 2026-05-19 cs.CL cs.LG

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

在大型语言模型中微调与上下文学习：从形式语言学习的角度

Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi

发表机构 * Max Planck Institute for Software Systems（马克斯·普朗克软件系统研究所）； Boston University（波士顿大学）

AI总结本文从形式语言学习的角度比较了大型语言模型中的微调与上下文学习，通过设计精确的语言边界、受控字符串采样和无数据污染的任务，发现微调在分布内泛化上优于上下文学习，而两者在分布外泛化上表现相当，且两者在不同熟练度水平上的归纳偏置也有所不同。

Comments Accepted at ACL 2026 (Main)

详情

AI中文摘要

大型语言模型（LLMs）在两种基本的学习模式中运作——微调（FT）和上下文学习（ICL），这引发了关于哪种模式产生更大的语言能力以及它们是否在归纳偏置上有所不同的关键问题。先前比较FT和ICL的研究由于实验设置不一致而得出混杂和不明确的结果。为了实现严格比较，我们提出了一项形式语言学习任务——提供精确的语言边界、受控字符串采样和无数据污染，并引入一种判别测试来评估语言能力，其中LLM成功当且仅当它将更高生成概率分配给语言字符串而不是非语言字符串。经验上，我们发现：（a）FT在分布内泛化上比ICL更具语言能力，但两者在分布外泛化上表现相当。（b）它们的归纳偏置，通过字符串生成概率的相关性来衡量，当两种模式部分学习语言时相似，但在更高熟练度水平上分化。（c）与FT不同，ICL的表现在不同大小和家族的模型之间差异显著，并且对语言的token词汇表敏感。因此，我们的工作展示了形式语言作为评估LLM的受控测试床的潜力，这些行为在自然语言数据集中难以隔离。我们的源代码可在https://github.com/bishwamittra/formallm上获得。

英文摘要

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

URL PDF HTML ☆

赞 0 踩 0

2604.23135 2026-05-19 cs.LG

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

刻画 Lean 4 自动形式化中的同义词诱导失败

William Feng, Ethan Lou, Aryan Sharma

发表机构 * Yale University（耶鲁大学）

AI总结本研究探讨了 Lean 4 自动形式化中由于同义词变化导致的失败模式，通过应用确定性同义词规则到本科和竞赛级数学问题数据集，发现代码生成层的失败主导了同义词敏感性，并揭示了不同数据集对失败类型的影响，结果为自动形式化提供了失败模式分类并推动了针对性的训练干预。

详情

AI中文摘要

近年来，Lean 4 自动形式化在前沿语言模型和开放权重自动形式化器中变得越来越流行，这些模型现在能够生成数学定理的有效形式化。然而，这些评估通常依赖于单个标准定理表述，很少探讨输出是否对输入的自然变化具有鲁棒性，而先前的工作已表明语义等价的同义词变化常导致形式化输出的差异。我们通过应用确定性同义词规则到本科和竞赛级数学问题数据集，研究了 Lean 4 中这些差异的结构。在四个前沿模型和三个开放权重自动形式化器上，我们发现同义词敏感性主要由代码生成层的失败主导，并且这些失败在不同数据集中被类型化不同。此外，这些模式扩展到开放权重模型，显示最先进的自动形式化器仍难以生成有效的 Lean 代码。我们的结果为自动形式化提供了失败模式分类，并推动了针对特定编译失败的训练干预。

英文摘要

Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.

URL PDF HTML ☆

赞 0 踩 0

2604.22626 2026-05-19 cs.CL

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

从字形依赖到词法结构：从马尔可夫视角看但丁的神曲

Angelo Maria Sabatini

发表机构 * The BioRobotics Institute（生物机器人研究所）； Scuola Superiore Sant’Anna（高等圣安娜学校）

AI总结本文通过基于元音-辅音编码的符号表示，研究但丁神曲的结构组织，发现从地狱到天堂字形记忆指数逐渐增加，表明局部依赖结构发生方向性变化，同时通过三元组分析识别出词法环境中的重复配置，并揭示局部符号依赖与词法结构之间的联系。

Comments 26 pages, 8 figures, 1 supplementary material; submitted to Journal of Computational Literary Studies

详情

AI中文摘要

本研究通过基于元音-辅音编码的符号表示，探讨但丁神曲的结构组织。将所得序列建模为四状态马尔可夫链，得到一个简洁的字形记忆指数，捕捉局部持续性和交替模式。在整部诗中，该指数从地狱到天堂略有但一致增加，表明局部依赖结构发生方向性变化。三元组分析识别出一组受限的重复配置，作为字形探针，将马尔可夫模式与词法环境及正字法现象如撇号形式联系起来。互补的分类分析识别出特定于歌的词法锚点，显示局部符号依赖既反映了三首歌之间的分离，又在整部诗中呈现出连续进展。结果提供了一个可解释的框架，将局部符号结构与高层文本组织联系起来。

英文摘要

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing local persistence and alternation patterns. Across the poem, this index shows a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram analysis identifies a restricted set of recurrent configurations acting as graphemic probes, linking Markov patterns to lexical environments and orthographic phenomena such as apostrophised forms. A complementary classification analysis identifies cantica-specific lexical anchors, showing that local symbolic dependencies reflect both the separation among the three cantiche and a continuous progression across the poem. The results provide an interpretable framework connecting local symbolic structure with higher-level textual organisation.

URL PDF HTML ☆

赞 0 踩 0

2604.22282 2026-05-19 cs.CL

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

STEM: 用于知识图谱驱动检索增强生成的结构追踪证据挖掘

Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

发表机构 * AI Product Center, Kingsoft Corporation（金山办公人工智能产品中心）

AI总结本文提出STEM框架，通过将多跳推理重新定义为以模式引导的图搜索任务，解决知识图谱结构异质性和现有推理路径检索方法缺乏全局结构视角的问题，从而提升多跳推理的准确性和证据完整性。

Comments 34 pages, 16 figures, accepted to ACL 2026 (Main Conference, Oral Presentation)

详情

AI中文摘要

基于知识图谱的问题回答（KGQA）在复杂推理任务中起着关键作用，但仍然受到两个持续存在的挑战的限制：知识图谱（KGs）的结构异质性常常导致检索过程中的语义不匹配，而现有的推理路径检索方法缺乏全局结构视角。为了解决这些问题，我们提出了结构追踪证据挖掘（STEM），一种新颖的框架，将多跳推理重新定义为以模式引导的图搜索任务。首先，我们设计了一个语义到结构的投影流水线，利用KG结构先验来将查询分解为原子关系断言并构建一个自适应的查询模式图。随后，我们执行全局感知的节点锚定和子图检索以获得最终的证据推理图。为了更有效地在图构建过程中整合全局结构信息，我们设计了三元组依赖图神经网络（Triple-GNN）以生成一个全局指导子图（指导图）以引导构建。STEM显著提高了多跳推理图检索的准确性和证据完整性，并在多个多跳基准上实现了最先进的性能。

英文摘要

Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2604.20155 2026-05-19 cs.CV

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

GSCompleter: 一种无需蒸馏的插件，用于在几秒钟内进行基于度量的3D高斯溅射完成

Ao Gao, Jingyu Gong, Xin Tan, Zhizhong Zhang, Lizhuang Ma, Yuan Xie

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Shanghai Innovation Institute（上海创新研究院）； Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University（重庆精密光学重点实验室，东华大学重庆研究院）； Shanghai Key Laboratory of Computer Software Evaluating and Testing（上海计算机软件评测测试重点实验室）； Department of Computer Science and Engineering, Shanghai Jiao Tong University（上海交通大学计算机科学与工程学院）

AI总结本文提出了一种无需蒸馏的GSCompleter插件，通过稳定的'生成-注册'流程实现基于度量的3D高斯溅射完成，提高了完成质量和效率，并在三个基准上取得了新的最先进的结果。

详情

AI中文摘要

3D高斯溅射（3DGS）凭借其显式表示和效率，已彻底改变了高质量神经渲染。然而，从稀疏视角重建场景会因覆盖范围有限而遭受严重的几何空洞和漂浮物。当前的场景完成方法通常依赖于迭代的'修复-蒸馏'范式，这计算成本高，容易出现不稳定优化，并且容易过拟合。为了解决这些限制，我们提出了GSCompleter，一种无需蒸馏的插件，将场景完成转移到稳定的'生成-注册'流程。具体而言，GSCompleter合成出视觉上合理的2D参考图像，并通过稳健的立体锚点视角选择机制将其显式提升为具有一致度量尺度的3D高斯原语。这些新生成的原语随后通过新颖的射线约束注册策略无缝集成到全局场景中。通过用稳定的几何注册替代不稳定蒸馏，GSCompleter在三个基准上表现出优越的3DGS完成性能，比各种基线在质量和效率上都得到了提升，并取得了新的最先进的（SOTA）结果。

英文摘要

3D Gaussian Splatting (3DGS) has revolutionized high-fidelity neural rendering with its explicit representation and efficiency. However, reconstructing scenes from sparse viewpoints suffers from severe geometric voids and floaters due to limited coverage. Current scene completion methods typically rely on an iterative "Repair-then-Distill" paradigm, which is computationally intensive, prone to unstable optimization, and susceptible to overfitting. To address these limitations, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Specifically, GSCompleter synthesizes visually plausible 2D reference images and explicitly lifts them into 3D Gaussian primitives with a consistent metric scale via a robust Stereo-Anchor View Selection mechanism. These newly generated primitives are then seamlessly integrated into the global scene using a novel Ray-Constrained Registration strategy. By replacing unstable distillation with rapid geometric registration, GSCompleter exhibits superior 3DGS completion performance across three benchmarks, enhancing both quality and efficiency over various baselines and achieving new state-of-the-art (SOTA) results.

URL PDF HTML ☆

赞 0 踩 0

2604.18966 2026-05-19 cs.LG cs.AI

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

通过迭代奖励引导的后训练改进表格语言模型

Yunbo Long, Tejumade Afonja, Guangya Hao, Alexandra Brintrup, Mario Fritz

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）； CISPA Helmholtz Center for Information Security, Saarbrücken, Germany（德国萨尔布吕肯信息安全中心）； The Alan Turing Institute, London（伦敦阿兰·图灵研究所）

AI总结本文研究了通过生成-评分-对齐协议进行迭代奖励引导的后训练，提出了一种基于组相对对齐的方法TabGRAA，通过比较高分和低分生成组的组平均策略/参考对数比来改进表格语言模型，在五个混合类型基准上优于额外监督微调，并在保真度和下游效用之间实现了最佳平均权衡，同时保持经验隐私诊断接近监督基线。

详情

AI中文摘要

表格语言模型可以通过将行建模为令牌序列来生成合成表格，但通常通过监督微调一次后就作为静态生成器使用。这限制了下一步令牌似然不能直接优化用于评估合成数据的分布、效用和不可区分性属性。我们通过生成-评分-对齐协议研究了表格语言模型的迭代奖励引导后训练，其中生成器采样合成行，任务特定的奖励对其进行排序，模型则相对于固定监督参考进行更新。在该协议中，我们提出了TabGRAA（表格组相对优势对齐），通过组平均的策略/参考对数比比较高分和低分生成组，而非一对一偏好对。在五个混合类型基准上，TabGRAA在GReaT基座上优于额外监督微调，并在保真度和下游效用之间实现了最强的平均权衡，同时保持经验隐私诊断接近监督基线。消融研究显示，收益依赖于有意义的奖励排名和稳定的组级更新，而非额外训练本身。奖励替换和评分分离研究进一步表明，后训练循环可以使用基于分类器和无分类器的奖励，且适当的评分分离对于保持保真度-效用-隐私权衡至关重要。这些结果将TabGRAA定位为一种自改进的后训练方法，用于表格语言模型生成器，作为强大静态表格生成器的补充。

英文摘要

Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference log-ratios rather than one-to-one preference pairs. Across five mixed-type benchmarks, TabGRAA improves a GReaT backbone beyond additional supervised fine-tuning and achieves the strongest average trade-off among adapted DPO, KTO, and NPO baselines on fidelity and downstream utility, while maintaining empirical privacy diagnostics near the supervised baseline. Ablations show that the gains depend on meaningful reward ranking and stable group-level updates rather than extra training alone. Reward-substitution and scorer-separation studies further show that the post-training loop can use both classifier-based and classifier-free rewards, and that proper scorer separation is important for preserving the fidelity--utility--privacy trade-off. These results position TabGRAA as a self-improving post-training method for tabular language-model generators, complementary to strong static tabular synthesizers.

URL PDF HTML ☆

赞 0 踩 0

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节：在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

发表机构 * AMLab（AM实验室）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文提出Mosaic模型，通过学习功能扰动生成集合成员，并利用网格对齐的块稀疏注意力机制，在原分辨率网格上操作，以线性成本捕捉长距离依赖关系，从而在1.5°分辨率下达到或超越更精细分辨率模型的性能，实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情

AI中文摘要

我们介绍Mosaic，一种概率天气预测模型，旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式：频谱阻尼（统计学）、高频混叠（架构学）和残余高频泄漏（参数学）。Mosaic通过学习的功能扰动生成集合成员，并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作，该机制是一种硬件对齐的机制，通过在空间相邻查询之间共享键和值，以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下，Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能，并在1.5°模型中实现了最先进的结果，生成了经过良好校准的集合，其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

URL PDF HTML ☆

赞 0 踩 0

2604.15851 2026-05-19 cs.LG cs.AI cs.CR

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

DPrivBench：评估大语言模型在差分隐私推理中的基准测试

Erchi Wang, Pengrun Huang, Eli Chien, Om Thakkar, Kamalika Chaudhuri, Yu-Xiang Wang, Ruihan Wu

发表机构 * Halıcıoğlu Data Science Institute, UC San Diego（哈里奇奥格卢数据科学研究所，加州大学圣地亚哥分校）； Department of Computer Science and Engineering, UC San Diego（计算机科学与工程系，加州大学圣地亚哥分校）； Department of Electrical Engineering, National Taiwan University（电气工程系，国立台湾大学）； OpenAI

AI总结本文提出DPrivBench基准测试，用于评估大语言模型在差分隐私推理中的能力，发现当前模型在高级算法推理上存在显著差距，并为改进自动化差分隐私推理提供了方向。

详情

AI中文摘要

差分隐私（DP）在保护数据隐私方面有广泛的应用，但设计和验证DP算法需要专家级推理，这为非专家从业者设置了高门槛。先前的工作要么依赖于需要大量领域专业知识的专用验证语言，要么仍然是半自动化的，需要人工在循环中指导。在本文中，我们研究大语言模型（LLMs）能否自动化DP推理。我们引入了DPrivBench，这是一个基准测试，每个实例询问函数或算法是否在指定假设下满足陈述的DP保证。该基准测试精心设计，覆盖了广泛的DP主题，跨越不同的难度级别，并通过简单的模式匹配来抵抗快捷推理。实验显示，尽管最强的模型能够处理教科书机制，但所有模型在高级算法上都面临困难，揭示了当前DP推理能力的显著差距。通过进一步的分析研究和失败模式分析，我们识别出改进自动化DP推理的几个有前途的方向。我们的基准测试为开发和评估此类方法提供了坚实的基础，并补充了现有的数学推理基准测试。

英文摘要

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.15762 2026-05-19 cs.LG

Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions

无人机群中的零样本可扩展韧性：一种带有物理信息图交互的去中心化模仿学习框架

Huan Lin, Lianghui Ding

发表机构 * Institute of Image Communication and Network Engineering, School of Integrated Circuits, Shanghai Jiao Tong University（图像通信与网络工程研究所，集成电路学院，上海交通大学）

AI总结本文提出了一种去中心化模仿学习框架，通过物理信息图神经网络编码局部交互，实现无人机群在大规模故障和碎片化拓扑下的鲁棒恢复。

详情

AI中文摘要

大规模无人机（UAV）故障可能导致无人机群网络分裂为断开的子网络，使得去中心化恢复既紧迫又困难。集中式恢复方法依赖于全局拓扑信息，在严重碎片化后变得通信密集。去中心化启发法和多智能体强化学习方法更容易部署，但其性能在群规模和损坏严重程度变化时通常会退化。我们提出了物理信息图对抗模仿学习算法（PhyGAIL），该算法采用集中训练与去中心化执行。PhyGAIL从异构观测中构建有界的局部交互图，并利用物理信息图神经网络将方向局部交互编码为具有显式吸引力和排斥力的门控消息传递。这使策略具有物理基础的协调偏置，同时保持局部观测的尺度不变性。它还使用场景自适应模仿学习来改进在碎片化拓扑和可变长度恢复周期下的训练。我们的分析建立了有界局部图放大、有界交互动态和终端成功信号的受控方差。在20个UAV群上训练的策略可直接转移到最多500个UAV的群中，无需微调，且在重新连接可靠性、恢复速度、运动安全性和运行效率方面优于代表性基线。

英文摘要

Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20-UAV swarms transfers directly to swarms of up to 500 UAVs without fine-tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.

URL PDF HTML ☆

赞 0 踩 0

2604.09609 2026-05-19 cs.AI cs.RO

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

通用大语言模型作为人类驾驶员行为模型：简化合并案例

Samir H. A. Mohammad, Wouter Mooi, Arkady Zgonnikov

发表机构 * Department of Transport and Planning, Delft University of Technology（代尔夫特理工大学交通与规划系）； Department of Cognitive Robotics（认知机器人学系）

AI总结本文研究了通用大语言模型在模拟人类驾驶员行为中的应用，通过在简化的一维合并场景中嵌入两个通用大语言模型，并与人类数据进行定量和定性分析，发现模型在间歇性操作控制和空间线索战术依赖方面能再现人类行为，但在动态速度线索响应和安全性能方面存在差异，提示未来需进一步研究其失效模式以确保其作为人类驾驶行为模型的有效性。

Comments To be published in proceedings of IEEE ITSC 2026

详情

AI中文摘要

人类行为模型在自动驾驶车辆（AVs）的虚拟安全评估中作为行为参考和模拟人类代理至关重要，但当前模型面临可解释性与灵活性之间的权衡。通用大语言模型（LLMs）提供了一种有前景的替代方案：一个模型可能在各种场景中无需参数拟合即可部署。然而，LLMs在捕捉人类驾驶行为方面能做什么、不能做什么仍不明确。我们通过将两个通用LLMs（OpenAI o3和Google Gemini 2.5 Pro）作为独立的闭环驾驶员代理嵌入简化的一维合并场景，并通过定量和定性分析将其行为与人类数据进行比较，来填补这一空白。两个模型能够再现人类样式的间歇性操作控制和对空间线索的战术依赖。然而，它们均无法一致地捕捉人类对动态速度线索的反应，且模型间的安全性能差异显著。系统性的提示消融研究揭示了提示组件作为模型特定的归纳偏置，这些偏置在不同LLMs之间不转移。这些发现表明，通用LLMs可能潜在地作为独立、即用型的人类行为模型在AV评估流程中发挥作用，但未来研究需要进一步理解其失效模式，以确保其作为人类驾驶行为模型的有效性。

英文摘要

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

URL PDF HTML ☆

赞 0 踩 0

2604.09450 2026-05-19 cs.LG cs.AI eess.IV

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

ECHO: 通过一步块扩散实现高效的胸部X光报告生成

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu

发表机构 * Beijing Jiaotong University（北京交通大学）； Dalian University of Technology（大连理工大学）

AI总结本文提出ECHO，一种基于扩散模型的高效视觉-语言模型，用于生成胸部X光报告，通过一步块扩散和响应不对称扩散策略，显著提高了生成效率和文本连贯性，同时在临床准确性上保持良好表现。

详情

AI中文摘要

胸部X光报告生成（CXR-RG）有潜力显著减轻放射科医生的工作负担。然而，传统自回归视觉-语言模型（VLMs）由于序列令牌解码而存在高推理延迟。基于扩散的模型通过并行生成提供了一种有前景的替代方案，但它们仍然需要多个去噪迭代。将多步去噪压缩到单步可以进一步减少延迟，但通常会因令牌因子化去噪器引入的均场偏差而降级文本连贯性。为了解决这一挑战，我们提出了ECHO，一种高效的基于扩散的VLM（dVLM），用于胸部X光报告生成。ECHO通过一种新颖的直接条件蒸馏（DCD）框架实现了稳定的每块一步推理，该框架通过从策略扩散轨迹中构建非因子化监督来缓解均场限制，以编码联合令牌依赖性。此外，我们引入了一种响应不对称扩散（RAD）训练策略，该策略进一步提高了训练效率，同时保持模型有效性。广泛的实验表明，ECHO超越了最先进的自回归方法，在RaTE和SemScore上分别提高了64.33%和60.58%，同时在临床准确性上几乎没有下降的情况下，实现了高达8倍的推理加速。

英文摘要

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving up to \textbf{$8\times$} inference speedup with negligible degradation in clinical accuracy.

URL PDF HTML ☆

赞 0 踩 0

2604.04932 2026-05-19 cs.CL

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

超越最终作者：为细粒度LLM生成文本检测建模创作者与编辑的双重角色

Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao

发表机构 * ict.ac.cn（中国科学院）

AI总结本文提出RACE方法，通过建模创作者和编辑的双重角色，实现细粒度LLM生成文本检测，以更精确地区分不同类型的文本，从而为LLM监管提供政策对齐的解决方案。

Comments ACL 2026 (Oral)

详情

AI中文摘要

大型语言模型（LLM）的滥用需要精确检测合成文本。现有工作主要遵循二元或三元分类设置，只能区分纯人类/LLM文本或协作文本。这在 nuanced 的监管中仍显不足，因为LLM润色的人类文本和人类化的LLM文本往往触发不同的政策后果。在本文中，我们探索了在严格四类设置下细粒度LLM生成文本检测。为处理这些复杂性，我们提出了RACE（Rhetorical Analysis for Creator-Editor Modeling），一种细粒度检测方法，该方法刻画了创作者和编辑的各自特征。具体而言，RACE利用修辞结构理论（RST）构建创作者的逻辑图，同时提取基本话语单元（EDU）级别的特征以捕捉编辑的风格。实验表明，RACE在识别细粒度类型时优于12个基线方法，具有较低的误报率，为LLM监管提供了一种政策对齐的解决方案。

英文摘要

The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory (RST) to construct a logic graph for the creator's foundation while extracting Elementary Discourse Unit (EDU)-level features for the editor's style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.

URL PDF HTML ☆

赞 0 踩 0

2604.01658 2026-05-19 cs.AI

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

CORAL：迈向自主多智能体进化以实现开放性发现

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

发表机构 * MIT（麻省理工学院）； NUS（新加坡国立大学）； MiniMax ； McGill（麦吉尔大学）； Stanford（斯坦福大学）； SambaNova ； Meta ； Singapore-MIT Alliance for Research and Technology（新加坡-麻省理工联合研究技术联盟）； Amazon（亚马逊）； Microsoft（微软）

AI总结本文提出CORAL框架，通过自主多智能体进化方法，实现了在开放性问题上的发现，展示了智能体自主性和多智能体进化对提升开放性发现的显著效果。

详情

AI中文摘要

基于大型语言模型（LLM）的进化是一种有前景的开放性发现方法，其中进展需要持续的搜索和知识积累。现有方法仍然严重依赖固定启发式和硬编码探索规则，这限制了LLM智能体的自主性。我们提出了CORAL，这是首个用于开放性问题的自主多智能体进化的框架。CORAL用长运行的智能体取代了刚性的控制，这些智能体通过共享持久记忆、异步多智能体执行和基于心跳的干预进行探索、反思和协作。它还提供了实用的保障措施，包括隔离的工作空间、评估者分离、资源管理以及智能体会话和健康管理。在多样化的数学、算法和系统优化任务上评估，CORAL在10个任务上实现了新的最先进结果，其改进率比固定进化搜索基线高出3-10倍，且使用更少的评估。在Anthropic的内核工程任务中，四个共进化智能体将最佳已知分数从1363提高到1103周期。机理分析进一步显示这些增益源于知识重用和多智能体探索和交流。这些结果表明，更大的智能体自主性和多智能体进化可以显著提高开放性发现。代码可在https://github.com/Human-Agent-Society/CORAL上获得。

英文摘要

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

URL PDF HTML ☆

赞 0 踩 0

2603.27341 2026-05-19 cs.AI cs.CV cs.LG

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究：数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

发表机构 * Center for Applied AI, Chicago Booth（应用人工智能中心，芝加哥商学院）； Surgical Data Science Collective（外科数据科学集体）； Children’s National Hospital（儿童医学中心）； Operations Management & Tolan Center for Healthcare, Chicago Booth（运营管理与托兰医疗中心，芝加哥商学院）

AI总结本文通过2026年最先进的AI方法，研究了外科手术工具检测中的性能和限制，发现即使使用多十亿参数模型和大量训练数据，当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足，且模型规模和训练时间的增加对性能提升效果有限，表明当前AI在手术应用中仍面临显著挑战。

详情

AI中文摘要

最近的人工智能（AI）模型在多个生物医学任务基准上已匹配或超越了人类专家，但特别是在外科手术基准方面，这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务，一般能力的AI模型可能成为协作工具，如果性能可以得到提升。一方面，通过扩展架构大小和训练数据的常规方法具有吸引力，尤其是由于每年有数百万小时的手术视频数据生成。另一方面，为AI训练准备手术数据需要显著更高的专业水平，并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中，我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明，即使使用多十亿参数模型和大量训练，当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外，我们展示了扩展实验，表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此，我们的实验表明，当前模型在手术使用案例中仍可能面临重大障碍。此外，一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中，提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者，并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

URL PDF HTML ☆

赞 0 踩 0

2603.25723 2026-05-19 cs.CL cs.AI

Natural-Language Agent Harnesses

自然语言代理Harness

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文提出自然语言代理Harness（NLAH）作为一种可执行的自然语言对象，用于描述任务运行的Harness策略，并引入Intelligent Harness Runtime（IHR）作为共享运行时，能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。实验表明，NLAH在编码、终端使用和计算机使用基准测试中表现与代码和提示实现相当，同时暴露了更短的静态Harness策略。

Comments revise paper

详情

AI中文摘要

代理性能受到周围Harness的强烈影响：围绕模型组织任务运行的外部执行系统。然而，这种逻辑通常隐藏在紧密耦合的控制器代码中，使得Harness难以检查、比较、转移和消解。本文探讨是否可以将代理Harness的可重用设计模式表示为可执行的自然语言对象。我们引入自然语言代理Harness（NLAH），即可编辑的文档，用于描述运行级别的Harness策略，并引入Intelligent Harness Runtime（IHR），一个共享运行时，能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。在编码、终端使用和计算机使用基准测试中，IHR执行的NLAH实现了与代码和提示实现相当的任务结果，同时暴露了更短的静态Harness策略。模块消解进一步表明，显式的Harness模块是可分析的。这些结果表明，代理Harness可以从模型周围的偶然粘合物转变为科学表示对象。

英文摘要

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

URL PDF HTML ☆

赞 0 踩 0

2603.23672 2026-05-19 cs.RO cs.CV

Bio-Inspired Event-Based Visual Servoing for Ground Robots

生物启发的基于事件的视觉伺服控制用于地面机器人

Maral Mordad, Kian Behzad, Debojyoti Biswas, Noah J. Cowan, Milad Siami

发表机构 * Department of Electrical & Computer Engineering, Northeastern University（东北大学电气与计算机工程系）； Laboratory for Computational Sensing and Robotics, Johns Hopkins University（约翰霍普金斯大学计算感知与机器人实验室）； Department of Mechanical Engineering, Johns Hopkins University（约翰霍普金斯大学机械工程系）

AI总结本文提出了一种基于生物启发的1D事件视觉伺服框架，用于在结构化环境中运行的地面机器人，通过动态视觉传感器和多模式刺激直接合成非线性状态反馈项，实现了高效低延迟的控制。

详情

AI中文摘要

生物感觉系统本质上是自适应的，能够过滤掉恒定刺激并优先处理相对变化，可能提高计算和代谢效率。受广泛动物主动感知行为的启发，本文介绍了一种原理性的1D基于事件的视觉伺服框架，用于在结构化环境中运行的地面机器人。利用动态视觉传感器（DVS），我们证明通过将固定的空间核应用于由结构化对数强度变化模式生成的异步事件流，所得到的网络事件流能够分析性地隔离特定的运动状态组合。我们建立了该事件率估计器的一般理论界，并证明线性和二次空间剖面分别隔离了机器人的速度和位置-速度乘积。利用这些特性，我们采用多模式刺激直接合成非线性状态反馈项，而无需传统状态估计。为克服事件感知中在平衡点固有的线性可观测性损失，我们提出了一种生物启发的主动感知极限环控制器。在1/10比例自主地面车辆上的实验验证证实了所提出直接感知方法的有效性、极低延迟和计算效率。

英文摘要

Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper introduces a principled 1D event-based visual servoing framework for ground robots operating in structured environments. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific combinations of kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.

URL PDF HTML ☆

赞 0 踩 0

2603.23231 2026-05-19 cs.AI

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

PERMA：通过事件驱动的偏好和现实任务环境评估个性化记忆代理

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）； City University of Hong Kong（香港城市大学）； Northeastern University（东北大学）； MemTensor (Shanghai) Technology Co., Ltd.（MemTensor（上海）科技有限公司）

AI总结本文提出PERMA基准，通过事件驱动的偏好和现实任务环境评估个性化记忆代理的长期一致性，引入文本变异和语言对齐以模拟真实数据中的不规则用户输入和个体语言风格，实验表明先进记忆系统能精准提取偏好并减少token消耗，但仍需更稳健的个性化记忆管理。

详情

AI中文摘要

为构建能适应用户不断变化需求的代理，增强大语言模型的长期记忆能力至关重要。现有评估通常将偏好相关对话与无关对话交织，使任务退化为needle-in-a-haystack检索，忽略了驱动用户偏好演变的事件之间的关系。此类设置忽视了现实世界个性化的一个基本特征：偏好是逐渐形成并在嘈杂环境中跨交互累积的。为弥合这一差距，我们引入PERMA，一个评估时间跨度内人格一致性的基准，超越静态偏好回忆。此外，我们引入（1）文本变异和（2）语言对齐，以模拟现实数据中的不规则用户输入和个体语言风格。PERMA包含跨多个会话和领域的时序排列交互事件，其中偏好相关查询随时间插入。我们设计了多选和交互任务以探测模型对人格的理解沿交互时间线。实验表明，通过关联相关交互，先进记忆系统能够精确提取偏好并减少token消耗，优于传统语义检索原始对话。然而，它们在时间和跨领域干扰中仍难以保持一致的人格，突显了代理中需要更稳健的个性化记忆管理的必要性。我们的代码和数据在https://github.com/PolarisLiu1/PERMA上开源。

英文摘要

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

URL PDF HTML ☆

赞 0 踩 0

2603.22056 2026-05-19 cs.CL

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

双空间知识蒸馏与关键-查询匹配用于具有词汇不匹配的大型语言模型

Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill

发表机构 * University of Cambridge, Department of Engineering（剑桥大学工程系）； Toshiba Europe Limited（东芝欧洲有限公司）

AI总结本文研究了针对具有词汇不匹配的大型语言模型的双空间知识蒸馏与关键-查询匹配方法，通过分析注意力机制揭示其优缺点，并提出基于生成对抗学习的新方法以解决关键-查询分布不匹配问题。

详情

AI中文摘要

大型语言模型（LLMs）在语言任务上实现了最先进的（SOTA）性能，但因其规模和资源需求而昂贵。知识蒸馏（KD）通过训练较小的学生模型模仿较大的教师模型来解决这一问题，从而在不显著损失性能的情况下提高效率。双空间知识蒸馏与跨模型注意力（DSKD-CMA）已成为在具有不同分词器的LLM之间进行KD的SOTA方法，但其内部机制仍然大多不透明。在本文中，我们通过手动标记对齐探测和热图可视化系统地分析DSKD-CMA的注意力机制，揭示其优缺点。在此基础上，我们引入了一种基于生成对抗（GA）学习的新方法DSKD-CMA-GA，以解决由不同模型计算出的关键-查询分布不匹配问题。实验显示在文本生成质量上获得了适度但一致的ROUGE-L提升，特别是在分布外数据上（平均+0.37），缩小了跨分词器KD与同分词器KD之间的差距。

英文摘要

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

URL PDF HTML ☆

赞 0 踩 0

2603.21787 2026-05-19 cs.CV

Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTevent

在MTevent上评估用于工业多类识别的循环事件基目标检测基准

Lokeshwaran Manohar, Moritz Roidl

发表机构 * Chair of Material Handling and Warehousing, TU Dortmund University, Dortmund, Germany（物料搬运与仓储学系，杜伊斯堡-艾森大学，多特蒙德，德国）

AI总结本文研究了在MTevent数据集上使用循环ReYOLOv8s进行工业多类识别的性能，并通过非循环YOLOv8s作为基线分析时间记忆的影响，发现事件域预训练对性能提升更有效。

Comments Accepted at the Neuromorphic Field Robotics and Automation Workshop, ICRA 2026

详情

AI中文摘要

事件相机因提供高时间分辨率、高动态范围和减少运动模糊而在工业机器人中具有吸引力。然而，大多数基于事件的目标检测研究集中在户外驾驶场景或有限类别设置上。在本工作中，我们在MTevent上评估了循环ReYOLOv8s用于工业多类识别，并使用非循环YOLOv8s变体作为基线来分析时间记忆的影响。在MTevent验证分割上，最佳的从头开始的循环模型（C21）达到了0.285 mAP50，比非循环YOLOv8s基线（0.260）提高了9.6%。事件域预训练效果更显著：GEN1初始化的微调在剪辑长度21时达到了最佳整体结果0.329 mAP50，并且与从头开始训练不同，GEN1预训练模型在剪辑长度上持续改进。PEDRo初始化下降到0.251，表明源域预训练不匹配可能不如从头开始训练有效。持续失败模式主要由类别不平衡和人-物体交互主导。总体而言，我们将这项工作定位为对工业环境中循环事件基检测的聚焦基准测试和分析研究。

英文摘要

Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTevent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTevent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6\% relative improvement over the non-recurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

URL PDF HTML ☆

赞 0 踩 0

2603.18702 2026-05-19 cs.LG

Off-Policy Learning with Limited Supply

有限供应下的离策略学习

Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito

发表机构 * Keio University（Keio大学）； Institute of Science Tokyo（东京科学研究所）； Meiji University（Meiji大学）； Yale University（Yale大学）； LY Corporation（LY公司）； Hanjuku-kaso, Co., Ltd.

AI总结本文研究了在情境老虎机中受限供应下的离策略学习问题，提出了一种新的OPLS方法，通过考虑用户间的相对预期奖励来更高效地分配有限供应的物品，实验证明其在有限供应情境下的优越性。

Comments Published as a conference paper at WWW 2026

详情

AI中文摘要

我们研究了情境老虎机中的离策略学习（OPL），这在推荐系统和在线广告等广泛的实际应用中起着关键作用。典型的OPL在情境老虎机中假设一个无约束环境，其中策略可以无限次选择同一物品。然而，在许多实际应用中，包括优惠券分配和电子商务，有限供应通过分布式优惠券的预算限制或产品库存限制来限制物品。在这些设置中，贪心地选择当前用户预期奖励最高的物品可能导致该物品的早期耗尽，使其无法为未来可能生成更高预期奖励的用户使用。因此，最优的无约束设置中的OPL方法在有限供应设置中可能变得次优。为了解决这个问题，我们提供了一个理论分析，显示传统贪心OPL方法可能无法最大化策略性能，并证明在有限供应设置中必须存在性能更优的策略。基于这一见解，我们引入了一种新的方法，称为有限供应下的离策略学习（OPLS）。与简单选择预期奖励最高的物品不同，OPLS关注相对预期奖励较高的物品，从而更有效地分配有限供应的物品。我们在合成和现实数据集上的实验证明，OPLS在具有有限供应的情境老虎机问题中优于现有的OPL方法。

英文摘要

We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

URL PDF HTML ☆

赞 0 踩 0

2603.13652 2026-05-19 cs.CV

Causal Attribution via Activation Patching

通过激活修补进行因果归因

Amirmohammad Izadi, Mohammadali Banayeeanzade, Alireza Mirrokni, Hosein Hasani, Mobin Bagherian, Faridoun Mehri, Mahdieh Soleymani Baghshah

发表机构 * Sharif University of Technology（谢尔万大学）

AI总结本文提出了一种新的因果归因方法CAAP，通过直接干预内部激活来估计图像补丁对Vision Transformer预测的贡献，从而产生更准确和局部化的归因结果。

详情

AI中文摘要

针对Vision Transformers（ViTs）的归因方法旨在识别影响模型预测的图像区域，但产生忠实且良好的局部化归因仍具有挑战性。现有归因方法面临多个限制，基于梯度、相关性传播和注意力的方法依赖于局部近似，而扰动或优化方法则干预输入、令牌或替代物，而非内部补丁表示。关键挑战在于类别相关证据是通过跨层的补丁令牌相互作用形成的；仅操作输入变化、注意力权重或反向相关性信号的方法可能只能提供补丁重要性的间接代理，而非直接测试上下文化补丁表示的预测效果。我们提出通过激活修补进行因果归因（CAAP），通过直接干预内部激活来估计单个图像补丁对ViT预测的贡献，而非使用学习的掩码或合成扰动模式。对于每个补丁，CAAP将对应的源图像激活插入中性目标上下文中的中间层范围，并使用由此产生的目标类别分数作为归因信号。所得到的归因图反映了补丁相关内部表示对模型预测的因果贡献。因果干预作为一种原则性的测量方法，通过在初始表示形成后捕捉语义证据，同时避免晚期层的全局混合，这可能减少空间特异性。在多个ViT骨干网络和标准度量指标上，CAAP在各种设置中均优于现有方法，并产生更忠实且局部化的归因结果。

英文摘要

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing attribution methods face several limitations, with gradient-based, relevance-propagation, and attention-based methods relying on local approximations, while perturbation or optimization-based methods intervene on inputs, tokens, or surrogates rather than internal patch representations. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers; methods that operate only on input changes, attention weights, or backward relevance signals may therefore provide indirect proxies for patch importance rather than directly testing the predictive effect of contextualized patch representations. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal contribution of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing semantic evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP consistently outperforms existing methods in various settings and produces more faithful and localized attributions.

URL PDF HTML ☆

赞 0 踩 0

2603.12145 2026-05-19 cs.LG cs.AI cs.SE

Automatic Generation of High-Performance RL Environments

自动生成高性能强化学习环境

Seth Karten, Rahul Dev Appapogu, Chi Jin

发表机构 * Princeton University（普林斯顿大学）； Independent Researcher（独立研究者）

AI总结本文提出了一种闭环方法，通过最小的计算成本生成等效的高性能强化学习环境，展示了三种不同的工作流程，并在五个环境中验证了无仿真到仿真的差距，同时展示了新的环境创建方法。

Comments 20 pages, 5 figures

详情

AI中文摘要

将复杂的强化学习（RL）环境转换为高性能实现传统上需要数月的专业工程工作。我们提出了一种闭环方法，以最小的计算成本生成等效的高性能环境。我们的方法使用通用提示模板、分层验证（属性、交互和运行测试）、迭代修复和跨后端策略转移来验证无仿真到仿真的差距。我们展示了三个不同的工作流程跨越五个环境：（1）从Game Boy模拟器PyBoy直接翻译到我们的EmuRust（通过Rust IPC）和从Pokemon Showdown翻译到我们的PokeJAX（通过JAX）；（2）通过与现有高性能实现的吞吐量一致性进行验证，如Puffer Pong、MJX和Brax在匹配的GPU批次大小下；（3）新环境的创建：TCGJax，第一个Pokemon TCG Pocket环境，从网页提取的规范中创建。在2亿个参数下，环境开销低于训练时间的4%。我们的闭环方法验证了所有五个环境的等效性。TCGJax，由一个不在公共存储库中的私有参考合成，用于控制代理预训练数据的污染问题。

英文摘要

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments: (1) Direct translation (no prior performance implementation exists) from Game Boy emulator PyBoy to our EmuRust (via Rust IPC) and from Pokemon Showdown to our PokeJAX (via JAX); (2) Translation verified against existing performance implementations via throughput parity with Puffer Pong, MJX and Brax at matched GPU batch sizes; and (3) New environment creation: TCGJax, the first Pokemon TCG Pocket environment, created from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Our closed-loop methodology confirms equivalence for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns.

URL PDF HTML ☆

赞 0 踩 0

2603.11689 2026-05-19 cs.AI

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

显式逻辑通道用于验证和增强用于零样本任务的前沿多模态大语言模型

Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen

发表机构 * Institute for Infocomm Research (I$^2$R)（信息通信研究所）； Agency for Science, Technology and Research (A*STAR)（科技研究局）； Singapore（新加坡）

AI总结本文提出显式逻辑通道用于验证和增强多模态大语言模型在零样本任务中的性能，通过显式逻辑推理提高模型的可解释性和可信度。

详情

AI中文摘要

前沿多模态大语言模型（MLLMs）在视觉-语言理解（VLC）任务中表现出显著能力。然而，它们通常以黑盒方式部署到新任务中。验证和理解这些模型的行为对于应用到新任务变得重要。我们提出显式逻辑通道，与黑盒模型通道并行，以进行显式逻辑推理用于模型验证、选择和增强。前沿MLLM，封装潜在的视觉语言知识，可以被视为隐式逻辑通道。所提出的显式逻辑通道，模仿人类逻辑推理，结合了一个LLM、一个VFM和逻辑推理与概率推理，用于事实、反事实和关系推理，基于显式视觉证据。提出了一种一致性率（CR）用于跨通道验证和模型选择，即使没有地面真相注释。此外，跨通道整合进一步提高了MLLM在零样本任务中的性能，基于显式视觉证据以增强可信度。在两个代表性的VLC任务，即MC-VQA和HC-REC上，对三个具有挑战性的基准进行综合实验，使用11个最近的开源MLLMs，来自四个前沿家族。我们的系统评估证明了所提出的ELC和CR在增强可解释性和可信度的MLLM模型验证、选择和改进中的有效性。

英文摘要

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

URL PDF HTML ☆

赞 0 踩 0

2603.10935 2026-05-19 cs.LG cs.AI cs.CV

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE：保证防止后验崩溃

Zegu Zhang, Jian Zhang

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种理论保证非崩溃解的新型框架，通过利用球壳几何和聚类感知约束，防止VAE中的后验崩溃问题，并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情

AI中文摘要

变分自编码器（VAEs）经常受到后验崩溃的影响，其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变，但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架，通过利用球壳几何和聚类感知约束，从理论上保证非崩溃解。我们的方法将数据转换为球壳，通过K-means计算最优聚类分配，并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时，崩溃解在数学上被排除在可行参数空间之外。关键的是，我们引入了规范约束机制，确保解码器输出保持与球壳几何兼容，而不限制表示能力。与以往方法不同，我们的方法提供了严格的理论保证，计算开销小，且不施加对解码器输出的限制。在合成和现实数据集上的实验表明，在传统VAE完全失败的条件下，实现了100%的崩溃预防，重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件（例如σ² < λ_max），并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

URL PDF HTML ☆

赞 0 踩 0

2603.03328 2026-05-19 cs.CL cs.AI

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

StructLens：通过最大生成树实现语言模型的结构镜像

Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology（奈良科学技术大学）

AI总结本文提出StructLens框架，通过最大生成树分析语言模型的表示结构，揭示模型在不同层和训练阶段中如何组织token表示。

详情

AI中文摘要

语言具有内在结构，这一特性解释了语言习得和语言变化。鉴于此特性，我们预期语言模型也会表现出自身的内部结构。尽管可解释性研究已经探讨了模型如何通过注意力模式和稀疏自编码器计算表示，但所得到的表示的组织方式却被忽视。为解决这一差距，我们引入StructLens，一个通过整体结构视角分析表示的框架。StructLens基于残差流中的语义表示构建最大生成树，受依赖解析中树表示的启发，并在表示空间中提供token关系的摘要。我们分析了连续token在表示空间中也彼此接近，并发现中间层显示出最强的局部跨度组织。此外，对预训练检查点的分析表明，较小的局部单元在预训练早期变得可检测，而较大的单元则在后期才变得可检测。我们的发现表明，StructLens提供了关于模型在不同层和训练过程中如何组织token表示的见解。我们的代码可在https://github.com/naist-nlp/structlens获取。

英文摘要

Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest their own internal structures as well. While interpretability research has investigated how models compute representations mechanistically through attention patterns and Sparse AutoEncoders, the organization of the resulting representations is overlooked. To address this gap, we introduce StructLens, a framework to analyze representations through a holistic structural view. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, inspired by tree representation in dependency parsing, and provides summaries of token relationships in representation space. We analyze how contiguous tokens are also nearby in representation space and find that middle layers show the strongest local-span organization. Moreover, analysis of pre-training checkpoints reveals that smaller local units become detectable earlier in pre-training, and larger units later. Our findings demonstrate that StructLens provides insights into how models organize token representations across layers and training. Our code is available at https://github.com/naist-nlp/structlens.

URL PDF HTML ☆

赞 0 踩 0

2603.03308 2026-05-19 cs.CL cs.AI

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

旧习惯难改：对话历史如何几何学地困住大语言模型

Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen

发表机构 * Technion - Israel Institute of Technology（技术ion-以色列理工学院）； University of Oxford（牛津大学）； University of Zagreb, FER（Zagreb大学，FER）； Kempner Institute, Harvard University（Kempner研究所，哈佛大学）； University of Edinburgh（爱丁堡大学）

AI总结研究探讨对话历史如何通过几何陷阱影响大语言模型的后续表现，提出History-Echoes框架从概率和几何两个角度分析对话历史偏差，并揭示行为持续性在潜在空间中的几何陷阱。

Comments Accepted to ICML 2026

详情

AI中文摘要

大语言模型（LLMs）的对话历史如何影响其未来表现？近期研究表明，LLMs受对话历史影响的方式出人意料。例如，先前交互中的幻觉可能影响后续模型响应。在本工作中，我们引入History-Echoes框架，研究对话历史如何偏移后续生成。该框架从两个角度探索这种偏差：概率上，我们将对话建模为马尔可夫链以量化状态一致性；几何上，我们测量连续隐藏表示的一致性。在三个模型家族和六个涵盖多样化现象的数据集上，我们的分析揭示了两种视角之间的强相关性。通过连接这些视角，我们证明行为持续性表现为几何陷阱，即潜在空间中的间隙会限制模型轨迹。代码可在https://github.com/technion-cs-nlp/OldHabitsDieHard获取。

英文摘要

How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion-cs-nlp/OldHabitsDieHard.

URL PDF HTML ☆

赞 0 踩 0

2603.03190 2026-05-19 cs.AI q-bio.NC

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

期望与听觉神经网络表示增强从脑活动识别音乐

Shogo Noguchi, Taketo Akama, Tai Nakamura, Shun Minamikawa, Natalia Polouliakh

发表机构 * Sony Computer Science Laboratories, Inc.（索尼计算机科学实验室）

AI总结本研究通过区分听觉和期望相关的神经网络表示作为教师目标，提高了基于EEG的音乐识别性能，展示了表示学习可以由神经编码引导，并为预测音乐认知和神经解码的发展提供了新方向。

Comments 47 pages, 12 figures

详情

AI中文摘要

在音乐聆听过程中，皮层活动编码了听觉和期望相关信息。先前工作已表明，ANN表示类似于皮层表示，并可作为EEG识别的监督信号。本文显示，将听觉和期望相关的ANN表示作为教师目标进行区分，能提高基于EEG的音乐识别性能。预训练以预测任一表示的模型优于非预训练基线，且结合它们可获得互补增益，超过通过不同随机初始化形成的强种子集合。这些发现表明，教师表示类型影响下游性能，且表示学习可以由神经编码引导。本工作为预测音乐认知和神经解码的发展指明了方向。我们的期望表示直接从原始信号计算得出，无需人工标签，反映了超越起始或音高的预测结构，使能够研究跨多样刺激的多层预测编码。其可扩展性表明，未来可能开发出基于皮层编码原理的通用EEG模型。

英文摘要

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

URL PDF HTML ☆

赞 0 踩 0

2603.03099 2026-05-19 cs.LG cs.AI

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

为何Adam能胜过SGD：二阶矩归一化产生更尖锐的尾部

Ruinan Jin, Yingbin Liang, Shaofeng Zou

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University（俄亥俄州立大学电气与计算机工程系）； School of Electrical, Computer and Energy Engineering, Arizona State University（亚利桑那州立大学电气、计算机与能源工程学院）

AI总结本文揭示了Adam中的关键二阶矩归一化机制，并通过停止时间/鞅分析，在经典有界方差模型下，证明了Adam在高概率收敛行为上优于SGD，前者对置信参数δ的依赖为δ^{-1/2}，而SGD则至少为δ^{-1}。

Comments 68 pages

2603.00631 2026-05-19 cs.AI

LiTS: A Modular Framework for LLM Tree Search

LiTS：一个用于LLM树搜索的模块化框架

Xinzhe Li, Yaguang Tao

发表机构 * RMIT University（皇家墨尔本理工大学）

AI总结本文提出LiTS，一个模块化框架，用于通过树搜索进行LLM推理，展示了其在语言推理、环境规划和工具使用任务中的可组合性，并发现无限动作空间中LLM策略多样性是有效树搜索的瓶颈。

Comments ACL 2026 Demo

详情

AI中文摘要

LiTS是一个模块化的Python框架，用于通过树搜索进行LLM推理。它将树搜索分解为三个可重用的组件（策略、转移和奖励模型），这些组件可以插入到MCTS和BFS等算法中。基于装饰器的注册机制使领域专家能够通过注册组件扩展到新领域，使算法研究人员能够实现自定义的搜索算法。我们在MATH500（语言推理）、Crosswords（环境规划）和MapEval（工具使用）上展示了可组合性，证明了组件和算法的正交性：组件可以在每个任务类型内跨算法重用，而算法可以在所有组件和领域中工作。我们还报告了一个模式崩溃发现：在无限动作空间中，LLM策略多样性（而不是奖励质量）是有效树搜索的瓶颈。演示视频可在https://youtu.be/nRGX43YrR3I获取。该包在Apache 2.0许可证下发布于https://github.com/xinzhel/lits-llm，包含安装说明和可运行示例，使用户能够重现演示的工作流。

英文摘要

LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.

URL PDF HTML ☆

赞 0 踩 0

2603.00607 2026-05-19 cs.CV cs.AI

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow: 多主体生成中的动态身份调节

Honghao Cai, Xiangyuan Wang, Jing Li, Yunhao Bai, Tianze Zhou, Haohua Chen, Chao Hui, Changhao Qiao, Runqi Wang, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

发表机构 * Xiaohongshu Inc.（小红书公司）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结本文提出IdGlow框架，通过任务自适应的时间步调度和视觉语言模型解决多主体生成中的稳定性与可塑性矛盾，提升面部真实感与商业级美学质量。

详情

AI中文摘要

多主体图像生成需要在一致的场景中无缝协调多个参考身份。然而，现有方法依赖刚性空间掩码或局部注意力，往往在需要复杂结构变形的任务中（如保持身份的年龄变换）面临'稳定性-可塑性困境'。为此，我们提出IdGlow，一种基于流匹配扩散模型的无掩码、分阶段框架。在监督微调（SFT）阶段，我们引入任务自适应的时间步调度，与扩散生成动力学对齐：一种线性衰减调度，逐步放松约束以生成自然群体组成，以及一个时间门控机制，将身份注入集中于关键语义窗口，成功保留成人面部语义而不覆盖儿童样结构。为解决属性泄漏和语义模糊问题而无需显式布局输入，我们进一步整合了基于badcase驱动的视觉语言模型（VLM）进行精确的上下文感知提示合成。在第二阶段，我们设计了细粒度群体级直接偏好优化（DPO）方法，采用加权边距公式，同时消除多主体伪影、提升纹理和谐度，并重新校准身份保真度以适应现实分布。在两个具有挑战性的基准测试——直接多人物融合和年龄变换群体生成——上的大量实验表明，IdGlow从根本上缓解了稳定性-可塑性冲突，实现了在最先进的面部保真度和商业级美学质量之间的优越帕累托平衡。

英文摘要

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

Natural-Language Agent Harnesses

Bio-Inspired Event-Based Visual Servoing for Ground Robots

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTevent

Off-Policy Learning with Limited Supply

Causal Attribution via Activation Patching

Automatic Generation of High-Performance RL Environments

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

LiTS: A Modular Framework for LLM Tree Search

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation