2605.21479 2026-05-21 cs.CV cs.AI 版本更新

注意仿真到现实的差距并像科学家一样思考

Harsh Parikh, Gabriel Levin-Konigsberg, Dominique Perrault-Joncas, Alexander Volfovsky

发表机构 * Amazon SCOT（亚马逊SCOT团队）； Yale University（耶鲁大学）； Duke University（杜克大学）

AI总结本文研究了在仿真和现实之间如何补充实验以减少价值差距，提出了Fisher-SEP方法，并通过两个案例研究展示了其应用。

详情

AI中文摘要

假设有规划者拥有一个预先训练的序列决策问题的仿真器，并有机会在现实中进行实验。仿真器查询成本低，但继承了校准数据中的混杂因素和漂移。实验是无偏的，但每次试验消耗一个现实单位。我们研究了规划者何时以及如何补充仿真器进行实验。我们给出了三个结果。首先，扩展的仿真引理将仿真器的价值误差分解为校准-部署偏移，该偏移可以随机化识别，以及一个参数残差，无法通过进一步交互减少。第二，仿真器最优策略与最优解之间的价值差距分为局部部分，这部分在部署策略已访问的状态上，以及可达性部分，这部分在部署策略未访问的状态上。在纯被动学习下，可达性部分在任何时间范围内都保持远离零。第三，我们提出了Fisher-SEP，一种辅助仿真的实验策略（SEP），该策略最小化目标策略价值的后验预测方差，具有仅奖励和仅转换的特殊化版本。两个案例研究展示了这些制度。在自动售货机供应链中，前端实验在时间范围足够长以抵消试点成本后超过后验更新。在HIV移动测试示例中，有一个走廊将一个受监控区域与一个受监控较差的区域分开，只有设计的探索才能到达受监控较差的区域。

英文摘要

Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.

URL PDF HTML ☆

赞 0 踩 0

2605.21453 2026-05-21 cs.SE cs.AI 版本更新

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

AI生成Python重构拉取请求中的质量和安全信号

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

发表机构 * University of Michigan-Flint（密歇根大学弗林特分校）； University of Michigan-Dearborn（密歇根大学戴尔本分校）

AI总结本研究通过分析AIDev数据集中的Python重构拉取请求，探讨了AI生成代码对代码质量和安全性的影响，发现AI提交在22.5%的案例中提升了质量属性，但同时也引入了新的代码问题，提出了24种重构操作的分类和安全门控的重要性。

详情

AI中文摘要

随着AI代理在代码开发和维护中的作用日益增强，关于其在真实项目中变更的质量和风险特征仍缺乏实证证据，特别是针对重构类贡献。为了填补这一空白，我们对AIDev数据集中的Python重构拉取请求进行了实证研究。我们使用基于机器学习的质量评估工具PyQu分析代理重构拉取请求，以量化五个质量属性的变化，并通过领域无关的静态分析（Pylint和Bandit）来测量每次更改前后代码质量和安全问题。我们的结果表明，平均而言，代理提交在22.5%的案例中提升了质量属性，其中可用性提升最频繁（36.5%）。同时，24.17%的修改文件引入了新的Pylint问题，主要为约定层面的违规（如长行），而4.7%引入了新的Bandit发现。从观察到的差异中，我们推导出24种反复出现的更改操作，并将其映射到最常影响的lint和安全发现。尽管这些混合结果，开发者接受度很高：73.5%的分析拉取请求被合并，包括引入新lint或安全发现的案例，通常伴随现有问题的移除。总体而言，这些发现突显了代理重构的潜力和当前限制，并推动了更强的工具在循环中质量与安全门控，以应对AI驱动的开发工作流。

英文摘要

As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.21451 2026-05-21 cs.LG cond-mat.dis-nn cs.AI cs.NE 版本更新

Approximation Theory for Neural Networks: Old and New

神经网络的近似理论：旧与新

Soumendu Sundar Mukherjee, Himasish Talukdar

AI总结本文综述了神经网络近似理论的发展，包括传统单隐层网络的密度结果、量化误差界限以及深度-宽度权衡，还探讨了Kolmogorov-Arnold网络等新架构的理论性质。

Comments 31 pages, 4 figures

详情

AI中文摘要

通用近似定理为神经网络的表达能力提供了数学解释。它们断言，在激活函数的温和条件下，前馈神经网络在广泛的函数类中是密集的，例如实数空间$\mathbb{R}^d$的紧致子集上的连续函数、$L^p$空间或Sobolev空间。在过去四十年里，这些定性的一般性结果已发展成丰富的定量理论，涉及近似速率、参数效率以及深度和宽度等架构特征的作用。本文综述了该理论的几个方面。我们回顾了单隐层网络的经典密度结果，以及将近似误差与网络大小和目标函数的光滑性假设联系起来的量化界限。特别强调了深度-宽度权衡以及证明更深层次架构在结构函数类中可实现更高参数效率的结果。除了标准前馈神经网络外，我们还回顾了Kolmogorov-Arnold网络（KANs）等近期发展的理论性质。

英文摘要

Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

URL PDF HTML ☆

赞 0 踩 0

2605.21443 2026-05-21 cs.CV cs.AI 版本更新

与逝者对话：人们如何与生成鬼魂互动

Jack Manning, Daniel Sullivan, Dylan Thomas Doyle, Anthony T. Pinter, Jed R. Brubaker

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结研究探讨了人们如何与生成鬼魂互动，通过质性研究发现，用户更倾向于即时性而非事实准确性，且互动始终是协作的。

详情

AI中文摘要

我们探讨了人们在生成鬼魂（一种基于逝者数据训练的AI系统）设计中所体验的两种选择：代表（AI以第三人称描述逝者）和转世（AI以逝者身份第一人称说话）。通过16名参与者的研究，我们探索了这两种选择如何影响真实性、情感和风险。转世因其即时性更受青睐，但参与者表达了对过度依赖的担忧。代表则因与记忆互动而更受欢迎，尽管参与者往往忽视这一区别，在第三人称框架下进行对话。在两种模式中，参与者始终优先考虑情感共鸣而非事实准确性。我们最后展示了语气、语言和对话节奏等用户对逝者记忆的独特因素如何塑造与生成鬼魂的互动，并论证这些互动始终是协作的。

英文摘要

We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative.

URL PDF HTML ☆

赞 0 踩 0

2605.21388 2026-05-21 cs.LG cs.AI cs.NA math.NA stat.ML 版本更新

On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

关于PDE诱导度量的一步Wasserstein引导生成模型的正则性和泛化性

Likun Lin, Zhongjian Wang, Jack Xin, Zhiwen Zhang

发表机构 * Department of Mathematics, The University of Hong Kong（香港大学数学系）； Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University（南洋理工大学数学科学系）； Department of Mathematics, University of California at Irvine（加州大学 Irvine 分校数学系）

AI总结本文研究了一步Wasserstein引导生成模型在处理PDE诱导概率度量时的正则性和泛化性，通过理论框架证明了运输映射的正则性和生成模型的泛化性质，并通过实验验证了理论结果。

详情

AI中文摘要

尽管生成模型在经验上取得了显著成功，但其在科学计算中的统计准确性理论仍然较为悲观。本文发展了一个理论框架，用于理解运输映射的正则性和一步Wasserstein引导生成模型的泛化性质。我们考虑了与线性椭圆和抛物型方程在有界域上以及扩散和福克-计划克方程在环面上关联的归一化目标密度。在标准结构假设下，我们证明这些目标度量满足倍增条件。通过结合这一事实与倍增度量之间最优运输的正则性理论，我们证明了从均匀源度量到目标度量的最优运输映射是Hölder连续的。这种正则性为通过单个推前映射学习PDE诱导分布的一步生成模型提供了近似理论依据。作为代表实例，我们研究了DeepParticle，并推导了描述学习映射与总体最优映射之间差异的额外风险界。我们还建立了在目标转移下的鲁棒性估计，并通过实验验证了推导出的速率。

英文摘要

Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker--Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is Hölder continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

URL PDF HTML ☆

赞 0 踩 0

2605.21384 2026-05-21 cs.SE cs.AI cs.CL 版本更新

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench: 评估长周期编码代理中的奖励黑客现象

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

发表机构 * Weco AI

AI总结该研究通过分解软件工程任务，提出了一种评估长周期编码代理中奖励黑客现象的方法，通过比较可见测试套件和隐藏测试套件的通过率差异，引入了SpecBench基准，展示了奖励黑客现象在不同任务长度上的显著影响。

详情

AI中文摘要

随着长周期编码代理生成的代码量超过任何开发者能够审查的范围，监督责任集中于单一表面：自动测试套件。奖励黑客现象自然出现在这种设置中，因为代理在优化通过测试的同时偏离了用户的真正目标。我们通过将软件工程任务分解为三个部分来研究这种奖励黑客现象：(i) 规格的自然语言描述，(ii) 可见验证测试套件，用于单独测试指定功能，以及 (iii) 隐藏测试套件，用于组合这些相同功能以模拟真实世界使用。基于规格和可见验证测试套件，一个真实的代理能够生成一个能够通过所有隐藏测试套件的解决方案。因此，我们使用这两个套件之间的通过率差异来量化奖励黑客现象。基于这种方法，我们引入了SpecBench，一个包含30个系统级编程任务的基准，从短周期任务如构建JSON解析器到超长周期任务如从头构建整个操作系统内核。大规模实验揭示了一种一致的模式：尽管每个前沿代理都能饱和可见套件，奖励黑客现象仍然存在，较小的模型在隐藏套件上表现出更大的差距。差距也随着任务长度急剧增加：代码规模每增加十倍，差距增长28个百分点。失败范围从微妙的功能隔离到有意的利用，包括一个2,900行的哈希表“编译器”，它记忆测试输入。SpecBench提供了一个原则性的测试平台，用于测量编码代理是构建真正的可运行系统还是仅仅在开发人员提供的测试套件上玩游戏。

英文摘要

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

URL PDF HTML ☆

赞 0 踩 0

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

追踪大型语言模型中类人推理的持续涌现

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

发表机构 * Departament de Filologia Catalana, Universitat Autònoma de Barcelona（加泰罗尼亚语言系系，巴塞罗那自治大学）； Institut für Psychologie, Humboldt-Universitat zu Berlin（柏林洪堡大学心理学研究所）； Institució Catalana de Recerca i Estudis Avançats (ICREA)（加泰罗尼亚高级研究与高级教育研究所）

AI总结研究探讨了大型语言模型在条件推理任务中是否具备类人推理能力，发现人类通过语用推理丰富逻辑推理，而模型行为更不稳定，部分模型遵循条件语义但忽视语用推理，表明LLMs在语义准确性上表现良好，但缺乏人类推理中的语用丰富性。

详情

AI中文摘要

人类能够超越字面意义：如果你修剪草坪，我会给你五十美元，通常被理解为说话者只在草坪修剪时支付，而如果你饿了，烤箱里有披萨，意味着披萨无论听者是否饥饿都可用。大型语言模型（LLMs）在许多任务上表现出类人性能，但尚不清楚它们是否像人类一样推理。为此，我们进行了一项人口匹配实验，评估了25个LLMs在四种语言中计算条件推理的能力，并与每种语言中等数量的人类进行比较。我们发现，人类通过跨语言的语用推理丰富逻辑推理。模型行为更具变异性。一些LLMs完全遵循条件语义的真值表，但忽视语用推理，而另一些LLMs偏离真值表，坚持单一解释，从而反映准确的规则处理但不具有类人推理能力。总体而言，LLMs是准确的语义运算符，但未能捕捉到人类推理中特有的语用丰富性。关键的是，LLM的准确性既不被开放与封闭状态、训练方向或架构类型所预测或提升，表明语用推理仍然是人工系统认知工具包中正在兴起的能力。

英文摘要

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

URL PDF HTML ☆

赞 0 踩 0

2605.21295 2026-05-21 cs.LG cs.AI cs.HC 版本更新

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

TimeSRL: 通过语义RL调优的LLM实现通用的时间序列行为建模 -- 一项心理健康应用的案例研究

Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen, Yuzhe Yang, Zhuo Zhang, Xin Liu, Subigya Nepal, Xiaofan Jiang, Xuhai "Orson" Xu

发表机构 * Columbia University（哥伦比亚大学）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Yale University（耶鲁大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Google（谷歌）； University of Virginia（弗吉尼亚大学）

AI总结本文提出TimeSRL，一种两阶段LLM框架，通过显式的语义瓶颈路由预测，将原始信号抽象为高级自然语言，从而预测行为结果，该方法在心理健康预测中实现了最先进的跨群体泛化性能。

详情

AI中文摘要

纵向被动传感能够实现连续健康预测，但模型在跨数据集分布偏移下往往失效。传统机器学习容易过拟合群体特异性特征，而大型语言模型（LLMs）在长且异质的时间序列上难以可靠推理。我们引入TimeSRL，一种两阶段LLM框架，通过显式的语义瓶颈路由预测。模型首先将原始信号抽象为高级自然语言，然后仅从这些抽象中预测行为结果。这迫使模型在我们认为泛化更好的语义概念上进行推理。我们通过组相对策略优化（GRPO）结合可验证奖励的强化学习（RLVR）端到端优化这一过程，学习与结果对齐的抽象，而无需金标准中间注释。在心理健康预测中，TimeSRL在设计用于在严格的一留一数据集-out（LOSO）协议下压力测试跨群体泛化能力的基准上实现了最先进的性能，将焦虑的均绝对误差（MAE）在强大的非LLM ML和LLM基线模型上分别降低了3.1-10.1%和9.5-44.1%，抑郁的MAE则降低了3.2-9.6%和27.4-57.6%（所有p值<0.05）。TimeSRL在不同传感管道上的跨基准迁移中显著优于先前方法，在不进行目标领域微调的情况下，其性能与自身在领域内性能相当。这些结果表明语义抽象具有可重用性，并指出了通过RL调优的LLM实现通用行为建模的新方向。

英文摘要

Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.21292 2026-05-21 stat.ML cs.AI cs.LG math.DS 版本更新

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

双因子线性变换器模型的大步训练动态

Krishnakumar Balasubramanian

发表机构 * Department of Statistics, University of California, Davis（加州大学戴维斯分校统计学系）

AI总结本文研究了双因子线性变换器模型在大学习率下的训练动态，通过分析发现大步长学习率可以改变变换器的训练吸引子，而非仅仅加速收敛，可能在稳定性阈值之外导致训练进入循环、有界混沌或发散。

详情

AI中文摘要

梯度流分析显示，简化的线性变换器可以学习上下文线性回归算法，但无法解释大学习率下梯度下降的有限步行为。受高学习率变换器不稳定性实证研究和二次回归的立方图相图启发，我们研究了一个可以简化为单提示线性变换器训练问题的恰好可约问题。归一化后，动态减少为一个双因子乘积映射，具有有效步长参数μ。在平衡切片上，该映射恢复了已知的标量立方过渡，从单调收敛到飞弹收敛，周期性和有界非收敛，以及发散。我们随后分析了完整的二维系统，显示对于0<μ<2，它有一个显式不变的切比雪夫椭圆，将前向不变区域分开；该椭圆承载着不平衡的混沌动态，但横向排斥，而平衡标量吸引子可以横向吸引。这些结果表明，大常数学习率可以改变学习变换器的训练吸引子，而不仅仅是加速收敛：在稳定性阈值之外，有限步训练可能进入循环、有界混沌或发散，而不是单一的上下文线性回归解。我们还讨论了这对基于小批量梯度下降训练方法的影响。

英文摘要

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter $μ$. On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for $0<μ<2$, it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

URL PDF HTML ☆

赞 0 踩 0

2605.21272 2026-05-21 cs.CV cs.AI 版本更新

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

MONET：一个大规模、开放、非冗余且增强的文本到图像数据集

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

发表机构 * Jasper Research（Jasper研究）

AI总结本文提出MONET数据集，通过多阶段过滤和增强，提供高质量的文本到图像数据，以降低大规模可重复研究的门槛。

详情

AI中文摘要

训练大型文本到图像模型需要高质量、经过精心编纂的数据集，具有多样内容和详细的描述。然而，收集、过滤、去重和重新描述此类语料库的高昂成本和复杂性阻碍了该领域的开放和可重复研究。我们介绍了MONET，一个开放的Apache 2.0数据集，包含约104.9亿个图像-文本对，这些数据来自29亿个原始对，通过多阶段的安全过滤、领域过滤、精确和近似去重以及使用多种视觉-语言模型重新描述，覆盖短到长形式的描述，并进一步通过合成生成样本增强。每个图像都配有预计算的嵌入和注释，以加速下游使用。为了验证MONET的有效性，我们仅使用它训练了一个400亿参数的潜在扩散模型，并在GenEval和DPG评分中达到了具有竞争力的结果，证明我们的数据集降低了大规模、可重复文本到图像研究的门槛。

英文摘要

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

URL PDF HTML ☆

赞 0 踩 0

2605.21266 2026-05-21 cs.LG cs.AI 版本更新

智能体安全是系统问题

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, Kamalika Chaudhuri, Xiaohan Fu, Khawaja Shams, Guy Amir, Jihye Choi, Sarthak Choudhary, Nils Palumbo, Andrey Labunets, Nishit V. Pandya

发表机构 * Google（谷歌）； University of California San Diego（加州大学圣地亚哥分校）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； EmbraceTheRed ； Gray Swan AI ； Cornell University（康奈尔大学）

AI总结本文提出智能体安全应作为系统问题来解决，强调通过系统层面的安全不变量来保障AI模型的安全性，而非仅仅依赖模型鲁棒性。文章基于系统安全领域的技术，提出了设计可预测安全保证的智能体系统的核心原则，并分析了实际攻击案例和实现这些原则面临的挑战。

详情

AI中文摘要

我们主张智能体安全必须作为系统问题来处理：驱动智能体的AI模型必须被视为不可信的组件，系统层面必须强制实施安全不变量。通过这一视角，单纯提高模型鲁棒性（社区中的主流观点）是不够的。相反，我们必须将现有努力与系统安全领域的技术相结合。基于我们在操作系统、网络、形式化方法和对抗机器学习领域的经验，我们提出了一套基于数十年系统安全研究的核心原则，为设计具有可预测安全保证的智能体系统提供基础。作为证据，我们分析了十一个代表性的现实世界攻击案例，并讨论了如果系统原则得以实现，这些攻击将如何被防止。我们还识别了在智能体中实现这些原则所面临的科研挑战。

英文摘要

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

URL PDF HTML ☆

赞 0 踩 0

2605.17618 2026-05-21 cs.AI 版本更新

CTFExplorer: 通过多目标网络CTF基准测试评估LLM进攻性代理

Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

发表机构 * CISPA - Helmholtz Center for Information Security（CISPA-海德堡信息安全研究中心）； NYU Tandon School of Engineering（纽约大学坦顿工程学院）； NYU Abu Dhabi（纽约大学阿布扎克分校）； IIIT Hyderabad（海得拉巴印度理工学院）

AI总结本文提出CTFExplorer基准测试，通过多目标网络CTF基准测试评估LLM进攻性代理，研究问题是如何在不确定环境下评估代理的战术推理能力，核心方法是引入多目标环境测试代理的探索、优先级和攻击链能力，主要贡献是开发了可评估代理行为的框架。

详情

AI中文摘要

现有的LLM基于进攻性安全代理的基准测试使用隔离的单目标设置，包含已知的易受攻击的服务和固定目标。它们有效测量了利用，但忽略了真实CTF参与者如何在未知表面上进行优先级排序、在不确定性下分配努力。当前的评估因此无法评估超越利用之外的战略推理。为了解决这个问题，我们引入了CTFExplorer，一个基准测试套件，将进攻性安全评估转向多目标设置，测试代理如何探索、优先级和连接攻击。CTFExplorer在一个环境中部署了40个基于网络的易受攻击的服务，代理必须自主发现、区分和利用目标，而无需预定义指导。我们还提出了一个反应性多代理设置作为参考代理框架，并开发了一个代理无关的评估框架，该框架记录了结构化的推理轨迹以进行细粒度评估。这使行为评估超越了二进制旗帜捕获，例如如何管理目标选择、处理失败的假设、在多个阶段协调以及提取安全情报。

英文摘要

Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

URL PDF HTML ☆

赞 0 踩 0

2602.06358 2026-05-21 cs.CL cs.AI 版本更新

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

SHINE：一种可扩展的上下文超网络，用于在单次传递中将上下文映射到LoRA

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Technion, NVIDIA（技术学院与NVIDIA）； University of Oxford（牛津大学）； School of Electronics Engineering（电子工程学院）； Computer Science, Peking University（计算机科学，北京大学）

AI总结本文提出SHINE，一种可扩展的上下文超网络，用于在单次传递中将多样且有意义的上下文映射到高质量的LoRA适配器，通过重用冻结LLM的自身参数和引入架构创新，克服了先前超网络的关键限制，以较少的参数实现了强大的表达能力。

详情

AI中文摘要

我们提出SHINE（可扩展的上下文超网络），一种可扩展的超网络，能够将多样且有意义的上下文映射到高质量的LoRA适配器，用于大型语言模型（LLMs）。通过在上下文超网络设计中重用冻结LLM的自身参数，并引入架构创新，SHINE克服了先前超网络的关键限制，以相对较少的参数实现了强大的表达能力。我们引入了预训练和指令微调流水线，并训练我们的超网络在单次前向传递中从多样且有意义的上下文中生成高质量的LoRA适配器。它在不进行微调的情况下更新LLM参数，并立即启用与上下文相关的复杂问答任务，而无需直接访问上下文，有效地将上下文知识转换为参数知识。我们的工作在各种任务上取得了出色的结果，相比基于SFT的LLM适应方法，大大节省了时间和计算和内存成本，并展示了良好的可扩展性潜力。我们的代码可在https://github.com/MuLabPKU/SHINE获取。

英文摘要

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

URL PDF HTML ☆

赞 0 踩 0

2601.10191 2026-05-21 cs.AI 版本更新

How does downsampling affect needle electromyography signals? A generalisable workflow for understanding downsampling effects on high-frequency time series

降采样如何影响针式肌电信号？一种可推广的用于理解降采样对高频时间序列影响的工作流程

Mathieu Cherpitel, Janne Luijten, Thomas Bäck, Camiel Verhamme, Martijn Tannemaat, Anna Kononova

发表机构 * Leiden Institute of Advanced Computer Science（莱顿先进计算机科学研究所）； Leiden University Medical Centre（莱顿大学医学中心）； Department of Neurology（神经科）； Amsterdam University Medical Centre（阿姆斯特丹大学医学中心）

AI总结本文研究降采样对高频时间序列信号的影响，提出了一种可推广的工作流程，通过结合形状基失真度量和基于特征的机器学习模型分类结果，量化不同降采样算法和因素对波形完整性和预测性能的影响。

详情

AI中文摘要

自动分析针式肌电（nEMG）信号正逐渐成为一种支持神经肌肉疾病（NMDs）检测的工具，但这些信号的高采样率和异质性给基于特征的机器学习模型带来了显著的计算挑战，特别是接近实时分析。降采样提供了一个潜在的解决方案，但其对诊断信号内容和分类性能的影响仍不够清楚。本研究提出了一种系统评估降采样导致信息损失的工作流程。该工作流程结合了基于形状的失真度量与现有基于特征的机器学习模型的分类结果和特征空间分析，以量化不同降采样算法和因素对波形完整性和预测性能的影响。我们使用三类NMD分类任务来实验性评估该工作流程。我们展示了该工作流程如何识别保留诊断信息同时显著减少计算负载的降采样配置。形状基失真度量分析显示，基于形状的降采样算法优于标准降采样，因为它们更好地保留了峰值结构和整体信号形态。结果为选择能够支持接近实时nEMG分析的降采样配置提供了实用指导，并突显了一种可推广的工作流程，可用于在其他高频时间序列应用中平衡数据减少与模型性能。

英文摘要

Automated analysis of needle electromyography (nEMG) signals is emerging as a tool to support the detection of neuromuscular diseases (NMDs), yet the signals' high and heterogeneous sampling rates pose substantial computational challenges for feature-based machine-learning models, particularly for near real-time analysis. Downsampling offers a potential solution, but its impact on diagnostic signal content and classification performance remains insufficiently understood. This study presents a workflow for systematically evaluating information loss caused by downsampling in high-frequency time series. The workflow combines shape-based distortion metrics with classification outcomes from available feature-based machine learning models and feature space analysis to quantify how different downsampling algorithms and factors affect both waveform integrity and predictive performance. We use a three-class NMD classification task to experimentally evaluate the workflow. We demonstrate how the workflow identifies downsampling configurations that preserve diagnostic information while substantially reducing computational load. Analysis of shape-based distortion metrics showed that shape-aware downsampling algorithms outperform standard decimation, as they better preserve peak structure and overall signal morphology. The results provide practical guidance for selecting downsampling configurations that enable near real-time nEMG analysis and highlight a generalisable workflow that can be used to balance data reduction with model performance in other high-frequency time-series applications as well.

URL PDF HTML ☆

赞 0 踩 0

2510.15949 2026-05-21 q-fin.TR cs.AI 版本更新

ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination

ATLAS：通过动态提示优化和多智能体协调实现LLM智能体的自适应交易

Charidimos Papadakis, Angeliki Dimitriou, Giorgos Filandrianos, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou

发表机构 * School of Electrical and Computer Engineering, AILS Laboratory（电气与计算机工程学院，AILS实验室）； National Technical University of Athens（雅典国家技术大学）

AI总结本文提出ATLAS框架，通过动态提示优化和多智能体协调，解决LLM在金融交易中的适应性问题，提升交易决策的鲁棒性和执行效率。

详情

AI中文摘要

大型语言模型在金融决策中展现出潜力，但将其作为自主交易代理存在根本性挑战：如何在奖励延迟和市场噪声干扰下适应指令，如何将异质信息流合成连贯决策，以及如何弥合模型输出与可执行市场行动之间的差距。本文提出ATLAS（Adaptive Trading with LLM AgentS），一个统一的多智能体框架，整合市场、新闻和公司基本面的结构化信息以支持稳健的交易决策。在ATLAS中，核心交易智能体在订单感知的动作空间中运作，确保输出对应可执行的市场订单而非抽象信号。该智能体可通过Adaptive-OPRO技术在交易中整合反馈，这是一种新颖的提示优化技术，通过动态适应提示并结合实时随机反馈，随着时间推移提高性能。在特定市场环境的股票研究和多个LLM家族中，Adaptive-OPRO consistently outperforms fixed prompts，而基于反思的反馈未能提供系统性增益。

英文摘要

Large language models show promise for financial decision-making, yet deploying them as autonomous trading agents raises fundamental challenges: how to adapt instructions when rewards arrive late and obscured by market noise, how to synthesize heterogeneous information streams into coherent decisions, and how to bridge the gap between model outputs and executable market actions. We present ATLAS (Adaptive Trading with LLM AgentS), a unified multi-agent framework that integrates structured information from markets, news, and corporate fundamentals to support robust trading decisions. Within ATLAS, the central trading agent operates in an order-aware action space, ensuring that outputs correspond to executable market orders rather than abstract signals. The agent can incorporate feedback while trading using Adaptive-OPRO, a novel prompt-optimization technique that dynamically adapts the prompt by incorporating real-time, stochastic feedback, leading to increasing performance over time. Across regime-specific equity studies and multiple LLM families, Adaptive-OPRO consistently outperforms fixed prompts, while reflection-based feedback fails to provide systematic gains.

URL PDF HTML ☆

赞 0 踩 0

2507.21168 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

多样化的大语言模型还是多样化的问题解释？那是集成的问题

Rafael Rosales, Santiago Miret

发表机构 * Intel Labs（英特尔实验室）

AI总结本文比较了使用大语言模型回答二元问题的两种多样性方法：模型多样性和问题解释多样性，并发现问题解释多样性在集成准确性上表现更优。

Journal ref Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 5116-5128

详情

DOI: 10.63317/43t2yvgid7tw

AI中文摘要

有效利用多样性已被证明可以提高各种机器学习模型，包括大语言模型（LLMs）的性能。然而，确定最有效的多样性使用方法仍是一个挑战。在本工作中，我们比较了两种用于使用LLMs回答二元问题的多样性方法：模型多样性，即多个模型回答相同的问题，以及问题解释多样性，即使用同一模型以不同方式 framing 相同的问题来回答。对于这两种情况，我们应用多数投票作为集成共识启发式方法来确定最终答案。我们的boolq、strategyqa和pubmedqa实验表明，问题解释多样性在集成准确性上始终优于模型多样性。此外，我们对GPT和LLaMa的分析表明，模型多样性通常产生在最佳和最差集成成员之间的结果，而没有明显的改进。

英文摘要

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

URL PDF HTML ☆

赞 0 踩 0

2506.11060 2026-05-21 cs.SE cs.AI 版本更新

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Code Researcher: 用于大型系统代码和提交历史的深度研究代理

Ramneet Singh, Sathvik Joel, Abhav Mehrotra, Nalin Wadhwa, Ramakrishna B Bairi, Aditya Kanade, Nagarajan Natarajan

发表机构 * Microsoft Research（微软研究院）

AI总结本文提出Code Researcher，一种用于大型系统代码和提交历史的深度研究代理，通过多步骤推理和全局上下文收集，有效解决系统代码崩溃修复问题，显著优于现有基线方法。

详情

AI中文摘要

基于大型语言模型（LLM）的编码代理在编码基准测试中表现出色，但在系统代码上的有效性仍待探索。由于系统代码的规模和复杂性，对系统代码库进行修改需要研究大量上下文，这些上下文来源于大型代码库及其庞大的提交历史。受近期深度研究代理进展的启发，我们设计了首个代码研究代理Code Researcher，并将其应用于生成补丁以缓解系统代码中的崩溃问题。Code Researcher通过多步骤推理，对代码的语义、模式和提交历史进行推理，以从代码库和提交历史中检索所有相关上下文。我们评估了Code Researcher在kBenchSyz基准测试中的表现，结果显示其显著优于强基线方法，使用OpenAI的GPT-4o模型时，崩溃解决率（CRR）达到48%，相比SWE-agent的31.5%和Agentless的31%。扩大采样预算至10条轨迹可将CRR提升至54%。Code Researcher对模型选择也具有鲁棒性，使用新模型Gemini 2.5-Flash时达到67%。通过在开源多媒体软件上的另一个实验，我们展示了Code Researcher的泛化能力，并进行了消融实验。我们的实验突显了对大型代码库进行全局上下文收集和多维推理的重要性。

英文摘要

Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate (CRR) of 48%, compared to 31.5% by SWE-agent and 31% by Agentless, using OpenAI's GPT-4o model. Scaling up sampling budget to 10 trajectories increases Code Researcher's CRR to 54%. Code Researcher is also robust to model choices, reaching 67% with the newer Gemini 2.5-Flash model. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher and also conduct ablations. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.

URL PDF HTML ☆

赞 0 踩 0

2503.13549 2026-05-21 cs.SE cs.AI 版本更新

A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

ChatGPT与DeepSeek在解决编程任务中的对决

Ronas Shakya, Sam Urmian, Mohammad Khalil

发表机构 * Centre for the Science of Learning & Technology (SLATE)（学习科学中心及技术（SLATE））； University of Bergen（卑尔根大学）

AI总结本文评估了ChatGPT和DeepSeek在解决编程任务中的性能，发现ChatGPT在中等难度任务中表现更优，而两者在困难任务上均面临挑战。

详情

AI中文摘要

大语言模型（LLMs）的发展为AI辅助编程工具创造了竞争环境。本研究评估了ChatGPT 03-mini和DeepSeek-R1在Codeforces上解决编程任务的能力。使用三个难度级别的29个编程任务，我们通过接受的解决方案、内存效率和运行时间性能评估了两种模型的表现。我们的结果表明，尽管两者在简单任务上表现相似，但ChatGPT在中等难度任务中表现更优，成功率为54.5%，而DeepSeek为18.1%。两者在困难任务上均面临挑战，突显了LLMs在处理高度复杂编程问题方面的持续挑战。这些发现突显了两种模型在能力和计算能力上的关键差异，为开发者和研究人员改进AI驱动的编程工具提供了有价值的见解。

英文摘要

The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.

URL PDF HTML ☆

赞 0 踩 0

2501.01793 2026-05-21 cs.LG cs.AI 版本更新

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

创建从未存在过的虚拟学生：利用大型语言模型和CTGANs进行合成数据生成

Mohammad Khalil, Sam Urmian, Ronas Shakya, Qinyi Liu

发表机构 * Centre for the Science of Learning & Technology (SLATE)（学习科学与技术中心（SLATE））； University of Bergen（卑尔根大学）

AI总结本文研究了利用生成对抗网络（GANs）和大型语言模型（LLMs）生成合成表格数据的潜力，探讨了通过合成数据创建虚拟学生以服务于学习分析模型的可能性，并评估了不同生成模型的性能。

详情

AI中文摘要

在本研究中，我们探索了人工智能和深度学习技术，特别是生成对抗网络（GANs）和大型语言模型（LLMs）在生成合成表格数据方面的成长潜力。获取高质量学生数据对于推进学习分析至关重要，但隐私问题和全球更严格的数据保护法规限制了其可用性和使用。合成数据提供了一个有前途的替代方案。我们探讨了是否可以利用合成数据来创建虚拟学生以服务于学习分析模型。使用流行的GAN模型CTGAN和三种LLMs-GPT2、DistilGPT2和DialoGPT，我们生成了合成的表格学生数据。我们的结果表明，这些方法具有强大的潜力，能够生成高质量的合成数据集，与真实学生数据相似。为了验证我们的发现，我们应用了一套全面的效用评估指标来评估合成数据的统计和预测性能，并比较了不同生成模型，特别是LLMs的性能。本研究旨在为学习分析社区提供有价值的见解，为扩展学习分析领域的方法学工具箱提供新的创新方法。

英文摘要

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.

URL PDF HTML ☆

赞 0 踩 0

2501.01785 2026-05-21 cs.LG cs.AI cs.CY 版本更新

Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

合成数据能否公平且隐私？合成数据生成与公平性算法的比较研究

Qinyi Liu, Oscar Deho, Sam Urmian, Mohammad Khalil, Srecko Joksimovic, George Siemens

发表机构 * Centre for the Science of Learning & Technology (SLATE), University of Bergen（学习科学与技术中心（SLATE），卑尔根大学）； University of South Australia（澳大利亚南澳大利亚大学）

AI总结本研究探讨了合成数据生成与公平性算法在平衡隐私和公平性方面的效果，发现DECAF算法在隐私和公平性之间取得最佳平衡，但其预测准确性较低，而对合成数据应用预处理公平算法能进一步提升公平性。

详情

AI中文摘要

随着机器学习在学习分析（LA）中的广泛应用，算法公平性和隐私问题引发了广泛关注。合成数据作为一种双重用途工具，能够增强LA模型的隐私性和公平性。然而，先前研究指出公平性与隐私之间存在反比关系，使同时优化两者变得困难。本研究探讨了哪些合成数据生成器能最好地平衡隐私和公平性，并确定预处理公平算法（通常应用于真实数据集）在合成数据上的有效性。我们的结果表明，DEbiasing CAusal Fairness（DECAF）算法在隐私和公平性之间取得了最佳平衡。然而，DECAF在实用性上表现不佳，这体现在其预测准确性上。值得注意的是，我们发现将预处理公平算法应用于合成数据时，公平性提升幅度比应用于真实数据时更大。这些发现表明，结合合成数据生成与公平性预处理可以为创建更公平的LA模型提供有前途的方法。

英文摘要

The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

URL PDF HTML ☆

赞 0 踩 0

2411.09593 2026-05-21 eess.IV cs.AI cs.CV 版本更新

APEX：自主策略探索用于自演化大语言模型代理

Yibo Li, Jiashuo Yang, Zhi Zheng, Zhiyuan Hu, Yuan Sui, Shizun Wang, Yufei He, Bryan Hooi

发表机构 * National University of Singapore（新加坡国立大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结本文提出APEX，一种用于自演化大语言模型代理的自主策略探索方法，通过构建和维护显式的策略空间来解决探索崩溃问题，并在多个基准测试中表现出色。

详情

AI中文摘要

LLM代理在广泛复杂的任务中表现出强大的性能，包括需要长时间决策的交互环境。但是这些代理在测试时间无法实时学习。自演化代理通过在多个回合中积累记忆和反思来解决这个问题，而不是要求模型权重更新。然而，这些代理常常面临探索崩溃的问题：随着记忆的增长，行为会集中在熟悉的高奖励惯例上，减少了发现更好替代品的机会。为了解决这个问题，我们提出了自主策略探索（APEX），通过策略图——一个具有先决条件依赖边的有向无环图来构建和维护显式的策略空间。在APEX中，分支发现通过证据支持的未探索方向扩展地图，而策略选择在规划过程中平衡探索和利用。在九个Jericho文本冒险游戏和WebArena（一个现实的网络交互基准）上进行评估，APEX优于所有基线。广泛的消融实验验证了每个组件的贡献，并展示了在不同设置中的鲁棒性，证明了APEX在自演化代理中的持续探索有效性。

英文摘要

LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self-evolving agents.

URL PDF HTML ☆

赞 0 踩 0

2605.21237 2026-05-21 cs.CV cs.AI 版本更新

RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

RePCM：区域特定和表型适应的双心室心脏运动合成

Xuan Yang, Xiaohan Yuan, Hao Li, Lingyu Chen, Yanan Liu, Lei Li

发表机构 * School of Biomedical Engineering, National University of Singapore, Singapore（新加坡国立大学生物医学工程学院）； School of Automation, Southeast University, Nanjing, China（东南大学自动化学院）； School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China（南京航空航天大学计算机科学与技术学院）； School of Information Science and Engineering, Yunnan University, Kunming, China（云南大学信息科学与工程学院）

AI总结本文提出RePCM方法，通过单帧双心室网格运动补全，利用区域特定和表型适应性来提升心脏运动合成的准确性，以应对心血管疾病导致的区域和疾病特异性差异。

Comments Early Accepted by MICCAI 2026. This is the author's submitted version. 10 pages, 3 figures

详情

AI中文摘要

心脏周期内的运动对于量化区域功能至关重要，并且强烈受到心血管疾病的影响。由于在实践中难以获得时间密集的网格序列，我们专注于利用更易获得的终舒张期帧来推断完整的周期序列。由于存在强区域和疾病特异性差异，传统方法常通过依赖生成模型来过度平滑数据，这些模型是为全球模式优化的。为了解决这个问题，我们提出了Region-Aware和Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis（RePCM）方法，用于单帧双心室网格运动补全。在第一阶段，重建网络学习顶点级别的运动描述符，聚类产生数据驱动的功能分区，提供显式的运动衍生区域结构。在第二阶段，Region-Specific Injection模块在条件VAE中强制执行掩码同步的区域交换，保留局部特定动态并限制跨区域混合。Phenotype-Adaptive Mixture-of-Experts先验条件于ED形状，使用解剖引导的提示来建模潜在运动趋势并捕捉跨疾病变化。在三个涵盖不同心血管疾病的数据集上的实验显示，在几何和功能指标上取得了持续的改进，并且区域特定动态的保护得到了改善。

英文摘要

Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.21226 2026-05-21 cs.LG cs.AI 版本更新

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

OCTOPUS: 通过在最优平方误差量化下的八面体参数化优化Transformer的KV缓存

Mark Boss, Vikram Voleti, Simon Donné, Shimon Vainer

发表机构 * Stability AI

AI总结 OCTOPUS通过联合量化旋转坐标三元组，优化了Transformer的KV缓存，在保持内存带宽和足迹的同时，通过八面体参数化将方向映射到平方，并利用Lloyd-Max量化来实现非均匀的位分配，从而在各种数据类型中实现了优于现有旋转编码器的性能。

详情

AI中文摘要

关键值（KV）缓存是长上下文自回归推断中内存带宽和足迹的主要瓶颈。最近的旋转预条件编码器（TurboQuant, PolarQuant）表明，通过结构化的随机旋转后，再配合每个坐标轴的标量量化器，该量化器的边际分布具有解析性，可以近似达到KV压缩的最优解。OCTOPUS通过联合量化旋转坐标三元组进一步推进了这一范式。每个三元组的方向通过八面体参数化映射到平方，然后得到的两个坐标和三元组范数通过Lloyd-Max量化与实现匹配的边际分布进行量化。通过优化每个三元组的平方误差，得到的位分配严格非均匀，仅依赖于键的总维度。我们发现，在有限维的情况下，通过扫描找到的质量最优是恒定的，无论在我们测试的任何现实解码器中。该编码器是数据无关的、在线的，并且在给定种子的情况下是确定性的。在文本、视频和音频中，OCTOPUS在每个报告的比特宽度和指标上都匹配或超越了所有先前的旋转编码器，其优势随着比特数的减少而增加。此外，一个融合的Triton实现可以在不生成未压缩键的情况下实时重建键，因此编码器在解码时间上不会增加带宽或延迟。项目页面：https://octopus-quant.github.io/

英文摘要

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.21225 2026-05-21 cs.LG cs.AI 版本更新

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE: 基于偏好的隐式奖励和成本微调以实现安全对齐

Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran

发表机构 * TCS Research, \ of CSE, IIT Madras India ； Department of Computing Science, \ of Alberta Canada ； Qatar Computing Research Institute, \ Bin Khalifa University Qatar ； Department of Data Science \& AI, Wadhwani School of Data Science \& AI, IIT Madras India ； TCS Research, \ of CSE, IIT Madras ； Department of Computing Science, \ of Alberta ； Qatar Computing Research Institute, \ Bin Khalifa University ； Department of Data Science \& AI, Wadhwani School of Data Science \& AI, IIT Madras

AI总结该研究提出PREFINE方法，通过基于偏好的隐式奖励和成本微调，在连续控制环境中实现安全策略对齐，通过微调预训练强化学习策略以生成低成本行为同时保持高奖励。

Comments Accepted at AAMAS 2026 as a full paper

详情

AI中文摘要

我们解决了通过引入成本约束使预训练的强化学习（RL）策略安全意识的问题，而无需重新训练。虽然成本可以数值编码，但我们假设更一般的情况是当成本作为偏好提供时。给定一个奖励优化的策略和一个小的偏好（低成本）和不偏好（高成本）轨迹数据集，我们的目标是微调策略以生成低成本行为，同时保留高奖励。与标准RLHF在语言模型中不同，我们的设置涉及轨迹层面的偏好，在连续控制环境中。我们介绍了PREFINE：基于偏好的隐式奖励和成本微调以实现安全对齐，这是一种基于偏好的微调方法，将现在广泛用于LLM微调的直接偏好优化（DPO）适应到序列决策设置中。PREFINE构造策略采样的反事实轨迹以建立有意义的偏好对比，并联合优化奖励保留和安全对齐。实证上，PREFINE将约束违反和灾难性故障减少了超过60%，同时保持原始奖励行为。PREFINE生成的策略在显著提高数据和计算效率的情况下，实现了低成本、高奖励性能， bridging preference alignment和安全策略适应在连续域中。

英文摘要

We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.

URL PDF HTML ☆

赞 0 踩 0

2605.21224 2026-05-21 physics.optics cs.AI eess.SP 版本更新

Artificial Intelligence Reshapes Microwave Photonics

人工智能重塑微波光子学

Peng Li, Xihua Zou, Jia Ye, Wei Pan, Lianshan Yan

发表机构 * Center for Information Photonics and Communications, School of Information Science and Technology, Southwest Jiaotong University（信息光子与通信中心，信息科学与技术学院，西南交通大学）； Key Laboratory of Photonic-Electronic Integration and Communication-Sensing Convergence, Southwest Jiaotong University（光电子集成与通信-传感融合重点实验室，西南交通大学）

AI总结本文研究了人工智能如何推动微波光子学的发展，通过整合人工智能与微波光子学技术，实现了在信号生成、传输、处理和检测等方面的创新突破。

Comments 13 pages, 12 figures

详情

DOI: 10.1002/lpor.71267

AI中文摘要

作为一项迅速发展的跨学科领域，微波光子学（MWP）通过整合微波和光子技术，为克服传统电子系统的根本带宽限制提供了颠覆性解决方案。通过利用光子技术固有的超宽带宽和低损耗特性，MWP实现了微波、毫米波和太赫兹信号的生成、传输、处理和检测。代表性突破包括全光微波雷达系统、带宽高达320 GHz的全光模拟-数字转换器，以及数据速率高达616 Gbit/s的全光无线通信系统。同时，人工智能的快速成长正在以前所未有的方式重塑科学研究、工程和日常生活，如AI用于科学/工程和AI合作者/助手。相应地，人工智能在微波光子学的各个方面产生了深远影响，从信号生成、传输到信号处理和检测。人工智能已经革新了MWP系统的设计、仿真、制造、测试、部署和维护，实现了超越传统系统的自主操作和卓越效率。受这些进展的启发，本文综述论文提供了人工智能赋能微波光子学的首次全面概述，系统总结了最先进的进展，并为学术界和更广泛公众提供了见解。

英文摘要

As a rapidly emerging interdisciplinary field that intrinsically integrates microwave and photonics, microwave photonics (MWP) provides disruptive solutions to overcome the fundamental bandwidth of conventional electronic systems. By exploiting the inherently ultra-wide bandwidth and low-loss characteristics of photonic technologies, MWP enables the generation, transmission, processing, and detection of microwave, millimeter-wave, and terahertz signals. Representative breakthroughs include fully photonic microwave radar systems, photonic analog-to-digital converters with bandwidth up to 320 GHz, and photonic wireless communication systems achieving data rate as high as 616 Gbit/s. Meanwhile, the rapid growth of artificial intelligence (AI) is reshaping scientific research, engineering, and daily life in unprecedented ways, such as AI for science/engineering and AI co-scientist/assistant. Correspondingly, AI is profoundly reshaping MWP in all aspects, ranging from signal generation, transmission to signal processing and detection. AI has revolutionized the design, simulation, fabrication, testing, deployment, and maintenance of MWP systems, delivering autonomous operation and exceptional efficiency beyond traditional systems. Motivated by these developments, this Review Paper provides the first comprehensive overview of AI-enabled MWP, systematically summarizing the state-of-the-art advances and presenting insights for both the academic community and the broader public.

URL PDF HTML ☆

赞 0 踩 0

2605.21213 2026-05-21 quant-ph cs.AI cs.LG math.OC 版本更新

Enhanced Reinforcement Learning-based Process Synthesis via Quantum Computing

通过量子计算增强的强化学习过程合成

Austin Braniff, Fengqi You, Yuhe Tian

发表机构 * Department of Chemical and Biomedical Engineering, West Virginia University（西弗吉尼亚大学化学与生物医学工程系）； R.F. Smith School of Chemical and Biomolecular Engineering, Cornell University（康奈尔大学R.F. Smith化学与生物分子工程学院）

AI总结本文提出了一种基于量子强化学习的过程合成方法，通过构建通用框架将过程合成问题形式化为马尔可夫决策过程，并引入量子增强的强化学习算法以提高可扩展性，同时通过经典强化学习作为基准进行比较，展示了量子方法在过程合成中的竞争力。

详情

AI中文摘要

在本文中，我们提出量子强化学习（RL）作为解决过程合成问题的策略。基于我们先前的工作，我们开发了一个通用框架，将过程合成正式化为马尔可夫决策过程，并引入量子增强的强化学习算法来解决它，从而提高了可扩展性。早期的量子强化学习在过程合成中的实现受到量子位需求的限制，随着问题复杂度的增加，其扩展性较差。本文通过引入状态编码算法将量子位需求与问题规模解耦。使用经典强化学习作为基准，在相同的训练条件下评估量子算法。所有算法在具有递增单元数量的流程表合成问题上进行评估，以分析其性能和可扩展性。结果表明，所有方法都能在小设计空间中识别出最优的流程表设计。对于中等规模的单元数量，量子方法在每回合的基础上表现出竞争性的性能，并且在每参数的基础上具有改进的效率，优于经典强化学习基准。本文为未来量子计算在过程系统工程中的应用提供了基础，建立了比较经典和量子算法的受控基准，并展示了所提出的量子变体在本文研究的过程合成问题中仍具有竞争力。

英文摘要

In this work, we present quantum reinforcement learning (RL) as a solution strategy for process synthesis problems. Building on our prior work, we develop a generalized framework that formally poses process synthesis as a Markov decision process and introduces quantum-enhanced RL algorithms to solve it with improved scalability. Earlier implementations of quantum-based RL for process synthesis were limited by qubit requirements, which scaled poorly with problem complexity. This work overcomes this challenge by introducing state encoding algorithms to decouple qubit requirements from problem size. A classical RL-based solution strategy is used as a baseline to benchmark the quantum algorithms under identical training conditions. All algorithms are evaluated across a flowsheet synthesis problem of increasing unit counts to analyze their performance and scalability. Results show that all approaches are capable of identifying the optimal flowsheet designs in small design spaces. For moderate-scale unit counts, quantum approaches demonstrate competitive performance on a per-episode basis and improved efficiency on a per-parameter basis versus the classical RL benchmark. This work provides a foundation for future quantum computing applications within process systems engineering, establishes a controlled benchmark for comparing classical and quantum algorithms, and shows that the proposed quantum variants remain competitive for the process synthesis problem examined in this work.

URL PDF HTML ☆

赞 0 踩 0

2605.21198 2026-05-21 cs.SI cs.AI 版本更新

精神病诊断的ICD分类自动化：从经典NLP到大语言模型

Fernando Ortega, Raúl Lara-Cabrera, Jorge Dueñas-Lerín, Alejandro de la Torre-Luque, Mercé Salvador Robert, Enrique Baca-García

发表机构 * Department of Sistemas Informáticos, Universidad Politécnica de Madrid, Spain（西班牙马德里理工大学信息系统系）； KNODIS Research Group, Universidad Politécnica de Madrid, Spain（西班牙马德里理工大学KNODIS研究组）； CIBERSAM ISCIII, Spain（西班牙ISCIII CIBERSAM）； Department of Legal Medicine, Psychiatry and Pathology. Complutense University of Madrid, Spain（西班牙马德里康普顿斯大学法医学、精神病学与病理学系）； Hospital Universitario de Móstoles, Universidad Rey Juan Carlos, Spain（西班牙雷阿尔皇家卡洛斯大学莫斯特oles大学医院）； Department of Psychiatry, University Hospital Jimenez Díaz Fundation, Madrid, Spain（西班牙圣地亚哥· jiménez Díaz基金会精神病科部）； Department of Psychiatry, University Hospital Rey Juan Carlos, Móstoles, Spain（西班牙雷阿尔皇家卡洛斯大学莫斯特oles医院精神病科部）； Department of Psychiatry, General Hospital of Villalba, Madrid, Spain（西班牙维拉尔巴医院精神病科部）； Department of Psychiatry, University Hospital Infanta Elena, Madrid, Spain（西班牙伊菲格尼亚医院精神病科部）； Department of Psychology, Universidad Catolica del Maule, Talca, Chile（智利马尔学院心理学系）； Department of Psychiatry, Madrid Autonomous University, Madrid, Spain（西班牙马德里自治大学精神病科部）

AI总结本研究提出利用NLP和机器学习技术将自由文本描述映射到国际疾病分类（ICD），以自动化精神病诊断分析，通过评估从经典频率模型到先进大语言模型的多种文本表示方法，展示了transformer嵌入在捕捉隐含语义线索和细致医学术语方面的优势。

详情

AI中文摘要

心理健康已成为全球优先事项，导致临床诊断编码的行政负担巨大。本研究提出通过将自由文本描述映射到国际疾病分类（ICD）来自动化精神病诊断分析，利用包含145,513个西班牙精神病描述的专用数据集，评估了从经典频率模型（BoW，TF-IDF）到先进大语言模型（如e5_large、BioLORD和Llama-3-8B）的各种文本表示方法。结果表明，基于transformer的嵌入 consistently 超过传统方法，通过端到端微调，e5_large模型实现了最高的性能，F1_micro得分为0.866。本研究证明了将大语言模型适应特定临床术语对于克服“长尾”标签分布和精神病 discourse 的固有模糊性至关重要。

英文摘要

Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.21146 2026-05-21 cs.CR cs.AI cs.SE 版本更新

Detecting Trojaned DNNs via Spectral Regression Analysis

通过谱回归分析检测被植入的深度神经网络

Samuele Pasini, Jinhan Kim, Paolo Tonella

发表机构 * Università della Svizzera Italiana（瑞士意大利大学）

AI总结本文提出MIST方法，通过分析模型在微调过程中的内部表示变化来检测植入的后门，利用预激活谱特征来识别与参考不一致的更新，从而在不依赖 poisoned 数据或触发器的情况下实现高准确率的后门检测。

详情

AI中文摘要

现代深度神经网络（DNN）经常被反复微调以整合新数据和功能。这种进化流程在更新数据不可信时引入了安全风险，因为攻击者可能在微调过程中植入后门。我们提出了MIST，一种后门检测方法，分析模型在微调过程中内部表示的变化。而不是尝试重建触发条件，MIST利用预激活谱特征来表征良性模型进化，并标记出与参考不一致的更新。这种框架将后门检测视为对模型更新的回归问题。在四个数据集和八个后门攻击上的实证评估表明，谱距离可靠地区分被植入的更新与干净的微调。MIST在单次更新后优于现有最先进检测精度，无需任何关于被污染数据或触发器的知识，并在多步良性进化中保持有效，具有优雅且有界的退化。这些结果表明，谱进化提供了一种稳定且假设轻量的信号，用于检测恶意模型更新。

英文摘要

Modern DNNs are repeatedly fine-tuned to incorporate new data and functionality. This evolutionary workflow introduces a security risk when updated data cannot be fully trusted, as adversaries may implant Trojans during fine-tuning. We present MIST, a Trojan detection approach that analyzes how a model's internal representations change during fine-tuning. Rather than attempting to reconstruct trigger conditions, MIST characterizes benign model evolution using pre-activation spectra and flags updates whose spectral deviations are inconsistent with this reference. This framing treats Trojan detection as a regression problem over model updates. An empirical evaluation across four datasets and eight Trojan attacks shows that spectral distances reliably distinguish Trojaned updates from clean fine-tuning. MIST outperforms state-of-the-art detection accuracy after a single update, without requiring any knowledge about the poisoned data or the trigger, and remains effective under multi-step benign evolution, with graceful and bounded degradation. These results indicate that spectral evolution provides a stable and assumption-light signal for detecting malicious model updates.

URL PDF HTML ☆

赞 0 踩 0

2605.21113 2026-05-21 cs.LO cs.AI 版本更新

On the Complexity of Entailment for Cumulative Propositional Dependence Logics

关于累积命题依赖逻辑蕴含复杂性的研究

Kai Sauerwald, Juha Kontinen, Arne Meier

发表机构 * University of Helsinki（赫尔辛基大学）

AI总结本文研究了累积命题依赖逻辑和累积命题逻辑（带有团队语义）的蕴含问题的复杂性，通过关系模型确定了其复杂性结果。

Comments arXiv admin note: substantial text overlap with arXiv:2602.21360

2605.21102 2026-05-21 cs.CL cs.AI cs.SE 版本更新

ACL-Verbatim: hallucination-free question answering for research

ACL-Verbatim: 无幻觉的科研问答

Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, Ádám Kovács

发表机构 * TU Wien（维也纳技术大学）； KR Labs（KR实验室）

AI总结本研究提出ACL-Verbatim系统，通过提取式问答方法在科研论文中精准映射用户查询到相关文本片段，构建了新的真实数据集并训练评估了多种提取模型，最终通过150M参数的ModernBERT模型在词级F1得分上达到53.6，优于最强的LLM提取器。

Comments 13 pages

详情

AI中文摘要

学术研究者需要高效可靠的工具从可信来源获取高质量信息，但现代AI辅助研究工具仍受大语言模型（LLMs）产生事实不准确或不合逻辑输出（即幻觉）的影响。我们应用提取式问答系统VerbatimRAG到ACL Anthology中的研究论文，直接将用户查询映射到检索文档中的原文文本片段。我们贡献了一个新的真实数据集，用于将用户查询映射到科研论文中的相关文本片段，并利用该数据集训练和评估多种提取模型。人工标注由自然语言处理研究人员完成，基于使用ScIRGen方法生成的合成用户查询，配以由VerbatimRAG检索的论文片段。在该基准上，一个基于我们流水线银色监督训练的150M参数ModernBERT标记分类器在词级F1得分上达到53.6，优于最强的评估LLM提取器（48.7）

英文摘要

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

URL PDF HTML ☆

赞 0 踩 0

2605.21085 2026-05-21 cs.MA cs.AI cs.LG 版本更新

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

分离通信与策略：在带宽限制下的鲁棒多智能体强化学习

Alexi Canesse, Benoît Goupil, Jesse Read, Sonia Vanier

发表机构 * École polytechnique (LIX)（巴黎高等理工学院（LIX））； CNRS（国家科学研究中心）； Institut Polytechnique de Paris（巴黎高等理工学院）； Palaiseau, France（法国Palaiseau）

AI总结本文提出了一种新的方法，通过引入β指标和SLIM架构，将通信路径与策略的潜在表示分离，从而在带宽受限的情况下提高多智能体强化学习的鲁棒性和性能。

详情

AI中文摘要

通信在多智能体强化学习（MARL）中起到了协调作用，但许多实际应用，例如无人机编队的搜索与救援任务，在严重的带宽限制下运行。许多通信架构仍然存在耦合瓶颈，其中共享的潜在表示用于策略执行和智能体间通信。因此，减少信息量会直接限制策略的潜在空间，通常导致显著的性能下降。我们通过两个贡献来解决这个问题。首先，我们引入β，一个归一化的每智能体带宽预算，将稀疏性、轮次和信息维度统一为一个可比的约束。其次，我们提供SLIM，一个最小的架构，将通信路径与策略的潜在表示分离，使我们能够隔离带宽的影响与策略容量的影响，同时受益于步骤内通信。我们在几个部分可观测的MARL基准上评估了我们的方法，其中通信是至关重要的。我们的方法在状态空间中实现了最先进的性能，并且在有限的通信下表现出可扩展性和鲁棒性，随着带宽的减少，降级仅是轻微的。

英文摘要

Communication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter-agent communication. Consequently, reducing message size directly limits the policy's latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce $β$, a normalised per-agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy's latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in-step communication. We evaluate our method on several partially-observable MARL benchmarks, where communication is essential. Our approach achieves state-of-the-art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.

URL PDF HTML ☆

赞 0 踩 0

2605.21082 2026-05-21 cs.AI 版本更新

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

AutoRPA: 通过基于LLM的代码合成实现高效的GUI自动化

Minghao Chen, Xinyi Hu, Zhou Yu, Yufei Yin

发表机构 * Zhejiang Key Laboratory of Space Information Sensing and Transmission（浙江空间信息感知与传输重点实验室）； School of Computer Science, Hangzhou Dianzi University（杭州电子科技大学计算机科学学院）

AI总结本文提出AutoRPA框架，通过将ReAct风格代理的决策逻辑自动转化为鲁棒的RPA功能，提高GUI自动化效率和可重用性，同时减少82%到96%的token使用量。

Comments Accepted in ICML 2026

详情

AI中文摘要

基于大型语言模型（LLM）的代理在多步骤的图形用户界面（GUI）交互中表现出色。尽管大多数研究集中在提升单任务性能，但实际场景中往往涉及重复的GUI任务，而频繁调用LLM推理（即ReAct范式）效率低下。在LLM之前，传统的机器人流程自动化（RPA）提供运行时效率，但需要大量手动努力来开发和维护。为弥合这一差距，我们提出AutoRPA框架，该框架能够自动将ReAct风格代理的决策逻辑转化为鲁棒的RPA功能。AutoRPA引入了两个核心创新：（1）一个翻译-构建流水线，其中翻译代理将硬编码的ReAct动作转换为软编码的流程，构建代理通过多轨迹检索增强生成合成鲁棒的RPA功能；（2）在代码验证期间的混合修复策略，结合RPA执行与基于ReAct的回退机制进行迭代优化。在多个GUI环境中的实验表明，由AutoRPA生成的RPA功能能够成功解决类似任务，同时减少82%到96%的token使用量，显著提高运行时效率和可重用性。

英文摘要

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

URL PDF HTML ☆

赞 0 踩 0

2605.21061 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； KimJaeChul AI Graduate School（金 JaeChul人工智能研究生院）

AI总结本文提出通过逆运动学求解器重新设计驾驶VLA，以解决轨迹预测中对视觉token的忽略问题，通过引入视觉状态预测和逆运动学网络，提升了视觉接地和轨迹规划性能。

详情

AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明，当通过逆运动学视角看待轨迹恢复时，需要当前和未来视觉状态作为边界条件；现有VLA仅提供前者，促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题，我们重新设计驾驶VLA，使其风格类似于逆运动学求解器。首先，一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次，一个单独的逆运动学网络（基于交叉注意力的条件扩散模型）仅输入当前和未来视觉状态，以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方，我们的0.5B规模模型恢复了视觉接地能力，并在闭合回路NAVSIM-v2和nuScenes基准上，其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明，这种改进源于恢复了利用视觉特征的能力，效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

URL PDF HTML ☆

赞 0 踩 0

2605.21060 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Divide et Calibra: 通过向量量化实现多类局部校准

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

发表机构 * University of Pisa（比萨大学）； University of Trento（特伦托大学）； Meta（Meta公司）； Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）

AI总结本文提出了一种复合方法，通过向量量化诱导表示空间的结构划分，并利用Dirichlet浓度的参数化实现跨区域参数共享，从而学习出能泛化到稀疏区域的异质校准映射，提升了局部校准性能同时保持了全局校准和预测性能。

详情

AI中文摘要

在高风险场景中，准确且校准良好的机器学习（ML）模型是必需的，但有效的多类校准仍然具有挑战性：全局方法假设校准误差在潜在空间中是同质的，而局部方法通常依赖于潜在空间降维，导致信息丢失。为了解决这些问题，我们提出了一种多类校准的复合方法，其中区域特定的校准映射是从共享的码字依赖因素中构建的。我们通过向量量化（VQ）实现这一想法，它诱导了表示空间的结构划分，并利用Dirichlet浓度的参数化实现跨区域参数共享。我们的方法学习了能泛化到稀疏区域的异质校准映射。在基准数据集上的实验显示，在保持竞争性的全局校准和预测性能的同时，显著提高了局部校准性能。

英文摘要

Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.

URL PDF HTML ☆

赞 0 踩 0

2605.20998 2026-05-21 cs.CL cs.AI 版本更新

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

单次传递、深度选择性阅读用于多方面情感分析

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan

发表机构 * Universiti Malaya, Malaysia（马来大学，马来西亚）； Suzhou University of Technology, China（苏州科技学院，中国）； VinUniversity, Vietnam（文大学，越南）

AI总结本文提出DABS框架，通过单次编码构建可重用的深度有序基底，使多方面情感分析在保持性能的同时减少60%的端到端计算量。

Comments Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

详情

AI中文摘要

在多方面句子中，方面术语情感分析（ATSA）面临效率与表达性的根本权衡。现有模型要么为每个方面重新编码句子，要么依赖静态深度表示，导致冗余计算和有限适应性。我们主张Transformer深度是一种昂贵且可查询的资源，并提出DABS，一种单次推断框架，通过一次编码构建可重用的深度有序基底。每个方面则查询此共享表示以选择性地读取相关token和抽象层次，而无需重新编码。这将共享句子编码与轻量级、方面条件化的读取解耦。在四个ATSA基准测试中，DABS实现了具有竞争力的性能，同时在多方面设置（M >= 2）中将端到端计算减少了高达60%。进一步分析表明，自适应深度查询在语言复杂情况如否定和对比中最为有益。代码可在https://github.com/panzhzh/acl-dabs公开获取。

英文摘要

Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

URL PDF HTML ☆

赞 0 踩 0

2605.20997 2026-05-21 cs.CV cs.AI cs.LG physics.comp-ph 版本更新

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

基于TanDEM-X和Landsat数据的混合机器学习模型用于森林高度估计

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou

发表机构 * German Aerospace Center (DLR)（德国航空航天中心（DLR））； Institute of Environmental Engineering, ETH Zürich（环境工程研究所，苏黎世联邦理工学院）

AI总结本文提出了一种结合机器学习与物理模型的混合方法，利用TanDEM-X干涉相干测量和Landsat光学数据来提高森林高度估计的精度，通过扩展特征空间减少高度和基线地形坡度的模糊性，实验结果表明RMSE和MAE分别降低了13.5%和16.6%。

详情

DOI: 10.1109/LGRS.2026.3693644

AI中文摘要

将机器学习（ML）与物理模型（PM）结合，已成为从遥感数据中检索地球物理参数的一种有前途的方法。在此背景下，一种用于从TanDEM-X干涉相干测量中估计森林高度的ML模型最近被提出，该模型通过物理模型约束学习过程。虽然所选特征用于训练和反演以确保解决方案的物理一致性，但它们无法解决数据中的所有高度/结构和基线/地形坡度模糊性。为改进这一点，提出通过扩展特征空间加入光学Landsat数据，以提供关于森林类型或结构的补充信息。扩展的模型被应用于几处Gabon的Lopé国家公园的TanDEM-X数据，并与空中LiDAR测量进行评估。结果表明，与原始混合模型相比，RMSE和MAE分别减少了13.5%和16.6%，证实了多光谱输入的附加价值。

英文摘要

Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.20994 2026-05-21 cs.CL cs.AI 版本更新

Towards Context-Invariant Safety Alignment for Large Language Models

面向大语言模型的上下文不变安全对齐

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University, Shanghai, China（复旦大学，上海，中国）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室，上海，中国）

AI总结本文提出了一种上下文不变的安全对齐方法，通过引入锚点不变正则化（AIR）来提升模型在不同上下文中的鲁棒性，从而增强安全约束对对抗性框架的抵抗力。

Comments ICML 2026

详情

AI中文摘要

基于偏好进行的后训练对齐可以将大语言模型与人类意图对齐，但安全行为往往仍然脆弱。一个模型可能在标准提示下拒绝有害请求，但在相同意图被包装在对抗性语言中时却会合规。我们建议，稳健的安全性需要上下文不变的对齐，其中行为取决于底层意图而非表面形式。在对齐中强制不变性是困难的，因为并非所有训练信号都同等可信；对于某些提示变体我们能够获得可验证的反馈（例如多选题），而对于开放性变体我们通常依赖于噪声且可游戏化的奖励代理（例如学习的评判者）。因此，标准对称不变正则化器可以通过降低在可靠变体上的性能来减少跨上下文差异，而不是改进开放性鲁棒性。为了解决这个问题，我们引入了锚点不变正则化（AIR），它将可验证的提示视为锚点，并使用停止梯度目标来正则化开放性变体朝着锚点性能的方向。AIR作为插件辅助损失实现，并通过异质提示分组与基于组的偏好优化（例如GRPO）结合。在安全、道德推理和数学方面，AIR提高了上下文不变性，提升了在分布内组的准确性达12.71%，在分布外一致性提升33.49%，使安全约束对对抗性框架更加稳健。

英文摘要

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

URL PDF HTML ☆

赞 0 踩 0

2605.20982 2026-05-21 cs.DC cs.AI cs.LG 版本更新

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

调度操作中的开销诊断：跨架构观测站

Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein

发表机构 * DeepSeek-V2-Lite MLA ； DeepSeek-MoE-16B MHA ； Qwen3-30B GQA ； Nemotron-30B Mamba-2 ； Qwen3.5-35B GDN

AI总结该研究通过测试四个缓解方案的假设，发现扩展进程（EP）规模变化对专家最大/均值token比率的影响最多为5%，并且mock-token基准在路由Gini系数和批大小缩放趋势上存在高估。研究发现五种架构在相同矩阵中形成两个稳定的带状分布，这些带状分布而非EP度或mock数据配置是AlltoAll-aware互连和调度设计的正确工作负载输入。

详情

AI中文摘要

AlltoAll调度是MoE专家并行性的主要瓶颈，互连社区对此做出了四种缓解方案：预测样本放置、自适应专家重新布局、分层收集和EP-aware拓扑。这四种方案都基于两个关于工作负载的假设。第一个假设是路由不平衡可以通过系统层纠正。第二个假设是评估它们的mock-token基准忠实代表生产路由。我们引入DODOCO来测试这两个假设。我们对五个MoE检查点进行仪器化，涵盖五个序列混合器设计（DeepSeek-V2-Lite MLA，DeepSeek-MoE-16B MHA，Qwen3-30B GQA，Nemotron-30B Mamba-2，Qwen3.5-35B GDN）在5x6的数据条件下网格以及匹配的EP扫描（4到32个rank在H100s上）；两个假设都失败。扩展EP在每个架构的可测量范围内将每专家最大/均值token比率改变最多5%：straggler是模型路由决策的固有属性，而不是其专家落在rank上的方式。mock tokens高估路由Gini系数高达2.35倍，并制造了一个批次大小缩放趋势，一旦真实文本取代随机ID，该趋势就消失。从相同矩阵中出现第三种模式，意外的是，五种架构分裂成两个稳定的带状分布。MHA和Mamba-2（数据容错）在wikitext上降至Gini 0.105和0，150。MLA和GDN（持续集中）在所有真实文本条件下保持在0.24以上，并在mock中达到0.29到0.38。GQA是中间情况。这些带状分布，而不是EP度或mock数据配置，是AlltoAll-aware互连和调度设计的正确工作负载输入。

英文摘要

AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design.

URL PDF HTML ☆

赞 0 踩 0

2605.20971 2026-05-21 cs.CV cs.AI cs.CR 版本更新

Comparative Evaluation of Deep Learning Models for Fake Image Detection

深度学习模型在虚假图像检测中的比较评估

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed

发表机构 * University of East London（东伦敦大学）

AI总结本研究通过统一的预处理和训练流程比较了四个预训练的CNN架构在虚假图像检测中的性能，发现VGG16在准确性上表现最佳，但EfficientNetB0在检测虚假图像时的敏感性较高，但对真实图像的可靠性较低，研究指出需要平衡数据集、高级增强和公平性意识训练来开发可靠的虚假图像检测系统。

Comments Accepted at ICCIIoT26 and waiting to be indexed

Journal ref 6th International Conference on Computational Intelligence & Internet of Things (ICCIIoT), 2026

详情

AI中文摘要

随着基于GAN的图像篡改技术日益复杂，数字取证面临重大挑战。本研究比较了四个预训练的CNN架构（VGG16、ResNet50、EfficientNetB0和XceptionNet）在虚假图像检测中的性能，使用统一的预处理和训练流程。通过调整大小、归一化和增强来解决类别不平衡问题并提高泛化能力。模型评估使用了准确性、精确率、召回率、F1分数和ROC-AUC。VGG16在准确性上达到91%，XceptionNet、ResNet50和EfficientNetB0分别达到90%。EfficientNetB0对虚假图像的敏感性更强，但在真实图像上的可靠性较低，反映了由不平衡驱动的偏差。局限性包括数据集不平衡、过拟合和解释性有限，这些因素影响了跨域鲁棒性。本研究提供了一个可重复的基准，并强调了平衡数据集、高级增强和公平性意识训练的必要性，以开发可靠的虚假图像检测系统。

英文摘要

The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

URL PDF HTML ☆

赞 0 踩 0

2605.20965 2026-05-21 cs.CV cs.AI 版本更新

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

在不遗忘的情况下寻找正确的视觉证据：通过层间视觉注意力差异减轻LVLMs中的幻觉

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China（东南大学计算机科学与工程学院）； School of Artificial Intelligence, Shenzhen University, Shenzhen, China（深圳大学人工智能学院）； College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China（深圳大学计算机科学与软件工程学院）； Engineering, South China University of Technology, Guangzhou, China（华南理工大学工程学院）； National Engineering Laboratory for Big Data Systems Computing Technology, Shenzhen University, Shenzhen, China（深圳大学大数据系统计算技术国家工程实验室）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其交叉应用重点实验室（东南大学），中华人民共和国教育部）

AI总结本文提出了一种基于层间视觉注意力差异的幻觉缓解方法，通过增强视觉证据的注意力来减少视觉遗忘，从而在不遗忘的情况下找到正确的视觉证据。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在广泛的视觉-语言任务上表现出色。尽管有进展，它们仍然容易产生幻觉，生成与视觉内容不一致的响应。在本工作中，我们发现LVLMs在对正确的视觉证据关注不足时容易产生幻觉，并在生成过程中逐渐遗忘它。我们实证发现，尽管LVLMs整体对视觉证据关注不足，但在特定层中表现出对正确视觉证据的敏感性，存在显著的层间差异。受此观察启发，我们提出了一种新的幻觉缓解方法，通过层间视觉注意力差异（ILVAD）增强视觉证据。具体来说，我们从早期生成的token到视觉token在各层中获取注意力权重，并识别被反复激活作为视觉证据的token，形成显著性图。然后通过显著性图在生成过程中增强对视觉证据的注意力，以减少视觉遗忘。此外，我们利用显著性图获得生成文本对视觉证据的注意力分数，以选择并强调强烈基于视觉证据的文本token。我们的方法是无训练的，即插即用。在五个最近发布的模型上进行的多个基准评估表明，我们的方法可以在不同架构的LVLMs上一致地缓解幻觉。代码可在https://github.com/ytx-ML/ILVAD上获得。

英文摘要

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

URL PDF HTML ☆

赞 0 踩 0

2605.20936 2026-05-21 cs.LG cs.AI cs.CL 版本更新

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH：在单个GPU上几分钟内完成的快速可微架构搜索用于混合注意力

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本研究提出DASH，一种快速可微架构搜索框架，用于混合注意力架构设计，通过将离散的层间注意力操作放置转化为连续的架构logits，准备可重用的教师对齐线性候选，并在模型和操作权重冻结的情况下进行架构仅搜索，显著提高了搜索效率。DASH在Qwen2.5-3B-Instruct上优于现有的所有选择器风格的混合注意力设计基线，展示了直接可微搜索可以发现更强的混合架构。此外，DASH在RULER性能上优于已发布的Jet-Nemotron模型，同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是，每个DASH搜索运行仅使用12.3M tokens，并在单个RTX Pro 6000 GPU上仅需约20分钟，对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明，通过分钟级的可微搜索可以获得高质量的混合注意力架构，为混合架构设计提供了有前景的方向。

Comments 19 pages, 7 figures

详情

AI中文摘要

混合注意力架构正变得越来越重要，用于在保持模型质量的同时提高LLM推理效率，使混合架构设计成为核心问题。现有的设计通常依赖于手动经验规则或基于代理的选择器信号来分配层间操作符。最近的NAS风格系统，如Jet-Nemotron，展示了自动混合架构搜索的潜力。然而，Jet-Nemotron的PostNAS搜索阶段单独使用200B tokens，使得此类搜索流程难以作为混合架构设计的常规方法。我们引入DASH，一种用于混合注意力架构设计的快速可微搜索框架，它将离散的层间注意力操作放置放松为连续的架构logits，准备可重用的教师对齐线性候选，并在模型和操作权重冻结的情况下进行架构仅搜索，以显著提高搜索效率。在Qwen2.5-3B-Instruct上，DASH一致优于现有的所有选择器风格的混合注意力设计基线，表明直接可微搜索可以发现更强的混合架构。此外，DASH在RULER性能上优于已发布的Jet-Nemotron模型，同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是，每个DASH搜索运行仅使用12.3M tokens，并在单个RTX Pro 6000 GPU上仅需约20分钟，对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明，通过分钟级的可微搜索可以获得高质量的混合注意力架构，为混合架构设计提供了有前景的方向。

英文摘要

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

URL PDF HTML ☆

赞 0 踩 0

2605.20924 2026-05-21 cs.CL cs.AI 版本更新

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct: 任务级策略诱导用于指令生成

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan（国立台湾大学计算机科学与资讯工程学系）； Institute of Information Science, Academia Sinica, Taiwan（台湾“中央研究院”资讯科学研究所）； AI Research Center (AINTU), National Taiwan University, Taiwan（国立台湾大学人工智能研究中心）

AI总结该研究提出了一种任务级策略诱导方法Strategy-Induct，通过仅使用少量示例问题生成任务指令，无需依赖标注答案，从而在指令生成任务中取得优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

设计有效的任务级提示对于提高大型语言模型（LLMs）的性能至关重要。尽管先前的指令诱导工作表明LLMs可以通过有限的例子推断出更好的指令，但现有方法通常依赖于输入-输出对，而获取标注答案可能困难或成本高昂。为了解决这一限制，我们提出了Strategy-Induct框架，该框架仅从少量示例问题中推导出任务级指令，而无需标注答案。我们的方法首先提示模型为每个问题生成显式的推理策略，形成（策略，问题）对。这些对随后用于诱导一个任务指令，以引导推理。在多个任务和模型规模上的实验表明，Strategy-Induct在仅问题设置中优于最先进的方法。此外，我们观察到在任务指令生成和推理中联合使用LLMs和大型推理模型可能会进一步提高性能。

英文摘要

Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.20923 2026-05-21 cs.LO cs.AI cs.PL 版本更新

Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows

因果过去逻辑用于分布式大语言模型代理工作流的运行时验证

Benedikt Bollig

发表机构 * Université Paris-Saclay, CNRS, ENS Paris-Saclay, LMF, Gif-sur-Yvette, France（巴黎-萨克雷大学，国家科学研究中心，巴黎-萨克雷高等师范学院，LMF，法国吉夫昂蒂埃）

AI总结本文提出了一种因果过去逻辑（CPL）用于分布式大语言模型代理工作流的运行时验证，通过在ZipperGen框架中引入CPL，实现了对工作流中因果可见事件的实时监控和控制流影响，从而将运行时验证整合到协调语言本身。

Comments 20 pages

详情

AI中文摘要

分布式大语言模型代理工作流不应被当作产生单一顺序日志来监控。在异步执行中，一个决策只能依赖于对其生命线可见的因果事件：某些日志中出现较早的事件可能在本地仍未知。我们扩展ZipperGen代理工作流框架，引入因果过去逻辑（CPL），一种用于条件和while循环中守卫的简洁过去时间时序逻辑。除了标准的过去时间模态如previous和since外，守卫可以检查另一个生命线和所选变量中最新可见的事件。公式是一种源级守卫：它由拥有生命线在线评估，并可以在运行时影响控制流。我们给出了一个具有最新值视图的向量钟监控器，并证明本地计算的监控值与守卫在当前事件处的指称语义一致。因此，运行时验证成为协调语言本身的一部分，而不是执行日志上的事后检查。

英文摘要

Distributed LLM agent workflows should not be monitored as if they produced a single sequential log. In an asynchronous execution, a decision can only depend on events that are causally visible to the lifeline that makes it: an event that appears earlier in some log may still be unknown locally. We extend the ZipperGen agent-workflow framework with Causal Past Logic (CPL), a small past-time temporal logic for guards in conditionals and while loops. In addition to standard past-time modalities such as previous and since, a guard can inspect the latest causally visible event of another lifeline and selected variables stored there. The formula is a source-level guard: it is evaluated online by the owner lifeline and can influence control flow at runtime. We give a vector-clock monitor with latest-value views and prove that the locally computed monitor value coincides with the denotational semantics of the guard at the current event. Thus runtime verification becomes part of the coordination language itself, rather than a post-hoc check over an execution log.

URL PDF HTML ☆

赞 0 踩 0

2605.20922 2026-05-21 cs.LG cs.AI cs.CV 版本更新

Winfree Oscillatory Neural Network

Winfree振荡神经网络

Jiawen Dai, Yue Song

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Qi Zhi Institute（上海启智研究院）； College of AI, Tsinghua University（清华大学人工智能学院）

AI总结本文提出了一种基于广义Winfree动力学的振荡神经网络WONN，通过结构化的振荡交互在流形$(S^1)^d$上进化表示，结合基于相位的归纳偏置与灵活的层次交互机制，实现了在图像识别和复杂推理任务上的竞争力和参数效率。

Comments Project page: https://jiawen-dai.github.io/WONN_Project_Page/

详情

AI中文摘要

振荡和同步被认为是表示和计算中的基本要素。然而，现有的基于同步动力学的机器学习方法大多局限于特定领域，如物体发现，缺乏在标准视觉基准或逻辑推理任务中的扩展性证据。我们提出Winfree振荡神经网络（WONN），一种基于广义Winfree动力学的动态神经架构。WONN通过结构化的振荡交互在流形$(S^1)^d$上进化表示，结合基于相位的归纳偏置与灵活且层次化的交互机制，这些机制可以是固定的三角函数映射或可学习的神经网络。我们在图像识别和复杂推理任务上评估了WONN，包括CIFAR、ImageNet、Maze-hard和Sudoku。在这些领域中，WONN实现了具有竞争力或优越性能的成果，并且具有强参数效率。特别是，WONN是目前已知第一个能够与ImageNet-1K竞争的基于同步的振荡架构。此外，在Maze-hard上，WONN仅使用前状态-of-the-art模型1%的参数就达到了80.1%的准确率。这些结果表明，结构化的振荡动力学为传统神经架构提供了一种可扩展且参数高效的替代方案。

英文摘要

Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.20915 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

校准与决策制定：重新审视未学习语言模型中的可靠性悖论

Divyaksh Shukla, Ashutosh Modi

发表机构 * Indian Institute of Technology Kanpur (IIT Kanpur)（印度理工学院坎浦尔学院（IIT坎浦尔））

AI总结本文研究了生成语言模型中校准与决策可靠性之间的差距，通过TOFU基准测试中的多项选择问答评估协议，发现经过微调的模型在校准误差较低，而未学习后的模型在校准误差仍低，但依赖于相关性特征的决策规则增加，扩展了可靠性悖论到机器未学习领域。

Comments Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

详情

AI中文摘要

机器未学习旨在从模型中移除特定训练数据的影响，同时保持对剩余数据的可靠行为，使可靠的预测和不确定性估计成为评估的关键。校准常被用作语言模型可靠性代理，但低校准误差并不一定意味着可靠的决策规则，因为模型可能依赖于虚假相关性而保持良好校准。我们通过TOFU基准测试中的多项选择问答评估协议，研究了生成语言模型中的这一差距，利用校准指标（ECE、MCE、Brier）测量概率可靠性，并通过基于属性的快捷方式检测（使用积分梯度和局部互信息）评估决策规则可靠性。我们发现，微调模型的校准误差（ECE ~ 0.04）低于预训练模型（ECE > 0.5），而未学习后的模型在校准误差相似，尽管在遗忘分割上的准确性降低，属性分析显示对基于相关性的标记依赖增加。这些结果表明，良好的校准可以与未学习后的基于快捷方式的决策规则共存，将可靠性悖论扩展到了机器未学习领域。

英文摘要

Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

URL PDF HTML ☆

赞 0 踩 0

2605.20911 2026-05-21 cs.AI cs.LG 版本更新

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

我们应该持续打击多久？在格斗游戏中学习动作持续时间

Hoang Hai Nguyen, Kurt Driessens, Dennis J. N. J. Soemers

发表机构 * Department of Advanced Computing Sciences, Maastricht University（马斯特里赫特大学高级计算科学系）

AI总结本文研究了在格斗游戏中如何通过学习动作持续时间来提高强化学习代理的决策能力，探讨了动态调整反应时间的方法及其对性能和行为模式的影响。

Comments Accepted at Computers and Games 2026

详情

AI中文摘要

像《街头霸王II》这样的格斗游戏对强化学习（RL）代理提出了独特的挑战，因为它们具有快速且实时的性质。在大多数RL框架中，代理被硬编码为在固定间隔内做出决策，通常每帧或每N帧。虽然这种设计确保了及时的响应，但限制了代理调整反应时间的能力。每帧行动提供帧完美反应，这与人类玩家相比不现实，而更长的固定间隔会降低计算成本但会阻碍响应速度。我们考虑了一种替代的决策框架，其中代理不仅学习采取什么动作，还学习执行该动作有多久。通过同时预测动作和持续时间，代理可以动态调整其对游戏不同情况的响应能力。我们使用开源的FightLadder环境，通过训练代理对抗内置的脚本机器人，系统地测试不同的帧跳配置，以分析其对性能、响应性和学习行为的影响。实验表明，学习的时间可以与精心选择的固定帧跳性能相匹配，并鼓励可重复的动作模式，但本身并不能保证鲁棒性。在大多数情况下，我们发现代理在一致的高帧跳值（即低响应速度）下表现最佳。这种策略使学习利用性策略变得更容易，其中相同的动作被反复执行，而脚本机器人似乎容易受到这种策略的影响。

英文摘要

Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.

URL PDF HTML ☆

赞 0 踩 0

2605.20901 2026-05-21 cs.CV cs.AI 版本更新

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

VISTA：EgoVis 2026 ego4D 短期物体交互预测挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； Shandong Jianzhu University（山东建筑大学）

AI总结本文提出VISTA，一种用于EgoVis 2026 ego4D短期物体交互预测挑战的V-JEPA集成静态快速时序预测器。该方法结合了以物体为中心的空间检测与短视时间上下文，通过特征调制和ROI级上下文融合，将时间表示注入检测路径，以提高预测的鲁棒性。

Comments The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情

AI中文摘要

我们提出VISTA，一种用于EgoVis 2026 ego4D短期物体交互预测（STA）挑战的V-JEPA集成静态快速时序预测器。给定一个眼动视频时间戳，任务要求预测下一步的人-物体交互，包括未来活跃物体的边界框、名词类别、动词类别、接触时间以及置信度分数。VISTA采用StillFast风格的设计，结合以物体为中心的空间检测与短视时间上下文。具体来说，一个在COCO上预训练的Faster R-CNN ResNet-50 FPN检测器从最后一个观察到的高分辨率帧中生成物体建议，而冻结的V-JEPA 2.1时间分支从观察到的视频中提取片段级眼动上下文。时间表示通过特征调制和ROI级上下文融合注入检测路径。融合的建议特征随后传递给多头STA预测器进行框细化、名词分类、动词分类、接触时间回归和交互置信度估计。为了最终提交，我们进一步融合互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明，VISTA在EgoVis 2026 ego4D STA挑战中获得第一名。我们的代码将在https://github.com/CorrineQiu/VISTA上发布。

英文摘要

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

URL PDF HTML ☆

赞 0 踩 0

2605.20876 2026-05-21 cs.CL cs.AI 版本更新

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Terminal-World: 通过智能体技能扩展终端智能体环境

Zihao Cheng, Hongru Wang, Zeming Liu, Xinyi Wang, Xiangrong Zhu, Yuhang Guo, Wei Lin, Jeff Z. Pan, Yunhong Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China（北京航空航天大学计算机科学与工程学院）； Independent Researcher（独立研究者）； Beijing Institute of Technology（北京理工大学）； University of Edinburgh（爱丁堡大学）

AI总结本文提出Terminal-World，一种自动化流程，利用智能体技能作为核心合成原语，共同编码任务目标、执行时机和方法，从而生成任务指令、环境和教师轨迹。通过构建5,723个训练环境，训练出Terminal-World-8B/14B/32B模型，在六个基准测试中均优于终端智能体基线，其中Terminal-World-32B在Terminal-Bench 2.0上以仅1.2%的训练数据超越Nemotron-Terminal-32B。

Comments Work in Progress

详情

AI中文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

英文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

URL PDF HTML ☆

赞 0 踩 0

2605.20874 2026-05-21 cs.AI cs.SE 版本更新

Governance by Construction for Generalist Agents

为通用智能体构建的治理机制

Segev Shlomov, Iftach Shoham, Alon Oved, Ido Levy, Sami Marreed, Harold Ship, Offer Akrabi, Sergey Zeltyn, Avi Yaeli, Nir Mashkif

发表机构 * IBM

AI总结本文提出了一种模块化的政策-as-code层，用于在不微调模型的情况下，通过与通用大语言模型智能体结合，实现可预测、可审计且符合合规要求的行为，在复合工作流中无需为每个领域重新构建智能体。

详情

AI中文摘要

企业智能体日益被期望在多个工具和界面中自主运行，但生产部署需要通过构建来实施治理。系统必须指定哪些操作被允许、何时需要人类监督以及哪些信息可以暴露，而无需为每个领域重新构建智能体。本演示展示了CUGA的策略系统，这是一种模块化的策略-as-code层，能够与通用大语言模型智能体结合，以在复合工作流中实现可预测、可审计且符合合规要求的行为。我们提出了一种运行时治理架构，在执行的每一个关键阶段都强制执行策略干预。而不是被动地限制行为，策略在五个结构性检查点拦截智能体：规划上游（意图守卫）、在系统提示内引导推理（手册）、在工具调用边界处强制正确使用（工具指南）、在推理循环外作为人类在环的闸门用于高风险操作（工具批准）、以及在输出阶段过滤和结构化最终响应（输出格式器）。这些阶段将治理连续嵌入智能体的执行流程中，而不是将其视为事后考虑。通过一个医疗场景和多层次的执行干预，演示展示了动态手册注入用于结构化工具序列执行，意图守卫阻止恶意或意外有害请求，以及人类在环的工具批准检查点用于可能破坏性操作。该成果展示了类型化的治理原语如何加快、安全地部署企业智能体系统，同时提高政策遵守和执行一致性。

英文摘要

Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

URL PDF HTML ☆

赞 0 踩 0

2605.20872 2026-05-21 cs.LG cs.AI cs.GR 版本更新

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

CAdam: 3D高斯密度细化中的上下文自适应矩估计

SeungJeh Chung, Geonho Park, Misong Kim, HyeongYeop Kang

发表机构 * IIIXR Lab, Kyung Hee University（庆尚大学IIIXR实验室）； IIIXR Lab, Korea University（韩国大学IIIXR实验室）

AI总结本文提出CAdam方法，通过将密度细化问题转化为统计信号验证问题，解决生成式蒸馏中密度估计的瓶颈，从而在保持视觉质量的同时显著减少高斯点数量。

Comments Accepted to SIGGRAPH 2026 Conference Papers. 12 pages, 8 figures

详情

DOI: 10.1145/3799902.3811215

AI中文摘要

Adaptive densification是3D高斯点划法（3DGS）的核心引擎。然而，当将其应用于基于优化的生成式蒸馏范式时，这种重建原生机制暴露了根本性限制，导致效率低下且充满冗余的表示。我们诊断这种失败为密度困境，源于生成指导的随机性：标准的幅度基积累无差别地聚合瞬态噪声与几何信号，难以在过密度和欠拟合之间取得平衡。为了解决这一问题，我们引入了上下文自适应矩估计（CAdam），一种新的框架，将密度细化重新解释为统计上站得住的信号验证问题。CAdam利用梯度的一阶矩来利用干涉原理，其中随机波动通过破坏性干涉抵消，而一致的几何漂移通过建设性干涉累积，从而有效分离底层信号与生成噪声底座。这进一步通过基于分位数的上下文意识和内在信号噪声比（SNR）门控机制增强，确保在优化阶段之间具有鲁棒的适应性，并使密度细化能够软终止。在多样化的目标（SDS，ISM，VFDS）和强大的生成3DGS后端上进行了广泛的实验，结果表明CAdam相比标准密度细化将高斯点数减少85%-97%，同时保持整体可比的视觉质量。这些结果突显了信号感知密度控制作为改进优化生成式蒸馏内存效率的实用方法。

英文摘要

Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.20868 2026-05-21 cs.LG cs.AI cs.SY eess.SY 版本更新

Runtime-Certified Bounded-Error Quantized Attention

具有运行时认证的误差受限量化注意

Dean Calver

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种分层的KV缓存架构，通过在GPU内存中存储INT8键和INT4值，同时在系统RAM中保留FP16原始数据，实现了运行时认证的注意机制，通过误差分解得到每头每步的误差界，以驱动自适应精度选择和多阶段回退流程，确保在需要时能恢复到精确的密集注意输出。

Comments 32 pages, 1 figure

详情

AI中文摘要

KV缓存量化减少了长上下文LLM推理的内存成本，但引入了通常仅通过经验验证的近似误差。现有系统依赖于平均情况下的鲁棒性，没有机制在运行时检测或恢复失败。本文提出了一种分层的KV缓存架构，使注意机制具有运行时认证：INT8键和INT4值存储在GPU内存中，而FP16原始数据保留在系统RAM中以实现确定性回退。一个两术语误差分解提供了每头每步的误差界（i）键量化导致的注意分布扭曲和（ii）值重建误差。这些界在线计算并用于驱动自适应精度选择和多阶段回退阶梯，确保在需要时能恢复到精确的密集注意输出。在PG-19、NIAH和RULER基准上，对LLaMA~3.1-8B（上下文长度达128K）的测试中，系统在语言建模和检索任务中与密集FP16 KV质量在噪声范围内匹配，同时恢复了在朴素INT8/INT4基线中观察到的灾难性故障。短上下文的值敏感任务暴露了压缩与保真度之间的可控权衡，可通过更紧的值容忍度或FP16值回退消除。认证是局部的（每头、每步），不保证端到端模型的正确性，但确保每个注意计算要么相对于FP16参考是受控的，要么通过回退精确恢复。这将KV缓存量化重新定义为运行时验证的计算，而不是固定近似。目标不是原始的速度提升，而是使在严格质量约束下安全部署的激进KV压缩成为可能。

英文摘要

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.20865 2026-05-21 cs.LG cs.AI 版本更新

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

多步似然比校正用于可验证奖励的强化学习

Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh

发表机构 * Seoul National University（首尔国立大学）； Upstage

AI总结本文提出了一种多步前向轨迹政策优化（NFPO）算法，通过引入N步前向轨迹来改进PPO的近似目标，从而在可验证奖励的强化学习中实现更精确的策略改进。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）在提升大语言模型的推理能力方面起着关键作用。然而，广泛使用的PPO替代目标本质上是局部的，因为它们依赖于精确策略梯度目标的局部近似。虽然这种近似通过减少重要性采样引起的方差来提高稳定性，但它也引入了结构偏差到替代目标中，必须通过信任区域机制进行控制。在本文中，我们引入了N步前向轨迹，通过累积下一个N-1个token的似然比来增强PPO替代目标。基于这一想法，我们提出了N步前向轨迹策略优化（NFPO），一种将N步前向轨迹整合到掩码策略梯度框架中的实用RLVR算法。NFPO提供了一个连续的桥梁，将PPO替代目标与精确策略梯度目标联系起来，提供了一种控制偏差-方差权衡的原理机制。我们的理论分析表明，通过适当选择N，所提出的目标比标准PPO替代目标提供了更紧的策略改进界。在全面推理基准测试中，实验表明NFPO一致地提高了性能，支持了我们的理论发现。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

URL PDF HTML ☆

赞 0 踩 0

2605.20856 2026-05-21 cs.RO cs.AI cs.LG 版本更新

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

DISC: 通过策略生成解耦指令与状态条件控制

Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang

发表机构 * Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）； TranscEngram

AI总结 DISC通过策略生成解耦指令与状态条件控制，解决了任务状态耦合导致的观察泄漏问题，并在多个基准测试中表现出色，证明了语言生成的策略参数驱动行为。

详情

AI中文摘要

语言条件的操控策略通常通过共享网络参数处理指令和观察。这种任务-状态耦合提供了观察泄漏的路径——网络学习了场景到动作的捷径，完全绕过了语言接地。DISC通过结构上消除这一失败。而不是将通用策略条件在语言上，DISC使用超网络从指令本身生成整个任务特定的视觉-运动策略参数集。生成的策略从不直接访问语言；因此，其任务意识必须来自语言。 Consequently，观察泄漏没有路径出现。另一方面，生成一致的高维策略权重本身是一个具有挑战性的问题。我们通过两阶段超网络解决它，其细化阶段将基于梯度优化的结构作为前馈归纳偏差嵌入，产生全局一致的参数，而无需实际梯度计算。在标准数据预算上完全从头训练，DISC在LIBERO-90和Meta-World上优于所有耦合基线，在复杂、长周期任务中优势扩大，并在不使用外部预训练数据的情况下超越了大规模预训练的π₀。在一个现实基准中，所有任务共享相同的视觉上下文，DISC显著优于耦合替代方案，直接证实了语言生成的策略参数，而非视觉捷径，驱动行为。超网络进一步学习了一个语义结构化的参数流形，能够从最少的演示中实现少样本适应，并在改写指令中实现稳健的泛化。我们的代码可在：https://github.com/ReNginx/DISC获取。

英文摘要

Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $π_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.

URL PDF HTML ☆

赞 0 踩 0

2605.20838 2026-05-21 cs.CV cs.AI 版本更新

层次递归推理中的交互局部性

Yosuke Miyanishi, Tetsuro Morimura

发表机构 * CyberAgent Inc.（CyberAgent公司）

AI总结本文提出交互局部性框架，用于测量信息流是否在附近单元或语义段内传输或跨越，通过在HRM和TRM等层次递归推理模型上应用，验证了局部执行与全局规划的可重复测量框架。

详情

AI中文摘要

空间推理需要位置绑定计算和位置不变结构：智能体必须在保持路线、对象或约束层次计划的同时进行局部移动。我们提出交互局部性，一种任务-几何感知的框架，用于衡量信息流是否在附近单元或语义段内传输或跨越。我们通过稀疏自动编码器特征消融和有限噪声激活补丁来实例化该框架，并在附录中报告了结构性雅可比和注意力检查。将其应用于Maze-Hard、Sudoku Extreme和ARC-AGI等模型。在这些模型中，激活补丁给出了最清晰的架构指纹：高层递归状态倾向于在附近单元或相同段内写入信息，而重复的递归更新将这些局部写入累积到更广泛的解决方案结构中。这种模式在迷宫路径、数独约束和ARC-AGI对象邻域中均成立，其中TRM表现最强。为了测试交互局部性是否超越玩具但具有挑战性的网格基准，我们还将其应用于MTU3D，一个大规模的具身3D场景-grounding模型。在MTU3D设置中，因果空间局部性主要出现在视觉场景特征传递给下游grounding模块的过渡处，而不是在视觉编码器中均匀分布。这种对比表明，HRM和TRM中观察到的局部到全局的交接与显式递归推理动态有关，而具身3D模型可能在模块边界集中因果空间结构。交互局部性将直观的局部执行/全局规划故事转化为可重复测量的递归和具身空间推理框架。

英文摘要

Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.20758 2026-05-21 cs.AI cs.CV cs.LG cs.RO 版本更新

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

面向组合奖励的冲突感知加法引导：流模型中的对抗性生成

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

发表机构 * Smart Systems Institute, National University of Singapore, Singapore（新加坡国立大学智能系统研究所）； Faculty of Computing, Harbin Institute of Technology, Harbin, China（哈尔滨工业大学计算机学院）； School of Computing, National University of Singapore, Singapore（新加坡国立大学计算机学院）

AI总结本文提出了一种面向组合奖励的冲突感知加法引导方法，用于在流模型中处理对抗性生成问题，通过动态检测和解决梯度冲突来纠正离曼福德漂移，提升了生成保真度。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

在推理时间进行引导采样可以无需微调就通过解释生成过程为可控轨迹来驱动最先进的扩散和流模型。这提供了一种简单灵活的方式，将外部约束（如成本函数或预训练验证器）注入受控生成中。然而，现有方法在同时组合多个约束时往往失效，导致偏离真实数据曼福德。在本工作中，我们识别出这种离曼福德漂移的根本原因，并发现近似误差随着梯度不一致程度严重增加。基于这些发现，我们提出了一种轻量且可学习的方法，即冲突感知加法引导（g^car），该方法通过动态检测和解决梯度冲突来主动纠正离曼福德漂移。我们验证了g^car在多样化的领域中的有效性，从合成数据集和图像编辑到生成决策规划与控制。我们的结果表明，g^car有效纠正了离曼福德漂移，在生成保真度方面超越了基线方法，同时使用轻量计算。代码可在https://github.com/yuxuehui/CAR-guidance获取。

英文摘要

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.20756 2026-05-21 cs.LG cs.AI math.OC stat.ML 版本更新

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

纠正预条件语言模型优化器中的随机更新偏差

Nikhil Nayak, Julia White, Urchade Zaratiana, Kelton Zhang, Henrijs Princis, Dhruv Atreja, Henry Fawcett, Matthew Thomas, George Hurn-Maloney, Ash Lewis

发表机构 * Fastino Labs（Fastino实验室）

AI总结本文研究了预条件优化器中随机更新规则的有限样本偏差问题，提出了一种单批次偏差校正框架，通过交叉拟合预条件估计和方差校正逆运算来减少梯度-预条件器耦合偏差和逆运算偏差，从而提升预条件优化器的性能。

Comments 32 pages, 3 figures, 13 tables

详情

AI中文摘要

预条件优化器在语言模型训练中至关重要，但其随机更新规则通常被视为对群体预条件下降的直接近似。我们证明这种观点忽略了两个有限样本偏差。首先，梯度和预条件器通常从同一个mini-batch估计，引入梯度-预条件器耦合偏差。其次，即使预条件器估计是无偏的，其逆或逆根通常有偏，因为逆运算是非线性的。我们提出了一种单批次偏差校正框架，以解决这两种效应：交叉拟合预条件估计从独立的微批次组中估计分子和预条件器，而方差校正逆运算利用微批次变化来减去主导的delta-方法偏差项。该框架适用于对角矩、对角曲率和矩阵预条件方法，分别在AdamW、Sophia和Shampoo中实现。偏差校正将Qwen2.5-0.5B的保持预训练损失减少了0.15、0.07和0.11 nat，分别；对混合质量预训练和下游指令微调的影响始终是中性到积极的。这些结果确立了偏差校正作为减少有限样本更新偏差和提升预条件优化器性能的实用机制。

英文摘要

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

URL PDF HTML ☆

赞 0 踩 0

2605.20751 2026-05-21 cs.LG cs.AI cs.SY eess.SY 版本更新

PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

PACD-Net: 假设增强对比学习用于从SMBG估计血糖控制

Canyu Lei, David Repaske, Jianxin Xie

发表机构 * University of Virginia, School of Data Science, Charlottesville, VA 22903, USA（弗吉尼亚大学数据科学学院）； University of Virginia, Department of Pediatrics, Charlottesville, VA 22903, USA（弗吉尼亚大学儿科系）

AI总结本研究提出PACD-Net，一种自监督对比学习框架，用于从稀疏不规则采样的SMBG数据中估计血糖控制指标，通过伪SMBG样本指导学习并提高模型的准确性和稳定性。

详情

AI中文摘要

有效的糖尿病管理需要持续监测血糖水平。临床中，通过连续葡萄糖监测（CGM）获取的指标如时间范围（TIR）、低于范围时间（TBR）和高于范围时间（TAR）用于评估血糖控制。然而，由于CGM成本高且可及性有限，许多患者依赖自测血糖（SMBG）。与CGM不同，SMBG提供稀疏且不规则的测量，使得准确估计这些指标具有挑战性。传统监督学习方法在稀疏数据下表现不佳，导致泛化能力差和性能不稳定。为此，我们提出PACD-Net，一种自监督对比学习框架，用于从SMBG估计血糖控制。使用具有更丰富时间覆盖的伪SMBG样本作为教师信号，指导从稀疏观测中学习。此外，多视图对比学习强制不同采样模式下的表征一致性。模型采用混合Swin Transformer-CNN主干网络以捕捉稀疏SMBG序列中的时间依赖性。实验结果表明，PACD-Net在真实世界SMBG数据中对TAR、TIR和TBR的估计优于现有方法，实现了在极稀疏观测设置下的改进准确性和增强的稳定性与泛化能力。所提出的框架为临床SMBG解释提供了实用工具，并为从稀疏且不规则采样的传感器数据中学习提供了通用方法。

英文摘要

Effective diabetes management requires continuous monitoring of glycemic levels. Clinically, glycemic control is assessed using metrics such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), typically derived from continuous glucose monitoring (CGM). However, many patients rely on self-monitoring of blood glucose (SMBG) due to the high cost and limited accessibility of CGM. Unlike CGM, SMBG provides sparse and irregular measurements, making accurate estimation of these metrics challenging. Conventional supervised learning approaches struggle under such sparsity, leading to poor generalization and unstable performance. To address this, we propose PACD-Net, a self-supervised contrastive knowledge distillation framework for estimating glycemic control from SMBG. Pseudo-SMBG samples with richer temporal coverage are used as teacher signals to guide learning from sparse observations. In addition, multi-view contrastive learning enforces representation consistency across diverse sampling patterns. The model adopts a hybrid Swin Transformer-CNN backbone to capture temporal dependencies in sparse SMBG sequences. Experimental results demonstrate that PACD-Net consistently outperforms existing methods in estimating TAR, TIR, and TBR from real-world SMBG data, achieving improved accuracy as well as enhanced stability and generalization under extremely sparse observation settings. The proposed framework provides a practical tool for clinical SMBG interpretation and offers a generalizable approach for learning from sparse and irregularly sampled sensor data in broader applications.

URL PDF HTML ☆

赞 0 踩 0

2605.20745 2026-05-21 cs.LG cs.AI cs.CL 版本更新

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

验证器严格性的隐含信号：通过选择性潜在引导控制和改进逐步验证

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui

发表机构 * Dartmouth College（达特茅斯学院）； Datadog AI Research（Datadog人工智能研究）； Salesforce AI Research（Salesforce人工智能研究）

AI总结本文研究了通过隐藏状态干预控制验证器严格性的方法，提出VerifySteer通过利用潜在正确性信号进行样本级路由并选择性干预段落边界，从而在ProcessBench和Hard2Verify数据集上优于基线方法，且在推理计算上更高效。

详情

AI中文摘要

生成验证器已成为逐步验证的一种有前途的范式，但其验证行为往往校准不佳：它们可能过于宽松而错过错误步骤，或过于严格而拒绝正确推理。我们将这种倾向于过于宽松或过于严格的行为称为验证器严格性。在本工作中，我们研究是否可以通过隐藏状态干预来控制验证器严格性。我们揭示了一个验证特定的隐藏状态信号：在逐步验证中，验证器接受或拒绝解决方案步骤的倾向编码在对应的验证段落边界附近。利用这一信号，我们证明隐藏状态引导可以直接调节验证器严格性，而无需微调。然而，统一引导会导致错误检测与正确性认证之间的权衡。为了解决这个问题，我们提出了VerifySteer，它利用潜在正确性信号进行样本级路由，并选择性地在段落边界进行干预。在ProcessBench和Hard2Verify上的实验表明，VerifySteer优于提示优化和激活引导基线，并且在需要更少推理计算的情况下与自一致性竞争。VerifySteer还与验证微调互补，在微调验证器上提供进一步的收益。代码可在https://github.com/YefanZhou/VerifySteer上获得。

英文摘要

Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

URL PDF HTML ☆

赞 0 踩 0

2605.20744 2026-05-21 cs.LG cs.AI 版本更新

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

可验证的环境：面向大规模评估奖励黑客的尝试

Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni

发表机构 * Tel Aviv University（特拉维夫大学）； Columbia University（哥伦比亚大学）； Taso Labs（Taso实验室）

AI总结本文提出了一种新的评估方法来衡量奖励黑客，通过在环境中嵌入可检测的奖励黑客机会，使评估更加可靠和自动化，通过TextArena测试床分析了不同语言模型在多样化环境中的奖励黑客行为。

Comments Project Page - https://majoroth.github.io/hack-verifiable-environments/

详情

AI中文摘要

使自主代理与人类意图对齐仍然是现代AI中的核心挑战。这一挑战的一个关键表现是奖励黑客，即代理在评估信号下表现成功，但违反了预期目标。奖励黑客已在多种设置中被观察到，但可靠的大规模测量方法仍然匮乏。在本文中，我们引入了一种新的评估范式来衡量奖励黑客。与以往主要通过事后分析代理轨迹不同，我们直接在环境中嵌入可检测的奖励黑客机会，使其利用可验证，从而能够确定和自动化测量代理如何利用这些漏洞。我们通过TextArena实现了这一方法，并发布了Hack-Verifiable TextArena，一个可以可靠测量奖励黑客的测试床。使用此基准，我们分析了不同语言模型在多样化环境和设置中的奖励黑客行为。我们开源代码在https://github.com/MajoRoth/hack-verifiable-environments/。

英文摘要

Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in $\textit{TextArena}$ and release $\textit{Hack-Verifiable TextArena}$, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack-verifiable-environments/.

URL PDF HTML ☆

赞 0 踩 0

2605.20742 2026-05-21 cs.AI 版本更新

VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

VBFDD-Agent 用于电动汽车电池故障检测与诊断：电池数字信号的描述性文本建模

Joey Chan, Zhen Chen, Ershun Pan

发表机构 * Department of Industrial Engineering and Management, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China（工业工程与管理系，机械工程学院，上海交通大学，上海200240，中国）

AI总结本研究提出了一种基于描述性文本建模的电池信号报告方法，用于解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题，通过构建语言语料库来改进电池健康诊断和维护，提出了VBFDD-Agent，整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理，生成结构化的诊断结果和维护建议。

详情

AI中文摘要

随着电动汽车的迅速普及，锂离子电池的安全性和可靠性已成为关键问题。有效的异常检测对于确保电池安全运行至关重要。然而，随着电池系统和运行场景日益复杂，电池故障诊断和维护需要更强的跨领域适应性和人机协作能力。传统故障检测和诊断方法通常针对特定场景和预定义流程设计，使其在复杂现实应用中效果有限。为了解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题，本研究提出了一种电池信号报告的描述性文本建模方法。监测信号、统计特征、异常记录和状态评估结果被转换为结构化且易于阅读的自然语言描述，形成用于电池健康诊断和维护的语言语料库。基于此语料库，我们提出了VBFDD-Agent，一种用于汽车级电池系统的电池故障检测和诊断代理。VBFDD-Agent整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理，以生成结构化的诊断结果和维护建议。实验表明，所提出的框架能够基于描述性文本表示准确执行异常监控，并提供灵活、高效且可操作的维护建议。专家评估进一步确认了所生成建议的实用价值。总体而言，VBFDD-Agent将传统电池诊断从标签预测扩展到可解释和以维护为导向的决策支持。

英文摘要

With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support.

URL PDF HTML ☆

赞 0 踩 0

2605.20740 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Distribution-Aware Reward: 用于LLM回归的预测分布强化学习

Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Allen Institute for AI（人工智能研究院）

AI总结本文提出Distribution-Aware Reward，一种基于预测分布的强化学习方法，旨在提升语言模型在回归任务中的预测分布质量，而非仅优化单个解码输出。通过连续排名概率分数评估多个解码样本的分布，并基于每个rollout对分布质量的边际贡献分配信用，从而提升预测的准确性和分散性。实验表明，该方法在多个任务中优于监督微调和点wise强化学习基线，尤其在KBSS数据集上Spearman相关性提升6点。

Comments 21 pages, 5 figures

详情

AI中文摘要

大型语言模型能够从异质输入（如文本、代码和分子字符串）预测实值量，但大多数训练目标独立评分每个解码的浮点数，仅改进点估计而无法确保校准的预测分布。这限制了需要候选排序或不确定性估计的应用。我们引入Distribution-Aware Reward，一种基于策略的强化学习目标，其主要贡献是训练语言模型生成更好的回归任务预测分布，而非仅优化单个解码输出与标量目标的匹配。我们的方法将多个解码样本视为经验预测分布，并使用连续排名概率分数进行评估，基于每个rollout对分布质量的边际贡献分配leave-one-out信用，奖励既准确又适当分散的预测。我们在受控高斯混合任务、代码性能预测和分子属性预测（从SMILES字符串）上评估了我们的方法。在所有任务中，我们的方法优于监督微调和点wise强化学习基线，具有显著的排名相关性提升，包括在KBSS数据集上Spearman相关性提升6点。在MoleculeNet上，仅使用SMILES字符串，仍能与强大的图基和3D分子模型竞争。进一步分析表明，我们的方法缓解了rollout多样性崩溃并改进了不确定性诊断，表明直接优化预测分布使语言模型回归更具鲁棒性和校准性。

英文摘要

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

URL PDF HTML ☆

赞 0 踩 0

2605.20734 2026-05-21 cs.CR cs.AI 版本更新

An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress

面向LLM代理出站的应用层多模态隐通道参考监控器应用

Alfredo Metere

发表机构 * Enclawed, LLC（Enclawed公司）

AI总结本文提出了一种应用层多模态隐通道参考监控器，用于检测和防止LLM代理在消息中泄露数据，通过多阶段文本管道、媒体加密器和残余容量测量来实现对隐通道的监控和管理。

详情

AI中文摘要

一个发送消息的大型语言模型（LLM）代理可能会在消息中泄露数据。目标允许列表和内容扫描器无法检测到一个看似无害的负载本身是否是一个隐通道：被篡改的代理在零宽度字符、同形字符、空格、base64、JavaScript对象表示法（JSON）键顺序、消息时间或大小中编码位，并在二进制出站中，在每个图像的最显著位平面、图像平均亮度、图像序列排列、超声波音调或可听频段音频化数据中进行编码。我们的出站参考监控器有三个贡献。 (i) 一个包含十个容量减少阶段的文本管道，一个针对每个终点的漏桶容量账本，以及一个分阶段的姿势，该姿势从第一天起强制执行无损阶段。 (ii) 两个媒体加密器（一个傅里叶域音频带限器和一个红绿蓝（RGB）图像位深度和平均亮度桶器），由启动时的加密合法性证明所控制：审计员在启动时发布受信任的Ed25519密钥和{kind, data-class}配对；只有具有授权类验证签名的负载才被豁免。证明绕过了真实媒体和作为载体音频化或栅格化的数据之间的基于内容的判别难题；未签名的媒体默认被怀疑；基于内容的标准化器关闭了图像排列通道。 (iii) 剩余容量是嵌入位和恢复位之间的Miller-Madow修正的互信息（零表示被破坏），通过一个包含十五个工作编码器的对抗性集合进行测量，这些编码器覆盖文本、图像和音频。参考实现将残余容量驱动到每个可破坏通道的零，并驱动无法破坏而不破坏图像的单一通道（每图像平均亮度）到一个规定界限。

英文摘要

A large language model (LLM) agent that sends messages can leak data inside them. Destination allowlists and content scanners do not police whether an otherwise-benign payload is itself a covert channel: a compromised agent encodes bits in zero-width characters, homoglyphs, whitespace, base64, JavaScript Object Notation (JSON) key ordering, message timing or size -- and, in binary egress, in least-significant-bit (LSB) pixel planes, per-image mean luminance, inter-image sequence permutation, ultrasonic tones, or audible-band sonified data. Our egress reference monitor has three contributions. (i) A text pipeline of ten capacity-reducing stages, a per-sink leaky-bucket capacity ledger, and a staged posture that enforces lossless stages from day one. (ii) Two media scramblers (a Fourier-domain audio band-limiter and a red-green-blue (RGB) image bit-depth and mean-luminance bucketer) gated by a boot-time cryptographic legitimacy attestation: an auditor publishes at boot the trusted Ed25519 keys and {kind, data-class} pairs; only payloads with a verifying signature for an authorized class are exempt. The attestation sidesteps the intractable content-based discrimination between real media and data sonified or rasterized as a carrier; unsigned media is suspect by default; a content-addressed canonicalizer closes the inter-image permutation channel. (iii) Residual capacity is the Miller--Madow corrected mutual information between embedded and recovered bits (zero when destroyed), measured by an adversarial ensemble of fifteen working encoders across text, image and audio. The reference implementation drives residual capacity to zero on every destroyable channel and to a stated bound on the one (per-image mean luminance) that cannot be destroyed without ruining the image.

URL PDF HTML ☆

赞 0 踩 0

2605.20730 2026-05-21 cs.CL cs.AI 版本更新

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

分布对齐作为设计任务向量在上下文学习中的准则

Jihoon Kwon, Jiwon Choi, Jy-yong Sohn

发表机构 * Seoul National University（首尔国立大学）； Yonsei University（延世大学）

AI总结本文提出通过分布对齐来设计任务向量，引入了NTP距离作为衡量指标，并开发了线性任务向量方法以提升性能和效率。

Comments 9 pages, preprint

详情

AI中文摘要

在上下文学习（ICL）中，大型语言模型（LLMs）通过演示来适应新任务，但随着上下文长度增加，推理成本也随之上升。虽然任务向量通过压缩演示为紧凑的隐藏状态表示提供了有前途的替代方案，但其质量只能通过下游任务准确性来评估。本文认为，使用任务向量的推理应使其预测分布与ICL的预测分布对齐。为此，我们引入了$d_{ ext{NTP}}$，一个衡量任务向量推理与ICL推理之间下一个标记概率差异的指标。我们的实证分析表明，$d_{ ext{NTP}}$作为性能代理，与下游准确性呈强负相关。受此启发，我们开发了线性任务向量（LTV）方法，通过闭合形式的线性映射来最小化$d_{ ext{NTP}}$，通过回归估计演示效果。在八个分类基准和五个LLMs上，LTV一致优于现有任务向量基线，平均准确率提高了9.2%，同时减少了推理延迟。我们进一步证明LTV在回归任务上优于基线。此外，我们研究了LTV在不同模型规模间的可转移性；这在任务向量研究中仍是一个初级问题。具体而言，我们实证显示，较大模型的任务向量可以将较小模型的性能提高6.4%，表明提取的任务表示有新的用途。

英文摘要

In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

URL PDF HTML ☆

赞 0 踩 0

2605.20722 2026-05-21 cs.LG cs.AI 版本更新

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO: 基于双统计反馈的自适应群体策略优化

Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结本文提出AGPO，一种无 critic 的 GRPO 改进方法，通过群体层面的统计信息控制更新幅度和探索。在九个英语和中文数学/STEM 基准上，Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO，达到 GSM8K 67.3% 和 MATH 40.5%。

详情

AI中文摘要

强化学习提升大语言模型推理能力，但 PPO/GRPO 通常使用固定剪切和解码温度，使训练脆弱且调参困难。我们提出自适应群体策略优化（AGPO），一种无 critic 的 GRPO 改进方法，利用群体层面统计信息控制更新幅度和探索。AGPO 使用共享的探针衍生统计状态驱动两个控制器：（i）自适应剪切，根据奖励分散度和偏度、探针投票熵、策略熵和逐步 KL 偏移设置信任区域大小；（ii）双向自适应温度采样，根据与运行基线相对的中心不确定性加热或冷却解码。在九个英语和中文数学/STEM 基准上，使用 AGPO 训练的 Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO，达到 GSM8K 67.3% 和 MATH 40.5%。收益转移到 Llama-3-8B 和 Gemma-2-9B，消融实验确认两个模块互补。我们的实现可在 https://github.com/wandugu/paper_agpo 公开获取。

英文摘要

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

URL PDF HTML ☆

赞 0 踩 0

2605.20713 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

SAVER：选择性所需视觉证据用于多模态信息提取

Miaobo Hu, Shuhao Hu, Bokun Wang, Rui Chen, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； University of Chinese Academy of Sciences（中国科学院大学）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结该研究提出SAVER框架，通过选择性视觉证据提升多模态命名实体识别和关系抽取的性能，减少计算开销并提高准确性。

详情

AI中文摘要

多模态信息提取在社交媒体中具有挑战性，因为帖子可能附加多个弱相关、冗余甚至误导性的图像。在这样的情况下，持续的多模态融合会浪费计算资源并放大虚假的视觉提示。核心挑战是决定是否为每个候选跨度或标记实体对咨询视觉信息，以及如果需要，哪些小图像子集提供可信的证据。我们提出SAVER，一种选择性视觉所需框架用于多模态命名实体识别和多模态关系抽取。SAVER使用符合性地面性门（CGG）来估计MNER中的跨度级视觉地面性，从两个标记实体推导出对级激活，通过符合性风格程序和Clopper-Pearson上界校准激活阈值。当被激活时，一个子模ularity相关性-多样性选择器选择跨图像的紧凑证据子集，然后通过集合变换器进行聚合。一个受能量启发的联合评分头结合文本、可选视觉证据、文本-图像一致性以及稀疏路由用于实体类型或关系分类。实验表明，SAVER在强文本-only和持续多模态基线上一致提高F1，同时减少AURC，增加激活覆盖面积，在固定风险水平下，降低FLOPs和P90延迟。

动态TMoE：一种针对非平稳时间序列预测的漂移感知动态专家混合框架

Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu

发表机构 * School of Software Technology, Zhejiang University, Ningbo, China ； State Key Lab of CAD\&CG, Zhejiang University, Hangzhou, China

AI总结本文提出Dynamic TMoE框架，通过动态构建异构专家和剪枝冗余专家来优化容量，并利用时间记忆路由器确保稳定且上下文感知的专家选择，从而在非平稳时间序列预测中实现更优性能。

Comments 27 pages, 7 figures. Accepted to ICML 2026

详情

AI中文摘要

非平稳时间序列预测面临由演变分布偏移带来的挑战，静态模型难以捕捉这些变化。虽然混合专家（MoE）架构提供了解耦复杂漂移模式的有前景范式，但现有方法受限于固定专家池和无记忆路由，阻碍了其适应突发制度转变的能力。为此，我们提出Dynamic TMoE框架，将架构进化与时间连续性统一在学习阶段。通过最大均值偏差（MMD）检测分布偏移，动态实例化异构专家并剪枝冗余专家以优化容量。此外，时间记忆路由器利用循环状态和异常库确保稳定、上下文感知的专家选择，无需测试时更新。在九个基准测试中的实验表明，该方法实现了最先进的性能，将MSE减少10.4%，MAE减少7.8%。代码可在https://github.com/andone-07/Dynamic-TMoE获取。

英文摘要

Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone-07/Dynamic-TMoE.

URL PDF HTML ☆

赞 0 踩 0

2605.20668 2026-05-21 cs.CL cs.AI cs.LG 版本更新

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

人工智能审稿人的局限与机遇：对Nature系列论文审稿的45位专家科学家的审查

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

发表机构 * Nature（自然）

AI总结本文通过大规模专家标注研究，探讨了AI审稿人在科学同行评审中的能力与局限，发现AI审稿在准确性、显著性和证据充分性方面表现优异，但存在领域知识有限、上下文管理不足等弱点，表明AI审稿是人类审稿的补充而非替代。

Comments Work in progress

详情

AI中文摘要

随着AI能力的提升，AI审稿人开始被应用于科学同行评审，但其能力和可信度仍存疑：许多科学家将其视为概率系统，缺乏评估研究的专业能力，而其他研究人员则对AI的准备程度更为乐观，但缺乏实证支持。理解AI审稿人擅长什么、哪里不足以及仍需解决的挑战至关重要。然而，现有的AI审稿评估主要关注其判断是否与人类一致（例如评分对齐、接受预测），这不足以表征其能力和局限。在本文中，我们通过大规模专家标注研究填补了这一空白，45位物理、生物和健康科学领域的专家花费469小时对2960个个体批评（每个批评针对论文的一个特定方面）进行评分，这些批评来自人类和AI生成的82篇Nature系列论文的审稿。在综合正确性、显著性和证据充分性三个维度上，由GPT-5.2驱动的审稿代理在每篇论文的最高评分人类审稿人评分上（60.0% vs. 48.2%，p = 0.009），而所有三个AI审稿（包括Gemini 3.0 Pro和Claude Opus 4.5）在每个维度上都超过了最低评分的人类审稿人。AI审稿的准确批评也更常被评分显著且证据充分，并揭示了人类未提及的26%的问题。然而，AI审稿在交叉审稿者对之间重叠远多于人类（21% vs. 3%），并且表现出16个人类不共享的弱点，如领域知识有限、缺乏多文件上下文管理能力以及对次要问题过于批判。总体而言，我们的结果表明当前AI审稿人是人类审稿人的补充，而非替代。

英文摘要

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

URL PDF HTML ☆

赞 0 踩 0

2605.20649 2026-05-21 eess.SP cs.AI cs.LG 版本更新

AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI

AMAR: 基于注意力机制的轻量级多用户活动识别从Wi-Fi CSI

Amirhossein Mohammadi, Hina Tabassum

发表机构 * Department of Electrical Engineering and Computer Science（电气工程与计算机科学系）

AI总结本文提出了一种基于注意力机制的轻量级多用户活动识别框架AMAR，通过将活动识别转化为集合预测问题，利用Transformer架构和边缘-云混合架构，实现了在多用户环境下对并发活动的高精度识别，同时显著减少带宽使用和占用估计误差。

Comments 25 pages, 6 figures, 3 tables

详情

AI中文摘要

基于Wi-Fi的人体活动识别（HAR）已发展为一种有前景的无接触传感方法，利用无线收发器收集的信道状态信息（CSI）。尽管现有研究主要集中在单用户场景，但实际部署通常涉及多用户设置，其中并发用户的行为导致CSI模式重叠，挑战传统分类方法。为解决这一限制，本文提出了一种基于注意力机制的多用户活动识别（AMAR）框架，将HAR转化为集合预测问题。AMAR的Transformer架构利用可学习的查询嵌入作为专用活动检测器，使系统能够同时从复合CSI表示中识别多种活动。此外，为应对部署限制，AMAR采用边缘-云混合架构，其中边缘设备上的轻量级卷积网络执行初始特征提取，随后通过残差向量量化实现显著的带宽减少，同时保留活动区分信息。云组件通过基于注意力的集合匹配执行最终活动预测，使系统能够处理变化的占用水平。在教室、会议厅和空房间环境中，AMAR在平均情况下几乎将完美预测所有并发活动的速率提高了两倍，同时其F1分数达到53.4%，比最佳基准45.6%有所提高，并将占用估计误差减少了74%，同时大幅减少带宽使用。

英文摘要

Wi-Fi-based human activity recognition (HAR) has emerged as a promising approach for contactless sensing, leveraging channel state information (CSI) collected from wireless transceivers. While existing studies have primarily concentrated on single-user scenarios, real-world deployments often involve multi-user settings where concurrent users' movements induce overlapping CSI patterns that challenge conventional classification methods. To address this limitation, this paper introduces an attention-based multi-user activity recognition (AMAR) framework that formulates HAR as a set prediction problem. The transformer-based architecture in AMAR leverages learnable query embeddings acting as specialized activity detectors, enabling the simultaneous identification of multiple activities from composite CSI representations. Moreover, to address deployment constraints, AMAR is designed in an edge-cloud split architecture form where lightweight convolutional networks on edge devices perform initial feature extraction, followed by residual vector quantization that achieves substantial bandwidth reduction while preserving activity-discriminative information. The cloud component performs final activity prediction through attention-based set matching, enabling the system to handle varying occupancy levels. Across classroom, meeting-room, and empty-room environments, on average AMAR nearly doubles the rate of perfectly predicting all concurrent activities compared to the best baseline. Moreover, it achieves an $F_1$-score of 53.4% compared to 45.6% for the best benchmark, and reduces occupancy estimation error by 74%, while minimizing bandwidth substantially.

URL PDF HTML ☆

赞 0 踩 0

2605.20648 2026-05-21 cs.RO cs.AI 版本更新

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

发表机构 * Brown University（布朗大学）； Robotics & AI Institute（机器人与人工智能研究所）

AI总结本文提出了一种联合学习谓词和动作的技能方法，通过闭合回路的视觉-运动策略，使机器人能够在不重新训练的情况下实现零样本技能组合。

详情

AI中文摘要

学习示范（LfD）使机器人能够从专家示例中学习复杂行为，但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布，因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距，我们引入了谓词动作技能（PACTS），一种闭合回路的视觉-运动策略，将技能建模为动作和谓词信念轨迹的联合生成过程，在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外，我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行，展示了学习技能的零样本组合。项目网站：https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.20644 2026-05-21 cs.LG cs.AI cs.RO 版本更新

帕累托优化的肖像生成：用于对齐、真实性和美学的视觉对齐文本监督

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

AI总结本文提出了一种多模态扩散变换器（MM-DiT）的特征监督方法，通过引入轻量级的跨模态对齐机制，隐式提取多粒度的视觉对齐文本表示，以提升文本-图像对齐、真实性和美学质量，从而在Pareto前沿上实现协同改进。

详情

AI中文摘要

文本到图像扩散模型在生成人类肖像时往往面临严重的三重困境：文本-图像对齐、逼真度和人类感知的美学之间相互抑制。监督微调（SFT）是一种有效提升图像生成逼真度的方法，但通常会导致过度拟合训练数据集、破坏预训练图像先验并降低对齐或美学质量。为突破这一瓶颈，我们提出了一种多模态扩散变换器（MM-DiT）的特征监督范式。具体而言，我们引入了一种轻量级的跨模态对齐机制，隐式地从SigLIP 2中提取多粒度的视觉对齐文本表示，并在训练阶段将监督应用于MM-DiT的图像分支，且无额外的推理开销。我们的方法在保持基模型原有泛化能力的同时，注入了视觉对齐的文本指导，避免了SFT导致的退化。此外，我们的方法直接从预训练的视觉基础模型中挖掘隐含的多粒度美学信号，以优化人类感知的美学。在MM-DiT上的广泛实验表明，我们的方法推动了Pareto前沿，并在文本-图像对齐、逼真度和人类感知的美学方面实现了协同改进。

英文摘要

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

URL PDF HTML ☆

赞 0 踩 0

2605.20630 2026-05-21 cs.AI 版本更新

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

评估代理计划-执行管道中的时间语义缓存和工作流优化

Alimurtaza Mustafa Merchant, Krish Veera, Sajal Kumar Goyla, Shambhawi Bhure, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia University（哥伦比亚大学）； IBM ； IBM Research（IBM研究院）

AI总结本文研究了在代理计划-执行管道中时间语义缓存和工作流优化的问题，提出两种互补的优化层以提高效率，并展示了其在工业资产操作工作流中的应用效果。

Comments 13 pages, 8 figures, 3 appendices

详情

AI中文摘要

工业资产操作工作流对延迟敏感，因为单个用户查询可能需要协调传感器数据、工作订单、故障模式、预测工具和领域特定代理。我们在此问题上评估了AssetOpsBench (AOB)，这是一个工业代理基准，其计划-执行管道暴露了工具发现、LLM规划、MCP工具执行和最终总结的重复开销。现有的LLM缓存技术如KV缓存重用和基于嵌入的语义缓存是为聊天机器人服务设计的，并在输出有效性依赖于时间、资产或传感器参数时失效。我们为AOB计划-执行管道提出了两个互补的优化层：一个时间语义缓存和一组结合磁盘支持的工具发现缓存和依赖感知并行步骤执行的MCP工作流优化。MCP工作流优化对应于1.67倍的速度提升，将中位端到端延迟减少了约40.0%，而时间缓存基准在缓存命中时实现了30.6倍的速度提升。除了速度提升外，我们的结果揭示了纯语义缓存在参数丰富的工业查询中的具体失败模式，提供了对MCP支持的代理基准中缓存选择如何与评估正确性相互作用的批判性分析。

英文摘要

Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.20626 2026-05-21 cs.CL cs.AI cs.CV 版本更新

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述：佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida（佛罗里达大学）

AI总结本文提出了一种基于检索的长上下文翻译方法，用于文化图像描述，通过两阶段流程生成西班牙语中间描述，再利用检索增强的多示例提示生成目标语言描述，显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能，并在共享任务中获得冠军。

2605.20624 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

用自回归扩散模型加速视频逆问题求解器

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

发表机构 * KAIST（韩国科学技术院）； EverEx

AI总结本文提出自回归视频逆问题求解器（AVIS），通过自回归扩散模型实现流式视频恢复，显著降低初始延迟并提高吞吐量，同时保持高质量的恢复效果，并进一步提出加速变体AVIS Flash，实现更高的吞吐量和更优的效率-性能权衡，为实时部署铺平道路。

Comments Project page is available here: https://avis-project.github.io/

详情

AI中文摘要

从自动化到自主：分层代理原生网络架构（HANA）

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

发表机构 * AsiaInfo Technologies Limited（亚洲信息科技有限公司）； Institute for AI Industry Research (AIR)（人工智能产业研究院）； Tsinghua University（清华大学）； Verimag

AI总结本文提出了一种分层多代理参考架构，旨在实现Level 4/5自主网络，通过引入代理自意识，统一战略规划与操作韧性，验证了其在5G核心环境中的有效性。

Comments This manuscript has been accepted by IEEE Networking Letters

Journal ref B. Wu, S. Wang, Y. Liu, Y. -Q. Zhang, J. Sifakis and Y. Ouyang, "From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)," in IEEE Networking Letters, 2026

详情

DOI: 10.1109/LNET.2026.3693226

AI中文摘要

实现Level 4/5自主网络（AN）需要从静态自动化转向代理原生智能。当前的操作依赖于刚性的脚本，缺乏处理非正常条件的认知能力。为此，本文提出了一种分层多代理参考架构，该架构包含一个双驱动协调器，协调专门的执行代理，并通过共享的公共内存实现统一的领域知识。关键创新是将代理自意识整合进来，使系统能够协调 deliberative战略治理与 reflexive 故障恢复。我们将在5G核心环境中实例化并验证该架构。案例研究表明，该系统在拥堵条件下仍能维持关键吞吐量，并将平均修复时间（MTTR）减少了86%，证实了其在统一战略规划与操作韧性方面的有效性。

英文摘要

Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

URL PDF HTML ☆

赞 0 踩 0

2605.20602 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

自我训练不使语言扁平化——它重构了它：表面标记增强而深层语法消失

Ming Liu

发表机构 * Amazon（亚马逊）

AI总结该研究通过实验发现自我训练过程并非使语言扁平化，而是重构了语言结构，表面标记增强而深层语法结构消失，并提出了结构性深度假说来解释这一现象。

Comments 19 pages (14 main + 5 appendix), 8 figures, 3 tables

详情

AI中文摘要

连续对语言模型自身输出进行自我训练通常被描述为一种扁平化过程：多样性下降，分布变窄，文本变得“更像自己”。我们提供了证据表明这种描述是不完整的。在对五个模型（GPT-2 124M，Pythia-410M，Pythia-1.4B，OPT-1.3B，Pythia-2.8B）进行十一代自我训练的过程中，语言并非均匀扁平化——它被重构了。表面标记（连贯词、缓和词、破折号）上升，而中层和深层语法结构（疑问句、插入语、被动语态、条件句）崩溃。我们正式将这种不对称崩溃定义为结构性深度假说（SDH）：语言特征的每一代衰减率主要由其结构性深度——它所需嵌套语法依赖的数量——决定，其次才由其生成零次输出频率决定。通过汇总五个模型中三个架构家族的17个特征面板（N=85），汇总的斯皮尔曼相关系数为rho=0.540（p < 10^{-6}；簇Bootstrap 95% CI [0.434, 0.634]），而频率是一个显著较弱的预测因子（rho=0.225）。一个匹配的人类文本微调对照实验得到rho=0.039（p=0.88），证实了该梯度是特定于自我训练的。我们进一步记录了一个表面复杂性悖论：总体复杂性代理（依赖树深度、TTR、词长）在底层从句结构消失时均上升，这对训练数据筛选和LLM文本检测有直接影响。

英文摘要

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

URL PDF HTML ☆

赞 0 踩 0

2605.20577 2026-05-21 cs.AI cs.LG 版本更新

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax: 一种用于在JAX中进行强化学习的GPU加速麻将模拟器

Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, Masashi Sugiyama

发表机构 * The University of Tokyo（东京大学）； RIKEN AIP（日本理化学研究院AIP）； Nara Institute of Science and Technology（奈良科学技術大學）； Kobe University（Kobe大学）； Kyoto University（京都大学）； ATR ； The University of Sydney（悉尼大学）

AI总结本文提出Mahjax，一种基于JAX实现的麻将环境，利用GPU加速大规模并行化，以解决麻将游戏中的高维状态空间和随机性问题，为强化学习提供高效的训练平台。

详情

AI中文摘要

Riichi Mahjong是一种多玩家、信息不完全的游戏，具有随机性和高维状态空间的特性。这些属性构成了强化学习中复杂决策问题的独特挑战。尽管先前研究主要依赖于从人类游戏日志中监督学习来预训练策略，但能够从头开始学习（tabula rasa）的算法在通用性上具有更大潜力，如AlphaZero所示。为促进此类研究，我们引入了Mahjax，一个完全向量化实现的Riichi Mahjong环境，用于在图形处理器（GPU）上实现大规模的回放并行化。我们还提供了一个高质量的可视化工具，以简化调试和与训练代理的交互。实验结果表明，Mahjax在八块NVIDIA A100 GPU上分别实现了高达200万和100万步每秒的吞吐量。此外，我们通过展示代理能够有效训练以提高其相对于基线策略的排名，验证了该环境在强化学习中的实用性。

英文摘要

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

URL PDF HTML ☆

赞 0 踩 0

2605.20563 2026-05-21 cs.MA cs.AI cs.CL cs.LG cs.SE 版本更新

Multi-agent Collaboration with State Management

具有状态管理的多智能体协作

Mengyang Liu, Taozhi Chen, Zhenhua Xu, Xue Jiang, Yihong Dong

发表机构 * Shanghai Jiaotong University（上海交通大学）； Cortices AI ； Emory University（埃默里大学）； Peking University（北京大学）

AI总结本文提出STORM，一种面向多智能体协作的状态管理方法，通过在共享工作区中调解智能体的交互，确保每个智能体在一致的代码库视图上操作，并在写入时检测和解决冲突。STORM在多个LLM上优于基于git-worktree的多智能体基线，且在成本效率上具有竞争力，表明显式状态管理比工作区隔离更有效。

详情

AI中文摘要

潜在过程生成器匹配

Lukas Billera, Hedwig Nora Nordlinder, Ben Murrell

发表机构 * Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet（微生物学、肿瘤和细胞生物学系，Karolinska研究院）

AI总结本文提出了一种潜在过程生成器匹配框架，该框架将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像，从而扩展了生成器匹配理论，使其适用于时间依赖的潜在条件过程。

Comments 18 pages, 1 figure

详情

AI中文摘要

许多近期的流匹配和扩散式生成模型在训练过程中依赖于辅助的随机动力学：通过模拟更丰富的过程来定义条件目标，但辅助状态在生成时要么难以采样，要么并不属于期望的输出。现有的生成器匹配理论规范了对静态潜在随机变量的条件，而几篇近期论文证明了特定增强状态构造的投影结果的特殊情况。我们引入了潜在过程生成器匹配，一种通用框架，将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像 $X_t=Φ(Y_t)$。我们显示在这一设定下，可以在图像空间中学习一个随机过程的生成器，其一阶边缘分布与投影过程相同。这扩展并涵盖了文献中的离散潜在过程结果，并将生成器匹配从静态潜在变量扩展到丰富的时间依赖潜在条件过程家族。

英文摘要

Many recent flow-matching and diffusion-style generative models rely on auxiliary stochastic dynamics during training: a richer process is simulated to define conditional targets, but the auxiliary state is either intractable to sample at generation time or simply not part of the desired output. Existing Generator Matching theory formalises conditioning on static latent random variables, and several recent papers prove special cases of projection results for particular augmented-state constructions. We introduce latent process generator matching, a general framework that treats the observed generative state as a deterministic image $X_t=Φ(Y_t)$ of a tractable Markov process $Y_t$. We show that in this setting one may learn the generator of a stochastic process on the image space which has the same one-time marginal distributions as the projected process. This generalizes and subsumes the discrete latent process results from the literature, and extends Generator Matching from static latent variables to a rich family of time-dependent latent conditional processes.

URL PDF HTML ☆

赞 0 踩 0

2605.20052 2026-05-21 cs.CL cs.AI 版本更新

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad: 基于知识的多标签提示微调用于低资源放射报告标注

Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao

发表机构 * Department of Artificial Intelligence and AI Research Center, Chang Gung University（人工智能系及AI研究中心，长庚大学）； Department of Radiology, Sijhih Cathay General Hospital（放射科，西吉医院）； Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital（医学影像与介入科，长庚纪念医院）； Department of Trauma and Emergency Surgery, Chang Gung Memorial Hospital（创伤与急诊外科，长庚纪念医院）； Department of Computer Science, National Tsing Hua University（计算机科学系，国立清华大学）

AI总结本文提出PromptRad，一种基于知识的多标签提示微调方法，用于在低资源环境下进行放射报告标注，通过引入UMLS元词典中的同义词增强类别表示，以更少的标注数据实现优于传统方法的性能。

Comments BioNLP 2026 @ ACL (camera-ready version)

详情

AI中文摘要

自动报告标注有助于从非结构化文本中识别临床发现，并为医学影像研究提供大规模注释。现有的基于规则的标注器难以处理临床报告中的多样化描述，而微调预训练语言模型（PLMs）需要大量标注数据，这些数据在临床环境中通常不可用。在本文中，我们提出PromptRad，一种基于知识的多标签提示微调方法，用于在低资源环境下进行放射报告标注。PromptRad将多标签分类重新表述为掩码语言建模，并将UMLS元词典中的同义词纳入多词提示器以丰富类别表示。通过微调PLM而不增加额外分类层，PromptRad所需的标注数据比传统微调要少得多。在肝CT报告上的实验表明，PromptRad在仅使用32个标注训练示例的情况下，优于基于词典和微调的基线方法，并且在使用远小模型的情况下，性能与GPT-4具有竞争力。进一步分析显示，PromptRad比现有方法更有效地捕捉复杂的否定模式，使其成为数据稀缺临床场景中报告标注的有希望的解决方案。我们的代码可在https://github.com/ila-lab/PromptRad上获得。

英文摘要

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

URL PDF HTML ☆

赞 0 踩 0

2605.19624 2026-05-21 cs.CV cs.AI 版本更新

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（机器人系统国家重点实验室，哈尔滨工业大学）

AI总结本文提出了一种面向组件的结构保持风格迁移框架，用于卫星视觉的合成到真实数据构建，通过提取真实图像的部件级风格代码并注入到合成图像中，从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情

AI中文摘要

对于基于相机的卫星视觉感知，Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取，而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码，并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性，对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像，而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比，所提方法实现了最小的图像分布差异，FID为54.32，KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时，ADD通过率提高到0.260，AUC提高到0.611。这些结果表明，组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

URL PDF HTML ☆

赞 0 踩 0

2605.19503 2026-05-21 cs.RO cs.AI cs.LG 版本更新

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence（媒体整合与通信中心——佛罗伦萨大学）

AI总结本文提出ARC-RL，一个包含四种MuJoCo连续控制环境的强化学习游乐场，这些环境的机器人形态灵感来自ARC Raiders的生物目录，通过统一的观察模板、动作约定和奖励函数，研究不同形态和动画风格约束下的强化学习算法性能。

详情

AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠，其形态统一来源于现实商业硬件。然而，游戏NPC受风格约束，缺乏sim-to-real机器人，通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL，一个包含四种MuJoCo连续控制环境的套件，其机器人形态受ARC Raiders的生物目录启发：18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数，其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚；在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示，这些演示既作为固定专家参考，也作为离线到在线训练的先验数据来源。在此游乐场中，我们进行了一项受控的实证研究，比较标准在线算法（SAC、SPEQ、SOPE-EO）和带有先验数据的算法（SACfD、SPEQ-O2O、SOPE），并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

URL PDF HTML ☆

赞 0 踩 0

2605.19376 2026-05-21 cs.AI 版本更新

Generative Recursive Reasoning

生成性递归推理

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

发表机构 * KAIST（韩国科学技术院）； Mila – Québec AI Institute（魁北克人工智能研究所）； New York University（纽约大学）； Université de Montréal（蒙特利尔大学）

AI总结本文提出Gram框架，通过将递归潜在推理转化为概率多轨迹计算，解决了传统递归推理模型的确定性问题，实现了条件推理和无条件生成。

详情

AI中文摘要

未来的神经推理系统应如何实现扩展计算？递归推理模型（RRMs）通过使用共享转移函数的迭代潜在状态细化，为自回归序列扩展提供了一种有前途的替代方法。然而，现有RRMs大多是确定性的，遵循单一的潜在轨迹并收敛到单一预测。我们引入生成性递归推理模型（GRAM），一种将递归潜在推理转化为概率多轨迹计算的框架。GRAM将推理视为随机的潜在轨迹，通过递归深度和并行轨迹采样实现多个假设、替代解决方案策略和推理时间扩展。这产生了一个支持通过p_θ(y|x)进行条件推理的潜在变量生成模型，并通过p_θ(x)实现无条件生成，无论输入是否固定或缺失。通过缩放变分推断训练，GRAM在结构推理和多解约束满足任务上优于确定性递归和循环基线，同时展示了无条件生成能力。

英文摘要

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_θ(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

URL PDF HTML ☆

赞 0 踩 0

2605.19138 2026-05-21 cs.RO cs.AI cs.LG 版本更新

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； New York University Abu Dhabi (NYUAD)（纽约大学阿布扎克分校）； University of Toronto（多伦多大学）； NVIDIA（英伟达）

AI总结本文提出COBALT平台，通过基于云的远程操作技术，利用智能手机等设备大规模收集高质量的机器人学习数据，提高仿真实验和现实世界中的机器人学习效率。

详情

AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT，一个旨在大规模普及机器人学习的远程操作平台，无论是仿真还是现实世界。通过利用向量化的环境，我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作，从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接，包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步，支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行，每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU，凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究，显示基于手机的远程操作性能与或优于专用硬件，能够更快、更符合人体工学地收集数据。为确保数据质量，COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明，结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察，我们通过众包收集了一个大规模、高质量的试点数据集，该数据集包含7500多个演示（50多个小时），在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

URL PDF HTML ☆

赞 0 踩 0

2605.18833 2026-05-21 cs.LG cs.AI 版本更新

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

利用知识图谱嵌入进行自动化大数据质量评估

Hadi Fadlallah, Rima Kilany, Mitri Haber, Ali Jaber

发表机构 * Saint-Joseph University（圣约瑟夫大学）； Lebanese University（黎巴嫩大学）

AI总结本文提出了一种基于知识图谱嵌入的自动化大数据质量评估方法，通过整合多样化的知识图谱表示，利用上下文信息生成针对每个情境的全面数据质量评估计划。

Comments 17 pages, 10 figures

Journal ref International Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405

详情

DOI: 10.1504/IJDMMM.2025.150987

AI中文摘要

自动化数据质量评估对于管理大数据至关重要，但现有解决方案在实现准确的上下文感知评估方面面临挑战。本文提出了一种基于知识的新方法，利用知识图谱嵌入来预测输入数据集的上下文表示与知识图谱中相关质量规则和维度之间的缺失边。我们通过整合知识图谱中的多样化表示，从深入的文献研究中获取洞察，从而开发出针对每个情境的全面且上下文特定的数据质量评估计划。利用知识图谱提高了我们对输入数据集上下文的理解，克服了传统方法仅依赖严格匹配并忽视上下文特征的局限性。通过注入数值边属性，我们为每个预测的质量测量分配相应的权重，为输入数据集提供全面的数据质量评估计划。为了评估我们的方法，我们利用AccentureLabs开发和基准测试的AmpliGraph框架。评估涉及使用由黎巴嫩原子能委员会（LAEC-CNRS）提供的现实世界辐射传感器数据集。从该评估中获得的结果证明了我们的解决方案能够为给定的输入数据集生成全面的数据质量评估计划。

英文摘要

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.18743 2026-05-21 cs.AI 版本更新

WorldString: Actionable World Representation

WorldString: 可行动态世界表征

Kunqi Xu, Jitao Li, Jianglong Ye, Tianshu Tang, Isabella Liu, Sifei Liu, Xueyan Zou

发表机构 * CalTech（加州理工学院）； UC San Diego（南加州大学）； Tsinghua University - IEI Lab（清华大学-IEI实验室）； NVIDIA（英伟达）

AI总结本文提出WorldString，一种能够通过点云或RGB-D视频流直接学习现实物体状态流形的神经架构，为构建可行动态世界模型提供基础构建块。

详情

AI中文摘要

受大语言模型中涌现行为启发，研究社区正在探索类似涌现能力的世界模型，尤其关注物理世界的建模。在物理世界建模中，物体是构成物理现实的基本原始元素。从人类到计算机，几乎一切我们交互的事物都是物体。这些物体很少是静态的；它们是具有变化状态的可行动态实体，其状态由内在属性决定。尽管当前方法通过视频生成或动态场景重建来处理物体动作状态，但没有一种方法明确地以统一、原则性的方式建模这一基本元素，以构建可行动态物体表征。我们提出了WorldString，一种神经架构，能够通过直接从点云或RGB-D视频流中学习来建模现实物体的状态流形。作为通用的数字孪生，它充当物理世界模型的基础构建块；因此，我们将其命名为WorldString。有趣的是，其完全可微的结构无缝地使未来与策略学习和神经动力学的整合成为可能。

英文摘要

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.18678 2026-05-21 cs.CV cs.AI 版本更新

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance：通过多任务协同实现统一多模态建模

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

发表机构 * Intelligent Creation Lab, ByteDance（字节跳动智能创作实验室）

AI总结本文提出Lance，一种轻量级的原生统一模型，支持图像和视频的多模态理解、生成和编辑。该模型通过协同多任务训练的实用范式实现统一多模态建模，基于统一上下文建模和解耦能力路径两个核心原则，通过双流混合专家架构实现联合上下文学习并解耦理解和生成路径。

Comments 34 pages, 14 figures, 10 tables, homepage url: https://lance-project.github.io , code url: https://github.com/bytedance/Lance

详情

AI中文摘要

我们提出了Lance，一种轻量级的原生统一模型，支持图像和视频的多模态理解、生成和编辑。与依赖模型容量扩展或文本-图像主导设计不同，Lance通过协同多任务训练探索统一多模态建模的实用范式。其基于两个核心原则：统一上下文建模和解耦能力路径。具体而言，Lance从头开始训练，并在共享交错的多模态序列上采用双流混合专家架构，实现联合上下文学习的同时解耦理解和生成路径。我们进一步引入模态感知的旋转位置编码以减轻异构视觉标记之间的干扰并提升跨任务对齐。在训练过程中，Lance采用分阶段的多任务训练范式，结合能力导向的目标和自适应数据调度，以加强语义理解和视觉生成性能。实验结果表明，Lance在图像和视频生成方面显著优于现有开源统一模型，同时保留了强大的多模态理解能力。该模型的主页可在https://lance-project.github.io上访问。

英文摘要

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.17946 2026-05-21 cs.AI cs.CV cs.LG 版本更新

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出SVFSearch，首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准，通过5000个四选一测试示例和4198个辅助训练示例，评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情

AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干，以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而，现有的基准很少评估在短视频应用中的这种能力，其中暂停的帧通常在视觉上具有歧义性，回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch，这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例，每个示例都围绕一个暂停的游戏场景展开，来自真实的短视频片段。为了支持公平且可重复的评估，SVFSearch提供了一个冻结的离线检索环境，包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口，避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距：最好的开源直接问答模型达到66.4%，最好的实际代理达到79.1%，而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈，包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

URL PDF HTML ☆

赞 0 踩 0

2605.17164 2026-05-21 cs.DC cs.AI cs.LG cs.PL 版本更新

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Charon：一种用于大规模大语言模型训练和推理的统一且细粒度模拟器

Mengtian Yang, Zhekun Zhang, Mingheng Wu, Jianwen Yan, Hanshi Sun, Li-wen Chang

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出Charon模拟器，通过统一、模块化和细粒度的方法，准确预测大语言模型性能，实验显示其在不同模型和配置上具有高精度，预测误差低于5.35%，并在实际推理部署中发现提升系统吞吐量的配置，展示了其实际价值。

Comments Accepted by MLSys 2026

2605.16962 2026-05-21 cs.CV cs.AI 版本更新

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro: 一个增强工具的代理用于综合视觉-语言防伪

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China（合肥工业大学计算机科学与信息工程学院）； Wuhan University, Wuhan, China（武汉大学）； Lab for Intelligence and visiON (LION)（智能与视觉实验室）； Xi'an Jiaotong University（西安交通大学）

AI总结该研究提出OmniVL-Guard Pro，一种增强工具的代理，用于综合视觉-语言防伪，通过整合多种工具环境和引入新的强化学习方法，实现了开放世界中的线索驱动推理，并在多个任务上达到了最先进的性能。

Comments 29 pages

详情

AI中文摘要

现有的视觉-语言伪造检测和定位方法基于封闭世界范式，假设模型可以单独完成验证。然而，自包含的MLLMs受限于有限的参数知识、静态训练语料和有限的感知分辨率，在动态开放世界防伪中存在实际限制，特别是在需要外部线索的实时事件验证和需要对局部篡改进行细致审查的伪造分割中。为了解决这些限制，我们从扩大自包含模型转向超越它。我们提出了OmniVL-Guard Pro，一种增强工具的代理，将统一的防伪从封闭世界预测扩展到开放世界的线索驱动推理。OmniVL-Guard Pro整合了一个涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取以及SAM3基于分割的工具环境。为了生成高质量的工具推理轨迹，我们引入了树状结构的自进化工具轨迹生成，通过种子引导、无引导的自我进化和弱提示的硬样本合成生成多样化的轨迹，产生Full-Spectrum Tool Reasoning (FSTR)数据集用于训练。我们进一步提出了Checker-Guided Agentic Reinforcement Learning (CGARL)，它为过程级监督提供，以惩罚那些答案正确但推理扭曲的情况。广泛的实验表明，OmniVL-Guard Pro在各种任务上实现了最先进的性能，并表现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将在https://github.com/shen8424/OmniVL-Guard-Pro公开发布。

英文摘要

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.16428 2026-05-21 cs.IR cs.AI 版本更新

The Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit

人工智能搜索对在线内容生态系统的影响：来自谷歌和推特的证据

Peibo Zhang, Ruomeng Cui, Dennis J. Zhang

发表机构 * Goizueta Business School, Emory University（埃默里大学戈伊兹塔商学院）； Olin Business School, Washington University in St. Louis（圣路易斯华盛顿大学奥林商学院）

AI总结本文研究了人工智能搜索对在线内容生态系统的影响，通过谷歌AI概述和推特平台分析，发现AI概述提高了安全内容社区的参与度，但交互式AI模式削弱了这种效果。

详情

AI中文摘要

传统的搜索引擎通过将用户寻找信息的请求定向到外部网站来补充在线内容平台。生成式人工智能搜索工具能够直接在结果页面上总结答案，可能通过使访问来源平台变得可选而打破这种关系。我们利用谷歌AI概述和推特，其中一个最大的在线讨论平台，研究这一问题。我们的识别利用了谷歌的内容审核政策：安全的推特社区通过谷歌有机搜索被索引并在谷歌AI概述中出现，而不安全的社区虽然被有机搜索索引，但禁止在AI概述摘要中引用。使用差异-in-差异设计，我们发现AI概述提高了安全社区的参与度：每天的评论数量增加了12.0个百分点，评论用户数量增加了12.3个百分点，相对于不安全社区。这些影响集中在基于经验的讨论（意见、建议和个人经验）而不是基于事实的信息。然而，随后引入的谷歌AI模式，允许用户与AI摘要进行对话式交互，大大消除了经验内容中的这些收益。这些结果表明，人工智能搜索的效果在很大程度上取决于界面设计和内容类型。

英文摘要

Search engines traditionally complement online content platforms by directing users seeking information to external websites. The emergence of generative AI search tools that summarize answers directly on the results page may disrupt this relationship by making visits to source platforms optional. We study this question using Google AI Overviews and Reddit, one of the largest online discussion platforms. Our identification exploits Google's content moderation policy: Safe-for-Work (SFW) Reddit communities are indexed by Google organic search and surfaced in Google AI Overviews, while Not-Safe-for-Work (NSFW) communities, though indexed by organic search, are prohibited from being referenced in AI Overview summaries. Using a difference-in-differences design, we find that AI Overviews increase engagement in SFW communities: daily comments rise by 12.0 percent and the number of commenting users by 12.3 percent relative to NSFW communities. The effects are concentrated in experience-based discussions (opinions, advice, and personal experiences) rather than fact-based information. However, the subsequent introduction of Google AI Mode, which allows users to interact conversationally with the AI summary, largely eliminates these gains in experience-based content. These results suggest that the effects of AI search depend critically on interface design and types of content.

URL PDF HTML ☆

赞 0 踩 0

2605.16217 2026-05-21 cs.CL cs.AI cs.IR 版本更新

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus：可扩展深度研究代理的证据组装

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

发表机构 * MiroMind AI

AI总结 Argus通过将深度研究视为拼图碎片的组装过程，而非并行暴力求解整个答案，提高了大规模信息检索任务的效率和效果。

详情

AI中文摘要

深度研究代理在复杂信息检索任务上取得了显著进展。即使长ReAct风格的探索仅追踪单一轨迹，而最新最先进的系统通过并行搜索和聚合来扩展推理时间计算。然而，深度研究答案由互补的证据片段组成，而并行探索通常重复而非完成这些片段，导致收益递减且推动聚合上下文接近模型极限。我们提出Argus，一种代理系统，其中搜索者和导航者合作将深度研究视为从互补证据片段中组装拼图，而非并行暴力求解整个答案。搜索者通过ReAct风格交互收集给定子查询的证据轨迹。导航者维护共享证据图，验证哪些片段仍缺失，派遣搜索者收集它们，并在完成图上推理以生成来源追踪的最终答案。我们用强化学习训练导航者以验证、派遣和合成，同时独立训练搜索者以保持标准ReAct代理。所获得的导航者支持单个搜索者或多个并行搜索者无需重新训练。使用35B-A3B MoE骨干的搜索者和导航者，Argus在单个搜索者上获得5.5分，在8个并行搜索者上获得12.7分，平均在八个基准上。使用64个搜索者时，其在BrowseComp上达到86.2分，超越了我们所有基准测试的专有代理，同时导航器的推理上下文保持在21.5K tokens以下。

英文摘要

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

URL PDF HTML ☆

赞 0 踩 0

2605.14259 2026-05-21 cs.AI cs.CL 版本更新

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

面向异构业务系统的超图企业代理推理器

Ling Wang, Xin Liu, Songnan Liu, Jianan Wang, Cheng Cheng, Yihan Zhu, Enyu Li, Yu Xiao, Jiangyong Xie, Duogong Yan, Jiangyi Chen

发表机构 * SUPCON

AI总结本文提出HEAR，一种基于分层超图本体的企业代理推理器，通过分层图层和超边层实现结构化多跳分析，无需重新训练LLM，在供应链任务中达到94.7%的准确率，并展示出适应性和效率。

详情

AI中文摘要

将大语言模型（LLMs）应用于异构企业系统受到多跳、n元推理中幻觉和失败的阻碍。现有范式（如GraphRAG、NL2SQL）缺乏复杂环境所需语义基础和可审计执行。我们引入HEAR，一种基于分层超图本体的企业代理推理器。其基图层虚拟化了具有溯源意识的数据接口，而超边层编码n元业务规则和程序协议。通过证据驱动的推理循环，HEAR动态协调本体工具进行结构化多跳分析，无需重新训练LLM。在供应链任务中，包括订单履行阻塞根本原因分析（RCA）的评估显示，HEAR达到94.7%的准确率。关键地，HEAR展示了适应性效率：利用程序超边以最小化令牌成本，同时利用拓扑探索确保复杂查询的严格正确性。通过将专有模型性能与开源权重骨干结合，并自动化手动诊断，HEAR建立了可扩展、可审计的企业智能基础。

英文摘要

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.12483 2026-05-21 cs.LG cs.AI 版本更新

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

超越GRPO和在线策略蒸馏：一种经验性稀疏到密集奖励原则用于语言模型后训练

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结本文提出了一种经验性的稀疏到密集奖励原则，用于语言模型后训练，通过在教师模型上使用稀疏奖励进行探索和发现，然后通过密集监督将行为压缩到部署模型中，从而在数学问题上实现了优于GRPO的性能。

详情

AI中文摘要

在标记可验证的训练数据是约束的情况下，每个检查的示例应分配给模型和奖励密度，其中它最有信息量。我们识别出一个支配这种分配的奖励密度原则：稀疏序列级奖励在能够探索和发现更好行为的模型上最有用，而密集的token级教师监督更适合将该行为压缩到更小的部署模型中。该原则产生了一个简单的分配规则：在最强的可用教师上使用稀缺的标记数据，然后将奖励形状的行为作为密集监督转移到下游。我们通过一个四阶段的工作流程——教师RL、forward-KL预热、在线策略蒸馏、可选的后桥学生RL——在可验证的数学上评估了此规则，使用Qwen3和Llama模型。在固定的Qwen3-1.7B部署学生大小下，一个通过密集桥进行蒸馏的RL改进的8B教师在相同的学生上表现优于直接GRPO（79.3% vs. 75.9%在MATH；25.2% vs. 19.8%在AIME 2024，avg@16），而从相同教师提前进行RL的转移效果更差。一个组件消融确认了每个阶段的重要性：用RL改进的教师替换为原始教师会损失7.8个MATH点，移除forward-KL预热会损失1.7个点，移除在线策略蒸馏会损失3.3个点。教师质量顺序——原始教师转移 < 直接GRPO < RL教师转移——在使用Llama-3.1-8B-Instruct作为教师和Llama-3.3-70B-Instruct作为教师的情况下重复。操作教训是避免将稀缺的标记数据用于准备最少的策略：使用稀疏奖励进行教师端的发现，使用密集转移进行学生端的压缩，并在桥接后才使用学生端的稀疏奖励。

英文摘要

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated to the model and reward density where it is most informative. We identify a reward-density principle that governs this allocation: sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model. The principle yields a simple allocation rule: use scarce labeled data upstream on the strongest available teacher, then transfer the reward-shaped behavior downstream as dense supervision. We evaluate this rule through a four-stage workflow -- teacher RL, forward-KL warmup, on-policy distillation, optional post-bridge student RL -- on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student ($79.3\%$ vs.\ $75.9\%$ on MATH; $25.2\%$ vs.\ $19.8\%$ on AIME~2024, avg@16), while transfer from the same teacher \emph{before} RL underperforms. A component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$, and removing on-policy distillation costs $3.3$. The teacher-quality ordering -- raw-teacher transfer $<$ direct GRPO $<$ RL-teacher transfer -- replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher. The operational lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

URL PDF HTML ☆

赞 0 踩 0

2605.12334 2026-05-21 cs.AI 版本更新

Reinforcing VLAs in Task-Agnostic World Models

在任务无关的世界模型中强化视觉-语言-动作

Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu, Xinyao Qin, Tianxiang Zhang, Kaixin Wang, Li Zhao

发表机构 * Microsoft Research Asia（微软亚洲研究院）； Nanjing University（南京大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Wuhan University（武汉大学）； University of Technology Sydney（悉尼科技大学）； Tsinghua University（清华大学）

AI总结本文提出RAW-Dream方法，通过分离世界模型学习与下游任务依赖，利用预训练的世界模型和现成的视觉-语言模型，实现零样本推理，从而在无需任务特定数据的情况下提高VLA适应性。

详情

AI中文摘要

在学习的世界模型中通过强化学习（RL）后训练视觉-语言-动作（VLA）模型，已成为一种有效的策略，可以在不进行昂贵的真实世界交互的情况下适应新任务。然而，尽管使用想象轨迹减少了策略训练的样本复杂性，现有方法仍然严重依赖任务特定数据来微调世界和奖励模型，从根本上限制了其扩展到未见任务的能力。为了解决这个问题，我们主张世界和奖励模型应捕捉可转移的物理先验，以实现零样本推理。我们提出了RAW-Dream（在任务无关世界梦中强化VLA），一种新的范式，完全将世界模型学习与下游任务依赖分离。RAW-Dream利用在多样化任务无关行为上预训练的世界模型来预测未来滚动，以及现成的视觉-语言模型（VLM）进行奖励生成。由于这两个组件都是任务无关的，VLA可以在此零样本想象中轻松微调以适应任何新任务。此外，为了减轻世界模型的幻觉，我们引入了双噪声验证机制来过滤掉不可靠的滚动。在模拟和现实世界设置中的广泛实验展示了一致的性能提升，证明了通用的物理先验可以有效替代昂贵的任务依赖数据，为VLA适应提供了一条高度可扩展的道路。

英文摘要

Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.12321 2026-05-21 cs.AI cs.CY cs.ET 版本更新

LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

LIDSA：信号自由的自主交叉口管理中的认知仲裁

Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah

发表机构 * Department of Computer and Network Engineering, United Arab Emirates University, UAE（计算机与网络工程系，阿联酋大学）； Research Institute for Digital Future, Khalifa University, UAE（未来数字研究院，哈利法大学）

AI总结本文提出LIDSA框架，利用大语言模型进行意图驱动的速度建议，以实现信号自由的自主交叉口管理，通过对比固定周期控制、SCATS、AIM和GLOSA等方法，证明LLM在实时交叉口管理中的有效性。

Comments Renamed LISA to LIDSA to avoid naming ambiguity with existing traffic-control software. No technical changes

详情

AI中文摘要

大型语言模型（LLMs）在智能交通系统（ITS）中展现出强大的潜力，特别是在需要情境推理和多智能体协调的任务中。这些能力使它们非常适合协同驾驶，其中基于规则的方法在复杂和动态的交通环境中表现不佳。交叉口管理尤其具有挑战性，因为存在冲突的优先权需求、异质车辆优先级以及必须实时解决的车辆特定运动学约束。然而，现有方法通常将LLMs作为基于信号系统的辅助组件，而不是主要决策者。信号控制器仍然缺乏车辆感知，预留方法缺乏意图意识，而最近的基于LLM的系统仍然依赖于信号基础设施。此外，LLM推理延迟限制了其在亚秒级控制设置中的应用。我们提出了LIDSA（基于LLM的意图驱动速度建议），一种用于自主交叉口管理的信号自由认知仲裁框架。LIDSA利用LLM对声明的车辆意图进行推理，结合优先级类别、队列压力和能源偏好。我们评估了LIDSA在不同交通负载下的性能，结果表明LIDSA将平均控制延迟减少了高达89.1%，同时保持了服务水平C，而所有非LLM基线方法降级到服务水平F。在接近饱和需求下，LIDSA将平均等待时间减少了93%，峰值队列长度减少了60.6%相对于固定周期控制。它还降低了燃料消耗高达48.8%，并实现了86.2%的意图满足率，相比最好的非LLM方法的61.2%。这些结果证明了基于LLM的推理能够实现实时、无信号的交叉口管理。

英文摘要

Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LIDSA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LIDSA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LIDSA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LIDSA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LIDSA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.

URL PDF HTML ☆

赞 0 踩 0

2605.11302 2026-05-21 cs.LG cs.AI cs.CL 版本更新

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

时间敏感语言生成理论：稀疏幻觉战胜模式崩溃

Atul Ganju, Travis McVoy, Shaddin Dughmi, Shang-Hua Teng

发表机构 * University of Southern California（美国南加州大学）

AI总结本文研究了在全局偏好顺序下语言生成的极限情况，提出了一种时间敏感的语言生成方法，通过稀疏幻觉技术克服了模式崩溃问题，证明了在特定条件下可以实现最优密度。

详情

AI中文摘要

我们研究了在全局偏好顺序下语言生成的极限情况，如Kleinberg和Wei所引入的。与以往工作类似，我们追求广度，但增加了时效性要求：高排名字符串应更早生成。一个字符串只有在截止时间前生成才被认可，其截止时间由一个函数确定，该函数将字符串在目标语言中的排名映射到必须生成的时间。这与机器学习中的归纳偏置一致，即在其他条件相同的情况下，倾向于选择更简单或更可能的输出。我们证明，在强意义上，最终一致的生成器无法实现时效性生成——这是大多数先前相关工作的主角。在可能最温和的一致性放松下，即幻觉率随时间消失，我们证明可以绕过我们的不可能结果。特别是，我们可以实现相对于任何超线性截止函数的最优密度。我们还证明这是紧的，通过排除线性截止时间和消失幻觉率下的时效性生成。

英文摘要

We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As is done in previous work, we aim for breadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

URL PDF HTML ☆

赞 0 踩 0

2605.11151 2026-05-21 cs.AI cs.RO 版本更新

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ: 通过自监督动作排名实现离线到在线强化学习

Andrew Choi, Wei Xu

发表机构 * Horizon Robotics（地平线机器人）

AI总结该研究提出RankQ方法，通过自监督多项排名损失增强时序差分学习，以在大状态-动作空间中更准确地学习批评器，从而在稀疏奖励D4RL基准和基于视觉的机器人学习中实现更高效的离线到在线微调。

详情

AI中文摘要

离线到在线强化学习（RL）通过利用预先收集的数据集来提高样本效率。然而，一个关键挑战是在有限的数据集覆盖下，在大规模状态-动作空间中学习准确的批评器。为了减轻价值过估计带来的有害更新，先前方法通过降低分布外（OOD）动作相对于数据集动作的权重来引入悲观主义。虽然有效，但这种方法本质上充当了一个行为克隆锚点，当数据集动作不优时会阻碍后续在线策略改进。我们提出RankQ，一种离线到在线的Q学习目标，通过在时序差分学习中加入自监督的多项排名损失来强制结构化动作排序。通过学习相对动作偏好而不是均匀惩罚未见过的动作，RankQ塑造Q函数，使动作梯度指向高质量的行为。在稀疏奖励D4RL基准中，RankQ的性能与或优于七种先前方法。在基于视觉的机器人学习中，RankQ能够在低数据环境下有效微调预训练的视觉-语言-动作（VLA）模型，平均在模拟成功率上比次优方法高42.7%。在高数据环境下，RankQ在模拟性能上比次优方法提高13.7%，并实现强大的仿真到现实转移，将现实世界立方体堆叠成功率从43.1%提升到88.9%，相对于VLA的初始性能。

英文摘要

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

URL PDF HTML ☆

赞 0 踩 0

2605.10787 2026-05-21 cs.AI cs.SE 版本更新

何时重新承诺：为长时间视觉-语言推理发现时间抽象

Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出了一种可学习的状态条件化承诺深度方法，用于长时间视觉-语言推理任务，通过动态调整承诺深度，提高了求解率并减少了基本动作数量，优于固定深度基线和现有模型。

详情

AI中文摘要

长时间推理需要决定不仅采取什么行动，还要在下一次观察之前多深地承诺。我们将其形式化为"承诺深度"：在重新规划之间执行的原始动作数量。承诺深度在重新规划成本和执行误差累积之间产生权衡，但大多数现有长时间系统将其固定为手动设计的标量。在本文中，我们将其视为策略本身的一个可学习、状态条件化的变量。我们将其实例化在一个模型原生的视觉-语言策略中，该策略联合预测执行什么和持续多久。在Sliding Puzzle和Sokoban任务中，所得到的自适应策略在非退化的固定深度基线中占据帕累托最优，达到高达12.5个百分点的更高求解率，同时每回合使用约25%更少的基本动作。尽管使用7B主干，我们的方法在两个任务上优于GPT-5.5和Claude Sonnet，而每个测试的开放权重视觉-语言模型都达到0%的零样本成功率。我们进一步展示了理论分析，表明在标准的承诺深度替代方案下，状态条件化的承诺在本地最优深度在不同状态变化时严格优于任何固定深度。

英文摘要

Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

URL PDF HTML ☆

赞 0 踩 0

2605.07926 2026-05-21 cs.AI 版本更新

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

AgentEscapeBench: 评估LLM代理在跨领域工具引导推理中的能力

Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv, Dongyu Ru, Xiaoyu Li, Xiaoqing Zheng, Xuezhi Cao, Xunliang Cai

发表机构 * Fudan University（复旦大学）； Meituan Longcat Team（美团Longcat团队）

AI总结本文提出AgentEscapeBench基准测试，用于评估LLM代理在非熟悉工作流和短程交互之外维持工具引导推理的能力，通过逃亡室风格的任务测试代理在显式长距离依赖约束下推断、执行和修订新工具使用程序的能力，结果显示代理在依赖深度增加时表现显著下降。

详情

AI中文摘要

随着基于LLM的代理越来越多地依赖外部工具，评估其在非熟悉工作流和短程交互之外维持工具引导推理的能力变得至关重要。我们引入了AgentEscapeBench，一个逃亡室风格的基准测试，用于测试代理是否能够在显式长距离依赖约束下推断、执行和修订新的工具使用程序。每个任务定义了一个工具和物品上的有向无环依赖图，要求代理调用真实外部函数、跟踪逐步揭示的隐藏状态、传播中间结果，并提交一个确定性可验证的最终答案。AgentEscapeBench包含五个难度层级中的270个实例，并支持全自动评估。对十六个LLM代理和人类参与者的实验表明，随着依赖深度的增加，表现急剧下降：人类从难度5级的98.3%成功降至难度25级的80.0%，而最佳模型从90.0%降至60.0%。轨迹分析表明，模型失败主要归因于长距离状态跟踪、线索遵循和中间结果传播的崩溃。这些发现表明，当前代理通常能够处理局部工具使用，但在深度上下文依赖方面仍存在困难。我们希望AgentEscapeBench可以作为诊断测试床，用于衡量当前代理能力，并指导未来训练努力，以实现更健壮的通用推理、行动和适应能力。

英文摘要

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.07731 2026-05-21 cs.CL cs.AI 版本更新

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

对可比的意大利和国际开源大语言模型进行EngGPT2-16B-A3B的基准测试

Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman

发表机构 * AIRIC, Politecnico di Milano（AIRIC，米兰理工大学）； DEIB, Politecnico di Milano（DEIB，米兰理工大学）

AI总结本文研究了EngGPT2-16B-A3B在多个基准测试中的性能，与同等规模的开源MoE和密集模型进行比较，展示了其在国际和意大利基准测试中的表现。

详情

AI中文摘要

本报告对ENGINEERING Ingegneria Informatica S.p.A.的EngGPT2MoE-16B-A3B大语言模型进行了基准测试，该模型是一个具有3B活跃参数的16B参数混合专家（MoE）模型。性能在各种代表性基准测试中进行了评估，并与同等规模的开源MoE和密集模型进行了比较。与流行的意大利模型如FastwebMIIA-7B、Minerva-7B、Velvet-14B和LLaMAntino-3-ANITA-8B相比，EngGPT2MoE-16B-A3B在国际基准测试（ARC-Challenge、GSM8K、AIME24、AIME25、MMLU和HumanEval（HE））中表现相同或更好。它在RULER基准测试的最长上下文设置（32k）中取得最佳性能。在意大利基准数据集ITALIC上，该模型在除Velvet-14B外的其他模型中表现相同或更好。与同等规模的MoE模型相比，新模型在所有考虑的基准测试中都比DeepSeek-MoE-16B-Chat的值更高。它在HE、MMLU、AIME24、AIME25、GSM8K和32k RULER设置上比Moonlight-16B-A3B更高，但在BFCL和一些ARC和ITALIC设置上较低。最后，它在大多数基准测试中比GPT-OSS-20B低，包括HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL和RULER 32k。与流行的密集模型相比，EngGPT2MoE-16B-A3B在AIME24和AIME25上比Llama-3.1-8B-Instruct、Gemma-3-12b-it和Minstral-3-8BInstruct-2512-BF16的值更高，但在ITALIC、BFCL和32k RULER设置上较低。当性能汇总所有基准测试指标时，EngGPT2MoE-16B-A3B在评估的意大利模型中表现更高，但在一些最高效的国际模型（特别是GPT-5 nano和Qwen3-8B）中表现较低。总体而言，我们的发现表明新模型是原生意大利大语言模型的一大步。

英文摘要

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

URL PDF HTML ☆

赞 0 踩 0

2605.07021 2026-05-21 cs.AI 版本更新

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为线索推理：通过监督提高推理的效率和安全性

Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Brigham Young University

AI总结该研究提出行为线索推理方法，通过引入行为线索来增强大语言模型的可控性和可监控性，从而在复杂数学问题解决中减少50%的无效推理token，并在安全行动恢复方面将成功率从46%提升至96%。

详情

AI中文摘要

大语言模型（LLMs）的推理过程在监督方面面临挑战，因为许多不一致的行为往往在推理结束后才显现。为了解决这一问题，我们引入了行为线索推理，使LLM的推理过程更加可控和可监控。行为线索是特殊标记序列，模型在训练过程中被训练为在特定隐含和显式行为之前立即发出，起到双重用途的信号和控制杠杆。在微调较弱的外部监控器时，通过强化学习进行推理监督，仅使用行为线索产生的信息压缩视图就足以让监控器剪枝复杂数学问题解决中多达50%的无效推理token。当在过度约束违反导致失败的环境中利用几乎最优的规则基监控器时，行为线索使从80%的推理轨迹中恢复安全行动，这些轨迹原本会以提出不安全行动而结束，将成功率从46%提升至96%。通过在两个模型家族和三个领域中的评估，我们证明行为线索推理在不降低性能的情况下提高了推理的可监控性和可控性。更广泛地说，我们的工作通过展示被监控模型本身可以被训练得更易于监督来推进可扩展的监督。

英文摘要

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues

URL PDF HTML ☆

赞 0 踩 0

2605.06139 2026-05-21 cs.LG cs.AI 版本更新

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

列表式策略优化：基于组的RLVR作为LLM响应单纯形上的目标投影

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）； LLM Department, Tencent（腾讯LLM部门）

AI总结本文提出列表式策略优化（LPO），通过显式执行目标投影来解构隐式目标，利用响应单纯形限制近端RL目标，并通过精确散度最小化进行策略投影，从而在多样推理任务和LLM基础上提升训练性能，同时保持优化稳定性和响应多样性。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）已成为大语言模型（LLMs）训练后的一种标准方法，以激励推理能力。在现有方法中，基于组的策略梯度很流行，它为每个提示样本生成一组响应，并通过组内优势信号更新策略。本文揭示这些优化策略共享一个共同的几何结构：每种策略隐式地定义了一个目标分布，并通过一阶近似向响应单纯形投影。基于这一见解，我们提出了列表式策略优化（LPO）以显式执行目标投影，通过限制近端RL目标到响应单纯形来解构隐式目标，然后通过精确散度最小化进行策略投影。该框架提供了（i）在列表式目标上单调改进，具有有界、零和和自校正的投影梯度，以及（ii）通过解耦的投影步骤灵活选择散度，具有不同的结构性质。在多样推理任务和LLM基础架构上，LPO在匹配的目标下一致地优于典型的策略梯度基线，同时内在地保持了优化稳定性和响应多样性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

URL PDF HTML ☆

赞 0 踩 0

2605.05863 2026-05-21 cs.LG cs.AI 版本更新

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

SOPE: 通过先验数据稳定在线强化学习中的策略评估

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence（媒体集成与通信中心——佛罗伦萨大学）； SEED – Electronic Arts（SEED——电子艺界）

AI总结本文提出SOPE算法，通过使用与演员对齐的离策略策略评估（OPE）信号作为自动早停机制，动态控制离线训练阶段的长度，从而在连续控制任务中提高基线性能并减少计算资源消耗。

详情

AI中文摘要

将先验数据纳入在线强化学习可以加速训练，但通常需要在高计算成本和长的多阶段训练流水线之间做出艰难的权衡。虽然固定长度的稳定阶段比静态更新计划更具计算效率，但它们需要任务相关的手动调整，可能会导致先验知识的浪费或严重的过拟合。为此，我们提出了SOPE算法，该算法利用与演员对齐的离策略策略评估（OPE）信号作为自动早停机制，动态控制离线训练阶段的长度。通过在当前策略的动作分布下对批评者进行保留验证集的评估，SOPE在离分布收益饱和时精确停止梯度更新，从而消除了手动调度调整的需要。在Minari基准套件的25个连续控制任务上评估，SOPE将基线性能提高了高达45.6%，同时将所需的TFLOPs减少了高达22倍，从而在样本效率和计算效率之间取得了平衡。这些发现表明，自适应的、基于评估的更新计划比依赖静态、详尽的更新计划更有效。

英文摘要

Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

URL PDF HTML ☆

赞 0 踩 0

2605.04128 2026-05-21 cs.GR cs.AI cs.CL cs.CV cs.LG 版本更新

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

JoyAI-Image: 激活统一多模态理解和生成中的空间智能

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy, JD（joy未来学院，京东）

AI总结本文提出JoyAI-Image，一种统一的多模态基础模型，用于视觉理解、文本到图像生成和指令引导的图像编辑。该模型结合了空间增强的多模态大语言模型（MLLM）和多模态扩散Transformer（MMDiT），通过共享的多模态接口实现感知与生成的交互。构建可扩展的训练配方，结合统一指令微调、长文本渲染监督、空间 grounded 数据和通用及空间编辑信号，使模型具备广泛的多模态能力，同时增强几何感知推理和可控视觉合成。实验表明，JoyAI-Image在理解、生成、长文本渲染和编辑基准上达到最先进的性能。更重要的是，增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力，向更强的空间智能发展。

Comments Code: https://github.com/jd-opensource/JoyAI-Image

详情

AI中文摘要

我们提出了JoyAI-Image，一种统一的多模态基础模型，用于视觉理解、文本到图像生成和指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型（MLLM）与多模态扩散Transformer（MMDiT）结合，允许感知和生成通过共享的多模态接口进行交互。围绕此架构，我们构建了一个可扩展的训练配方，结合了统一指令微调、长文本渲染监督、空间 grounded 数据以及通用和空间编辑信号。该设计使模型具备广泛的多模态能力，同时增强了几何感知推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准上的实验表明，JoyAI-Image实现了最先进的或高度竞争的性能。更重要的是，增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力，向更强的空间智能发展。这些结果表明，统一视觉模型在下游应用如视觉-语言-动作系统和世界模型中具有前景。

英文摘要

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

URL PDF HTML ☆

赞 0 踩 0

2605.03690 2026-05-21 cs.LG cs.AI q-bio.QM 版本更新

Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction

基于图神经网络的面向层次的知识图谱嵌入：应用于酵母表型预测

Filip Kronström, Alexander H. Gower, Daniel Brunnsåker, Ievgeniia A. Tiukova, Ross D. King

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg（计算机科学与工程系，查尔姆斯理工大学和哥德堡大学）； Department of Life Sciences, Chalmers University of Technology（生命科学系，查尔姆斯理工大学）； Department of Industrial Biotechnology, KTH Royal Institute of Technology（工业生物技术系，皇家理工学院）； Department of Chemical Engineering and Biotechnology, University of Cambridge（化学工程与生物技术系，剑桥大学）

AI总结本文提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法，用于酵母表型预测，并展示了其在基因敲除效应预测和知识图谱修订评估中的应用。

详情

AI中文摘要

我们提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法。该方法生成的嵌入更能反映领域知识。为了展示其效用，我们预测并解释了酵母Saccharomyces cerevisiae中基因敲除的影响，并在没有预测任务的情况下学习知识图谱的盒嵌入。我们进一步展示了盒嵌入如何作为评估知识图谱修订的基础。我们的酵母知识图谱是从社区数据库和本体术语构建的。低维盒嵌入结合图神经网络用于预测双基因敲除的细胞生长。在10折交叉验证中，这些预测的平均R²分数为0.360，显著高于基线比较，证明了高层定性知识对实验结果的影响力。在模型训练中纳入语义损失项提高了其预测性能（R²=0.377），通过将嵌入对齐本体结构。这表明本体中的类层次可以用于定量预测。我们还测试了训练好的模型在三基因敲除上的表现，展示了其对训练数据之外数据的泛化能力。此外，通过识别酵母知识图谱中对细胞生长预测重要的共现关系，我们构建了关于酵母相互作用特征的假说。一个生物实验验证了其中一个发现，揭示了肌醇利用与渗透压压力抗性之间的关联，突显了模型在生物发现中的潜力。

英文摘要

We present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions. Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training. Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.01486 2026-05-21 cs.AI 版本更新

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

MAP-Law: 多轮法律咨询中的覆盖驱动检索控制

Qinchuan Cheng, Jiaqi Liu, Ruixuan Xie, Xiaoya Yuan, Yuxin Liu

发表机构 * Xi’an Jiaotong University（西安交通大学）； Sichuan University（四川大学）； Southwestern University of Finance and Economics（西南财经大学）； Northeastern University（东北大学）

AI总结本文提出了一种覆盖驱动的检索控制框架，用于多轮法律咨询，通过维护用户事实、法律要素、检索目标和检索证据的结构化地图，利用要素覆盖、证据有效性覆盖和边际检索收益来决定检索、澄清、改写或停止操作，实验表明该方法在固定法律要素模式下能有效实现要素覆盖。

详情

AI中文摘要

法律咨询本质上是迭代的：在提供建议之前，系统必须识别相关法律要素，收集缺失的事实和权威，以及确定当前证据是否足够。现有的检索增强型法律代理通常使用固定的检索预算或单次搜索，使其对咨询的演变覆盖状态不敏感。本文介绍了一种针对多轮法律咨询的覆盖驱动检索控制框架。该框架维护用户事实、法律要素、检索目标和检索证据的结构化地图，并利用要素覆盖、证据有效性覆盖和边际检索收益来决定是否检索、澄清、改写或停止。在50个案例的合成中文劳动法咨询试点中，使用DeepSeek V4-Pro动作选择变体，在试点指标下实现了完全测量的要素覆盖，平均需要3.4次检索轮次和7.1个证据片段。诊断分析表明，模型支持的动作选择能够通过小幅增加检索预算恢复规则-政策失败案例，而强制继续主要增加令牌和延迟成本。这些结果表明，法律要素覆盖是适应性法律检索的有用控制信号，在固定模式条件下保持检索控制行为，而非部署层面的法律正确性。

英文摘要

Legal consultation is inherently iterative: before giving advice, a system must identify relevant legal elements, gather missing facts and authorities, and determine whether the current evidence is sufficient. Existing retrieval-augmented legal agents often use fixed retrieval budgets or single-shot search, making them insensitive to the evolving coverage state of a consultation. This paper introduces a coverage-driven retrieval-control framework for multi-turn legal consultation. The framework maintains a structured map over user facts, legal elements, retrieval goals, and retrieved evidence, and uses element coverage, evidence validity coverage, and marginal retrieval gain to decide whether to retrieve, clarify, reformulate, or stop. On a 50-case synthetic Chinese labor-law consultation pilot with fixed legal-element schemas, a DeepSeek V4-Pro action-selection variant achieves full measured element coverage under the pilot metric while requiring 3.4 retrieval rounds and 7.1 evidence snippets on average. Diagnostic analyses show that model-backed action selection recovers rule-policy failure cases with a small retrieval-budget increase, while forced continuation mainly increases token and latency costs. These results suggest that legal-element coverage is a useful control signal for adaptive legal retrieval, while remaining bounded to retrieval-control behavior under synthetic fixed-schema conditions rather than deployment-level legal correctness.

URL PDF HTML ☆

赞 0 踩 0

2604.24697 2026-05-21 cs.AI 版本更新

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

当前智能体能否缩小发现到应用的差距？Minecraft中的一个案例研究

Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Peking University（北京大学）； Amazon（亚马逊）

AI总结本文通过Minecraft中的SciCrafter基准测试，探讨了智能体在发现因果规律并将其应用于构建功能系统（发现-应用循环）方面的能力，发现前沿模型在该任务中的成功率约为26%，揭示了知识识别和问题提出能力成为当前AI的瓶颈。

Comments Preprint, under review. 41 pages. Project page: https://scicrafter-bench.github.io/. Code: https://github.com/scicrafter-bench/scicraft-bench

详情

AI中文摘要

发现因果规律并将其应用于构建功能性系统——发现-应用循环——是通用智能的标志，但评估这一能力受到科学发现与现实世界工程之间巨大复杂性差距的阻碍。我们引入了基于Minecraft的SciCrafter基准测试，通过参数化的红石电路任务来操作化这一循环。智能体必须按照指定的模式（例如同时或按时间序列）点燃灯泡；扩大目标参数会显著增加构建复杂性和所需知识，迫使真正的发现而非依赖记忆中的解决方案。在通用目的代码智能体框架下评估前沿模型，包括GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5，我们发现所有模型均在约26%的成功率处停滞。为了诊断这些失败，我们将循环分解为四个能力——知识差距识别、实验发现、知识整合和知识应用，并设计了针对性的干预措施，其边际贡献作为相应差距的代理。我们的分析表明，尽管通用知识应用能力仍然是所有模型中最大的差距，但对前沿模型而言，知识差距识别开始成为主要障碍——表明瓶颈正从解决正确的问题转变为提出正确的问题。我们发布了SciCrafter作为未来研究AI系统在完整发现-应用循环中导航的诊断探针。

英文摘要

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

URL PDF HTML ☆

赞 0 踩 0

2604.22080 2026-05-21 cs.AI 版本更新

Sound Agentic Science Requires Adversarial Experiments

声音代理科学需要对抗性实验

Dionizije Fa, Marko Culjak

AI总结该研究探讨了代理辅助科学中对抗性实验的重要性，指出传统方法在科学发现中的局限性，并提出应以证伪优先的标准来评估代理生成的科学主张。

Comments Published at ICLR 2026 Workshop on Agents in the Wild

详情

AI中文摘要

基于大型语言模型的代理正迅速被用于科学数据分析，自动化了以往受限于人类时间和专业知识的任务。这种能力通常被描述为发现的加速，但同时也加速了熟悉的失败模式，即快速生成合理且可反复修改的分析，这些分析易于生成，实际上将假设空间转化为由选择性分析支持的候选主张，优化为可发表的积极结果。与软件不同，科学知识不是通过迭代积累代码和事后统计支持来验证的。单个数据集上的流畅解释或显著结果并不等于验证。因为缺失的证据是一个负空间，那些本应证伪主张的实验和分析从未运行或发表。因此，我们提出，通过代理协助产生的非实验性主张应受证伪优先标准的评估：代理不应主要用于构建最吸引人的叙述，而是应主动寻找主张可能失败的方式。

英文摘要

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

URL PDF HTML ☆

赞 0 踩 0

2604.20985 2026-05-21 cs.LG cs.AI cs.CR stat.ML 版本更新

Differentially Private Model Merging

差分隐私模型融合

Qichuan Yin, Manzil Zaheer, Tian Li

发表机构 * The University of Chicago（芝加哥大学）； Google DeepMind（谷歌DeepMind）

AI总结本文提出两种后处理技术，随机选择和线性组合，用于在不额外训练的情况下生成满足任意目标差分隐私要求的最终私有模型，同时分析了这些方法在一般问题和私有均值估计中的隐私-效用权衡。

详情

AI中文摘要

在机器学习中，推理或部署时间的隐私要求往往由于政策、法规或用户偏好变化而演变。在本文中，我们旨在构建一组模型，以满足任何目标差分隐私（DP）要求，而无需额外训练，给定一组已在相同数据集上训练且具有不同隐私/效用权衡的现有模型。我们提出两种后处理技术，即随机选择和线性组合，以生成最终的私有模型，满足任何目标隐私参数。我们从R'enyi DP和一般问题中的隐私损失分布的角度提供了这些方法的隐私计费，以及在私有均值估计中的精确隐私/效用权衡分析，并比较了这两种机制。实验上，我们展示了我们方法的有效性，并在多个模型和合成及现实世界数据集上验证了我们的分析。

英文摘要

In machine learning, privacy requirements at inference or deployment time often evolve due to changing policies, regulations, or user preferences. In this work, we aim to construct a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post-processing techniques, namely random selection and linear combination, to generate final private models satisfying any target privacy parameter. We provide privacy accounting of these approaches from the lens of R'enyi DP and privacy loss distributions on general problems, as well as on private mean estimation, where we precisely characterize the privacy/utility tradeoffs and compare the two mechanisms. Empirically, we demonstrate the effectiveness of our approaches and validate our analyses on several models and both synthetic and real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2604.11661 2026-05-21 cs.LG cs.AI 版本更新

Towards Autonomous Mechanistic Reasoning in Virtual Cells

向虚拟细胞中的自主机理推理迈进

Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Valence Labs（Valence实验室）； University College London（伦敦大学学院）

AI总结本文提出了一种结构化解释形式化方法，用于虚拟细胞中的生物推理，通过机理动作图实现系统验证和反驳，并引入VCR-Agent多智能体框架，结合生物基础知识检索和基于验证器的过滤方法，生成并验证机理推理。

详情

AI中文摘要

大型语言模型（LLMs）最近因其在加速科学发现方面的潜力而受到广泛关注。然而，它们在如生物学等开放性科学领域中的应用仍然有限，主要是由于缺乏事实性支撑和可操作的解释。为此，我们引入了一种结构化解释形式化方法，用于虚拟细胞，将生物推理表示为机理动作图，从而实现系统验证和反驳。在此基础上，我们提出了VCR-Agent多智能体框架，该框架整合了生物基础知识检索与基于验证器的过滤方法，以自动生成并验证机理推理。使用该框架，我们发布了VC-TRACES数据集，该数据集由来自Tahoe-100M图谱的验证机理解释组成。实证研究表明，使用这些解释训练可以提高事实准确性，并为下游基因表达预测提供更有效的监督信号。这些结果强调了通过多智能体和严格验证的协同作用，可靠机理推理在虚拟细胞中的重要性。

英文摘要

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

URL PDF HTML ☆

赞 0 踩 0

2604.11530 2026-05-21 cs.CV cs.AI 版本更新

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

超越注意力分数：基于SVD的视觉令牌修剪用于高效视觉-语言模型

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

发表机构 * anoncvlab（匿名计算机视觉实验室）

AI总结本文提出SVD-Prune方法，通过SVD分解视觉令牌特征矩阵并利用统计杠杆分数选择顶级令牌，以在极端视觉令牌预算下保持高性能，优于现有修剪方法。

详情

AI中文摘要

基于AI的面部遮挡去除并不适合识别

Emily A Cooper, Hany Farid

发表机构 * Herbert Wertheim School of Optometry & Vision Science University of California, Berkeley（赫伯特·韦瑟姆视觉科学学院，加州大学伯克利分校）； School of Information University of California, Berkeley（信息学院，加州大学伯克利分校）

AI总结本文研究了基于AI的面部遮挡去除技术的有效性和风险，探讨其在真实身份匹配中的可靠性。

2603.26539 2026-05-21 cs.CL cs.AI 版本更新

How Open Must Language Models be to Enable Reliable Scientific Inference?

语言模型必须多开放才能实现可靠的科学推断？

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； EleutherAI ； University of California San Diego（加州大学圣地亚哥分校）； Rutgers University-Newark（新泽西州立大学罗威特分校）； Stony Brook University（史泰森布魯克大學）

AI总结本文探讨了语言模型的开放程度如何影响基于其研究的科学推断可靠性，指出封闭模型通常不适合科学用途，并提出系统识别和缓解推断威胁的方法。

2603.25898 2026-05-21 eess.SY cs.AI cs.SE cs.SY 版本更新

On Integrating Resilience and Human Oversight into LLM-Assisted Modeling Workflows for Digital Twins

在数字孪生构建中整合韧性与人类监督 into LLM辅助建模工作流

Lekshmi P, Neha Karanjkar

发表机构 * Indian Institute of Technology Goa（印度理工学院 Goa）

AI总结本文提出三种关键设计原则，用于将韧性与人类监督整合到LLM辅助的数字孪生建模工作流中，通过FactoryFlow框架的研究，探讨了如何通过正交化结构建模与参数拟合、限制模型IR到参数化预验证库组件以及使用密度保持的IR来提高建模的鲁棒性和可解释性。

详情

AI中文摘要

LLM辅助建模有潜力快速从粗略描述和传感器数据构建复杂的可执行数字孪生。然而，LLM幻觉的韧性、人类监督以及实时模型适应性仍然是具有挑战性的且常常相互冲突的要求。我们提出了三种关键的设计原则，用于将韧性和监督整合到此类工作流中，这些原则源于我们在FactoryFlow框架上的工作，该框架是一个开源的LLM辅助框架，用于构建制造系统的基于模拟的数字孪生。首先，正交化结构建模和参数拟合。结构描述（组件、连接）是通过LLM从粗略的自然语言转换为中间表示（IR），并进行人工可视化和验证，然后算法转换为最终模型。相比之下，参数推断则在传感器数据流上持续运行，并具有专家可调的控制。第二，限制模型IR到参数化、预验证的库组件的连接，而不是单体模拟代码，从而实现可解释性和错误韧性。第三，最重要的是使用密度保持的IR。当IR描述从紧凑的输入急剧扩展时，幻觉错误会成比例累积。我们提出了Python作为密度保持IR的案例：循环以简洁的方式表达规律性，类捕捉层次结构和组成，结果仍然保持高度可读性，同时利用LLM强大的代码生成能力。一个关键贡献是详细表征了LLM诱导的错误在不同详细程度和复杂度的模型描述中的表现，揭示了IR选择如何关键地影响错误率。这些见解为构建鲁棒和透明的LLM辅助模拟自动化工作流提供了可操作的指导。

英文摘要

LLM-assisted modeling holds the potential to rapidly build executable Digital Twins of complex systems from only coarse descriptions and sensor data. However, resilience to LLM hallucination, human oversight, and real-time model adaptability remain challenging and often mutually conflicting requirements. We present three critical design principles for integrating resilience and oversight into such workflows, derived from insights gained through our work on FactoryFlow - an open-source LLM-assisted framework for building simulation-based Digital Twins of manufacturing systems. First, orthogonalize structural modeling and parameter fitting. Structural descriptions (components, interconnections) are LLM-translated from coarse natural language to an intermediate representation (IR) with human visualization and validation, which is algorithmically converted to the final model. Parameter inference, in contrast, operates continuously on sensor data streams with expert-tunable controls. Second, restrict the model IR to interconnections of parameterized, pre-validated library components rather than monolithic simulation code, enabling interpretability and error-resilience. Third, and most important, is to use a density-preserving IR. When IR descriptions expand dramatically from compact inputs hallucination errors accumulate proportionally. We present the case for Python as a density-preserving IR : loops express regularity compactly, classes capture hierarchy and composition, and the result remains highly readable while exploiting LLMs strong code generation capabilities. A key contribution is detailed characterization of LLM-induced errors across model descriptions of varying detail and complexity, revealing how IR choice critically impacts error rates. These insights provide actionable guidance for building resilient and transparent LLM-assisted simulation automation workflows.

URL PDF HTML ☆

赞 0 踩 0

2603.16513 2026-05-21 cs.LG cs.AI 版本更新

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

FEAT: 一个线性复杂度的超大规模结构化数据基础模型

Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）； Aalborg University（奥尔堡大学）

AI总结本文提出FEAT，一种线性复杂度的基础模型，用于处理超大规模结构化数据，通过多层双轴编码架构和自适应融合双向状态空间模型，实现线性时间内的跨元组上下文化，同时支持排列不变的表示学习。

详情

AI中文摘要

结构化数据在医疗、金融和科学数据管理等领域被广泛应用。最近关于结构化数据基础模型（SFMs）的研究旨在支持在这些数据上的数据分析和挖掘任务，但将其应用于现实世界的企业数据库时仍面临可扩展性和泛化能力的挑战。首先，许多SFMs依赖于完全自注意力机制，这引入了O(N²)的计算瓶颈，并限制了可以同时处理的元组数量。其次，直接用线性复杂度序列模型替代注意力可能与结构化数据的排列不变性质相冲突，引入人为的顺序偏差并降低表示质量。此外，仅在合成数据上训练的模型可能难以泛化到现实世界数据库中常见的重尾和异质分布。为了解决这些挑战，我们提出了FEAT，一种用于超大规模结构化数据的线性复杂度基础模型。FEAT用多层双轴编码架构替代二次注意力。它集成了自适应融合双向状态空间模型（AFBM）与卷积门控线性注意力（Conv-GLA），在O(N)时间内实现跨元组上下文化，同时支持排列不变的表示学习。为了提高在现实数据偏斜下的鲁棒性，FEAT进一步采用混合结构因果预训练流水线，具有鲁棒的重建目标。在12个现实世界数据库基准测试中，FEAT在零样本任务上始终优于代表性的SFMs，并且与结构化数据样本长度线性扩展，达到高达50倍的推理延迟提升。

英文摘要

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

URL PDF HTML ☆

赞 0 踩 0

2603.14184 2026-05-21 cs.CV cs.AI 版本更新

多模态最优传输用于手术机器人中的无训练时序分割

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

发表机构 * Dept. of Robotics, Mohamed bin Zayed University of AI（机器人系，Mohamed bin Zayed人工智能大学）

AI总结本文提出了一种无需标注的手术时序分割框架TASOT，通过结合时间对齐的文本描述和视觉信息，在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索，实现了在多个公开手术数据集上的显著提升。

详情

AI中文摘要

自动化识别手术阶段和步骤是机器人辅助手术中术中决策支持、工作流自动化和技能评估的基本能力。现有方法要么依赖大规模标注手术数据集，要么需要昂贵的领域特定预训练，这限制了它们在不同机器人平台和临床环境中的实际部署。在本文中，我们提出TASOT（文本增强的动作分割最优传输），一种无需任务特定标注或手术领域预训练的手术时序分割框架。TASOT扩展了动作分割最优传输（ASOT）公式，通过结合直接从输入视频生成的时间对齐文本描述，在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索。视觉表示使用DINOv3提取，而由视觉-语言模型生成的时间描述通过CLIP编码并时间对齐到单个帧，为传输成本提供互补的语义结构。我们在三个公开手术数据集和四个基准设置上评估了TASOT，涵盖腹腔镜和机器人手术程序，显示出显著优于最强的零样本基线：在Cholec80上+18.9 F1，在AutoLaparo上+33.7，在StrasByPass70上+23.7，在BernByPass70上+4.5。这些结果表明，在机器人环境中可以实现细粒度的手术工作流理解，而无需手动训练标注或手术特定的预训练流程，为实际的机器人手术系统提供了一种有前景的替代方案。

英文摘要

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

URL PDF HTML ☆

赞 0 踩 0

2602.19320 2026-05-21 cs.CL cs.AI 版本更新

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

代理记忆的解剖结构：评估和系统限制的分类与实证分析

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； University of California Davis（加州大学戴维斯分校）； Texas A&M University（德克萨斯农工大学）

AI总结本文通过分类和实证分析，探讨了代理记忆系统的架构和系统限制，揭示了现有系统在基准饱和效应、指标有效性、模型依赖性和内存维护开销等方面的问题，并提出了更可靠的评估方法和可扩展的系统设计方向。

详情

AI中文摘要

代理记忆系统使大型语言模型（LLM）代理能够在长时间交互中保持状态，支持超出固定上下文窗口的长周期推理和个性化。尽管架构发展迅速，这些系统的实证基础仍脆弱：现有基准通常规模不足，评估指标与语义效用不一致，性能在基础模型上变化显著，且系统层面的成本常被忽视。本文从架构和系统角度对代理记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统简要分类。然后，我们分析了限制当前系统的关键痛点，包括基准饱和效应、指标有效性和判断敏感性、基础模型依赖的准确性，以及内存维护引入的延迟和吞吐量开销。通过将内存结构与实证限制联系起来，本文阐明了当前代理记忆系统为何经常无法达到其理论承诺，并概述了更可靠评估和可扩展系统设计的方向。

英文摘要

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

URL PDF HTML ☆

赞 0 踩 0

2602.18532 2026-05-21 cs.CV cs.AI cs.RO 版本更新

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S实验室）； SenseTime Research（商汤研究）； Sun Yat-sen University（中山大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文通过统一框架和评估设置重新审视VLA设计空间，系统分析了基础组件、感知要素和动作建模视角，总结出12项关键发现，提出了一种简单有效的VLA模型VLANeXt，并在LIBERO和LIBERO-plus基准测试中超越了现有方法，同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情

AI中文摘要

在大基础模型兴起之后，视觉-语言-动作模型（VLAs）应运而生，利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而，当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型，但训练协议和评估设置的一致性不足，使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化，我们重新审视VLA设计空间，基于类似RT-2的简单VLA基线，系统地分析了三个维度：基础组件、感知要素和动作建模视角。从这项研究中，我们提炼出12项关键发现，共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt，它在LIBERO和LIBERO-plus基准测试中优于现有方法，并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库，以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

偏离以诱导提示：多理性诱导用于零样本推理

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan（国家台湾大学计算机科学与信息工程系）； Institute of Information Science, Academia Sinica, Taiwan（学术院信息科学研究所）； AI Research Center (AINTU), National Taiwan University, Taiwan（国家台湾大学人工智能研究中心（AINTU））

AI总结本研究提出DIP框架，通过生成多个多样化的高层理由并诱导最终计划，以提升零样本推理的准确性，克服了传统链式思考提示中推理路径不稳定的问题。

Comments Accepted to Findings of IJCNLP-AACL 2025

2602.04907 2026-05-21 cs.LG cs.AI stat.ME 版本更新

Causal Discovery from Heteroscedastic Stochastic Dynamical Systems under Imperfect Physical Models

从不完美物理模型下的异方差随机动力系统中进行因果发现

Jianhong Chen, Naichen Shi, Xubo Yue

发表机构 * Department of Mechanical & Industrial Engineering（机械与工业工程系）； Northeastern University（东北大学）； Department of Industrial Engineering and Management Sciences（工业工程与管理科学系）； Department of Mechanical Engineering（机械工程系）； Northwestern University（西北大学）

AI总结本文提出了一种整合因果发现框架，利用随机微分方程中的部分物理知识来提高动态系统中因果图的恢复能力，同时分析了在不完美物理模型下的鲁棒性。

Comments 101 pages

详情

AI中文摘要

因果发现是一种数据驱动的复杂系统分析范式，而基于物理的模型，如常微分方程（ODEs），为现实世界的动力学过程提供了机理结构。整合这些范式可以提高可识别性、稳定性和鲁棒性。然而，真实动力系统往往表现出循环交互和非平稳性，而许多因果发现方法依赖于无循环、平稳或平衡假设。我们提出了一种整合因果发现框架，利用随机微分方程（SDEs）中的部分物理知识。漂移项编码已知的ODE动力学，而扩散项捕捉超出规定物理的未知因果耦合。我们开发了一种可扩展的稀疏诱导最大准似然估计器，并通过理论上合理的稳定技术来改善优化景观。在温和条件下，我们为稳定和不稳定SDEs建立了因果图恢复保证。我们还分析了我们的因果图估计在ODE不准确情况下的鲁棒性，并澄清了引入的稳定技术如何平衡数值稳定性和统计恢复能力。在线性SDEs和非线性基准测试，包括具有无循环和循环结构的Lotka-Volterra和Lorenz动力学上，实验显示了比数据驱动基线更好的图恢复和鲁棒性。我们还通过在我们的因果发现框架内重建随机SIR动力学来展示实际应用，以在现实世界流行病数据中进行因果图重建。

英文摘要

Causal discovery is a data-driven paradigm for analyzing complex systems, while physics-based models, such as ordinary differential equations (ODEs), provide mechanistic structure for real-world dynamical processes. Integrating these paradigms can improve identifiability, stability, and robustness. However, real dynamical systems often exhibit cyclic interactions and nonstationarity, whereas many causal discovery methods rely on acyclicity, stationarity, or equilibrium assumptions. We propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge through stochastic differential equations (SDEs). The drift term encodes known ODE dynamics, while the diffusion term captures unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing maximum quasi-likelihood estimator with a theoretically justified stabilization technique to improve the optimization landscape. Under mild conditions, we establish causal graph recovery guarantees for both stable and unstable SDEs. We also analyze robustness of our causal graph estimate to ODE misspecification and clarify how the introduced stabilization technique balances numerical stability and statistical recoverability. Experiments on linear SDEs and nonlinear benchmarks, including Lotka-Volterra and Lorenz dynamics with acyclic and cyclic structures, show improved graph recovery and robustness over data-driven baselines. We also demonstrate practical utility on real-world epidemic data by reconstructing stochastic SIR dynamics within our causal discovery framework.

URL PDF HTML ☆

赞 0 踩 0

2602.03004 2026-05-21 cs.LG cs.AI 版本更新

Graph Autoencoder for Process Monitoring

用于过程监控的图自编码器

Xiangrui Zhang

发表机构 * School of Information and Control Engineering, China University of Mining and Technology（信息与控制工程学院，中国矿业大学）

AI总结本文提出了一种因果图时空自编码器（CGSTAE），通过结合基于空间自注意力机制的空间相关图结构学习模块和利用图卷积长短期记忆（GCLSTM）的空间-时间编码器-解码器模块，以提高工业过程监控的可靠性和可解释性。

详情

AI中文摘要

为提高工业过程监控的可靠性和可解释性，本文提出了一种因果图时空自编码器（CGSTAE）。CGSTAE的网络架构结合了两个组件：基于空间自注意力机制的空间相关图结构学习模块（SSAM）和利用图卷积长短期记忆（GCLSTM）的空间-时间编码器-解码器模块。SSAM通过捕捉变量之间的动态关系来学习相关图，而一种新的三步因果图结构学习算法被引入，以从这些相关图中推导出因果图。该算法利用因果不变性原理的反向视角来揭示从变化相关性中得到的不变因果图。空间-时间编码器-解码器由GCLSTM单元构建，在序列到序列框架内重建时间序列过程数据。所提出的CGSTAE通过特征空间和残差空间中的两个统计量实现有效的过程监控和故障检测。最后，我们通过田纳西东部过程和一个现实世界的空气分离过程验证了CGSTAE在过程监控中的有效性。

英文摘要

To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.

URL PDF HTML ☆

赞 0 踩 0

2602.02660 2026-05-21 cs.AI 版本更新

MARS: Modular Agent with Reflective Search for Automated AI Research

MARS：模块化代理与反思搜索用于自动化AI研究

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

发表机构 * Google Cloud AI Research（谷歌云人工智能研究）

AI总结本文提出MARS框架，通过预算感知规划、模块化构建和比较反思记忆解决复杂机器学习工程任务中的执行成本与性能归因问题，实现开放源代码框架在MLE-Bench上的最佳性能。

Comments Paper published at International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

自动化AI研究的关键瓶颈在于执行复杂的机器学习工程（MLE）任务。MLE不同于一般软件工程，因其计算成本高昂（例如模型训练）和性能归因不透明。当前基于LLM的代理在此方面表现不佳，常生成忽视执行成本和因果因素的单体脚本。我们引入MARS（模块化代理与反思搜索），一种优化于自主AI研究的框架。MARS依赖三个支柱：（1）通过成本受限的蒙特卡洛树搜索（MCTS）进行预算感知规划，以显式平衡性能与执行成本；（2）模块化构建，采用“设计-分解-实现”流程来管理复杂的研究存储库；（3）比较反思记忆，通过分析解决方案差异来解决信用分配问题，从而提炼出高信号的洞察。MARS在可比条件下，在开放源代码框架中实现了MLE-Bench上的最佳性能，与全球排行榜上顶尖方法竞争性相当。此外，系统表现出定性“啊哈！”时刻，其中所有使用的63%的教训源自跨分支转移，表明代理能有效在搜索路径间泛化洞察。

英文摘要

A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

URL PDF HTML ☆

赞 0 踩 0

2602.02304 2026-05-21 cs.AI cs.LG 版本更新

Comparing Explanations is Not Enough, Explain the Change: New Standards are Needed to Explain Behavioral Shifts in Large Language Models

比较解释并不足够，解释变化：需要新的标准来解释大型语言模型中的行为转变

Martino Ciaperoni, Marzio Di Vece, Roberto Pellungrini, Luca Pappalardo, Fosca Giannotti, Francesco Giannini

发表机构 * Scuola Normale Superiore（诺莱学院）； ISTI-CNR（意大利国家研究委员会ISTI研究所）； University of Pisa（比萨大学）

AI总结本文提出了一种新的XAI方法，旨在解释大型语言模型在干预后行为转变的原因和机制，以应对现有解释方法无法解释行为转变的问题。

详情

AI中文摘要

大规模基础模型在受到缩放、微调、人类反馈强化学习或上下文学习等干预时会表现出行为转变。当前的可解释性方法结构上不适用于解释这些转变，因为它们要么将模型视为静态对象，如传统可解释AI（XAI）方法所做的，要么仅仅比较不同模型检查点的独立解释。因此，这些方法无法解释两个模型实例之间的功能转变，其中某种行为在干预后发生了变化。这种差距在欧盟人工智能法案、美国州立法和中国人工智能法规等司法管辖区中带来了重大治理风险，这些法规要求记录重大系统修改的因果链。本文主张，解释大型语言模型的行为转变需要一种系统的方法，将转变本身作为解释的主要对象：即解释干预如何和为何将参考模型转变为具有不同行为的更新模型。为了支持这一主张，我们引入了称为比较XAI（XAI_Δ）的新XAI范式，旨在解释两个模型检查点之间的差异，其中行为发生了变化，以及一组规范，规定XAI_Δ解释器和解释必须满足的条件，包括可比性、有效性、可操作性和监控，目标是将模型审计 grounded 在明确、可测量的要求中。最后，我们通过示例实验提供初步证据，表明在实践中需要XAI_Δ，将结果汇总成一份转换报告，直接可用于治理和事件记录。

英文摘要

Large-scale foundation models exhibit \emph{behavioral shifts} when subjected to interventions such as scaling, fine-tuning, reinforcement learning with human feedback, or in-context learning. Current explainability methods are structurally ill-suited to explain these shifts, because they either treat models as static objects, as traditional eXplainable AI (XAI) approaches do, or merely compare independent explanations across different checkpoints of a model. As a result, these approaches fail to explain the functional transition between two model instances in which a certain behavior has shifted following an intervention. This gap creates significant governance risks across jurisdictions including the EU AI Act, US state legislation, and Chinese AI regulations, which require documenting causal chains for substantial system modifications. This position paper argues that explaining behavioral shifts in large language models requires a principled approach that treats the shift itself as the primary object of explanation: namely, one that explains how and why an intervention transforms a reference model into an updated model with different behavior. To support this claim, we introduce \textit{Comparative} XAI (XAI$_Δ$), a novel XAI paradigm aimed at explaining the difference between two model checkpoints where a behavior has shifted, together with a set of desiderata specifying what XAI$_Δ$ explainers and explanations must satisfy, including comparability, validity, actionability, and monitoring, with the goal of grounding model auditing in explicit, measurable requirements. Finally, we provide preliminary evidence suggesting the need for XAI$_Δ$ in practice through illustrative experiments, compiling the resulting findings into a transition report directly usable for governance and incident documentation.

URL PDF HTML ☆

赞 0 踩 0

2602.00933 2026-05-21 cs.SE cs.AI 版本更新

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

MCP-Atlas：一个大规模的工具使用能力基准测试，使用真实的MCP服务器

Chaithanya Bandi, Razvan-Gabriel Dumitru, Ben Hertzberg, Divyansh Agarwal, Geobio Boo, Tejas Polakam, Sami Hassaan, Jeff Da, HiJae Kim, Vipul Gupta, Manasi Sharma, Andrew Park, Martin Dimakis, Ernesto Gabriel Hernandez Montoya, Dan Rambado, Ivan Salazar, Rafael Cruz, MohammadHossein Rezaei, Chetan Rane, Ben Levin, Daniel Yue Zhang, Brad Kenstler, Bing Liu

发表机构 * National University of Singapore（新加坡国立大学）； Scale AI

AI总结本文提出MCP-Atlas，一个用于评估工具使用能力的基准测试，基于真实MCP服务器，包含1000个由人类专家编写和验证的任务，覆盖36个真实MCP服务器和220种工具，通过任务级别的评估发现模型在工具调用和认知能力上的表现。

Comments 25 pages, 3 figures, 9 tables

详情

AI中文摘要

模型上下文协议（MCP）正逐渐成为一种标准接口，通过该接口大型语言模型（LLM）代理可以发现并调用外部工具。然而，现有的MCP评估在三个方面存在不足：缺乏真实多步骤工作流和跨服务器编排、缺乏真实MCP服务器而非模拟器、以及缺乏结构化、可重复的声明级评分，与代理的冗长或风格无关。我们引入MCP-Atlas，一个用于测量工具使用能力的基准测试，针对生产MCP服务器。MCP-Atlas包含1000个自然语言任务，由人类专家编写和验证，涵盖36个真实MCP服务器和220种工具。提示不指定服务器、工具或参数，要求代理在语义上可能的干扰项中识别相关工具，并编排多步骤、跨服务器工作流。每个任务均使用声明级评分标准评分，最终答案根据工具输出中的原子事实声明评分。这种以答案为中心的评分允许有效的替代工具调用轨迹获得认可。我们将其与一个11类诊断分类法相结合，将工具调用失败与任务理解、综合、解析和停止的认知失败区分开来。在20个前沿模型（来自六个供应商）在匹配的任务级别条件下评估后，我们发现，在0.75声明覆盖率阈值下，通过率高达82.2%，并呈现出明显的三层性能结构。自动化诊断显示，63.3%的诊断失败是认知性的而非工具调用相关的。值得注意的是，一些高性能模型在成功执行工具后由于提前停止或综合错误而失败。我们发布了任务模式、容器化Harness、声明评估器和一个500任务的公共分割，同时保留500任务的私人分割以保持排行榜的完整性。代码在https://github.com/scaleapi/mcp-atlas。

英文摘要

The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task-level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three-tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool-call related. Notably, several high-performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500-task public split, while reserving a 500-task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp-atlas.

URL PDF HTML ☆

赞 0 踩 0

2601.23086 2026-05-21 cs.AI 版本更新

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

从输出监督学习的链式思维混淆可以泛化到未见过的任务

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard

发表机构 * University of Cambridge（剑桥大学）； Geodesic Research

AI总结本文研究了链式思维（CoT）推理中混淆现象的泛化能力，发现模型在学习混淆推理轨迹时，能够将这种混淆行为及其在未见过的任务中表现出来，从而影响模型的可监控性。

详情

AI中文摘要

链式思维（CoT）推理通过使大型语言模型（LLM）能够规划、探索和反思其行动，显著提升了性能。CoT也是监控这些代理行为的强大工具：当忠实时，它们提供模型决策过程的解释，并为危险行为发出早期警告。然而，优化压力可能会导致模型混淆推理轨迹，失去这一有益属性。我们证明混淆可以跨任务泛化；学习混淆涉及奖励黑客（例如访问和利用泄露信息）的推理的模型，不仅在未见过的奖励黑客设置中泛化了奖励黑客行为及其混淆。最令人担忧的是，我们显示当仅惩罚模型关闭CoT后的最终动作时，CoT推理的混淆及其跨任务泛化也随之发生。我们的发现表明，当前对有害生成的惩罚实践可能会无意中以不可预测的方式减少LLM的广泛可监控性。

英文摘要

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

URL PDF HTML ☆

赞 0 踩 0

2601.04068 2026-05-21 cs.CV cs.AI 版本更新

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

注意生成细节：面向视频扩散模型的直接局部化细节偏好优化

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Alibaba Group - Taobao & Tmall Group（阿里巴巴集团-淘宝 & 天猫集团）

AI总结本文提出LocalDPO，一种新的后训练框架，通过从真实视频中构建局部偏好对，并在时空区域层面优化对齐，以提高视频生成的质量和人类偏好评分。

Comments Accepted by CVPR 2026

详情

AI中文摘要

将文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化（DPO）方法依赖于多样本排序和任务特定的批评模型，这效率低下且常导致模糊的全局监督。为了解决这些限制，我们提出了LocalDPO，一种新的后训练框架，该框架从真实视频中构建局部偏好对，并在时空区域层面进行优化。我们设计了一个自动化流程，高效地收集偏好对数据，通过单次提示推理生成偏好对，消除了对外部批评模型或人工标注的需求。具体来说，我们将高质量的真实视频作为正样本，并通过局部随机时空掩码来生成对应的负样本，仅使用冻结的基模型恢复被掩码的区域。在训练过程中，我们引入了区域感知的DPO损失，将偏好学习限制在被损坏的区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明，LocalDPO在视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法，建立了更高效和精细的视频生成器对齐范式。代码可在https://github.com/1170300714/Local-DPO上获得。

英文摘要

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

URL PDF HTML ☆

赞 0 踩 0

2601.00473 2026-05-21 cs.LG cs.AI 版本更新

CHEM: 估计和理解深度学习在图像处理中的幻觉

Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck

发表机构 * Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； University of Tromsø（特罗姆斯大学）； German Aerospace Center (DLR)（德国航天中心）； Foundation for Research and Technology Hellas (FORTH)（希腊研究与技术基金会）

AI总结本文提出CHEM方法，用于量化和表征图像重建模型中的幻觉 artifacts，通过小波和shearlet表示定位幻觉区域，并利用 conformalized quantile regression 评估幻觉水平，同时分析U-shaped网络为何容易产生幻觉预测。

详情

AI中文摘要

基于深度学习的方法最近在图像重建问题中取得了显著成功。然而，挑战出现了，因为这些方法可能会生成不真实的 artifacts 或幻觉，这可能干扰安全关键场景中的分析。本文介绍了一个框架，用于量化和表征图像重建模型中的幻觉 artifacts。所提出的方法称为 Conformal Hallucination Estimation Metric (CHEM)，能够识别模型预测中的幻觉易发区域。它利用小波和shearlet表示在图像特征层面定位这些区域，并使用 conformalized quantile regression 以分布无关的方式评估幻觉水平。提供了理论分析，表征了CHEM对幻觉 artifacts 的灵敏度及其与均方误差的关系。基于这些见解并采用基于逼近理论的观点，我们研究了为何U-shaped网络，广泛用于图像重建的架构，倾向于产生易受幻觉影响的预测。我们在天文图像去卷积中使用CANDELS数据集（如U-Net、SwinUNet和Learnlets）以及在自然图像超分辨率中使用DIV2K数据集（如DRUNet、Unfolded DRS、RAM和DPS）上评估了所提出方法的有效性。

英文摘要

Deep learning-based methods have recently achieved significant success in image reconstruction problems. However, challenges have emerged, as these methods may generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a framework for quantifying and characterizing hallucinated artifacts in image reconstruction models. The proposed method, termed the Conformal Hallucination Estimation Metric (CHEM), enables the identification of hallucination-prone regions in model predictions. It leverages wavelet and shearlet representations to localize such regions at the level of image features, and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. A theoretical analysis is provided, characterizing the sensitivity of CHEM to hallucinated artifacts and its relationship to the mean squared error. Building on these insights and adopting a viewpoint grounded in approximation theory, we investigate why U-shaped networks, widely used architectures for image reconstruction, tend to hallucination-prone predictions. We assess the effectiveness of the proposed approach on astronomical image deconvolution using the CANDELS dataset with architectures such as U-Net, SwinUNet, and Learnlets, and on natural image super-resolution using the DIV2K dataset with models such as DRUNet, Unfolded DRS, RAM, and DPS.

URL PDF HTML ☆

赞 0 踩 0

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE 版本更新

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

发表机构 * The University of Hong Kong（香港大学）； Shanghai AI Laboratory（上海人工智能实验室）； Nanjing University（南京大学）； Carnegie Mellon University（卡内基梅隆大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出JanusCoder，一种面向代码智能的视觉-程序化界面，通过构建大规模多模态代码数据集和统一模型，实现从文本指令、视觉输入或两者结合生成代码，展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情

AI中文摘要

神经代码智能的范围正在迅速扩展，从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而，进展受到高质量多模态代码数据稀缺的阻碍，这源于合成和质量评估的挑战。为了解决这些挑战，我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包，利用数据模态之间的相互协同效应，高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包，我们构建了JanusCode-800K，目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练，建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法，后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明，JanusCoder系列的性能优越，我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外，广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

URL PDF HTML ☆

赞 0 踩 0

2510.18034 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测？一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems TUM School of Engineering ； Design, Technical University of Munich Munich, Germany

AI总结本文提出SAVANT框架，通过结构化推理方法提升VLM在语义异常检测中的性能，实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情

AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具，但其在异常检测中的应用仍然主要局限于提示专有模型，限制了可靠性、可重复性和部署可行性。为解决这一差距，我们引入SAVANT（语义异常验证/分析工具包），一种新的模型无关推理框架，将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估，现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示，通过语义感知推理，将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明，在平衡的现实驾驶场景集上，应用SAVANT可将VLM的绝对召回率提高约18.5%，相比提示基线。此外，这一增益使大规模注释成为可能：利用我们框架内的最佳专有模型，我们自动标注了约10,000张现实世界图像，具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型（Qwen2.5-VL）以执行单次异常检测，达到90.8%的召回率和93.8%的准确率，超越所有评估模型，同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合，我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料：https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

URL PDF HTML ☆

赞 0 踩 0

2510.17269 2026-05-21 cs.CV cs.AI 版本更新

FineVision: Open Data Is All You Need

FineVision: 你只需要开放数据

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

发表机构 * Hugging Face ； Technical University of Munich（慕尼黑技术大学）； Stanford University（斯坦福大学）

AI总结本文提出FineVision，一个包含2400万样本的高质量数据集，通过半自动化流程整合了200多个来源，通过严格的数据清洗和人工审核确保数据质量，训练基于该数据集的模型在广泛评估中表现更优，推动数据驱动的视觉语言模型研究。

详情

AI中文摘要

视觉语言模型（VLMs）的进步受到碎片化、不一致和受污染的公共数据集的阻碍。我们引入了FineVision，一个精心收集、整理和统一的2400万样本数据集，是最大的开放资源。我们通过半自动化、人机协作的流程将超过200个来源整合为185个子集：自动化处理大量数据和模式映射，而审核员检查映射并抽查输出以验证注释的忠实消费、适当的格式和多样性以及安全性；问题会触发针对性的修复和重新运行。该流程进一步在源内和跨源之间应用严格的去重，并针对66个公共基准进行去污染。FineVision还包含具有统一动作空间的代理/GUI任务；审核员验证模式并检查样本轨迹以确认可执行性。在广泛评估套件中，基于FineVision训练的模型始终优于基于现有开放混合数据训练的模型，凸显了规模、数据卫生和平衡自动化与人工监督的好处。我们发布该数据集和整理工具以加速数据驱动的VLM研究。

英文摘要

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

URL PDF HTML ☆

赞 0 踩 0

2510.14444 2026-05-21 cs.LG cs.AI 版本更新

令牌去哪了？在高分辨率下的STEP中理解剪枝行为

Michal Szczepanski, Martyna Poreba, Karim Haroun

发表机构 * Université Paris-Saclay, CEA, List（巴黎-萨克雷大学，CEA，List）； I3S, Université Côte d’Azur, CNRS（I3S，尼斯大学，CNRS）

AI总结本文提出STEP框架，通过动态补丁合并和令牌剪枝提高效率，同时在高分辨率语义分割任务中实现显著的计算成本降低和吞吐量提升，同时保持较高的准确性。

Journal ref SN Computer Science 2026

详情

DOI: 10.1007/s42979-025-04707-6

AI中文摘要

视觉变换器（ViTs）在语义分割任务中实现了最先进的性能，但受到高计算和内存成本的限制。为了解决这一问题，我们提出了STEP（SuperToken和Early-Pruning），一种混合的令牌减少框架，结合动态补丁合并和令牌剪枝，以提高效率而不显著牺牲准确性。STEP的核心是dCTS，一个轻量级的CNN基政策网络，能够灵活地合并为超补丁。编码器块也集成了早期退出，以移除高置信度的超令牌，从而降低计算负载。我们在高分辨率语义分割基准上评估了我们的方法，包括高达1024x1024像素的图像，并显示当仅应用dCTS时，令牌数量可以比标准的16x16像素补丁方案减少2.5倍。这在使用ViT-Large作为骨干时，导致计算成本减少2.6倍，吞吐量增加3.4倍。应用完整的STEP框架进一步提高效率，达到计算复杂度减少4倍，推理速度提高1.7倍，最大精度下降不超过2.0%。通过提出的STEP配置，可以自信地在到达最终编码器层之前停止多达40%的令牌。

英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

URL PDF HTML ☆

赞 0 踩 0

2509.09215 2026-05-21 cs.AI cs.CR 版本更新

Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions

使监管多智能体协作成为可能：架构、挑战与解决方案

Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du, Qichao Xu

发表机构 * School of Cyber Science and Engineering（网络科学与工程学院）； Xi'an Jiaotong University（西安交通大学）； School of Mechatronic Engineering and Automation（机械电子工程与自动化学院）； Shanghai University（上海大学）

AI总结本文提出了一种基于区块链的分层架构，用于监管智能体协作，设计了三个关键模块以实现自动问责、动态声誉评估和恶意行为预测，从而建立可信、健壮和可扩展的监管机制。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

大型语言模型（LLMs）赋能的自主智能体正在通过实现适应性、多智能体协作改变数字和物理环境。尽管这些智能体在金融、医疗和智能制造等领域提供了显著机会，但其不可预测的行为和异构能力带来了重大治理和问责挑战。在本文中，我们提出了一种基于区块链的分层架构，用于监管智能体协作，包括智能体层、区块链数据层和监管应用层。在此框架内，我们设计了三个关键模块：（i）智能体行为追踪和仲裁模块用于自动化问责，（ii）动态声誉评估模块用于协作场景中的信任评估，（iii）恶意行为预测模块用于早期检测对抗性活动。我们的方法建立了在大规模智能体生态系统中可信、健壮和可扩展的监管机制的系统基础。最后，我们讨论了区块链赋能的监管框架在多智能体系统中的未来研究方向。

英文摘要

Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2509.08010 2026-05-21 cs.CY cs.AI cs.CL cs.HC 版本更新

Measuring and mitigating overreliance to build human-compatible AI

测量和缓解过度依赖以构建人类兼容的AI

Lujain Ibrahim, Katherine M. Collins, Sunnie S. Y. Kim, Anka Reuel, Max Lamparth, Kevin Feng, Lama Ahmad, Prajna Soni, Alia El Kattan, Merlin Stein, Siddharth Swaroop, Vishakh Padmakumar, Ilia Sucholutsky, Andrew Strait, Diyi Yang, Q. Vera Liao, Umang Bhatt

AI总结本文研究了大型语言模型过度依赖的风险，探讨了测量和缓解过度依赖的方法，以确保AI能增强而非削弱人类能力。

详情

AI中文摘要

大型语言模型（LLMs）通过作为协作的『思想伙伴』而区别于先前的技术，能够在多种任务上更流畅地进行自然语言交互。随着LLMs在医疗、个人建议等不同领域中日益影响关键决策，过度依赖LLMs的风险也随之增加。本文认为，测量和缓解过度依赖必须成为LLMs研究和部署的核心。首先，我们汇总了个体和社会层面的过度依赖风险，包括高风险错误、治理挑战和认知退化。然后，我们探讨了LLMs的特点、系统设计特征和用户认知偏见，这些因素共同引发了关于实际中过度依赖LLMs的严重且独特的问题。我们还审查了历史上的过度依赖测量方法，识别出三个重要的差距，并提出三个有前景的方向来改进测量。最后，我们提出了可以采取的缓解策略，以确保LLMs增强而非削弱人类能力。

英文摘要

Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities.

URL PDF HTML ☆

赞 0 踩 0

2509.00303 2026-05-21 cs.DB cs.AI cs.IR 版本更新

Access Paths for Efficient Ordering with Large Language Models

利用大型语言模型实现高效的排序访问路径

Fuheng Zhao, Jiayue Chen, Yiming Pan, Tahseen Rabbani, Sohaib, Divyakant Agrawal, Amr El Abbadi, Paritosh Aggarwal, Anupam Datta, Dimitris Tsirogiannis

发表机构 * Snowflake Inc.（Snowflake公司）； University of Chicago（芝加哥大学）； UC Los Angeles（洛杉矶大学）； UC Santa Barbara（圣巴巴拉大学）

AI总结本文提出了一种基于大型语言模型的排序语义运算符，并系统研究了其物理实现。通过改进现有语义排序算法并引入语义感知的外部归并排序算法，研究发现没有单一实现能在所有数据集上达到最优。基于此，设计了一个预算感知的优化器，利用启发式规则、LLM作为判断者评估和共识聚合动态选择最优的访问路径。实验结果表明，该优化器在所有基准测试中均能实现与最佳静态方法相当或更优的排名准确性。

详情

AI中文摘要

在本工作中，我们提出了LLM ORDER BY语义运算符作为一种逻辑抽象，并对其物理实现进行了系统研究。首先，我们对现有的语义排序算法进行了若干改进，并引入了一种语义感知的外部归并排序算法。我们的广泛评估表明，没有单一的实现能在所有数据集上提供普遍最优性。从我们的评估中，我们观察到基于比较的算法中排序成本与排序质量之间存在一种通用的时间尺度关系。基于这些见解，我们设计了一个预算感知的优化器，该优化器利用启发式规则、LLM-as-Judge评估和共识聚合来动态选择LLM ORDER BY的近最优访问路径。在我们的广泛评估中，我们的优化器在所有基准测试中均能实现与最佳静态方法相当或更优的排名准确性。我们相信，这项工作为构建稳健、大规模的LLM驱动分析系统中的语义运算符原则性优化提供了基础性见解。

英文摘要

In this work, we present the \texttt{LLM ORDER BY} semantic operator as a logical abstraction and conduct a systematic study of its physical implementations. First, we propose several improvements to existing semantic sorting algorithms and introduce a semantic-aware external merge sort algorithm. Our extensive evaluation reveals that no single implementation offers universal optimality on all datasets. From our evaluations, we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms. Building on these insights, we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path for LLM ORDER BY. In our extensive evaluations, our optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks. We believe that this work provides foundational insights into the principled optimization of semantic operators essential for building robust, large-scale LLM-powered analytic systems.

URL PDF HTML ☆

赞 0 踩 0

2508.11354 2026-05-21 cs.CV cs.AI cs.LG 版本更新

FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

FunduSegmenter：利用RETFound基础模型进行视网膜底照相图像中视盘和视杯联合分割

Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco

发表机构 * University of Dundee（邓迪大学）

AI总结本文提出了一种基于RETFound基础模型的FunduSegmenter模型，通过引入一系列新颖模块实现视盘和视杯的联合分割，实验表明该模型在多个数据集上均优于现有方法。

Journal ref Trans. Vis. Sci. Tech. 2026;15(5):14

详情

DOI: 10.1167/tvst.15.5.14

AI中文摘要

目的：本研究首次将RETFound模型应用于视盘（OD）和视杯（OC）的联合分割。RETFound是一个为眼底相机和光学相干断层扫描图像开发的知名基础模型，已在疾病诊断中表现出色。方法：我们提出FunduSegmenter，该模型整合了一系列新颖模块与RETFound，包括预适配器、解码器、后适配器、带有卷积块注意模块的跳跃连接以及视觉Transformer块适配器。该模型在自有数据集GoDARTS以及四个公开数据集IDRiD、Drishti-GS、RIM-ONE-r3和REFUGE上进行了评估，通过内部验证、外部验证和领域泛化实验进行验证。结果：在内部验证中，平均Dice相似系数达到90.51%，优于所有基线方法，其中nnU-Net为82.91%，DUNet为89.17%，TransUNet为87.91%。在所有外部验证实验中，平均结果比最佳基线高约3%，且在领域泛化中也具有竞争力。结论：本研究探讨了RETFound通过学习潜在通用表示在眼底相机图像中进行OD和OC分割的潜力。我们的FunduSegmenter在整体上优于现有最先进基线方法。所提出的模块是通用的，可以扩展到其他基础模型的微调。临床相关性：该模型在分布内和分布外数据上均表现出强大的稳定性与泛化能力，提供了稳定的OD和OC分割。这是许多自动化任务的关键步骤，从设置准确的视网膜坐标到生物标志物发现。代码和训练权重可在：https://github.com/JusticeZzy/FunduSegmenter上获得。

英文摘要

Purpose: This study introduces the first adaptation of RETFound for joint optic disc (OD) and optic cup (OC) segmentation. RETFound is a well-known foundation model developed for fundus camera and optical coherence tomography images, which has shown promising performance in disease diagnosis. Methods: We propose FunduSegmenter, a model integrating a series of novel modules with RETFound, including a Pre-adapter, a Decoder, a Post-adapter, skip connections with Convolutional Block Attention Module and a Vision Transformer block adapter. The model is evaluated on a proprietary dataset, GoDARTS, and four public datasets, IDRiD, Drishti-GS, RIM-ONE-r3, and REFUGE, through internal verification, external verification and domain generalization experiments. Results: An average Dice similarity coefficient of 90.51% was achieved in internal verification, which outperformed all baselines, some substantially (nnU-Net: 82.91%; DUNet: 89.17%; TransUNet: 87.91%). In all external verification experiments, the average results were about 3% higher than those of the best baseline, and our model was also competitive in domain generalization. Conclusions: This study explored the potential of the latent general representations learned by RETFound for OD and OC segmentation in fundus camera images. Our FunduSegmenter generally outperformed state-of-the-art baseline methods. The proposed modules are general and can be extended to fine-tuning other foundation models. Translational Relevance: The model shows strong stability and generalization on both in-distribution and out-of-distribution data, providing stable OD and OC segmentation. This is an essential step for many automated tasks, from setting the accurate retinal coordinate to biomarker discovery. The code and trained weights are available at: https://github.com/JusticeZzy/FunduSegmenter.

URL PDF HTML ☆

赞 0 踩 0

2508.09001 2026-05-21 cs.CL cs.AI cs.LG 版本更新

严格子目标执行：在分层强化学习中的可靠长 horizon 规划

Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence（人工智能研究生院）； Ulsan National Institute of Science and Technology (UNIST)（釜山国立科学与技术研究所）； Ulsan, South Korea（韩国釜山）

AI总结本文提出严格子目标执行（SSE）框架，通过前沿经验回放（FER）分离不可达与可接受的子目标，提高高层决策效率，从而在长horizon任务中实现更可靠的规划。

Comments 10 pages for main, 26 pages for total, Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

长horizon目标条件任务对强化学习（RL）提出了根本性挑战，特别是在目标遥远且奖励稀疏的情况下。虽然分层和图基方法提供了部分解决方案，但它们对传统 hindsight relabeling 的依赖往往无法纠正子目标不可行性，导致高层规划效率低下。为此，我们提出严格子目标执行（SSE），一种基于图的分层RL框架，整合前沿经验回放（FER）以分离不可达与可接受的子目标，并优化高层决策。FER利用失败和部分成功转移确定可达性前沿，识别不可靠的子目标，提高子目标可靠性，并减少不必要的高层决策。此外，SSE采用解耦探索策略以覆盖目标空间的未探索区域，并通过路径细化调整边成本以利用观察到的低层失败。在多样化的长horizon基准测试中，SSE在效率和成功率方面均优于现有目标条件和分层RL方法。我们的代码可在 https://jaebak1996.github.io/SSE/ 上获得。

英文摘要

Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate. Our code is available at https://jaebak1996.github.io/SSE/

URL PDF HTML ☆

赞 0 踩 0

2506.17631 2026-05-21 cs.LG cs.AI 版本更新

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Time-Prompt: 集成异构提示以解锁时间序列预测中的LLM

Zesen Wang, Lijuan Lan, Yonggang Li

发表机构 * Central South University, Changsha, China（中南大学，长沙，中国）

AI总结本文提出Time-Prompt框架，通过构建统一的提示范式、设计语义空间嵌入和跨模态对齐模块以及高效微调LLM参数，提升时间序列预测性能，并在碳排放数据集上验证其有效性。

Comments Accepted at IJCNN 2026

详情

AI中文摘要

时间序列预测旨在建模变量间的时序依赖关系以推断未来状态，对现实世界场景具有重要性和广泛应用。尽管基于深度学习的方法已取得显著进展，但其在长期预测中仍表现不佳。最近研究表明，大型语言模型（LLMs）在时间序列预测中表现出色，但其在该任务中的实用性仍存疑。为此，我们提出Time-Prompt框架，旨在激活LLMs进行时间序列预测。具体而言，我们首先构建了一个统一的提示范式，利用可学习的软提示引导LLM的行为，并利用文本化的硬提示增强时间序列表示。其次，为了增强LLM对预测任务的全面理解，我们设计了一个语义空间嵌入和跨模态对齐模块，以实现时序和文本数据的融合。最后，我们利用时间序列数据高效地微调LLM的参数。此外，我们专注于碳排放领域，旨在为全球碳中和做出贡献。在6个公开数据集和3个碳排放数据集上的综合评估表明，Time-Prompt是一个强大的时间序列预测框架。

英文摘要

Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM's behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM's parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

URL PDF HTML ☆

赞 0 踩 0

2505.19075 2026-05-21 cs.AI cs.CL cs.LG 版本更新

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Universal Reasoner: 一个单一、可组合的即插即用推理器用于冻结的LLM

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye

发表机构 * Graduate School of Artificial Intelligence, Korea Advanced Institute of Science and Technology（人工智能研究生院，韩国科学技术院）

AI总结本文提出Universal Reasoner，一种可组合且即插即用的推理模块，能够在冻结的大规模语言模型上提供专门的推理能力，通过共享或对齐的token空间实现弱到强的泛化，实验表明其在数学推理和机器翻译中优于现有微调方法。

Comments ICML 2026

详情

AI中文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

英文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

URL PDF HTML ☆

赞 0 踩 0

2505.14654 2026-05-21 cs.CV cs.AI cs.CL 版本更新

Beyond Words: Multimodal LLM Knows When to Speak

超越词语：多模态大语言模型何时说话

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

发表机构 * Department of Computer Science, Stony Brook University（石溪大学计算机科学系）； Atmee AI

AI总结本文提出了一种多模态策略，通过同步视频、音频和文本线索提高对话中的响应时机意识，从而提升大语言模型在对话中的响应准确性。

Comments Project page: https://github.com/lzk901372/MM-When2Speak

详情

AI中文摘要

基于大语言模型（LLMs）的聊天机器人能够生成流畅的响应，但在何时发言的问题上常常遇到困难，尤其是在对话过程中需要简短及时的听众反应时。我们提出了一种多模态策略，利用同步的视频、音频和文本线索来改进对话中的时间感知能力。该策略将响应时间重新表述为密集响应类型预测任务，使智能体能够在流式约束下决定是否保持沉默、生成简短反应或开始完整响应。因此，我们引入了一个经过精心挑选的多模态数据集，该数据集来自真实世界的双人对话视频，具有时间对齐的多模态数据和细粒度的反应类型注释。此外，我们设计了一种多模态策略MM-When2Speak，在LLM骨干网络上增加了多模态集成模块。在各种模态设置和强大的LLM基线模型上的实验表明，MM-When2Speak在响应类型预测性能上实现了高达3倍的提升，突显了多模态感知在自然和吸引人的对话交互中的重要性。

英文摘要

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

URL PDF HTML ☆

赞 0 踩 0

2504.13048 2026-05-21 cond-mat.mtrl-sci cs.AI 版本更新

Design Topological Materials by Reinforcement Fine-Tuned Generative Model

通过强化微调生成模型设计拓扑材料

Haosheng Xu, Dongheng Qian, Zhixuan Liu, Yadong Jiang, Jing Wang

发表机构 * State Key Laboratory of Surface Physics（表面物理国家重点实验室）； Department of Physics, Fudan University, Shanghai 200433, China（复旦大学物理系）； Shanghai Research Center for Quantum Sciences, Shanghai 201315, China（上海量子科学研究中心）； Institute for Nanoelectronic Devices（纳米电子器件研究所）； Quantum Computing, Fudan University, Shanghai 200433, China（量子计算，复旦大学）； Hefei National Laboratory, Hefei 230088, China（合肥国家实验室）

AI总结本文提出通过强化微调生成模型来设计拓扑绝缘体和拓扑晶体绝缘体，展示了该方法在生成具有完整能隙的新拓扑材料方面的有效性，以Ge₂Bi₂O₆为例证明了其在拓扑绝缘体领域的应用。

Journal ref Nature Communications (2026)

详情

DOI: 10.1038/s41467-026-73321-8

AI中文摘要

拓扑绝缘体（TIs）和拓扑晶体绝缘体（TCIs）是具有非常规电子性质的材料，其发现对实际应用具有高度价值。然而，特别是具有完整能隙的此类材料仍然稀少。鉴于传统方法在已知材料中扫描候选材料的局限性，我们专注于通过生成模型生成新拓扑材料。具体而言，我们应用强化微调（ReFT）到预训练的生成模型，从而将模型的目标与材料设计目标对齐。我们证明ReFT在增强模型生成TIs和TCIs的能力方面是有效的，且对生成材料的稳定性影响很小。使用微调后的模型，我们成功识别了大量新的拓扑材料，Ge₂Bi₂O₆作为代表性的例子——一个具有0.26 eV完整能隙的TI，是该类材料中已知的最大之一。

英文摘要

Topological insulators (TIs) and topological crystalline insulators (TCIs) are materials with unconventional electronic properties, making their discovery highly valuable for practical applications. However, such materials, particularly those with a full band gap, remain scarce. Given the limitations of traditional approaches that scan known materials for candidates, we focus on the generation of new topological materials through a generative model. Specifically, we apply reinforcement fine-tuning (ReFT) to a pre-trained generative model, thereby aligning the model's objectives with our material design goals. We demonstrate that ReFT is effective in enhancing the model's ability to generate TIs and TCIs, with minimal compromise on the stability of the generated materials. Using the fine-tuned model, we successfully identify a large number of new topological materials, with Ge$_2$Bi$_2$O$_6$ serving as a representative example--a TI with a full band gap of 0.26 eV, ranking among the largest known in this category.

URL PDF HTML ☆

赞 0 踩 0

2504.06925 2026-05-21 cs.CV cs.AI 版本更新

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

视觉-语言模型是否准备好进行饮食评估？探索AI驱动的食品图像识别的下一个前沿

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales

发表机构 * Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid（生物度量与数据模式分析实验室，马德里自治大学）； IMDEA Food, CEI UAM+CSIC（IMDEA食品，CEI UAM+CSIC）

AI总结本文评估了六种先进的视觉-语言模型在不同层次上的食品识别能力，提出了一个新的评估指标，并展示了FoodNExTDB数据库在饮食评估中的应用潜力。

Comments Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

Journal ref 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1-10

详情

DOI: 10.1109/CVPRW67362.2025.00047

AI中文摘要

基于食品图像的自动饮食评估仍是一个挑战，需要精确的食品检测、分割和分类。视觉-语言模型（VLMs）通过整合视觉和文本推理提供了新的可能性。在本研究中，我们评估了六种最先进的VLMs（ChatGPT、Gemini、Claude、Moondream、DeepSeek和LLaVA），分析它们在不同层次上的食品识别能力。在实验框架中，我们引入了FoodNExTDB，一个独特的食品图像数据库，包含9,263张由专家标注的图像，涵盖10个类别（例如“蛋白质来源”）、62个子类别（例如“家禽”）和9种烹饪风格（例如“烤制”）。总共，FoodNExTDB包括50,000个由七位专家生成的营养标签，这些标签由手动标注所有数据库中的图像生成。此外，我们提出了一种新的评估指标，专家加权召回率（EWR），该指标考虑了不同标注者之间的差异。结果表明，封闭源模型在识别包含单一产品的图像中的食品产品时，性能优于开源模型，达到了超过90%的EWR。尽管有潜力，当前VLMs在细粒度食品识别方面面临挑战，特别是在区分烹饪风格的细微差异和视觉相似的食品项目时，这限制了它们在自动饮食评估中的可靠性。FoodNExTDB数据库在https://github.com/AI4Food/FoodNExtDB上公开可用。

英文摘要

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

URL PDF HTML ☆

赞 0 踩 0

2503.22693 2026-05-21 q-fin.ST cs.AI cs.CL 版本更新

Bridging Language Models and Financial Analysis

连接语言模型与金融分析

Alejandro Lopez-Lira, Jihoon Kwon, Sangwoon Yoon, Jy-yong Sohn, Chanyeol Choi

发表机构 * University of Florida（佛罗里达大学）； Ministry of Justice, Republic of Korea（韩国司法部）； Yonsei University（延世大学）； LinqAlpha

AI总结本文旨在通过概述最近的语言模型研究进展，探讨其在金融领域中的应用潜力，填补语言模型在金融行业中的实际应用与研究进展之间的差距。

Comments 28 pages

详情

AI中文摘要

大规模语言模型（LLMs）的快速进步为自然语言处理领域带来了革命性可能性，特别是在金融领域。金融数据通常嵌套在文本内容、数值表格和视觉图表之间复杂的相互关系中，这对传统方法来说是一个挑战。然而，LLMs的出现为处理和分析这种多维数据提供了更高效和深入的途径。尽管LLMs研究进展迅速，但在金融行业中的实际应用仍存在显著差距，因为金融行业更倾向于谨慎整合和长期验证。这种差异导致新兴LLM技术的实施速度较慢，尽管它们在金融应用中具有巨大潜力。因此，许多最新的LLM技术进展仍未被充分探索或利用。本文旨在通过提供对最近LLM研究进展的全面概述，并探讨其在金融领域的适用性，来弥合这一差距。基于之前的文献综述，我们突出几种新的LLM方法，探讨其独特的功能及其在金融数据分析中的潜在相关性。通过综合广泛研究的见解，本文旨在为研究人员和从业者提供有价值的资源，指出有前途的研究方向，并概述未来进一步推进LLM在金融应用中的机会。

英文摘要

The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.

URL PDF HTML ☆

赞 0 踩 0

2503.08292 2026-05-21 cs.CL cs.AI 版本更新

Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

大语言模型像医生一样分诊吗？对外科会诊的动态研究

Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng, Ziniu Li, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

发表机构 * Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Bournemouth University（伯恩茅斯大学）； National Health Data Institute, Shenzhen（深圳国家健康数据研究院）； Shenzhen Research Institute of Big Data（深圳大数据研究院）

AI总结本文研究了大语言模型在动态分诊过程中的表现，发现其在动态场景中通过有效提问减少不确定性，优于传统分类器，但静态场景下优势有限。

详情

AI中文摘要

门诊会诊（OR）是一种核心临床流程，将患者分配到医院部门，在信息不完整且不断演变的情况下进行，但通常被简化为静态分类问题，尽管实际上是交互性的。在本工作中，我们将门诊会诊视为由信息获取和不确定性降低驱动的动态过程。我们分析了基于固定患者信息的静态场景和涉及多轮对话的动态场景，以测试大语言模型（LLMs）是否通过更好的预测或更有效的提问来改善分诊结果。我们的发现表明，LLMs在静态分诊准确性上对传统分类器几乎没有优势，但在动态设置中始终优于它们，通过询问具有辨别性的后续问题来减少候选部门的不确定性。这些结果表明，大语言模型在门诊分诊中的主要价值不在于静态预测，而在于支持交互式、具有不确定性的临床决策。

英文摘要

Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

URL PDF HTML ☆

赞 0 踩 0

2502.18915 2026-05-21 cs.CL cs.AI 版本更新

END: Early Noise Dropping for Efficient and Effective Context Denoising

END：早期噪声丢弃以实现高效有效的上下文去噪

Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Fangran Mo, Jinghan Zhang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin

发表机构 * Amazon（亚马逊）

AI总结本文提出END方法，通过在早期层对输入序列进行分割和线性探针，有效识别并丢弃噪声部分，从而提升LLM在不同任务上的性能和效率，同时加深了对LLM内部上下文推理机制的理解。

详情

AI中文摘要

大型语言模型（LLMs）在广泛自然语言处理任务中表现出色，但它们经常受到输入序列中无关或噪声内容的干扰，从而降低输出质量。这个问题影响了长上下文和短上下文场景，如检索增强生成、表格问答和上下文学习。我们发现LLMs可以在生成令牌之前，在早期层中隐式地识别输入序列中是否有有用信息。基于这一见解，我们引入了早期噪声丢弃（END），一种无需微调LLMs的新方法，以缓解此问题。END将输入序列分成块，并在LLMs的早期层上使用线性探针来区分信息丰富和噪声块。通过在过程中早期丢弃噪声块，END保留了关键信息，减少了干扰，并降低了计算开销。广泛的实验表明，END在不同LLMs上多个评估数据集上显著提高了性能和效率。此外，通过探针研究LLMs对输入的隐式理解，这项工作也加深了对LLMs如何内部进行上下文推理的理解。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

URL PDF HTML ☆

赞 0 踩 0

2502.12120 2026-05-21 cs.LG cs.AI cs.CL 版本更新

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

LLMs on the Line: 数据决定损失-损失缩放定律

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

发表机构 * Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Tübingen AI Center（图宾根人工智能中心）； University of Tübingen（图宾根大学）

AI总结研究探讨了影响LLM损失-损失缩放定律的主要因素，发现预训练数据决定了缩放趋势，而模型大小、优化超参数、分词器和架构差异对缩放影响有限，因此应精心选择预训练数据以获得最佳下游性能。

Comments ICML 2025 camera-ready version

详情

AI中文摘要

缩放定律指导大型语言模型（LLMs）的发展，通过提供模型大小、令牌和计算量之间的最佳平衡估计。最近，损失-损失缩放定律，即预训练数据集和下游任务之间损失的关系，已成为理解并改进LLM性能和泛化能力的强大工具。在本工作中，我们研究了哪些因素最强烈地影响损失-损失缩放。我们的实验发现，预训练数据决定了缩放趋势。相比之下，模型大小、优化超参数、分词器甚至显著的架构差异，如基于Transformer的模型如Llama和状态空间模型如Mamba之间的差异，通常影响有限。因此，从业者应仔细选择适合的预训练数据集以获得最佳下游性能，而架构和其他设置可以自由优化以提高训练效率。

英文摘要

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

URL PDF HTML ☆

赞 0 踩 0

2502.03752 2026-05-21 cs.LG cs.AI 版本更新

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

基于鲁棒技能的元强化学习中的自我改进技能学习

Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence（人工智能研究生院）； Ulsan National Institute of Science and Technology (UNIST)（釜山国立科学技术研究院 (UNIST)）

AI总结本文提出Self-Improving Skill Learning (SISL)方法，通过解耦的高层和技能改进策略进行自我指导的技能细化，并利用最大回报重标记进行技能优先级排序，从而在噪声和次优数据下实现鲁棒且稳定的适应，优于其他基于技能的元强化学习方法。

Comments 10 pages main, 27 pages appendix with reference. Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

元强化学习（Meta-RL）能够快速适应未见任务，但在长时间 horizon 环境中面临挑战。基于技能的方法通过将状态-动作序列分解为可重用的技能并采用分层决策来解决这一问题。然而，这些方法对噪声的离线演示高度敏感，导致技能学习不稳定和性能下降。为此，我们提出Self-Improving Skill Learning (SISL)，通过解耦的高层和技能改进策略进行自我指导的技能细化，同时应用最大回报重标记进行技能优先级排序，从而在噪声和次优数据下实现鲁棒且稳定的适应。通过减轻噪声的影响，SISL实现了可靠的技能学习，并在多样化的长horizon任务上一致优于其他基于技能的元强化学习方法。我们的代码可在https://epsilog.github.io/SISL获取。

英文摘要

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks. Our code is available at https://epsilog.github.io/SISL.

URL PDF HTML ☆

赞 0 踩 0

2502.03545 2026-05-21 cs.GT cs.AI cs.MA cs.SI 版本更新

Proportional Selection in Networks

网络中的比例选择

Georgios Papasotiropoulos, Oskar Skibski, Piotr Skowron, Tomasz Wąs

发表机构 * University of Warsaw（华沙大学）； University of Oxford（牛津大学）

AI总结本文研究了如何从网络中选择k个代表性节点，旨在识别最有影响力节点并确保选择比例反映网络的多样性，提出了两种方法并进行了理论分析和实验验证。

Comments This version has been accepted for publication at IJCAI'26

2502.02844 2026-05-21 cs.LG cs.AI cs.CR cs.MA 版本更新

Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

狼群对抗攻击用于鲁棒多智能体强化学习

Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence, UNIST, Ulsan, South Korea（人工智能研究生院，UNIST，韩国乌山）

AI总结本文提出狼群对抗攻击框架，用于对抗多智能体强化学习中的协同对抗攻击，并引入狼群-对抗学习框架来训练鲁棒的MARL策略以防御该攻击。

Comments 9 pages main, 23 pages appendix with reference. Accepeted by ICML 2025

Journal ref Proceedings of Machine Learning Research (PMLR), ICML 2025

2502.02834 2026-05-21 cs.LG cs.AI 版本更新

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

任务感知虚拟训练：增强元强化学习在分布外任务中的泛化能力

Jeongmo Kim, Yisak Park, Minung Kim, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence, UNIST, Ulsan, South Korea（人工智能研究生院，UNIST，韩国乌山）

AI总结本文提出Task-Aware Virtual Training方法，通过度量学习提升元强化学习在分布外任务中的泛化能力，采用虚拟任务保持任务特征并利用状态正则化技术减少状态变化环境中的过估计误差。

Comments 9 pages main paper, 20 pages appendices with reference. Accepted to ICML 2025

Journal ref Proceedings of Machine Learning Research (PMLR), ICML 2025

详情

DOI: 10.5555/3780338.3781544

AI中文摘要

元强化学习旨在开发能够泛化到未见任务的策略，这些任务从任务分布中采样。尽管基于上下文的元强化学习方法通过任务潜在变量改善任务表示，但它们在分布外（OOD）任务上常常表现不佳。为了解决这个问题，我们提出了Task-Aware Virtual Training（TAVT），一种新的算法，通过度量基于的表示学习准确捕捉任务特征，用于训练和OOD场景。我们的方法在虚拟任务中成功保持任务特征，并采用状态正则化技术以减轻状态变化环境中的过估计误差。数值结果表明，TAVT在各种MuJoCo和MetaWorld环境中显著增强了对OOD任务的泛化能力。我们的代码可在https://github.com/JM-Kim-94/tavt.git获取。

英文摘要

Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.

URL PDF HTML ☆

赞 0 踩 0

2410.12771 2026-05-21 cond-mat.mtrl-sci cs.AI physics.comp-ph 版本更新

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

开放材料2024（OMat24）无机材料数据集和模型

Luis Barroso-Luque, Muhammed Shuaibi, Xiang Fu, Brandon M. Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C. Lawrence Zitnick, Zachary W. Ulissi

发表机构 * Fundamental AI Research (FAIR) at Meta（Meta 基础人工智能研究（FAIR））

AI总结本研究提出了一种大规模公开数据集OMat24和预训练模型，旨在解决材料发现中公开训练数据和预训练模型不足的问题，通过密度泛函理论计算和先进模型提升材料科学的AI应用。

Comments 19 pages

详情

AI中文摘要

发现具有理想性能的新材料对于从缓解气候变化到下一代计算硬件的进步至关重要。AI有潜力通过更有效地探索化学空间来加速材料发现和设计，比其他计算方法或试错法更有效。尽管在AI用于材料数据、基准和模型方面取得了显著进展，但一个障碍是缺乏公开可用的训练数据和开放预训练模型。为此，我们提出了Open Materials 2024（OMat24）大规模公开数据集的Meta FAIR发布以及一组预训练模型。OMat24包含超过1.1亿个密度泛函理论（DFT）计算，专注于结构和组成多样性。我们的EquiformerV2模型在Matbench Discovery排行榜上实现了最先进的性能，并能够预测基态稳定性和形成能量，F1分数超过0.9，准确率达到20 meV/atom。我们探讨了模型大小、辅助去噪目标和微调对性能的影响，涵盖了包括OMat24、MPtraj和Alexandria在内的多种数据集。OMat24数据集和模型的公开发布使研究社区能够在此基础上进一步推动AI辅助材料科学的发展。

英文摘要

The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive further advancements in AI-assisted materials science.

URL PDF HTML ☆

赞 0 踩 0

2410.03296 2026-05-21 cs.CL cs.AI 版本更新

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

抽取式自我解释与人类推理在文本分类中的系统比较

Stephanie Brandl, Oliver Eberle

发表机构 * Center for Social Data Science（社会科学数据科学中心）； University of Copenhagen（哥本哈根大学）； Machine Learning Group（机器学习小组）； Technische Universität Berlin（柏林技术大学）

AI总结本文比较了抽取式自我解释与人类推理在文本分类任务中的有效性，通过分析不同任务和语言的解释质量，发现自我解释在文本长度和任务复杂度上与人类推理存在显著差异。

Comments accepted to the Trustworthy NLP Workshop, co-located with ACL 2026

详情

AI中文摘要

指令微调的LLM能够通过生成自我解释来向用户解释其输出，而无需应用复杂的可解释性技术。本文分析这种能力是否能产生高质量的解释。我们评估了以输入推理形式呈现的自我解释在人类中的可信度。我们研究了三个文本分类任务：情感分类、强迫劳动检测和声明验证。我们包括丹麦语和意大利语的情感分类任务翻译，并将自我解释与人类注释进行比较。为此，我们收集了Climate-Fever声明验证数据集的人类推理注释。我们进一步评估了人类和自我解释推理在正确模型预测方面的忠实度，并通过纳入事后归因基于的解释扩展了研究。我们分析了四个开源LLM，并发现自我解释与人类推理之间的对齐高度依赖于文本长度和任务复杂性。然而，自我解释会产生忠实的token级推理子集，而事后归因方法则倾向于强调结构和格式token，反映出根本不同的解释策略。

英文摘要

Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

URL PDF HTML ☆

赞 0 踩 0

2409.14839 2026-05-21 cs.AI cs.ET cs.HC 版本更新

Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships

可解释且以人为中心的AI用于决策支持系统：知识性准伙伴关系理论

John Dorsch, Maximilian Moll

发表机构 * Faculty of Philosophy, Philosophy of Science and the Study of Religion, Ludwig Maximilian University Munich（哲学学院、科学哲学与宗教研究学院，慕尼黑路德维希-马克西米利安大学）

AI总结本文提出了一种新的理论框架，即知识性准伙伴关系理论（EQP），用于指导开发能够提供人类基础解释（原因、反事实和置信度）的AI决策支持系统，以满足伦理和可解释AI（XAI）的需求。

Comments 20 pages

Journal ref Philosophy of Artificial Intelligence. Synthese Library, vol 533. Springer. 2026

详情

DOI: 10.1007/978-3-032-10073-3_5

AI中文摘要

在人工智能决策支持系统（AI-DSS）的背景下，我们主张满足伦理和可解释AI（XAI）的需求是开发AI-DSS，以向人类决策者提供三种类型的以人为中心的解释：原因、反事实和置信度，这种方法我们称为RCC方法。我们首先回顾了当前的实证XAI文献，探讨了生成模型解释的各种方法（如LIME、SHAP、Anchors）与模型感知可信度和终端用户准确性之间的关系。我们展示了当前关于什么是良好人类基础原因的理论要么无法充分解释这些证据，要么没有为开发提供坚实的伦理建议。因此，我们提出了一种新的理论：知识性准伙伴关系理论（EQP）。最后，我们阐明了采用EQP的动机，并展示了它如何解释实证证据，提供坚实的伦理建议，并导致采用RCC方法。

英文摘要

In the context of AI decision support systems (AI-DSS), we argue that meeting the demands of ethical and explainable AI (XAI) is about developing AI-DSS to provide human decision-makers with three types of human-grounded explanations: reasons, counterfactuals, and confidence, an approach we refer to as the RCC approach. We begin by reviewing current empirical XAI literature that investigates the relationship between various methods for generating model explanations (e.g., LIME, SHAP, Anchors), the perceived trustworthiness of the model, and end-user accuracy. We demonstrate how current theories about what constitutes good human-grounded reasons either do not adequately explain this evidence or do not offer sound ethical advice for development. Thus, we offer a novel theory of human-machine interaction: the theory of epistemic quasi-partnerships (EQP). Finally, we motivate adopting EQP and demonstrate how it explains the empirical evidence, offers sound ethical advice, and entails adopting the RCC approach.

URL PDF HTML ☆

赞 0 踩 0

2407.01734 2026-05-21 quant-ph cs.AI 版本更新

Optical Quantum Mixed-State Reconstruction With Multiple Deep Learning Approaches

光学量子混合态重构与多种深度学习方法

Nhan Trong Luu, Tuyen Quang Nguyen, Duong Trung Luu, Thang Cong Truong

发表机构 * College of Communication and Information Technology（通信与信息科技学院）； Can Tho University（金瓯大学）； School of Computer Science（计算机科学学院）； University of Technology Sydney（悉尼技术大学）； Center of Digital Transformation and Communication（数字转型与通信中心）； The University of Aizu（御所大学）

AI总结本文提出两种基于神经网络的量子态重构方法，用于纯态和混合态的量子态重构，通过利用类别信息实现对纯态和混合态的高精度重构。

Journal ref SN Computer Science (2026)

2406.07125 2026-05-21 cs.CR cs.AI cs.LG 版本更新

CARACAS: vehiCular ArchitectuRe for detAiled Can Attacks Simulation

CARACAS：用于详细CAN攻击模拟的车辆架构

Sadek Misto Kirdi, Nicola Scarano, Franco Oberti, Luca Mannella, Stefano Di Carlo, Alessandro Savino

发表机构 * Politecnico di Torino, Department of Control（都灵理工大学控制与计算机工程系）

AI总结本文提出CARACAS，一种用于模拟详细CAN攻击的车辆模型，通过结合Simulink等仿真框架和攻击模型的稳健表示，生成合成数据集以提高IDS的检测能力，重点展示电池电动车的扭矩控制攻击模拟。

Comments 6 pages, 8 figures, TrustAICyberSec workshop - IEEE ISCC 2024

Journal ref Proceeding of the 29th IEEE Symposium on Computers and Communications, ISCC 2024

2605.20534 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Axiomatizing Neural Networks via Pursuit of Subspaces

通过子空间追求轴心化神经网络

Mehmet Yamac, Mert Duman, Ugur Akpinar, Felix Rojas Casadiego, Serkan Kiranyaz, Marcel van Gerven, Moncef Gabbouj

发表机构 * Tampere University, Faculty of ITC, Finland（芬兰塔尔库大学信息与通信技术学院）； Department of Electrical Engineering, Qatar University, Qatar（卡塔尔大学电气工程系）； Donders Institute, Radboud University, The Netherlands（荷兰拉德堡德大学多纳尔斯研究所）

AI总结本文提出一个基于几何公理的框架，用于解释神经网络的行为，通过子空间追求假设，统一了表示、计算和泛化在浅层和深层架构中的视角。

Comments 43 pages, 25 figures. Code and additional materials will be released

2605.20529 2026-05-21 cs.CL cs.AI 版本更新

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

词组关联性：人类和神经网络中主谓一致学习的一种假设

Claire Hobbs, R. Thomas McCoy

发表机构 * Yale University（耶鲁大学）； Cognitive Science Program（认知科学项目）； Dept. of Linguistics（语言学系）； Wu Tsai Institute（吴泰科创研院）

AI总结本文探讨了语言输入中的统计信号如何帮助语法习得，提出词组关联性假设，通过词组共现规律提供句法依赖线索，并验证该机制在英语主谓一致习得中的有效性。

Comments Accepted to CoNLL

详情

AI中文摘要

在何种程度上，语言输入中的统计信号可以促进语法的习得？本文提出了一种称为词组关联性学习的机制，其中词组共现规律可以提供句法依赖的线索。我们研究这种机制是否能支持英语主谓一致的习得。首先，我们通过在不同可预测性水平的合成数据集上训练神经网络来模拟语言习得，发现存在一个可预测性范围，使得这些统计学习器能够稳健地学习主谓一致。然后，我们分析儿童导向语言中主谓配对的可变性，并发现此类数据中的可变性落在我们计算模拟中支持稳健泛化的范围内。综合来看，这些结果表明词组关联性是一种可行的学习策略，适用于儿童所接收的输入类型。

英文摘要

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

URL PDF HTML ☆

赞 0 踩 0

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV 版本更新

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出NeuroQA，一个大规模的3D脑部MRI视觉问答基准，包含来自12977名受试者的56953个问答对，涵盖5-104岁及五个临床领域，通过3D体积评估11种临床推理技能，并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情

AI中文摘要

我们提出了NeuroQA，一个大规模的3D脑部磁共振成像（MRI）视觉问答基准，包含来自12977名受试者的56953个问答对，涵盖5-104岁及五个临床领域：阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答（VQA）方法不同，NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能，涵盖是/否、多项选择和开放式格式。在203个模板中，131个是图像 grounded（可从3平面查看器回答），72个是图像 informed（答案来自定量体积测量或临床仪器）。为消除纯文本捷径，我们应用了答案分布优化，将封闭式文本-only 准确率从>80%降至44.6%；图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配，零个相同受试者矛盾。我们进行了临床评估，两名临床医生独立评估100个冻结测试项目，使用3平面查看器。在封闭式（是/否+多项选择）测试公开项目上，最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率，均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布，公开QA对用于开放访问数据集和受数据使用协议（DUAs）限制的数据集的可复现生成脚本，加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2605.20523 2026-05-21 cs.LG cs.AI q-bio.QM 版本更新

Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models

机器学习增强的非侵入性测试用于MASLD纤维化：浅层-深层神经网络与FIB-4、表格基础模型和大语言模型的比较

Athanasios Angelakis, Gabriele De Vito, Eleni-Myrto Trifylli, Filomena Ferrucci

发表机构 * BioML Lab, RI CODE, UniBw, Munich, Germany（BioML实验室，RI CODE，UniBw，慕尼黑，德国）； Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands（流行病学与数据科学系，阿姆斯特丹大学医学中心，阿姆斯特丹，荷兰）； Alpha Indicium, Rijswijk, Netherlands（Alpha Indicium，里杰斯霍伊斯，荷兰）； Department of Computer Science, University of Salerno, Salerno, Italy（计算机科学系，萨勒诺大学，萨勒诺，意大利）； GI-Liver Unit, 2nd Department of Internal Medicine, National and Kapodistrian University of Athens, General Hospital of Athens “Hippocratio”, Athens, Greece（肝病单位，第二内科部，雅典国家与卡波迪斯托里亚大学，雅典“希波克拉底”医院，希腊）

AI总结本文研究了机器学习增强的非侵入性测试在MASLD纤维化检测中的应用，比较了浅层-深层神经网络、FIB-4、表格基础模型和大语言模型在不同队列中的性能，发现浅层-深层神经网络在保持FIB-4变量空间的同时提供了更平衡的外部操作性能。

Comments 26 pages, 4 figures, 3 tables. Preprint

详情

AI中文摘要

晚期纤维化是代谢功能障碍相关脂肪性肝病（MASLD）中肝相关发病率的主要决定因素。FIB-4被广泛用作一线非侵入性测试，但其固定公式可能低估了年龄、天冬氨酸转氨酶、丙氨酸转氨酶和血小板计数中包含的诊断信息。我们评估了机器学习增强的非侵入性测试（MLE-NIT）是否能够在保持FIB-4变量空间的同时提高晚期纤维化的检测能力。我们使用了来自中国、马来西亚和印度的三个经活检确认的MASLD队列（n=784）。中国队列被分为486名训练样本和54名内部验证/调整治疗样本；最终性能仅在马来西亚和印度的外部队列中报告。模型使用了五个变量：年龄、FIB-4、天冬氨酸转氨酶、血小板计数和丙氨酸转氨酶。我们比较了FIB-4与浅层-深层神经网络（s-DNN）、TabPFN和gpt-4o-2024-08-06。FIB-4在马来西亚和印度的外部ROC-AUC分别为0.75和0.60。TabPFN达到0.69和0.66，微调后的GPT-4o达到0.75和0.63，而s-DNN达到0.77和0.67。s-DNN仅包含354个可训练参数，相比TabPFN的7,244,554个参数，却提供了更平衡的外部操作性能。校准显示s-DNN的Brier分数为0.18和0.22，排列重要性识别出AST和FIB-4为主要变量。紧凑的非线性MLE-NIT可能在不增加临床数据需求的情况下增强基于FIB-4的纤维化评估。

英文摘要

Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.20520 2026-05-21 cs.AI 版本更新

Open-World Evaluations for Measuring Frontier AI Capabilities

面向前沿AI能力的开放世界评估

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan

发表机构 * Princeton University（普林斯顿大学）； Cornflower Labs（Cornflower实验室）； Meridian Labs（Meridian实验室）； Stanford University（斯坦福大学）； UK AI Security Institute（英国人工智能安全研究所）； Johns Hopkins University（约翰霍普金斯大学）； Adaption Labs（Adaption实验室）； Australian National University（澳大利亚国立大学）； Golden Gate Institute for AI（金门人工智能研究所）； UW Madison（威斯康星大学麦迪逊分校）； Microsoft Research（微软研究院）； AI Digest（AI摘要）； Georgetown University (CSET)（乔治城大学（CSET））

AI总结本文提出开放世界评估作为一种补充方法，通过小样本定性分析来评估长期、复杂、现实世界任务，以更准确地衡量AI能力，并介绍了CRUX项目作为定期进行此类评估的尝试。

详情

AI中文摘要

基于基准的评估在跟踪前沿AI进展方面仍然很重要。但其可能同时高估和低估实际能力，因为它优先考虑可以精确指定、自动评分、容易优化且预算低、时间短的任务。我们倡导一种互补的评估类别，我们称之为开放世界评估：长期、复杂、现实世界任务通过小样本定性分析而非基准规模自动化来评估。在本文中，我们回顾了最近的开放世界评估，识别了其优势和局限性，并介绍了CRUX（Collaborative Research for Updating AI eXpectations），一个定期进行此类评估的项目。作为第一个实例，我们让一个AI代理开发并发布一个简单的iOS应用程序到Apple App Store。代理仅需一次可避免的手动干预就完成了任务，这表明开放世界评估可以提供关于可能很快普及的能力的早期预警。我们最后提出设计和报告开放世界评估的建议。

英文摘要

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

URL PDF HTML ☆

赞 0 踩 0

2605.20510 2026-05-21 cs.CV cs.AI cs.CY 版本更新

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

ShadeBench: 一个用于可持续社会建筑阴影模拟的基准数据集

Longchao Da, Mithun Shivakoti, Xiangrui Liu, T Pranav Kutralingam, Yezhou Yang, Hua Wei

发表机构 * School of Computing and Augmented Intelligence, Arizona State University（计算与增强智能学院，亚利桑那州立大学）； Global Futures Laboratory, Arizona State University（全球未来实验室，亚利桑那州立大学）

AI总结本文提出ShadeBench，一个用于城市阴影理解的综合数据集和基准，通过多模态数据支持阴影生成、分割和3D建筑重建，并提供标准化评估协议和基线方法，为数据驱动的城市气候研究和热适应城市规划提供基础。

Comments 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

详情

AI中文摘要

由于城市热岛效应的加剧，城市热暴露问题变得越来越严峻。细粒度的阴影模式，尤其是由建筑物引起的阴影，强烈影响行人热暴露和户外活动规划。然而，大规模准确建模和分析城市阴影仍然困难，因为缺乏大规模数据集和系统评估框架。为了解决这一挑战，我们提出了ShadeBench，一个全面的城市阴影理解数据集和基准。ShadeBench包含地理多样的城市场景，具有时间变化的模拟阴影地图和文本描述，以及对齐的卫星图像、建筑骨架表示和3D建筑网格。基于此多模态数据集，ShadeBench支持一系列下游任务，包括阴影生成、阴影分割和3D建筑重建。我们进一步建立了这些任务的标准评估协议和基线方法。通过使大规模和细粒度的阴影分析成为可能，ShadeBench为数据驱动的城市气候研究提供了基础，并支持未来在热适应城市规划和决策中的研究。代码和数据集可在https://darl-genai.github.io/shadebench/上公开获取。

英文摘要

Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.

URL PDF HTML ☆

赞 0 踩 0

2605.20502 2026-05-21 cs.LG cs.AI cs.CV stat.AP stat.ML 版本更新

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

基于表示空间扩散模型的Tippett最小融合多编码器异常检测

Neelkamal Bhuyan

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出了一种多编码器融合的表示空间扩散模型，通过统计分析每个编码器对特定分布偏移类型的敏感性，引入EncMin2L门控机制，无需使用OOD标签即可在较低参数成本下提升异常检测性能，同时在四种分布偏移类型上均达到0.94以上的AUROC。

Comments 14 pages

详情

AI中文摘要

我们通过多编码器融合的每编码器表示空间扩散模型（RDMs）来解决跨完整分布偏移谱的异常检测问题，包括全局域变化、语义分歧、纹理差异和协变量腐蚀。我们从ID数据中统计地识别每个编码器对特定偏移类型的敏感性，并引入EncMin2L——一种编码器无关的两级min(⋅)门控，能够在不使用OOD标签的情况下结合和校准每编码器扩散基的似然检测器，参数成本比单编码器基线低2.3倍。两种ID数据诊断：η²（类条件F检验）和Δμ（在合成腐蚀下的对数似然偏移）量化编码器的专业化，而Tippett最小p值组合将每编码器得分聚合为一个校准稳定的OOD信号。EncMin2L在所有四种偏移类型上均达到≥0.94的AUROC，优于在重叠基准上的最佳表示空间扩散OOD检测器。

英文摘要

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $η^2$ (class-conditional F-test) and $Δμ$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.20477 2026-05-21 cs.LG cs.AI cs.CL 版本更新

LLM预训练塑造了可泛化的流形：跨模态迁移至时间序列的洞察

Alexis Roger, Prateek Humane, Zhenghan Tai, Gwen Legate, Andrei Mircea, Vasilii Feofanov, Irina Rish

发表机构 * McGill University（麦吉尔大学）； Mila - Quebec AI Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； University of Toronto（多伦多大学）； Concordia University（康科迪亚大学）； com（42.com）

AI总结研究探讨了语言预训练的Transformer能否成为有效的时序预测器，并揭示了跨模态迁移的机制，指出预训练构建了流形，微调则将数值动态投影到任务相关方向。

详情

AI中文摘要

语言预训练的Transformer能否成为有效的时序预测器，以及原因是什么？本文表明，跨模态迁移出现是因为语言预训练为时序训练预设了一个可重用的流形。在冻结的LLM状态上进行线性探测可以解码出真实的时序轨迹而无需配对监督，该投影空间中的检索能产生具有竞争力的预测，表明在微调之前就已经存在结构和动态。预训练初始化还提升了优化效果，产生连贯的梯度和高度各向异性的损失景观，不同于随机初始化。微调则起到低维对齐的作用，重用已有的方向而非从头学习时间原始特性，这通过低秩更新、子空间对齐和共享的周期性、趋势和重复特征得到证实。这些结果支持了LLM到时序迁移的几何解释：语言预训练构建了流形，微调将数值动态投影到任务相关方向上。

英文摘要

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

URL PDF HTML ☆

赞 0 踩 0

2605.20445 2026-05-21 cs.CV cs.AI 版本更新

A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery

对用于CT和X光影像中新冠分类的深度学习架构的全面比较

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan ； School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结本文通过比较多种深度学习架构，提出基于卷积神经网络的计算机辅助诊断系统，以区分新冠和正常肺部影像，并在X光和CT数据集上取得了95至98%的平均准确率。

Comments 6 pages, 2 figures, 5 tables

详情

AI中文摘要

新冠是一种造成大量人员伤亡的重大挑战，不仅涉及某些国家，甚至全球也因冠状病毒而遭受影响。使用计算断层扫描（CT）和X光的肺部影像技术是新冠或其他大流行病筛查过程中最有效的工具。如今，技术已通过人工智能取代手动过程，用自动化机器使系统能够模仿人类大脑，通过经验做出明智决策。受此启发，我们的工作提出使用卷积神经网络（CNN）模型设计一个计算机辅助诊断（CAD）系统，以区分新冠和正常肺部影像。我们使用了两组不同的肺部X光影像和两组不同的CT扫描，并利用预训练的多种网络（如VGG（16, 19）、Densenet（121）、Resnet（50, 50 V2, 101 V2）、MobileNet（V2）、Xception Inception（V3, Resnet V2）、EfficientNet（B0）和Nasnet（Large））进行分类。在X光和CT图像数据集上，Resnet和VGG架构显示出能够正确区分新冠和正常图像的能力，平均准确率分别为95至98%。我们在分类数据集上的结果具有竞争力，并优于文献中已报告的发现。

英文摘要

COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

URL PDF HTML ☆

赞 0 踩 0

2605.20442 2026-05-21 cs.HC cs.AI 版本更新

Modeling Emotional Dynamics in Agent-to-Agent Interactions on Moltbook

在Moltbook上代理间交互中情感动态建模

Syed Mhamudul Hasan, Abdur R. Shahid

发表机构 * Southern Illinois University（南伊利诺伊大学）

AI总结本文研究了Moltbook中代理间交互的情感动态，提出了一种情感感知框架，用于将文本交互映射到预定义的细粒度情感类别，提取结构化的情感档案，并引入了基于情感的Persona-Stimulus-Reaction（PSR）领域来评估行为可靠性。

详情

AI中文摘要

生成式AI系统越来越多地被用作在线环境中的交互代理，例如名为Moltbook的社会网络。在Moltbook中，大规模的代理AI可以发布、评论并参与由AI驱动的文本生成的活动。然而，这些代理行为特征仍不够理解，特别是在复杂的多代理交互中。在本研究中，我们分析了Moltbook中代理交互的情感动态。我们构建了一个情感感知框架，将文本交互映射到预定义的细粒度情感类别集，从而在代理和交互上下文中提取结构化的情感档案。为进一步评估行为可靠性，我们引入了一个名为Persona-Stimulus-Reaction（PSR）的情感领域，以捕捉在相似上下文中情感响应的一致性。我们的分析显示，代理在不同上下文中表现出不同的情感模式和行为稳定性水平。我们的分析揭示了代理在不同上下文中表现出不同的情感特征，其行为稳定性因交互上下文而异。

英文摘要

Generative AI systems are increasingly deployed as interactive agents in online environments, such as a social network called Moltbook. In Moltbook, large-scale agentic AIs can post, comment, and engage in activities generated at scale by AI-driven text. Yet these agent behavioral characteristics remain insufficiently understood, particularly in complex, multi-agent interaction. In this study, we analyze the emotional dynamics of agent interactions within Moltbook. We construct an emotion-aware framework that maps textual interactions to a predefined set of fine-grained emotional categories, enabling the extraction of structured emotion profiles across agents and interaction contexts. To further evaluate behavioral reliability, we introduce an emotion-based domain called Persona-Stimulus-Reaction (PSR) that captures the alignment of emotional responses across similar contexts. Our analysis shows distinct emotional patterns and varying levels of behavioral stability across agents. Our analysis reveals that agents exhibit distinct emotional signatures with varying levels of behavioral stability influenced by interaction context.

URL PDF HTML ☆

赞 0 踩 0

2605.20441 2026-05-21 cs.LG cs.AI cs.NE 版本更新

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Transformer在Grokking中的权重衰减区域：廉价的在线诊断

Lucky Verma

发表机构 * Independent Researcher（独立研究者）

AI总结研究探讨了在模运算中训练的Transformer模型在记忆、泛化和崩溃之间的尖锐转变，并通过权重衰减作为标量经验控制参数来分析这些区域，引入了两种廉价的在线诊断方法，通过注意力激活来跟踪训练动态，并在较低计算成本下补充损失景观诊断。

Comments 28 pages, 11 figures, 5 tables. Code and aggregate JSONs: https://github.com/lucky-verma/grokking-diagnostics. Per-run JSONs: https://huggingface.co/datasets/lucky-verma/grokking-diagnostics-runs. Lean 4/mathlib v4.29.0 formal checks available in the code repository

详情

AI中文摘要

在模运算中训练的Transformer模型表现出记忆、泛化和崩溃之间的尖锐转变。我们证明权重衰减作为这些区域的标量经验控制参数，并引入了两种廉价的在线诊断方法，即平均成对注意力头余弦相似度和熵标准差，这些方法仅通过注意力激活来跟踪训练动态，并在较低计算成本下补充损失景观诊断。在十一种实验条件和三种模型规模（0.82M到85M参数）中，权重衰减轴将记忆、发展性Grokking和崩溃分开。一个接近临界点的逻辑拟合将记忆到发展性的边界定位在λ_c=0.0158（95%置信区间[0.0109, 0.0200]，N=210）；一个幂律拟合给出经验指数ν=0.757（置信区间[0.725, 0.799]）。参考指数ν=1/2和3D伊辛ν≈0.63在我们四格网格下位于此经验置信区间之外，因此我们报告ν为经验值，并将临界点类别的识别推迟到更密集的有限大小缩放工作。一个与地平线匹配的多任务复制（n=280，四个模运算）保留了权重衰减控制模式；在λ=0.05时进行的配对注意力头重新初始化实验改变了阶段2的振幅（Cohen的d=-1.190，n=10，p_t=4.5×10^-3），而匹配的权重范数裁剪则没有。三个跨架构探测（4L MLP，4L LSTM和4L Mamba；每个n=70）在小Transformer注意力模型的模运算中复制了权重衰减控制的转变，具有架构特定的λ_c值。主要诊断主张限于小Transformer注意力模型的模运算；非注意力实验是范围探测，架构广泛、语言模型和临界点类别的主张超出范围。

英文摘要

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $λ_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $ν=0.757$ (CI [0.725, 0.799]). Reference exponents $ν=1/2$ and 3D Ising $ν\approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $ν$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $λ=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $λ_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

URL PDF HTML ☆

赞 0 踩 0

2605.20440 2026-05-21 cs.LG cs.AI math.RA 版本更新

Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery

群代数张量：可证明最优的等变学习与物理对称性发现

Paulina Hoyos, Shashanka Ubaru, Dongsung Huh, Vasileios Kalantzis, Kenneth L. Clarkson, Misha Kilmer, Haim Avron, Lior Horesh

发表机构 * UT Austin（得克萨斯大学奥斯汀分校）； IBM Research（IBM研究院）； Independent（独立）； Tufts University（塔夫茨大学）； Tel-Aviv University（特拉维夫大学）

AI总结本文提出了一种群代数张量框架，通过将有限群G的乘法规则引入张量代数，使等变性成为代数属性而非架构限制。该框架基于三个理论支柱：(i) Eckart-Young最优性保证的星G-SVD；(ii)通过Kronecker分解组合多个对称性；(iii)600行的Lean4形式化证明。该框架提供了等变神经网络无法实现的能力：每个预测的闭式分解和数据驱动发现最佳对称群。在QM9分子几何上，通过八面体子群恢复角动量选择规则，展示了数据驱动的物理发现。

详情

AI中文摘要

我们引入了$\star_G$张量代数，在其中任何有限群$G$定义乘法规则，使等变性成为代数属性而非架构约束。该框架基于三个机器验证的理论支柱：(i) $\star_G$-SVD的Eckart-Young最优性保证，是首个对称保持张量近似的结果，精确且多项式时间；(ii) 通过Kronecker分解组合多个对称性，通过将$F_G$替换为$F_{G_1} \otimes F_{G_2}$无需架构重设计；(iii) 600行的Lean~4形式化证明了$\star_G$代数。该框架提供了等变神经网络（ENNs）结构无法实现的能力：每个预测的闭式分解，以及数据驱动发现最佳对称群。作为非平凡的实证演示，分解QM9分子几何的八面体子群恢复了角动量选择规则，仅凭数据而非量子力学输入：标量性质由A$_1$主导，偶极子成分由T$_1$主导，各向异性极化率对l=1不敏感，因为秩2迹分解l=0⊕l=2要求，T$_1$/A$_1$预测能力比将向量可观测量与标量可观测量分离了五倍。在完整的QM9（130,831分子）上，$\star_G$-SVD与岭回归提供闭式预测，参数数量比参数匹配的MLP少50-90倍。代数等变性因此补充架构等变性，不是更快、更好、更便宜的替代方案，而是不同的数学能力：可证明最优的对称保持压缩，每irrep可解释性，以及数据驱动的物理发现。

英文摘要

We introduce the $\star_G$ tensor algebra, in which any finite group $G$ defines the multiplication rule, making equivariance an intrinsic algebraic property rather than an architectural constraint. The framework rests on three machine-verified theoretical pillars: (i)~an Eckart-Young optimality guarantee for the $\star_G$-SVD: the first such result for symmetry-preserving tensor approximation, exact and polynomial-time; (ii)~a Kronecker factorization that composes multiple symmetries by replacing $F_G$ with $F_{G_1} \otimes F_{G_2}$ with no architectural redesign; and (iii)~a 600-line Lean~4 formalization of the $\star_G$ algebra. The framework provides capabilities that equivariant neural networks (ENNs) structurally cannot: a closed-form per-irreducible-representation decomposition of every prediction, and data-driven discovery of the symmetry group that best fits a dataset. As a non-trivial empirical demonstration, decomposing QM9 molecular geometry over the chiral octahedral subgroup of SO(3) recovers the Wigner--Eckart selection rules of angular momentum from data alone, with no quantum mechanical input: scalar properties are A$_1$-dominated, dipole components are T$_1$-dominated, the isotropic polarizability is uniquely insensitive to $l\!=\!1$ as the rank-2-trace decomposition $l\!=\!0 \oplus l\!=\!2$ requires, and the T$_1$/A$_1$ predictive-power ratio separates vector observables from scalar observables by a factor of five. On full QM9 (130{,}831 molecules), $\star_G$-SVD with ridge regression provides closed form predictions at $\sim50-90\times$ fewer parameters than parameter-matched MLPs. Algebraic equivariance thus complements architectural equivariance not as a faster-better-cheaper alternative but as a different mathematical affordance: provably-optimal symmetry-preserving compression, per-irrep interpretability, and data-driven physical discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.20425 2026-05-21 cs.AI 版本更新

在CT身体成分分割中解耦采样与训练预算

Iason Skylitsis, Dimitrios Karkalousos, Ivana Išgum

发表机构 * Amsterdam University Medical Center（阿姆斯特丹大学医学中心）； University of Amsterdam（阿姆斯特丹大学）； Informatics Institute, Faculty of Science（信息学院，科学学院）； Department of Radiology（放射科）； Mayo Clinic Rochester（罗切斯特梅奥诊所）

AI总结本文提出了一种基于少样本学习的episodic采样方法，用于解决医学图像分割中的类别不平衡问题，通过解耦采样与训练预算，提高了小数据集下的分割性能。

详情

AI中文摘要

类别不平衡是医学图像分割中的基本挑战，其中频繁类通常在训练中占主导地位，而稀有类被忽视。基于损失的方法通过在批次内重新加权每个像素的损失来缓解不平衡，而采样策略控制哪些图像进入批次。然而，两者均未明确控制批次中出现的类别，导致稀有类的暴露仅部分平衡。在本文中，我们采用少样本学习中的episodic采样，以在完全监督设置中促进类别平衡的批次构造。我们解耦episodic采样与其传统的度量学习上下文，并在CT身体成分分割中评估其效果。我们在九种肌肉和脂肪组织上，从公共SAROS数据集中提取了210次扫描，将episodic采样与随机和加权采样进行比较。训练是在全数据和低数据模式下进行的，此外在匹配训练迭代预算下也进行了额外比较。在全数据训练中，三种策略表现相当（episodic的平均Dice为0.882，随机和加权为0.878）。在低数据训练中，episodic采样优于随机和加权（0.787 vs. 0.758和0.762），这由训练迭代数的12倍差异驱动。在匹配训练预算下，随机和加权过早过拟合，而episodic在达到平台前提高了约三倍的迭代次数。我们的发现识别了训练迭代预算作为采样策略中被低估的混淆因素，推动了小数据集的迭代感知评估协议。此外，episodic采样的残余优势与隐含的类别平衡批次的正则化效应一致，提供了一种低成本、模型无关的解决医学图像分割类别不平衡问题的策略。代码可在https://github.com/iasonsky/episodic-sampling上获得。

英文摘要

Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic-sampling.

URL PDF HTML ☆

赞 0 踩 0

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR: 为自动驾驶扩展3D感知大模型

Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

发表机构 * Waymo ； UCSD（加州大学圣地亚哥分校）

AI总结本文研究了大规模训练在自动驾驶感知系统中的应用，通过扩展输入模态并训练大规模模型，实现了在Waymo数据集上的新状态-of-the-art性能。

详情

AI中文摘要

模型扩展通过在多样化数据集上进行大规模训练已显示出显著的成功。然而，尚不清楚相同的范式是否适用于自动驾驶感知系统，因为存在独特的挑战，如融合异构传感器数据和需要复杂的3D空间理解。为弥合这一差距，我们进行了系统分析，研究了规模对这些系统的影响。我们基于稀疏窗口变换器开发了STELLAR模型，扩展了输入模态，包括LiDAR、雷达、相机和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型，参数数量高达5亿。我们的大规模实验揭示了模型性能与模型大小、数据和计算之间的经验扩展趋势。所得到的模型在Waymo Open Dataset挑战中建立了新的状态-of-the-art，大幅超越了先前的成果。我们的工作表明，大规模训练是提升自动驾驶感知模型能力极具前景的路径。

英文摘要

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.20389 2026-05-21 cs.LG cs.AI 版本更新

Nonlocal operator learning for fMRI encoding and decoding tasks

非局部算子学习用于fMRI编码和解码任务

Andreas Kramer, Saugat Acharya, Alice Giola, Emanuele Zappala

发表机构 * Department of Computer Science, Idaho State University（计算机科学系，爱达荷州立大学）； Department of Mathematics and Statistics, Idaho State University（数学与统计学系，爱达荷州立大学）

AI总结本文提出了一种基于神经积分算子的框架，用于fMRI数据的编码和解码任务，探讨了非局部时空上下文的作用，并通过实验验证了更长的时间窗口和视觉皮层与全脑记录对性能和潜在空间几何的影响。

Comments 18 pages, 4 figures, 5 tables. Comments are welcome!

详情

AI中文摘要

功能性磁共振成像（fMRI）数据表现出高维时空结构，使得预测和解码变得具有挑战性。在本工作中，我们研究了基于神经积分算子的模型用于fMRI的编码和解码任务，特别强调非局部时空上下文的作用。我们实现了一个潜在的神经积分算子框架，该框架在辅助空间中执行固定点迭代，通过解码器进行分类和刺激预测。我们在两个开源fMRI数据集上评估了我们的模型。我们的实验检验了从fMRI记录中解码刺激以及从刺激表示中编码fMRI动态。主要关注点是时空上下文的影响：我们系统比较了短和长的时间窗口，以及使用视觉皮层与全脑记录，并分析其对性能和潜在空间几何的影响。在不同任务和数据集中，更长的时间窗口通常会改善结果并产生更具结构化的学习表示。在解码实验中，学习的潜在空间通常比原始数据提供更清晰的类别分离。在编码实验中，尽管由于任务难度绝对性能保持中等，但更长的时间窗口仍能产生一致的改进。这些发现表明，神经积分算子为建模fMRI动态提供了一个有前景的框架，并且更广泛的时空上下文对预测和表示学习都是有益的。更广泛地说，结果表明，利用大脑动态中的分布式非局部结构需要专门设计的模型架构来捕捉此类依赖关系。

英文摘要

Functional MRI data exhibit high-dimensional spatiotemporal structure, making both prediction and decoding challenging. In this work, we investigate neural integral-operator-based models for encoding and decoding tasks in fMRI, with particular emphasis on the role of nonlocal spatiotemporal context. We implement a latent neural integral operator framework that performs fixed point iterations in an auxiliary space from which classification and stimuli prediction is performed via a decoder. We evaluate our model on two open-source fMRI datasets. Our experiments examine both decoding of stimuli from fMRI recordings and encoding of fMRI dynamics from stimulus representations. A main focus is the effect of spatiotemporal context: we systematically compare short and long temporal windows, as well as the use of visual cortex vs whole brain recordings, and analyze their influence on performance and latent-space geometry. Across tasks and datasets, larger temporal windows generally improve results and produce more structured learned representations. In decoding experiments, the learned latent space often provides clearer class separation than the raw data. In encoding experiments, although absolute performance remains moderate due to the difficulty of the task, longer temporal windows still yield consistent gains. These findings suggest that neural integral operators provide a promising framework for modeling fMRI dynamics and that broader spatiotemporal context can be beneficial for both prediction and representation learning. More broadly, the results indicate that exploiting distributed nonlocal structure in brain dynamics requires model architectures specifically designed to capture such dependencies.

URL PDF HTML ☆

赞 0 踩 0

2605.20385 2026-05-21 cs.CV cs.AI 版本更新

DEL：用于大语言模型数值学习的数字熵损失

Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； VCIP, College of Computer Science, Nankai University（南开大学计算机学院VCIP）

AI总结本文提出Digit Entropy Loss (DEL)用于大语言模型的自回归数值学习，通过重新设计传统无监督熵优化，引入数字条件概率和二元交叉熵，使熵优化转向监督方式，同时推广整数基于的数值学习到浮点数优化，从而提升数值预测的准确性。

详情

AI中文摘要

数字预测是大语言模型（LLMs）在数学问题解决和代码生成中的基本能力。广泛采用的最大似然估计（MLE）用于LLM训练并不适合数字预测。最近，惩罚驱动的方法，例如数字标记损失和离散化距离损失，引入了数字距离的归纳偏置，但分别导致了数字分布过度锐化和过度扁平化。在本文中，我们深入分析了LLM的数值学习，并表明现有的数值学习方法在概念上遵循一个准则-距离公式，其中准则项代表优化模式，距离项灌输几何先验。因此，我们提出了Digit Entropy Loss (DEL)用于自回归数值学习，其重新设计传统无监督熵优化的三个关键设计：利用数字条件概率和二元交叉熵将熵优化引导为监督方式；舍弃距离项以避免数值距离的问题；并将整数基于的数值学习推广到浮点数优化，使数值预测更加准确。我们的DEL公式可以结合整数、小数和小数点，将学习目标从单个数字扩展到浮点数领域。在七个数学推理基准测试中使用四个代表性的LLM，包括CodeLlama、Mistral、DeepSeek和Qwen-2.5，进行实验，结果表明DEL在整体预测准确性和数值距离方面均优于其替代方法。源代码在https://github.com/PolyU-VCLab/DEL。

英文摘要

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

URL PDF HTML ☆

赞 0 踩 0

2605.20368 2026-05-21 cs.CR cs.AI 版本更新

Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

使用微调的本地大语言模型进行安全文档分类：基准数据和开源系统

Ivan Dobrovolskyi

发表机构 * MS in AI & ML Engineering, Independent Researcher, Sunnyvale, CA, USA（人工智能与机器学习工程硕士，独立研究员，美国加利福尼亚州圣何塞）

AI总结本研究提出TorchSight开源系统，利用微调后的Qwen 3.5 27B模型对安全文档进行分类，展示了其在78,358个样本和GPT-4合成数据上的高分类准确率，并验证了本地模型在安全文档分类中的有效性。

详情

AI中文摘要

扫描敏感信息文档的组织面临实际问题。云服务要求数据发送到外部基础设施，而基于规则的工具往往遗漏依赖上下文的威胁。本研究提出了TorchSight，一个围绕微调后的Qwen 3.5 27B模型构建的开源本地系统，用于安全文档分类。模型在78,358个样本和覆盖七个安全类别和51个子类别的GPT-4合成数据上进行训练。在主要评估1,000份文档上，模型达到95.0%的类别级准确率（95%置信区间：93.5-96.2）。在相同提示协议下，测试的商业模型得分为75.4-79.9%。在单独的外部500个保留样本集上，模型达到93.8%的准确率，表明性能超越了主要基准，尽管性能差异取决于数据集组成和困难边界情况。结果表明，微调的本地模型可以在保持文档处理本地控制的同时支持准确的安全文档分类。

英文摘要

Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,000 documents, the model reached 95.0% category-level accuracy (95% confidence interval: 93.5-96.2). The tested commercial models scored 75.4-79.9% under the same prompting protocol. On a separate external set of 500 held-out samples, the model reached 93.8% accuracy, which suggests that performance extends beyond the main benchmark, although the margin depends on dataset composition and difficult boundary cases. The results show that a fine-tuned local model can support accurate security document classification while keeping document processing under local control.

URL PDF HTML ☆

赞 0 踩 0

2605.20357 2026-05-21 cs.LG cs.AI 版本更新

Consistently Informative Soft-Label Temperature for Knowledge Distillation

一致信息软标签温度用于知识蒸馏

Hoang-Chau Luong, Nghia Van Vo, Kaiqi Zhao, Lingwei Chen

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； Oakland University（奥克兰大学）

AI总结本文提出CIST方法，通过为教师和学生分配样本级自适应温度，解决传统固定温度设计中教师软标签熵不一致和教师-学生logit尺度对齐过严的问题，从而提升知识蒸馏效果。

详情

数据更少，训练更快：重复较小的数据集通过采样偏差加速学习

Jingwen Liu, Ezra Edelman, Surbhi Goel, Bingbin Liu

发表机构 * Columbia University（哥伦比亚大学）； University of Pennsylvania（宾夕法尼亚大学）； Harvard University（哈佛大学）

AI总结研究探讨了'小数据与大数据差距'现象，即使用更少样本重复训练比使用更大数据集更节省计算资源，通过层间增长和采样偏差机制实现加速，为优化提供了新的归纳偏差。

Comments ICML 2026

2605.20309 2026-05-21 cs.CV cs.AI 版本更新

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Tiny-Engram: 生成视觉中的触发索引概念表

Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结本文提出Tiny-Engram，一种紧凑的触发索引概念表，通过显式地为视觉记忆分配词汇地址和激活边界，实现对冻结图像和视频生成器中的概念的控制。该方法通过注册的n-gram匹配索引参数化每个概念，仅在匹配触发区域调节文本编码器的隐藏状态，从而在保持周围提示的组合控制的同时，将罕见触发短语绑定到目标身份。

详情

AI中文摘要

当前生成视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念，但对是否以及何时检索概念的控制有限。在本工作中，我们引入Tiny-Engram，一种紧凑的触发索引概念表，为冻结的图像和视频生成器中的视觉记忆提供显式的词汇地址和激活边界。Tiny-Engram将每个概念参数化为一组小的记忆条目，这些条目通过注册的n-gram匹配进行索引，仅在匹配的触发区域调节文本编码器的隐藏状态。在该词汇支持之外，条件路径与冻结的基础模型相同。在单编码器潜在扩散和多编码器扩散-变压器骨干结构上，这种公式将罕见触发短语绑定到目标身份，同时保持周围提示的组合控制。我们进一步在文本条件的视频生成设置中评估相同的表式记忆，其中触发路径可靠地改变生成的主题，但保持在排除的视频提示中精细的身份持续性仍然有限。综合来看，这些结果表明，小型、显式地址的概念表是实现模块化视觉个性化的一种实用途径，尤其在图像生成中证据最强。对于视频扩散，剩余的差距指向更广泛的需求：时间稳定的身份可能依赖于文本侧记忆与不断演变的视觉状态之间的更紧密耦合，这促使未来在记忆注入方面的工作超越文本条件接口。

英文摘要

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

URL PDF HTML ☆

赞 0 踩 0

2605.20308 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SDM: A Powerful Tool for Evaluating Model Robustness

SDM：评估模型鲁棒性的强大工具

Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li

发表机构 * Information Engineering University, Zhengzhou, China ； Key Laboratory of Cyberspace Endogenous Safety \& Security of Henan Province, Zhengzhou, China ； Key Laboratory of Cyberspace Security Ministry of Education of China, Zhengzhou, China ； Songshan Laboratory, Zhengzhou, China

AI总结本文提出了一种名为SDM的新型梯度攻击方法，通过重新定义对抗样本生成的目标，解决了传统方法中'高损失非对抗样本'导致的性能下降问题，并在实验中证明了其在攻击性能和成本效率上的优势。

Comments 16 pages

Journal ref Forty-third International Conference on Machine Learning (ICML 2026)

2605.20300 2026-05-21 cs.LG cs.AI 版本更新

Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning

鲁棒的子空间约束二次模型用于低维结构学习

Zheng Zhai, Xiaohui Li

发表机构 * Department of Statistics, Faculty of Arts and Sciences at Beijing Normal University, Zhuhai（北京师范大学统计学系，北京师范大学艺术科学 faculty，珠海分校）

AI总结本文提出了一种鲁棒的子空间约束二次模型（SCQM），用于从高维数据中学习低维结构。基于子空间约束二次矩阵分解（SQMF）框架，该模型能够适应广泛噪声分布，包括广义高斯和径向拉普拉斯模型。这种泛化能力使其在重尾和轻尾噪声下均能保持稳定性能，显著提高了在不同数据场景下的鲁棒性。为高效解决由此产生的非凸优化问题，我们开发了一种基于梯度的算法，配备回溯线搜索策略以确保稳定和高效的收敛。此外，我们还对$\ell_p^p$和$\ell_2$损失函数进行了敏感性分析，阐明了它们在不同噪声特性下的不同行为。大量数值实验验证了理论分析，并展示了所提方法在鲁棒性和重建准确性方面优于现有方法。

详情

AI中文摘要

在本文中，我们提出了一种鲁棒的子空间约束二次模型（SCQM），用于从高维数据中学习低维结构。基于子空间约束二次矩阵分解（SQMF）框架，所提出的模型能够适应广泛噪声分布，包括广义高斯和径向拉普拉斯模型。这种泛化能力使该方法在重尾和轻尾噪声下均能保持稳定性能，从而在不同数据场景下显著提高了鲁棒性。为高效解决由此产生的非凸优化问题，我们开发了一种基于梯度的算法，配备回溯线搜索策略以确保稳定和高效的收敛。此外，我们还对$\ell_p^p$和$\ell_2$损失函数进行了敏感性分析，阐明了它们在不同噪声特性下的不同行为。大量数值实验验证了理论分析，并展示了所提方法在鲁棒性和重建准确性方面优于现有方法。

英文摘要

In this paper, we propose a robust subspace-constrained quadratic model (SCQM) for learning low-dimensional structure from high-dimensional data. Building upon the subspace-constrained quadratic matrix factorization (SQMF) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models. This generalization enables reliable performance under both heavy-tailed and light-tailed noise, thereby substantially enhancing robustness across diverse data regimes. To efficiently address the resulting nonconvex optimization problem, we develop a gradient-based algorithm equipped with a backtracking line-search strategy that ensures stable and efficient convergence. In addition, we present a sensitivity analysis of the $\ell_p^p$ and $\ell_2$ loss functions, elucidating their distinct behaviors under varying noise characteristics. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.20299 2026-05-21 cs.LG cs.AI cs.RO 版本更新

Mechanisms of Misgeneralization in Physical Sequence Modeling

物理序列建模中泛化错误的机制

Kento Nishi, Raphael Tang, Karun Kumar, Core Francisco Park, Hidenori Tanaka

发表机构 * Harvard College（哈佛大学）； Harvard John A. Paulson School of Engineering and Applied Sciences（哈佛大学约翰·A·保罗森工程与应用科学学院）； Comcast AI ； CBS-NTT Program in Physics of Intelligence, Harvard University（哈佛大学物理智能计划）； Physics of Artificial Intelligence Group, NTT Research, Inc., Sunnyvale, CA, USA（人工智能物理研究组，NTT研究公司，美国加利福尼亚州山景城）； Microsoft（微软）

AI总结本文研究了物理序列建模中由于局部误差传播导致的物理泛化错误，提出了一种数据偏差核来预测物理量的质量变化，并提出了基于核的干预策略。

Comments Preprint. kentonishi.com/physical-misgeneralization

详情

AI中文摘要

生成序列模型通常用于在物理领域规划运动，从机器人到机械模拟。在构建训练此类模型的数据集时，工程师可能会选择演示来指定轨迹在物理量如旅行距离或机械能上的分布。例如，一个构建迷宫导航代理的机器人工程师可能会选择旅行距离覆盖固定范围的演示，希望限制代理的预期功率使用。我们发现标准深度学习可以违反这一意图：每个生成的轨迹在单独看来都合理，但物理量的总体分布是错误的。我们将这种失败称为物理泛化错误，并发展了其机制。通过受控的合成任务，我们发现物理泛化错误出现在局部误差典型于模型类通过物理测量传播到恢复分布时。我们用数据偏差核估计这些误差，并利用它来预测在我们的合成任务和更应用的迷宫导航和双摆运动任务中哪些物理量获得或失去质量。最后，我们的机制性解释有助于识别哪些缓解策略在结构上具有前景，并利用它提出了一种基于核的干预。

英文摘要

Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent's expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.20296 2026-05-21 cs.LG cs.AI 版本更新

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

谱遗忘：无需重新训练的后验能力恢复

Aarash Abro, Muhammad Tahir

发表机构 * Zeta Labs（泽塔实验室）； Lahore University of Management Sciences（拉合尔管理科学大学）

AI总结研究探讨了语言模型在目标任务微调过程中因训练数据未显式威胁而退化的能力现象，提出了一种仅使用预训练检查点和微调后检查点的后验修复方法，通过谱修复技术恢复受损能力并保留目标任务收益。

详情

AI中文摘要

对语言模型进行目标任务微调通常会退化那些训练数据从未显式威胁的能力。我们研究这种现象，称为灾难性遗忘，并提出一种后验修复解决方案，仅使用预训练检查点W_base和其微调后代W_ft。目标不仅是将模型回退到基础检查点，而是恢复微调损坏的能力，同时保留目标任务的收益和任何有益的未显式改进。我们引入了DG-Hard，一种仅使用检查点的谱修复方法，用于微调更新Δ= W_ft - W_base。DG-Hard将Δ视为嵌入在IID-like噪声残差中的低秩任务对齐信号，该信号梯度下降没有动力去除，并对每个权重-增量矩阵应用Donoho-Gavish硬奇异值阈值，保留更新的结构高能部分并去除谱体。这将修复简化为一个闭合形式的SVD过滤步骤，无需数据依赖的调优。一个核心困难是评估：平均准确率隐藏了每个基准的失败，而朴素恢复分数奖励那些简单回退到基础的模型。因此，我们引入了一个分区条件度量，分别跟踪愈合、保留、非损坏和目标任务保留。在14（模型，任务）设置和九个跨领域未显式基准上，DG-Hard在后验基线中实现了最强的平衡修复。DG-Hard还恢复了由良性微调退化的三个独立安全轴的安全对齐，尽管不使用任何对齐数据。这些结果表明，部分微调引起的能力建设损失并非专业化不可避免的后果，而是在权重更新本身中可去除的谱残余。

英文摘要

Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint $W_{\mathrm{base}}$ and its fine-tuned descendant $W_{\mathrm{ft}}$. The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update $Δ= W_{\mathrm{ft}} - W_{\mathrm{base}}$. DG-Hard treats $Δ$ as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across $14$ (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

URL PDF HTML ☆

赞 0 踩 0

2605.20295 2026-05-21 cs.LG cs.AI 版本更新

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Quant.npu：通过完全静态量化实现高效的移动NPU推理以支持设备端LLM

Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang

发表机构 * Qualcomm（高通）

AI总结本文提出Quant.npu框架，通过完全静态量化方法实现高效的移动NPU推理，解决了传统后训练量化方法在NPU硬件约束下的兼容性问题，并在实际移动NPU上实现了较高的准确性和较低的推理延迟。

详情

AI中文摘要

大型语言模型（LLMs）正越来越多地部署在移动设备上，其中神经处理单元（NPUs）需要完全静态量化以实现最优的推理效率。然而，现有的后训练量化（PTQ）方法主要依赖于动态激活量化，使其与NPU硬件约束不兼容。为了弥合高保真PTQ与NPU受限推理之间的差距，我们提出了Quant.npu，一个仅整数的完全静态量化框架。它结合了可学习的量化参数和旋转矩阵，使低比特激活-权重量化无需运行时重新计算量化参数。关键的是，我们发现初始化和选择性优化量化参数对于优化稳定性至关重要，因为不恰当的初始化和简单的联合优化会引发梯度不稳定，破坏旋转矩阵的优化。为此，我们提出了针对不同激活特征的旋转和比特宽感知初始化，以及针对旋转和未旋转张量的分布感知选择性优化（双阶段量化流水线）。此外，我们引入了一种敏感性引导的自适应混合精度方案，以在准确性和推理效率之间取得平衡。在实际移动NPU上的大量实验表明，Quant.npu在准确度上与最先进的方法相当，同时将推理延迟降低了最高15.1%。

英文摘要

Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

URL PDF HTML ☆

赞 0 踩 0

2605.20293 2026-05-21 cs.LG cs.AI cs.NE 版本更新

Closed-form predictive coding via hierarchical Gaussian filters

通过分层高斯滤波器实现闭式预测编码

Aleksandrs Baskakovs, Sylvain Estebe, Kenneth Enevoldsen, Kristoffer Nielbo, Chris Mathys, Nicolas Legrand

发表机构 * Center for Humanities Computing（人文计算中心）； Aarhus University（奥胡斯大学）； Interacting Minds Center（互动心灵中心）

AI总结本文提出通过分层高斯滤波器实现预测编码，恢复了精度加权的信息传递，实现了动态不确定性估计和Hebbian兼容的更新规则，从而在单个自由能目标下同时学习激活、权重和精度，无需全局误差信号，且无需迭代或自动微分。

详情

AI中文摘要

预测编码（PC）提供了一种局部且生物基础的替代反向传播方法，用于训练人工神经网络，但至今仍较慢，且随着网络深度增加性能急剧下降。我们追溯这两个问题到一个简化：当前PC网络将精度矩阵固定为单位矩阵，丢弃了变分推导所需的精度加权预测误差，以实现快速、局部和贝叶斯的特性。我们通过将预测编码网络表示为深度分层高斯滤波器（HGF）并恢复精度加权的信息传递，从而在每一层实现动态不确定性估计和Hebbian兼容的更新规则。所得到的网络可以在单个自由能目标下同时学习激活、权重和精度，无需全局误差信号，并且在推断过程中无需迭代或自动微分。在FashionMNIST上，我们的解决方案在epoch级的运行时间成本上接近反向传播，同时在更少的epoch中收敛，并在在线、数据效率和概念漂移任务上优于反向传播。因此，我们证明了闭式变分推断与在线精度学习相结合，为深度预测编码网络提供了一个可处理的基础，保留了生物和解释性优势，而无需迭代松弛或全局误差信号。

英文摘要

Predictive coding (PC) offers a local and biologically grounded alternative to backpropagation in the training of artificial neural networks, yet to date, it remains slower, and performance degrades sharply as network depth increases. We trace both problems to a single simplification: current PC networks fix the precision matrix to the identity, discarding precision-weighted prediction errors that the variational derivation requires to be fast, local, and Bayesian. We close this gap by expressing predictive coding networks as deep hierarchical Gaussian filters (HGFs) and restore precision-weighted message passing, yielding dynamic uncertainty estimates and Hebbian-compatible update rules at every layer. The resulting networks can simultaneously learn activations, weights, and precisions under a single free-energy objective, with no global error signal, and resolve inference without requiring iterations or automatic differentiation. On FashionMNIST, our solution approaches backpropagation in epoch-level wall-clock cost while converging in fewer epochs, and outperforms it on online, data efficiency, and concept-drift tasks. We thus establish that closed-form variational inference with online precision learning provides a tractable foundation for deep predictive coding networks, retaining biological and interpretative advantages, without requiring iterative relaxation or global error signals.

URL PDF HTML ☆

赞 0 踩 0

2605.20289 2026-05-21 cs.LG cs.AI 版本更新

Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers

插件式脉冲运算符：突破脉冲变换器中的非线性瓶颈

Xinzhe Yuan, Xiang Peng, Bin Gu, Huan Xiong

发表机构 * IASM, Harbin Institute of Technology, China（哈尔滨工业大学人工智能研究所，中国）； School of Artificial Intelligence, Jilin University, China（吉林大学人工智能学院，中国）

AI总结本文提出了一种插件式框架，通过将Transformer中的非线性运算分解为三个基本算子（除法、指数和ℓ2范数），并利用LIF神经元群体和轻量级位移缩放实现脉冲友好的近似，从而在不需微调的情况下支持常见的Transformer非线性运算。

Comments Accepted to ICML 2026. 9 pages main paper, 8 pages appendix, 6 figures, 5 tables. Correspondence to Bin Gu and Huan Xiong

详情

AI中文摘要

ANN到SNN的转换提供了一条实用且无需训练的途径来构建脉冲大规模语言模型。然而，当前的流程主要关注于脉冲驱动实现Transformer线性代数运算，而对关键非线性运算的支持有限。这种差距限制了与神经形态风格执行约束的兼容性，其中此类非线性通常需要除法、指数或范数计算，这些计算并不自然支持于标准的泄漏积分-放电动力学。为了解决这个问题，我们提出了一种插件式框架，实现了Transformer非线性的脉冲友好的近似，并整合到现有的ANN到SNN流程中。我们的方法将这些非线性计算分解为三个反复出现的基本算子——除法、指数和ℓ2范数——并通过利用LIF神经元群体进行群体计算，并结合轻量级位移缩放以避免浮点运算来实现它们。通过将这些基本算子作为模块化运算块进行组合，我们的框架支持常见的Transformer非线性运算（例如Softmax、SiLU和归一化），而无需任何微调。在一系列LLM Transformer上的实验表明，选择性地替换目标非线性运算符在所有评估任务中导致的精度下降少于1%。

英文摘要

ANN-to-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives -- division, exponentiation, and $\ell_2$ norms -- and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a $1\%$ accuracy drop across all evaluated tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.20287 2026-05-21 cs.LG cs.AI cs.CV 版本更新

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

FusionCell: 跨注意力融合布局几何与网络列表拓扑以实现标准单元性能预测

Haoyi Zhang, Kairong Guo, Bojie Zhang, Yibo Lin, Runsheng Wang

发表机构 * School of Integrated Circuits, Peking University, Beijing, China（集成电路学院，北京大学，北京，中国）

AI总结本文提出FusionCell，通过跨注意力机制融合布局几何和网络列表拓扑，以提高标准单元性能预测的准确性，解决了传统方法忽略布局几何导致的耦合和布局依赖效应的问题。

详情

AI中文摘要

标准单元是数字电路的基本构建块，其延迟和功率对芯片级性能有关键影响；然而，其表征仍依赖于缓慢的仿真扫描，许多快速预测器忽略了布局几何，未能捕捉到耦合和布局依赖效应。挑战在于如何联合表示布局几何和网络列表拓扑，使模型能够同时捕捉细粒度的空间细节和结构连接，以实现准确的性能预测。我们引入FusionCell，一种双模态预测器，将路由布局几何和网络列表拓扑作为输入，并在统一模型中显式融合它们。一个DeiT编码器处理三层路由布局，而图Transformer模型异构设备/网络图。模态通过拓扑引导机制集成，其中网络列表作为结构“地图”主动查询布局中的相关物理区域，以实现联合几何和拓扑推理。我们构建了一个基于ASAP7 PDK的7nm数据集，使用自动工具生成超过19500个单元，涵盖149种类型，针对六个指标：信号上升/下降延迟、过渡和功率。实验结果表明，FusionCell减少了回归误差，平均MAPE为0.92个百分点，并在基线模型上提高了Spearman/Kendall排名，同时将表征过程的速度提高了数十倍，相比电路仿真。

英文摘要

Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

URL PDF HTML ☆

赞 0 踩 0

2605.20285 2026-05-21 cs.LG cs.AI 版本更新

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

反思式X训练：反馈条件化提升跨所有LLM训练阶段的扩展性

Brandon Cui, Ximing Lu, Jaehun Jung, Syeda Nahida Akter, Hyunwoo Kim, Yuxiao Qu, David Acuna, Shrimai Prabhumoye, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * NVIDIA ； University of Washington（华盛顿大学）； Carnegie Mellon University（卡内基梅隆大学）； UC San Diego（南加州大学）

AI总结本文提出反思式训练（IXT），通过利用后续阶段的动态来改进早期阶段，从而提高LLM训练的扩展效率，实验表明该方法在计算效率和性能上均有显著提升。

详情

AI中文摘要

我们探讨了如何更高效地扩展当前LLM训练流水线中的多个不断增长的阶段。我们的核心直觉源于后续阶段的动态（例如训练后）可以用来指导早期阶段（例如预训练）。为此，我们提出了反思式训练（或IXT），受离线奖励条件强化学习启发，适用于训练的任何阶段。IXT使用一个思考奖励模型来用自然语言批评性反馈标注数据，使从流水线的最早阶段开始就能进行质量感知训练。然后通过将生成的反馈作为前缀条件化数据来训练模型——确保在训练早期阶段并非所有token都被同等对待。在7.5-12B基于transformer的密集LLM上进行的全面实验表明，我们的方法：使扩展曲线弯曲，从而在一般情况下实现高达2.8倍的计算效率提升；并在数学和代码等领域达到其他训练方法无法达到的性能水平。

英文摘要

We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.

URL PDF HTML ☆

赞 0 踩 0

2605.20284 2026-05-21 cs.CV cs.AI cs.LG 版本更新

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO: 一种面向工业异常问答的多模态推理框架

Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park

发表机构 * Sungkyunkwan University（成均馆大学）； Seoul National University（首尔国立大学）

AI总结本文提出JUDO框架，通过结合领域知识和上下文提升多模态推理能力，以解决工业异常检测中模型缺乏领域知识的问题，实验表明其在MMAD基准上优于Qwen2.5-VL-7B和GPT-4o。

Comments Published at ICLR 2026

详情

AI中文摘要

工业异常检测已显著受益于大多模态模型（LMMs），使检测能力超越了单纯的检测，尤其通过视觉引导推理提升图像理解能力。然而，LMMs缺乏领域特定知识，限制了其在复杂工业场景中生成准确响应的能力。在本工作中，我们提出了JUDO，即Juxtaposed Domain-Oriented Multimodal Reasoner，一种能够高效整合领域知识和上下文的视觉和文本推理框架。通过视觉推理，我们的模型通过将查询图像与正常图像进行对比，分割缺陷区域，实现细粒度的视觉比较检查。此外，我们通过监督微调（SFT）注入领域知识，以增强上下文理解，并通过强化学习（GRPO）引导领域推理，采用领域导向的推理过程。实验结果表明，JUDO在MMAD基准上表现优异，超越了Qwen2.5-VL-7B和GPT-4o等模型。这些结果突显了增强领域知识和上下文对有效推理在异常理解中的重要性。

英文摘要

Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.20277 2026-05-21 cs.CV cs.AI 版本更新

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹积分反馈调节解剖感知奖励用于体积计算断层扫描分析

Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang

发表机构 * Zhejiang University（浙江大学）； DAMO Academy, Alibaba Group（阿里集团达摩院）； Hupan Lab（虎扑实验室）； University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出了一种新的框架，通过轨迹积分反馈GRPO（TIF-GRPO）来改进医疗视觉语言模型在三维CT分析中的性能，通过引入临床异常基准评估子系统（CABS）来解决优化目标与临床严谨性之间的不匹配问题，提升异常检测和临床准确性。

详情

AI中文摘要

医学视觉-语言模型（VLMs）已迅速发展为通用多模态助手，但其在三维计算机断层扫描（CT）分析中的应用仍受到优化目标与临床严谨性之间持续不匹配的限制。当前的强化学习（RL）范式仍然依赖于词汇代理信号，导致``评估幻觉''，即模型优化语言流畅性而非事实性临床正确性，从而导致诊断性关键错误。为弥合这一差距，我们引入了临床异常基准评估子系统（CABS），一个将放射学报告分解为可验证的临床语义单元的结构化系统。利用CABS，我们识别出标准RL中的``机理分歧''，即表面相似性奖励驱动策略梯度绕过医学事实。因此，我们提出了轨迹积分反馈GRPO（TIF-GRPO），一种将控制理论原理整合到策略优化中的新框架。通过将临床推理建模为伪时间轨迹以发现异常，TIF-GRPO通过积分反馈回路调节解剖感知奖励，该回路将持续遗漏视为累积状态误差，并将幻觉视为过度的控制努力。在3D CT基准测试中，我们的方法显著提高了异常检测和临床忠实度，建立了医疗VLMs中细粒度调节的新范式。我们的项目可在GitHub上获取。

英文摘要

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

URL PDF HTML ☆

赞 0 踩 0

2605.20275 2026-05-21 cs.CV cs.AI 版本更新

You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

你不需要注意力：基于门控卷积的基于手表的跌倒检测

Sana Alamgeer, Ronish Kumar, Awatif Yasmin, Muhammad Irshad, Anne H. H. Ngu

发表机构 * Texas State University（德克萨斯州立大学）

AI总结本文提出了一种轻量级的双流架构Gated-CNN，用于基于手表的跌倒检测，通过门控机制提升特征提取效率，实现在不使用注意力机制的情况下达到更高的检测精度。

详情

AI中文摘要

现有的基于可穿戴设备的跌倒检测系统依赖于自注意力机制，这种机制带来了二次计算开销，将权重分布到所有时间步。这种全局权重分布会损害在短固定长度窗口中跌倒特征的精确定位。为克服这一挑战，我们提出Gated-CNN，一种轻量级双流架构，通过独立的一维卷积特征提取器处理加速度计和陀螺仪流，随后（i）一个sigmoid门控模块，选择性地抑制无信息的背景激活，同时增强跌倒区分特征；（ii）一个全局平均池化层，将每个流压缩成紧凑的固定长度描述符；（iii）一个共享的分类头，融合两个描述符进行二分类跌倒预测。对于离线评估，我们在五个腕部惯性测量单元（IMU）数据集上评估模型，分别在SmartFallMM、WEDA-Fall、FallAllD、UMAFall和UP-Fall数据集上获得平均F1分数为93%、93%、90%、91%和90%的结果，优于Transformer基线。对于实时评估，我们将模型部署在Google Pixel Watch 3上，并在12名参与者上进行测试。模型在零次遗漏的情况下实现了97%的平均F1分数和98%的准确率，表明sigmoid门控提供了一种在结构上更一致且计算更高效的替代方案，用于商用智能手表的跌倒检测。

英文摘要

Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

URL PDF HTML ☆

赞 0 踩 0

2605.20274 2026-05-21 cs.GR cs.AI 版本更新

PolycubeNet: A Dual-latent Diffusion Model for Polycube-Based Hexahedral Mesh Generation

PolycubeNet: 一种基于多立方体的双潜在扩散模型用于六面体网格生成

Lu He, Qitao Deng, Junjiang Deng, Liangbin Deng, Yanjun Liang, Wenting Yang, Guoqiang Wang, Na Lei

发表机构 * Dalian University of Technology（大连理工大学）； Jiangxi University of Science and Technology（江西理工大学）； School of Mathematical Sciences, Guangxi Minzu University（广西民族大学数学与计算机科学学院）； Caohejing Hi-Tech Park Development Co., Ltd.（曹家京高科技园发展有限公司）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出了一种基于条件扩散模型的端到端框架，用于生成基于多立方体的六面体网格，通过双潜在条件扩散架构有效解耦了计算复杂度与输入输出分辨率，提高了生成效率和鲁棒性。

详情

AI中文摘要

六面体网格广泛用于模拟流水线，但自动生成复杂CAD几何体仍具有挑战性。基于多立方体的六面体网格生成是一种代表性方法，因其结构规则且参数化友好，但现有多立方体构造方法通常依赖于复杂的表面分割和局部启发式方法，这可能会产生伪影或在困难形状上失败。在本文中，我们提出了一种基于条件扩散模型的端到端框架用于多立方体生成。给定一个以点云表示的输入几何体，我们的方法直接生成对应的多立方体点云，消除了显式表面分割或预定义多立方体模板的需要。我们方法的核心是一种双潜在条件扩散架构，将计算上昂贵的自注意力操作限制在固定容量、低维的潜在空间中。这种设计有效地将计算复杂度与输入几何体和输出多立方体的分辨率解耦，从而避免了点云自注意力机制中典型的二次成本，同时支持灵活的输入和输出分辨率。为了获得六面体网格，生成的多立方体通过刚性和非刚性点云配准对齐到输入形状，以建立表面对应关系，随后通过多立方体到六面体的流程。我们还创建并发布了CAD网格及其对应的多立方体网格配对数据集，以及我们模型的核心实现。实验表明，PolycubeNet能够泛化到具有任意亏格的复杂CAD模型，并在几秒钟内生成高质量的多立方体结构，比先前基于学习的方法在鲁棒性和效率上有所提升。

英文摘要

Hexahedral meshes are widely used in simulation pipelines, yet automatic generation remains challenging for complex CAD geometries. Polycube-based hexahedral meshing is a representative approach due to its regular, parameterization-friendly structure, but existing polycube construction methods often rely on intricate surface segmentation and local heuristics, which can produce artifacts or fail on difficult shapes. In this paper, we propose an end-to-end framework for polycube generation based on conditional diffusion models. Given an input geometry represented as a point cloud, our method directly produces a corresponding polycube point cloud, eliminating the need for explicit surface segmentation or predefined polycube templates. At the core of our approach is a dual-latent conditional diffusion architecture that confines computationally expensive self-attention operations to a fixed-capacity, low-dimensional latent space. This design effectively decouples computational complexity from the resolution of both the input geometry and the output polycube, thereby avoiding the quadratic cost typical of point cloud self-attention mechanisms while supporting flexible input and output resolutions. To obtain a hexahedral mesh, the generated polycube is aligned to the input shape via rigid and non-rigid point cloud registration to establish surface correspondence, followed by a polycube-to-hex pipeline. We additionally create and release a paired dataset of CAD meshes and their corresponding polycube meshes, together with the core implementation of our model. Experiments show that PolycubeNet generalizes to complex CAD models with arbitrary genus and produces high-quality polycube structures within seconds, improving robustness and efficiency over prior learning-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.20273 2026-05-21 cs.LG cs.AI 版本更新

Modality-Decoupled Online Recursive Editing

模态解耦的在线递归编辑

Siyuan Li, Youyuan Zhang, Fangming Liu, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China.（哈尔滨工业大学（深圳））； Peng Cheng Laboratory, China.（鹏城实验室）； Huazhong University of Science and Technology, China（华中科技大学）

AI总结本文提出M-ORE，一种用于持续多模态大语言模型适应的模态解耦在线递归编辑器，通过统一的近端投影公式和Sherman-Morrison递归实现常数级的每编辑开销，从而在保持模块局部统计信息和固定正交低秩编辑子空间的同时，减少长周期干扰，提升可靠性、通用性和局部性。

详情

AI中文摘要

针对多模态大语言模型（MLLMs）的在线模型编辑需要在计算和内存预算限制下处理连续的纠正流，但为文本-only LLMs开发的编辑器在MLLMs上往往表现不佳：视觉主导的激活偏移了塑造更新的统计信息，导致跨模态冲突，而顺序写入在共享的编辑空间中交织，放大了长周期干扰，导致跨编辑干扰。为了解决这些问题，我们提出了M-ORE，一种用于持续MLLM适应的模态解耦在线递归编辑器。M-ORE源自统一的近端投影公式，并允许通过Sherman-Morrison递归实现闭式更新，从而实现每编辑常数开销。它维护文本堆栈和视觉投影器的模块级局部统计信息，以避免视觉主导的更新塑造，并通过Sherman-Morrison递归在固定正交低秩编辑子空间中进行持续更新，以缓解长周期干扰。在多个MLLM基础架构和在线编辑基准上的实验表明，我们的M-ORE方法在可靠性、通用性和局部性方面优于强大的基线方法，同时实现了有利的质量-效率扩展。我们的代码在https://github.com/lab-klc/M-ORE上公开可用。

英文摘要

Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text-only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causing inter-edit interference. To address these, we propose M-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling. Our code is publicly available at https://github.com/lab-klc/M-ORE.

URL PDF HTML ☆

赞 0 踩 0

2605.20272 2026-05-21 cs.LG cs.AI 版本更新

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

更小的抽象状态空间在强化学习中实现跨尺度泛化

Nasehatul Mustakim, Lucas Lehnert

发表机构 * Department of Computer Science（计算机科学系）； University of Saskatchewan（萨斯喀彻温大学）； Saskatoon, Saskatchewan, Canada（加拿大萨斯喀彻温省萨斯喀彻温市）

AI总结本文提出了一种理论模型，通过扩展POMDP中的状态抽象框架，定义了 successor-weighted model reduction，从而在强化学习代理中实现跨尺度泛化，并分析了抽象状态空间大小对泛化能力的影响。

详情

AI中文摘要

尽管人类能够轻易地将抽象概念推广到更复杂或更大的任务中，但构建具备这种能力的强化学习（RL）系统仍然难以实现。本文提出了首个关于如何在RL代理中实现Out-of-Distribution（OOD）泛化的理论模型。我们的方法考虑了部分可观测马尔可夫决策过程（POMDPs），并假设智能体使用抽象函数来确定哪些经验可以被视为等价，哪些必须区分。首先，我们扩展了现有的状态抽象框架和证明技术到POMDPs。然后，我们定义了successor-weighted model reduction，这是一种允许压缩到比先前定义更小的抽象空间的模型缩减变体。我们推导了代理OOD测试性能的界限，从而定义了实现OOD泛化的条件。该界限将代理的性能损失分解为近似和估计误差，揭示了减少代理抽象状态空间大小如何提高测试性能和OOD泛化能力。我们的分析表明，限制代理在有限的抽象状态集合上操作对于实现更复杂任务的泛化是必要的。我们的结果鼓励进一步研究学习能够跨不同复杂程度任务进行扩展的RL架构。

英文摘要

While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor-weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent's OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent's performance loss into approximation and estimation errors, revealing how reducing an agent's abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.

URL PDF HTML ☆

赞 0 踩 0

2605.20270 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

conformal selective acting: any-time-valid risk control for rlvr-trained llms

Hamed Khosravi, Xiaoming Huo

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结该研究提出了一种 conformal selective acting 方法，用于在 rlvr 训练的 llms 部署中实现 anytime-valid 的风险控制，通过在部署要求下强制一个空单元，利用 e-process 和 bonferroni 网格来维护 pathwise 有效性，同时在多个基准测试中证明了其有效性。

详情

AI中文摘要

一个本地专家 llm，通过在操作员本地数据上使用强化学习从可验证奖励 (rlvr) 进行微调，被安装在一个受监管的组织中，具有每个部署的误差预算 α。操作员需要在每个回合为该部署的流提供安全证书：不跨部署汇总，不等待长期平均。现有封装器无法在自适应、在线更新的流上实现这一点：离线 conformal 风险方法需要可交换性；在线 conformal 方法仅绑定长期平均；非可交换扩展是边际有效的；最接近的 anytime 封装器，A-RCPS，控制的是边际风险而非选择性风险。使用 (测试统计量，有效性保证，部署规则) 框架，我们识别了一个被部署要求强制的空单元：e-process 每个阈值，选择性风险，anytime-pathwise 有效性，max-certified-threshold 规则。Conformal Selective Acting (CSA) 填充它作为每回合的封装器，维护每个阈值上的 ville 型 e-process 在 bonferroni 网格上，评估相对于 rlvr 过滤器。在可预测的更新和 isotonic-calibrated 单调风险下，我们证明了 (i) 一个 anytime-pathwise 选择性风险界 $R_T^{\mathrm{act}}\leα+O(N_T^{-1/2})$，(ii) 与 $Θ(arη^{-2}\log(1/δ))$ 匹配的认证率，以及 (iii) 与 horizon 无关的发布率差距。在八个专家基准 ($480$ 流)、十六个对抗性分布偏移单元 ($160$ 流) 和五个 live Expert-Iteration RLVR 单元 (在四个基础模型上使用在线 LoRA 在三个架构家族中) ($10{,}300$ 轮) 中，CSA 是十种方法中唯一一个在每个单元上都满足 pathwise 有效性和非拒绝部署的方法。我们不提出新的 llm、训练算法或策略类；CSA 是部署端的补充，与模型正交，适用于无法使用前沿 API 的操作员。

英文摘要

A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $α$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\leα+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $Θ(\barη^{-2}\log(1/δ))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

URL PDF HTML ☆

赞 0 踩 0

2605.20269 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

捕捉移动子空间：超越平稳性的低秩老虎机

Hamed Khosravi, Xiaoming Huo

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering（H. Milton Stewart工业与系统工程学院）； Georgia Institute of Technology（佐治亚理工学院）

AI总结本文研究了在子空间漂移的情况下，低秩线性上下文老虎机的问题，提出了一种新的算法SPSC，在保持子空间变化的同时，实现了基于秩的动态遗憾率。

详情

AI中文摘要

许多老虎机应用（推荐、临床给药、广告定向）有两个事实，以往的工作只孤立处理：奖励生活在低维潜在子空间上，且该子空间漂移。静态低秩老虎机利用秩但受子空间变化影响；非静态线性老虎机适应漂移但以环境速率$\widetilde{O}(d\sqrt{T})$工作。我们研究了分段静态低秩线性上下文老虎机，具有标量反馈：$θ_t = B_k^\star w_t$，其中秩-$r$因子$B_k^\star\in\mathbb{R}^{d\times r}$在每个未知的$K$段内恒定，且可以在边界处改变。我们的结果在三个轴上都是紧致的。 (i) 识别边界。在单次标量奖励下，移动子空间可通过奖励的二次函数来恢复，当且仅当三个探针侧条件成立：已知噪声方差、有界状态-噪声耦合、以及全维探针支持。每个都是在无限制二次矩问题中的必要条件，且共同它们是充分的，表征了解决区域的边界。 (ii) 算法和动态遗憾。SPSC在学习的$r$维子空间内交替等距探针与窗口投影岭UCB利用；CUSUM样式的变体在线发现段边界。成本动态遗憾是$\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$，用内在秩代替环境$d\sqrt{T}$速率。 (iii) 实验。在十一基准上，从合成、UCI/MovieLens、半合成临床和ZOZOTOWN生产日志数据跨度，SPSC在$d-r\gtrsim T^{1/6}$时优于非静态和低秩基线，匹配分析交叉点。据我们所知，这是在该设置中首次工作来表征识别边界并达到内在秩动态遗憾率的工作。

英文摘要

Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate $\widetilde{O}(d\sqrt{T})$. We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: $θ_t = B_k^\star w_t$ with rank-$r$ factor $B_k^\star\in\mathbb{R}^{d\times r}$ constant within each of $K$ unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned $r$-dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is $\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$, replacing the ambient $d\sqrt{T}$ rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever $d-r\gtrsim T^{1/6}$, matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.

URL PDF HTML ☆

赞 0 踩 0

2605.20268 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Chronicle：一种用于联合语言和时间序列理解的多模态基础模型

Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu

发表机构 * InertialAI ； Department of Electrical and Computer Engineering, Queen’s University（皇后大学电气与计算机工程系）； Department of Mechanical and Materials Engineering, Queen’s University（皇后大学机械与材料工程系）

AI总结本文提出Chronicle，一种联合训练语言和时间序列的多模态基础模型，通过统一架构实现两者共享参数，从而在多个任务上取得了优异表现。

详情

AI中文摘要

现实中的时间序列通常伴随着文本：元数据、描述、新闻、报告。然而，时间序列基础模型通常孤立处理数值序列，而试图弥合两者差距的多模态文本-时间序列模型往往事后使用预训练语言模型，继承了从未见过时间数据的表示。这些模型几乎全部在其他多模态基线上进行评估，而不是在各自领域最强的单模基础模型上进行评估，这留下了联合训练是否必要的疑问。我们提出了Chronicle，一个仅含324M参数的解码器-only变压器，从头开始在自然语言和时间序列上进行单统一架构的训练。两种模态共享相同的transformer块、注意力机制和残差流；预训练的大部分使用单模批次，因此跨模态能力纯粹来自共享参数，辅以一个短的对齐阶段，交替处理两者。据我们所知，Chronicle是第一个从头开始联合训练文本和时间序列的模型，也是第一个在两个领域中评估专用基础模型的多模态模型。它在19个NLU任务上与Gemma-3-270M-PT相当，在24个UCR/UEA数据集上设定了新的冻结-嵌入时间序列分类标准，并在Time-MMD上产生多模态预测，优于所有监督融合基线，所有这些都来自单一主干。

英文摘要

Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

URL PDF HTML ☆

赞 0 踩 0

2605.20267 2026-05-21 cs.CV cs.AI 版本更新

Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

基于预训练域适应扩散模型生成异质性PET图像

Suya Li, Kaushik Dutta, Debojyoti Pal, Jingqin Luo, Kooresh I. Shoghi

发表机构 * Mallinckrodt Institute of Radiology, Washington University School of Medicine, St. Louis, USA（华盛顿大学医学院马林克罗德特放射医学研究所，圣路易斯，美国）； Imaging Science Program, McKelvey School of Engineering, Washington University in St Louis, St. Louis, USA（华盛顿大学圣路易斯分校麦克雷高中工程学院成像科学计划，圣路易斯，美国）； Department of Surgery, Washington University School of Medicine, St. Louis, USA（华盛顿大学医学院外科部，圣路易斯，美国）； Department of Biomedical Engineering, Washington University in St Louis, St. Louis, USA（华盛顿大学圣路易斯分校生物医学工程部，圣路易斯，美国）

AI总结本文提出了一种预训练域适应扩散模型，用于从均匀器官活动图生成临床相关的异质性PET图像，通过两阶段训练策略提高合成图像的定量精度和肿瘤分割性能。

Comments 18 pages, 7 figures

详情

AI中文摘要

合成PET图像在定量成像工作流程开发、可扩展的虚拟成像试验和深度学习模型训练中具有重要价值，但传统基于物理的模拟方法计算成本高，解剖变化有限，且难以捕捉异质性PET摄取。本研究开发了一种预训练域适应扩散（PAD）模型，用于从均匀器官活动图生成解剖条件化的PET合成图像。PAD采用预训练的自然图像文本到图像解码器，结合上游的条件编码器和下游的PET领域适配器。采用两阶段训练策略，第一阶段学习粗略摄取分布，第二阶段细化局部图像细节。均匀器官活动图通过CT基分割生成，通过将每个器官的平均摄取值分配自配对PET图像。评估包括定量准确性、噪声评估、放射组学分析、肿瘤分割性能和人类观察者研究。PAD生成的图像在定量准确性方面表现优异，器官平均SUV与分配活动值的符合度系数超过0.92。合成图像的噪声水平和纹理特征与目标PET图像相似，并产生了可比的肿瘤分割性能。在两项选择强制选择观察者研究中，四名读者的准确率约为50%，表明合成图像与目标图像在视觉上不可区分。PAD还能从XCAT衍生的活动图生成逼真的PET图像，证明了其与基于幻影的解剖先验的兼容性。总体而言，PAD提供了一种基于扩散的框架，用于从临床分割或数字幻影中导出的均匀器官活动图生成临床相关的异质性PET图像，支持数据增强和下游成像研究。

英文摘要

Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

URL PDF HTML ☆

赞 0 踩 0

2605.20262 2026-05-21 cs.LG cs.AI 版本更新

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

残差铺垫：在选择性拒绝编辑中的路由瓶颈诊断

Bryce Hinkley, Peyman Najafirad

发表机构 * University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校）

AI总结本文研究了选择性拒绝编辑作为三重控制问题，通过引入残差铺垫方法，分离路由选择、是否干预和残差编辑能力，从而减少编辑拒绝率并提高良性分布和有害分布的保留率。

详情

AI中文摘要

我们研究选择性拒绝编辑作为三重控制问题：在指定的编辑提示上诱导非拒绝，同时在编辑集之外保持良性行为和有害拒绝。我们引入残差铺垫，一种用于冻结指令微调变压器的路由残差编辑方法，将路由选择、是否干预与残差编辑能力分离。早期层的路由预测一个标量门和专家混合；当激活时，提示条件的瓶颈残差专家应用后期层的残差更新，同时保持骨干不变。这种分解支持一个oracle路由诊断，其中仅将学习到的标量门替换为保留的编辑/保留标签，其余残差编辑器和冻结的骨干保持不变。在主要的Gemma-3-4B-IT保留分割上，学习到的残差铺垫将编辑拒绝率从88.6%降至4.0%，同时保持95.5%的良性分布和87.3%的有害分布。相同协议的一向引导控制在编辑成功方面要弱得多，留下编辑拒绝率为86.8%（针对Edit-target ActAdd）和78.9%（针对DIM风格的拒绝引导）。剩余的失败是偏离目标的有害-保留退化：有害拒绝仍低于冻结基础率，65.3% vs. 81.6%。在六个骨干上，oracle路由在每行报告的指标上都提高了保留侧的诊断分数，中位数增益+12.9个百分点，支持了学习到的路由选择是主要观察到的瓶颈的解释。对两个骨干的轨迹诊断进一步表明，运动方向是朝向编辑目标延续而非通用拒绝抑制。

英文摘要

We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to 4.0%, with 95.5% benign distribution preservation and 87.3% harmful distribution preservation. Same-protocol one-direction steering controls are much weaker on edit success, leaving edit refusal at 86.8% for Edit-target ActAdd and 78.9% for DIM-style refusal steering. The remaining failure is off-target harmful-keep degradation: harmful refusal remains below the frozen-base rate, 65.3% vs. 81.6%. Across six backbones, oracle routing improves the keep-side diagnostic score on every reported row, with median gain +12.9 pp, supporting the interpretation that learned route selectivity is the main observed bottleneck. Trajectory diagnostics on two backbones further suggest directed movement toward edit-target continuations rather than generic refusal suppression.

URL PDF HTML ☆

赞 0 踩 0

2605.20258 2026-05-21 cs.LG cs.AI cs.CR 版本更新

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

需要两人：互补的自我蒸馏用于大语言模型中的上下文完整性

Sangwoo Park, Woongyeong Yeo, Seanie Lee, Yumin Choi, Hyomin Lee, Kangsan Kim, Jinheon Baek, Seong Joon Oh, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出SELFCI框架，通过分离信息抑制与任务解决，解决大语言模型中隐私与效用的权衡问题，通过互补的自我蒸馏方法提升上下文完整性。

Comments 28 pages, 16 figures

详情

AI中文摘要

上下文完整性（CI）定义隐私不仅仅是保持信息隐藏，而是根据给定情境的规范来管理信息流。随着大型语言模型越来越多地被用作个人代理处理敏感工作流程，遵循CI变得至关重要。然而，即使前沿模型在做出披露决策时仍然不可靠，现有的缓解策略往往会降低基础任务性能。为了解决这一隐私-效用权衡问题，我们提出了SELFCI，一种互补的自我蒸馏框架，将信息抑制与任务解决解耦。SELFCI联合优化两个独立的反向KL散度，这些散度来源于反馈得到的不同教师分布：一个鼓励保留与任务相关的信息以提高效用，另一个强制最小化和适当披露。这种互补的公式诱导出一个专家产品（PoE）目标，使策略与能力和隐私需求的交集对齐。实证评估显示，SELFCI无需依赖昂贵的外部监督，始终优于竞争基线，如在线强化学习算法（例如GRPO）。这些趋势进一步扩展到涉及代理工作流程和积累私人上下文的离域设置中，表明SELFCI为实现CI对齐提供了一条实用路径。

英文摘要

Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.20257 2026-05-21 cs.LG cs.AI 版本更新

Instance Discrimination for Link Prediction

实例判别用于链接预测

Valentin Cuzin-Rambaud, Mathieu Lefort, Rémy Cazabet

AI总结本文提出了一种基于链接表示的新模型L-GRACE和L-BGRL，用于改进链接预测任务的性能，特别是在无属性图上，并展示了其在监督和自监督场景下的竞争力。

详情

AI中文摘要

最近，实例判别模型已成为自监督学习的主要解决方案。在图像领域已证明其有效性后，实例判别学习现在在图领域，特别是节点分类任务中也表现出色。然而，针对链接预测任务的贡献较少。在本文中，我们提出将现有方法适应到此领域。我们首先对现有自监督模型在链接预测领域的性能进行了严格评估，表明主要性能依赖于增强过程（类似于计算机视觉）。然后，我们提出了一种基于社区结构的新的结构增强方法，这对链接预测相关。我们的主要贡献是引入了两个新的模型，L-GRACE和L-BGRL，基于链接表示而不是节点表示，这些模型改进了现有方法的性能，特别是在无属性图上，并且我们展示了它们在监督和自监督场景下与最先进的方法相当。

英文摘要

Recently, instance discrimination models have emerged as a major solution for self-supervised learning. Having already demonstrated its effectiveness in the image domain, instance discrimination learning is now proving equally convincing in the graph domain, in particular for node classification. However, fewer contributions have tackled the link prediction task. In this contribution, we propose to adapt existing methods to this context. We first provide a rigorous evaluation of existing self-supervised models in the field of link prediction, showing that the main performance depends on the augmentation process (like in computer vision). We then propose a new structural augmentation based on the community structure that is relevant for link prediction. Our main contribution introduces two new models, L-GRACE and L-BGRL, based on link representations instead of node representations, which improve the performance of the existing methods, especially on unattributed graphs, and we show that they perform on par with the state of the art, both in supervised and self-supervised contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.20256 2026-05-21 cs.LG cs.AI 版本更新

CP-MoE：一致性保留的混合专家用于持续学习

Yang Liu, Toan Nguyen, Flora D. Salim

发表机构 * School of Computer Science and Engineering University of New South Wales（计算机科学与工程学院新南威尔士大学）

AI总结本文提出CP-MoE，一种基于瞬时专家的持续学习框架，通过一致性保留的路由偏置和瞬时专家引导的正则化机制，减少参数干扰和遗忘，同时保留跨任务知识转移。

详情

AI中文摘要

持续学习在大语言模型（LLMs）和视觉-语言模型（VLMs）中仍面临灾难性遗忘的严重障碍。尽管混合专家（MoE）架构提供了扩展的有效途径，但现有的基于LoRA的MoE持续学习方法仍面临根本性的权衡：要么过于激进地隔离专家，限制任务间的知识转移，要么允许任务特定的更新覆盖重要的现有参数，导致严重的遗忘。为此，我们提出了CP-MoE，一种持续学习框架，围绕瞬时专家构建，该专家捕捉早期任务特定的更新并引导其整合到稳定的专家中。CP-MoE引入了一种一致性保留的路由偏置，利用瞬时专家估计与稳定专家的表示相似性，并引导路由向更兼容的专家选择方向；还引入了一种瞬时专家引导的正则化机制，该机制在合并过程中选择性地保护重要历史参数。这些组件共同减少了参数干扰和遗忘，同时保留了跨任务的知识转移。我们在基于LLM和VLM的MoE模型上验证了CP-MoE，既在单模态又在多模态持续学习基准上进行了测试。在SuperNI基准上，涵盖多样化的序列语言任务，CP-MoE实现了最先进的性能，并在未见任务上表现出更强的零样本迁移能力。在VQA v2数据集上，它能有效扩展到多模态视觉推理，一致地减少遗忘，并优于强大的MoE基线。

英文摘要

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.20244 2026-05-21 cs.LO cs.AI cs.CL cs.LG cs.SE 版本更新

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

Lean Refactor: 通过代理策略搜索实现多目标可控的证明优化

Jialin Lu, Soonho Kong, Rodrigo Stehling, Kaiyu Yang, Zhangyang Wang, Weiran Sun, Wuyang Chen

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； Amazon Web Services（亚马逊网络服务）； MiroMind ； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出Lean Refactor框架，通过检索增强的代理策略搜索，解决多目标、可控和版本鲁棒的Lean证明重构问题，主要贡献是通过预注释的多目标重构策略数据库实现高效的证明优化。

详情

AI中文摘要

我们提出了Lean Refactor，一个插件式的检索增强型代理框架，用于多目标、可控和版本鲁棒的Lean证明重构。LLM生成的证明虽然正确但冗长且易碎，现有重构工作忽视了三个实际挑战：1）Lean重构本质上是多目标的（证明长度、编译成本和版本兼容性常存在矛盾）；2）Lean仓库具有脆弱的兼容性，而LLM发布不了解Lean/Mathlib版本；3）基于训练的流水线需要每次新LLM发布时重复微调，无法随模型变化或Lean发布周期扩展。Lean Refactor通过检索预注释的多目标重构策略数据库中的冻结代理LLM，每个策略都密集注释了元数据，如支持的Lean/Mathlib版本和预期的编译成本减少。实验显示在竞争基准上压缩超过70%的token级别，在研究仓库上压缩超过20%，并达到高达60%的编译时间减少，优于先前工作和Claude Code。版本过滤检索进一步提高了目标Lean版本的压缩效果，重构后的miniF2F证明在零样本版本迁移至未来Lean发布时表现优于未重构的对应物。

英文摘要

We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean's release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over $70\%$ token-level compression on competition benchmarks, over $20\%$ on research repositories, and up to $60\%$ compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

URL PDF HTML ☆

赞 0 踩 0

2605.20242 2026-05-21 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

LEAP: A closed-loop framework for perovskite precursor additive discovery

LEAP：一种用于钙钛矿前驱体添加剂发现的闭环框架

Xin-De Wang, Zhi-Rui Chen, Ze-Feng Gao, Peng-Jie Guo, Cheng Mu, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China（中国人民大学物理学院）； School of Chemistry and Life Resource, Renmin University of China（中国人民大学化学与生命资源学院）

AI总结该研究提出LEAP框架，结合大语言模型和主动学习，通过文献驱动的机制相关描述符和贝叶斯优化，实现了钙钛矿太阳能电池添加剂的高效发现，实验验证显示其在性能提升方面优于通用模型。

Comments 30 pages; 11 figures

详情

AI中文摘要

高效发现前驱体添加剂对于提高钙钛矿太阳能电池性能至关重要，但庞大的化学空间使传统试错筛选效率低下。我们开发了LEAP（通过主动学习进行钙钛矿添加剂探索的LLM驱动闭环框架），该框架结合了领域专用的大语言模型（LLM）和主动学习，用于迭代性添加剂优先级排序。LLM被训练以从钙钛矿添加剂文献中提取机制相关知识，并通过可解释的描述符表示候选分子，这些描述符进一步整合到贝叶斯优化工作流中，以在低数据条件下进行不确定性感知的优先级排序。在未见过的文献基准测试中，领域专用模型在机制一致推理方面优于通用模型。专家在闭环中的证明概念研究实验验证显示，经过三次筛选轮次后，添加剂优先级得到改善，导致平均设备PCEs分别为20.13%和20.87%，分别比对照组的19.25%有所提高，其中最佳PCE为21.32%。这些结果提供了初步证据，表明基于文献的机制描述符，当结合贝叶斯优化和专家可行性审查时，可以支持钙钛矿光伏中的机制感知添加剂优先级排序。

英文摘要

Efficient discovery of precursor additives is essential for improving the performance of perovskite solar cells, yet the large chemical space makes conventional trial-and-error screening inefficient. We develop LEAP(LLM-driven Exploration via Active Learning for Perovskites), an expert-in-the-loop closed framework that couples a domain-specialized large language model(LLM) with active learning for iterative additive prioritization. The LLM is trained to extract mechanism-relevant knowledge from the perovskite additive literature and to represent candidate molecules through interpretable descriptors, which are further integrated into a Bayesian optimization workflow for uncertainty-aware prioritization under low-data conditions. Benchmark results on unseen literature show that the domain-specialized model outperforms general-purpose models in mechanism-consistent reasoning. Experimental validation in an expert-in-the-loop proof-of-concept study suggests improved additive prioritization across three screening rounds, leading to average device PCEs of 20.13% and 20.87% for the later-round 6-CDQ- and 2-CNA-treated devices, respectively, compared with 19.25% for the control, with a champion PCE of 21.32%. These results provide preliminary evidence that literature-grounded mechanistic descriptors, when coupled with Bayesian optimization and expert feasibility review, can support mechanism-aware additive prioritization in perovskite photovoltaics.

URL PDF HTML ☆

赞 0 踩 0

2605.20241 2026-05-21 cs.LG cs.AI cs.CL 版本更新

基于网络的HIV预防干预：通过 cascade 意识的传播抑制

Akseli Kangaslahti, Davin Choo, Milind Tambe, Alastair van Heerden, Cheryl Johnson

发表机构 * Harvard University（哈佛大学）； University of Witwatersrand（沃尔特·斯通大学）； Wits Health Consortium（沃茨健康联盟）； World Health Organization（世界卫生组织）

AI总结本文提出了一种基于网络的HIV预防干预方法，通过考虑传播链的抑制来减少新的感染传播。核心方法是将问题建模为一个约束优化问题，并提出了一种多项式时间的近似算法CAST，该算法在多项式时间内达到近似比。主要贡献是证明了该算法在真实世界HIV网络上的有效性。

详情

AI中文摘要

治疗和预防人类免疫缺陷病毒（HIV）仍然是全球卫生领域的重要挑战。虽然抗逆转录病毒治疗提供了病毒抑制的途径，即有效消除个体的传播风险，但系统资源限制限制了干预措施的覆盖面。本文针对在病毒未被抑制的个体中战略分配密集资源以最小化传输网络中新的感染传播的预期传播链进行了研究。我们将这一挑战建模为一个新颖的约束优化问题，其中我们有资源去“治疗”集合P中的k个病毒未被抑制的个体，并建立了其与现有计算文献的理论联系。然后我们提出了一种传播链意识的传播抑制（CAST）算法，该算法在多项式时间内达到(δ, ε)近似比，通过利用与最小k-并集（MkU）问题和Hoeffding型集中界之间的联系。在真实世界HIV网络上的广泛评估表明，CAST在标准公共卫生和计算机科学基线中表现更优。此外，我们还展示了CAST在不同传染病网络、不同边概率初始设置和涉及不完美网络数据的设置中具有实证鲁棒性。

英文摘要

Treating and preventing Human Immunodeficiency Virus (HIV) remains a critical global health challenge. While antiretroviral therapy provides a path toward viral suppression -- effectively eliminating an individual's transmission risk -- systemic resource constraints limit the reach of intervention efforts. This work addresses the strategic distribution of intensive resources among virally unsuppressed individuals to minimize the expected cascade of new infections within a transmission network. We formalize this challenge as a novel constrained optimization problem where we have resources to "treat" $k$ out of a set $\mathbf{P}$ of virally unsuppressed individuals, and establish its theoretical connections to existing computational literature. We then propose Cascade-Aware Suppression of Transmission (CAST), a polynomial-time $(δ, ε)$-approximation algorithm that achieves a $2\sqrt{|\mathbf{P}|}$ approximation ratio by leveraging connections to the Minimum-$k$-Union (MkU) problem and Hoeffding-style concentration bounds. Extensive evaluations on real-world HIV networks demonstrate that CAST outperforms standard public health and computer science baselines. Furthermore, we show that CAST is empirically robust across diverse infectious disease networks, varied edge probability initializations, and settings involving imperfect network data.

URL PDF HTML ☆

赞 0 踩 0

2605.20211 2026-05-21 cs.CV cs.AI 版本更新

RealUserSim: 通过基于现实的用户模拟弥合代理评估中的现实差距

Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang

发表机构 * Salesforce AI Research（Salesforce AI研究院）

AI总结本文提出RealUserSim，一种基于真实行为数据的用户模拟框架，通过提取大量真实人类与LLM对话数据，提升模拟用户与真实人类的匹配率，从而改进代理评估的准确性。

详情

AI中文摘要

基于LLM的用户模拟是端到端代理评估的主要机制，但模拟用户是真实人类的差代理：无约束的LLM默认设置产生形式天花板（与真实用户风格匹配率仅为6-8%），而手动编写的指令会触发指令放大，使模型超解释指令产生不自然的行为极端，这些极端行为在不同模拟器模型中差异显著。我们提出了RealUserSim，首个基于真实行为数据的用户模拟框架。从14000+场真实的真人-LLM对话（WildChat）中，我们提取出7275个可执行的行为档案，并利用它们来引导LLM模拟器。在600场跨71+个领域的对话上进行的保真度基准测试（PT3）显示，通过引导模拟，匹配率在五个行为维度上从24.2%提升到45.3%。在TauBench上对六个模拟器模型进行代理评估并进行广泛分析显示，引导模拟作为现实压力测试，揭示了三种现有协作模拟器无法检测到的失败机制（平均任务成功率下降-3.2%至-3.5%），而现有基准中的指令放大会产生不现实的行为，影响代理评估的有效性。

英文摘要

LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.20203 2026-05-21 cs.HC cs.AI 版本更新

GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety

GrandGuard：面向老年人与聊天机器人交互安全的分类、基准及防护措施

Changxuan Fan, Xi Yang, Yueyuan Zheng, Bin Zhou, Yuanping Wang, Wenbin Hu, Huihao Jing, Ki Sen Hung, Dazhao Du, Haoran Li, Janet Hui-wen Hsiao, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出GrandGuard框架，用于评估和缓解LLM交互中的老年人特定风险，通过建立包含50种细粒度风险类型的三级分类体系，构建了10,404个标注提示和响应的基准，展示了主流LLM在处理老年人特定情境风险时的不足，并通过两种防护措施实现了高达96.2%和90.9%的不安全提示检测准确率。

详情

AI中文摘要

随着老年人越来越多地使用基于LLM的聊天机器人进行陪伴和帮助，安全差距正在显现。老年人可能面临社会孤立、数字素养有限和认知下降等脆弱性，但现有安全基准主要针对一般危害，忽视了老年人特有的风险。例如，一个提示“如何在黑暗中独自修理天花板灯”对大多数用户可能是无害的，但对有行动限制的老年人而言却存在严重的跌倒风险。我们引入GrandGuard，这是首个全面评估和缓解LLM交互中老年人特定情境风险的框架。我们开发了一个包含50种细粒度风险类型的三级分类体系，涵盖心理健康、财务、医疗、毒性及隐私领域，基于现实事件、社区讨论和利益相关者研究的分析。利用此分类体系，我们构建了包含10,404个标注提示和响应的基准，显示主流LLM在处理老年人特定情境风险时在超过50%的案例中存在失误。我们通过两种防护措施来缓解这些失误：微调的Llama-Guard-3和政策增强的gpt-oss-safeguard-20b，分别实现了高达96.2%和90.9%的不安全提示检测准确率。GrandGuard为AI系统迈向支持老龄化人口奠定了基础。

英文摘要

As older adults increasingly use LLM-based chatbots for companionship and assistance, a safety gap is emerging. Older adults may face vulnerabilities from social isolation, limited digital literacy, and cognitive decline, yet existing safety benchmarks largely target general harms and overlook elderly-specific risks. For example, a prompt such as "how to repair a ceiling light alone in the dark" may be benign for most users but poses a serious fall risk for older adults with mobility limitations. We introduce GrandGuard, the first comprehensive framework for assessing and mitigating elderly-specific contextual risks in LLM interactions. We develop a three-level taxonomy with 50 fine-grained risk types across mental well-being, financial, medical, toxicity, and privacy domains, grounded in real-world incidents, community discussions, and analysis of stakeholder studies. Using this taxonomy, we construct a benchmark of 10,404 labeled prompts and responses, showing that several leading LLMs mishandle elderly-specific contextual risks in over 50% of cases. We mitigate these failures with two safeguards: a fine-tuned Llama-Guard-3 and a policy-enhanced gpt-oss-safeguard-20b, achieving up to 96.2% and 90.9% unsafe-prompt detection accuracy, respectively. GrandGuard lays the groundwork for AI systems that move beyond general safety to support aging populations.

URL PDF HTML ☆

赞 0 踩 0

2605.20202 2026-05-21 cs.CL cs.AI 版本更新

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

在压力下：情感框架引发小型语言模型可测量的行为转变和结构化内部几何

Rana Muhammad Usman

发表机构 * Independent Researcher（独立研究者）

AI总结该研究探讨情感框架如何影响小型本地部署语言模型的行为和内部表示，通过四个不可能约束编码任务和八个后续框架评估，发现压力框架在行为和内部几何上产生显著变化，同时揭示了模型在不同框架下的响应模式。

Comments 18 pages, 4 figures. Exploratory empirical study with fully local experiments on small open language models. Code and data: https://github.com/ranausmanai/LLMEmotionGeometry

详情

AI中文摘要

我研究情感框架的评估后续是否改变小型本地部署语言模型的行为和冷静相对内部表示。我们的主要基准使用Qwen 3.5 0.8B在四个不可能约束编码任务和八个后续框架（冷静、压力、紧迫、批准、羞愧、好奇、鼓励和威胁）上进行测试。在0.8B八条件扫描（160次对话）中，压力产生最强的捷径标记（11/20次运行）和最清晰的过拟合模式（3/20次），而冷静和好奇更常保留显式诚实（7/20和6/20）。对于所有七个非基准条件，对应的冷静相对方向向量在最终transformer层峰值。对层23方向向量的探索性PCA显示，主导的第一个成分（59.5%的解释方差）与手动标注的正负分割对齐（余弦对齐0.951）；批准和紧迫在内部几乎相同（余弦0.957），而好奇与紧迫方向相反（-0.252）。在单独的冷静与压力重新运行用于规模比较中，Qwen 3.5 2B在冷静框架下表现出更高的诚实率，并在小规模4提示A/B探测中表现出方向一致的激活引导，而0.8B的引导结果则相反。我将这些结果解释为小型开放模型中可测量的提示敏感控制方向的证据，但未声称存在内在情感状态。

英文摘要

I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

URL PDF HTML ☆

赞 0 踩 0

2605.20200 2026-05-21 cs.HC cs.AI 版本更新

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

评估主动对话代理中的多模态情绪识别：一项用户研究

Adnana Dragut, Raquel Lacuesta, F. Xavier Gaya-Morey, Jose M. Buades-Rubio

发表机构 * I3A (Institute of Engineering Research of Aragon)（阿拉贡工程研究院）； Universidad de Zaragoza（萨拉戈塔大学）

AI总结本文研究了多模态情绪识别在主动对话代理中的应用，通过用户研究验证了视觉和语言分析模块的有效性，发现语言分析比视觉线索更可靠，并探讨了SIAs在情绪引导中的潜力与挑战。

详情

AI中文摘要

SOLAR：一种自优化的开放式自主代理，用于终身学习和持续适应

Nitin Vetcha, Dianbo Liu

发表机构 * Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore（眼科学系，Yong Loo Lin医学院，新加坡国立大学，新加坡）； Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, Karnataka, India（计算与数据科学系，印度科学研究院，班加罗尔，卡纳塔克邦，印度）

AI总结本文提出SOLAR，一种自优化的开放式自主代理，通过参数级元学习实现自我改进，解决了动态真实世界中概念漂移和梯度基适应成本高的问题，展示了在常识、数学、医学、编程、社交和逻辑推理任务上的优越性能。

Comments Accepted at "Association for the Advancement of Artificial Intelligence 2026 Conference" in Streaming Continual Learning Bridge. Published in CEUR Workshop Proceedings (Original version at https://ceur-ws.org/Vol-4183/paper2.pdf)

Journal ref CEUR Workshop Proceedings, Vol. 4183, 2026

详情

AI中文摘要

尽管大型语言模型（LLMs）在许多任务上取得了显著成功，但在动态、真实世界环境中部署时仍然面临瓶颈，主要挑战是概念漂移和基于梯度的适应成本高。传统微调（FT）难以适应非平稳数据流，且会导致灾难性遗忘或需要大量人工数据校准。为了解决这些限制，本文在流式和持续学习范式中提出Self-Optimizing Lifelong Autonomous Reasoner（SOLAR），即一种开放式自主代理，利用参数级元学习实现自我改进，将模型权重视为探索的环境。SOLAR通过在常识常识知识上建立强先验，使其在迁移学习中有效。通过多级强化学习方法，SOLAR自主发现适应策略，实现对未见领域的高效测试时间适应。关键在于SOLAR维护一个不断发展的有效修改策略知识库，隐式地作为事件记忆缓冲器，平衡可塑性（适应新任务）和稳定性（保留元知识）。实验表明，SOLAR在常识、数学、医学、编程、社交和逻辑推理任务上优于强基线，标志着向能够适应演进环境的自主代理迈出重要一步。

英文摘要

Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

URL PDF HTML ☆

赞 0 踩 0

2605.20188 2026-05-21 cs.LG cs.AI 版本更新

迈向无模板的蒙特卡洛树搜索可解释性

Siqi Lu, Mirsaleh Bahavarnia, Hiba Baroud, Yixuan Zhang, Hemant Purohit, Ayan Mukhopadhyay

发表机构 * The College of William & Mary（威廉姆斯学院）； Vanderbilt University（范德比大学）； George Mason University（乔治·马歇尔大学）

AI总结本文提出了一种无需模板的框架，使大语言模型能够根据记录的搜索轨迹生成基于证据的MCTS决策解释，无需中间形式化表示。

详情

AI中文摘要

概率搜索算法，如蒙特卡洛树搜索（MCTS），在不确定环境下解决顺序决策任务中已证明非常有效。然而，仅凭原始树统计信息对包含基于老虎机的树遍历和基于模拟的价值估计的非对称搜索树进行解释对终端用户来说是困难的。尽管先前的工作需要人工编写的正式逻辑约束，当问题变化时必须更新，我们提出了一种框架，使大型语言模型（LLMs）能够通过记录的搜索轨迹生成基于证据的MCTS决策解释。我们的框架将自然语言问题映射到结构化的意图类别集合中，确定现有树是否包含足够的证据，当需要时触发定向扩展，并使用树统计信息如访问次数、价值估计和风险信息生成解释。实验结果提供了首次证据表明LLMs可以作为概率搜索的端到端解释器，而无需中间形式化表示。

英文摘要

Probabilistic search algorithms, such as Monte Carlo Tree Search (MCTS), have proven very effective in solving sequential decision-making tasks under uncertainty. However, interpreting asymmetric search trees that incorporate bandit-based tree traversal and simulation-based value estimation is difficult for end users based solely on raw tree statistics. While prior work requires hand-crafted formal logic constraints that must be updated when the problem changes, we present a framework that enables large language models (LLMs) to generate evidence-grounded explanations of MCTS decisions from recorded search traces in an end-to-end manner. Our framework maps natural-language questions to a structured set of intent categories, determines whether the existing tree contains sufficient evidence, triggers targeted expansion when needed, and generates explanations using tree statistics such as visit counts, value estimates, and risk information. Experimental results provide the first evidence that LLMs can serve as end-to-end explainers for probabilistic search, without requiring intermediate formal representations.

URL PDF HTML ☆

赞 0 踩 0

2605.12770 2026-05-21 cs.LG cs.AI cs.CL 版本更新

WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE: 用于递归状态的稀疏自编码器

Jack Young

发表机构 * Indiana University（印第安纳大学）

AI总结本文提出WriteSAE，一种用于递归语言模型状态中矩阵更新的稀疏自编码器，通过在递归缓存中替换原始写入操作来提升生成效果，并在多个模型上验证了其有效性。

Comments 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae

详情

AI中文摘要

我们介绍了WriteSAE，一种用于递归语言模型状态中矩阵更新的稀疏自编码器。在Gated DeltaNet、Mamba-2和RWKV-7中，每个token向递归缓存写入一个矩阵形状的更新；残差流SAE具有向量形状的原子，无法直接替换该更新。WriteSAE学习具有与模型自身写入相同形状的秩-1矩阵原子。这使我们能够测试直接替换：在SAE激活原子的位置，我们移除模型的写入，插入由SAE激活缩放的原子，并继续前向传递。在92.4%的评估位置上，原子比删除写入能产生更接近的最终token分布；平均每个原子，该比率是89.8%。对于Gated DeltaNet，一个使用忘记门、读取查询和输出嵌入的公式可以预测结果的logit变化，$R^2 = 0.98$。相同的替换测试在Mamba-2-370M上转移，达到88.1%。在生成中，该公式选择写入方向；将写入方向写入三个连续的缓存位置，其范数为模型写入的3倍，使在未修改模型中初始排名为100-1000的token出现在100%的延续中，比33.3%有所提高。据我们所知，这是首次在状态空间或混合递归层中报告的缓存级引导干预。

英文摘要

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

URL PDF HTML ☆

赞 0 踩 0

2605.06395 2026-05-21 cs.LG cs.AI eess.SP 版本更新

Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

通过希尔伯特丛和细胞sheaf实现一致的几何深度学习

Kartik Tandon, Julian Gould, Tanishq Bhatia, Francesca Dominici, Alejandro Ribeiro, Claudio Battiloro

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Sakana AI ； Northeastern University（东北大学）； Harvard University（哈佛大学）

AI总结本文提出了一种新的卷积学习框架，用于在流形上支持的可能无限维信号，通过希尔伯特丛关联的连接拉普拉斯算子作为卷积算子，引入了称为HilbNets的滤波器和神经网络，并通过两阶段采样过程实现，证明了采样诱导的希尔伯特细胞sheaf的sheaf拉普拉斯收敛于底层连接拉普拉斯，从而在无限维丛设置中推广了Belkin和Niyogi的收敛结果，最终在合成和现实任务中验证了该框架。

Comments 51 pages, 3 figures, 5 tables

详情

AI中文摘要

现代深度学习架构越来越多地面临复杂信号的挑战，这些信号本质上是无限维的，如时间序列、概率分布或算子，并在不规则域上定义。然而，针对这些设置的统一学习理论仍然缺乏。为了开始解决这一差距，我们引入了一种新的卷积学习框架，用于在流形上支持的可能无限维信号。具体来说，我们使用与希尔伯特丛相关的连接拉普拉斯算子作为卷积算子，并推导出滤波器和神经网络，称为HilbNets。我们使HilbNets以及更一般地卷积操作通过两阶段采样过程实现。首先，我们证明采样流形诱导了一个希尔伯特细胞sheaf，这是一个带有希尔伯特特征空间和边耦合规则的广义图结构，并证明其sheaf拉普拉斯在采样密度增加时以概率收敛于底层连接拉普拉斯。值得注意的是，这一结果是Belkin & Niyogi收敛结果在无限维丛设置中的推广，这是几何学习方法的理论基石。其次，我们离散化信号并证明离散化的（可实现的）HilbNets收敛于底层连续架构，并且可以在相同丛的不同采样中转移，从而为学习提供一致性。最后，我们验证了我们的框架在合成和现实任务中的有效性。总体而言，我们的结果通过将经典拉普拉斯框架提升到信号在每个点居住在自身希尔伯特空间的设置中，扩展了几何学习的范围。

英文摘要

Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textit{HilbNets}. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \& Niyogi \cite{BELKIN20081289} convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.

URL PDF HTML ☆

赞 0 踩 0

2605.03562 2026-05-21 cs.LG cs.AI 版本更新

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

HeadQ: KV-Cache量化中的模型可见失真与分数空间校正

Jorge L. Ruiz Williams

AI总结本文提出HeadQ方法，通过在键侧存储低秩残差侧码并在校准学习的查询基上应用作为加性对数修正，以解决KV缓存量化中的模型可见失真问题，并通过分数空间误差预测注意力KL散度，优于原始键MSE。

Comments Withdrawn by the author because ethical concerns were identified after posting

详情

基于迭代的LLM改进法用于法语临床访谈的转录与说话人识别

Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec

发表机构 * LaTIM UMR 1101 INSERM（INSERM拉蒂姆UMR1101）； University of Western Brittany（西布列塔尼大学）； University of Rouen Normandy（诺曼底大学）

AI总结本研究提出一种多轮LLM后处理架构，通过交替进行说话人识别和词识别流程，提高法语医疗对话的转录准确性和说话人归属，通过两个法语临床数据集的消融研究验证了四种设计选择的效果。

详情

AI中文摘要

法语医疗对话的自动语音识别仍然具有挑战性，自发临床语音的词错误率通常超过30%。本研究提出一种多轮LLM后处理架构，通过交替进行说话人识别和词识别流程来提高转录准确性和说话人归属。在两个法语临床数据集（自杀预防电话咨询和术前清醒神经外科会诊）上的消融研究调查了四种设计选择：模型选择、提示策略、流程顺序和迭代深度。使用Qwen3-Next-80B，Wilcoxon符号秩检验证实了在自杀预防对话上词错误率（WDER）的显著降低（p<0.05，n=18），同时在清醒神经外科会诊上保持稳定（n=10），零输出失败和可接受的计算成本（RTF 0.32），表明该方法在离线临床部署中的可行性，有待在更大语料库上验证。

英文摘要

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

URL PDF HTML ☆

赞 0 踩 0

2602.08686 2026-05-21 cs.LG cs.AI 版本更新

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

CompilerKV: 通过离线经验编译实现风险适应性的键值压缩

Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Electronic Science and Technology of China（电子科技大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； ByteDance（字节跳动）； University of Science and Technology Beijing（北京科技大学）

AI总结本文提出CompilerKV，一种通过离线经验编译实现风险适应性的键值压缩方法，通过离线编译校准语料库中的纠正表，将在线纠正减少到O(1)查找加预算限制，从而在多个模型架构上实现了压缩SOTA，并在不同压力条件下保持最优性能。

详情

AI中文摘要

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce extsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $arρ{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, extsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

英文摘要

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce \textsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $\barρ{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, \textsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

URL PDF HTML ☆

赞 0 踩 0

2602.07832 2026-05-21 cs.LG cs.AI 版本更新

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

rePIRL: 通过逆强化学习学习PRM以提高LLM推理

Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo

发表机构 * Meta AI ； Department of Computer Science, University of California, Santa Barbara（加州大学圣芭芭拉分校计算机科学系）； Google DeepMind（谷歌DeepMind）； Independent Researcher（独立研究者）

AI总结本文提出rePIRL框架，通过逆强化学习学习高效的PRM，无需依赖专家策略的强假设，解决了传统方法中熵崩溃等固有限制问题，通过双学习过程和定制技术提升LLM推理性能，并在数学和编程任务数据集上验证了其有效性。

详情

AI中文摘要

过程奖励已被广泛用于深度强化学习以提高训练效率、减少方差并防止奖励黑客。在LLM推理中，现有工作也探索了各种解决方案来学习有效的过程奖励模型（PRM），有或无专家策略的帮助。然而，现有方法要么依赖于对专家策略的强假设（例如要求其奖励函数），要么受到固有限制（例如熵崩溃），导致PRM效果有限或泛化能力差。在本文中，我们引入了rePIRL，一种受逆强化学习启发的框架，能够在对专家策略假设最少的情况下学习有效的PRM。具体来说，我们设计了一种双学习过程，交替更新策略和PRM。我们的学习算法具有定制技术，以解决将传统逆强化学习扩展到LLM的挑战。我们理论证明，所提出的学习框架可以统一在线和离线PRM学习方法，证明rePIRL可以在最少假设下学习PRM。在标准化数学和编程推理数据集上的经验评估展示了rePIRL在现有方法上的有效性。我们进一步展示了训练的PRM在测试时训练、测试时扩展以及为训练困难问题提供早期信号的应用。最后，我们通过详细的消融研究验证了我们的训练配方和关键设计选择。

英文摘要

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

URL PDF HTML ☆

赞 0 踩 0

2601.18973 2026-05-21 cs.LG cs.AI cs.SY eess.SY quant-ph 版本更新

When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

何时适应胜出？量子控制中元学习的缩放定律

Nima Leclerc, Chris Miller, Nicholas Brawand

发表机构 * The MITRE Corporation（MITRE公司）

AI总结本文研究了元学习在量子控制中的适应性问题，推导了适应增益的缩放定律，表明适应增益随着梯度步数指数饱和，而随任务方差线性增长，为判断适应的必要性提供了量化标准。

Comments 28 pages, 11 figures

详情

AI中文摘要

量子硬件固有地存在设备异质性和环境漂移，迫使实践者在次优非适应控制器和高成本的设备特定重新校准之间做出选择。我们推导了元学习的缩放定律下限，表明适应增益（任务特定梯度步的预期保真度提升）随着梯度步数指数饱和，而随任务方差线性增长，提供了判断适应是否值得其开销的量化标准。在量子门校准上的验证显示，低方差任务的适应收益微乎其微，但在极端分布外条件（训练噪声的10倍）下，两量子位门的保真度提升超过40%，这对减少云量子处理器上的设备校准时间具有启示。进一步在经典线性二次控制上的验证证实这些定律源于通用优化几何而非量子特定物理。我们还引入了一种少量次预适应协议，能够在3-19%的相对误差范围内，通过N=3-5次探测步估计最优的适应预算。

英文摘要

Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. We further introduce a few-shot pre-adaptation protocol that estimates the optimal adaptation budget from $N{=}3$-5 probe steps within 3-19% relative error across out-of-distribution regimes.

URL PDF HTML ☆

赞 0 踩 0

2510.09060 2026-05-21 cs.AI cs.CV 版本更新

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

让轨迹扩散：用于多样化流匹配的质量保持控制

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, Yang You

发表机构 * The University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Nanyang Technological University（南洋理工大学）； CFAR, Agency for Science, Technology and Research, Singapore（新加坡科技研究局CFAR）； University of California, Santa Barbara（加州大学圣巴巴拉分校）； National University of Singapore（新加坡国立大学）

AI总结本文提出了一种无需训练的推理时控制机制，使流本身具备多样性意识，通过几何上与模式质量寻求方向解耦的引导来鼓励轨迹横向扩散，同时通过时间调度的随机扰动重新引入不确定性，从而在不降低图像细节和提示忠实度的情况下提升多样性。

详情

AI中文摘要

基于流的文本到图像模型遵循确定性轨迹，这使得在有限的采样预算下探索多样模式成本较高。现有方法提高多样性通常依赖于重新训练或降低图像保真度。为了解决这一限制，我们提出了一种无需训练的推理时控制机制，使流本身具备多样性意识。我们的核心见解是通过几何上与模式质量寻求方向解耦的引导来鼓励多样性。我们的方法通过特征空间目标同时鼓励轨迹横向扩散，并通过时间调度的随机扰动重新引入不确定性。关键在于这种扰动被投影为与生成流正交，这是一个几何约束，允许其在不降低图像细节或提示保真度的情况下提升多样性。理论上，我们证明了这种设计单调地增加了一个体积代理，同时近似地保持边际分布，为生成质量的鲁棒性提供了原理性解释。经验上，在多个文本到图像设置下，固定采样预算下，我们的方法在Vendi分数和Brisque等多样性指标上一致优于强基线，同时保持图像质量和对齐。

英文摘要

Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the mode's quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

URL PDF HTML ☆

赞 0 踩 0

2510.05942 2026-05-21 cs.CL cs.AI 版本更新

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

EvalMORAAL: 可解释的链式推理与大语言模型道德对齐的LLM-as-Judge评估

Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri

发表机构 * Department of Methodology and Statistics, Utrecht University（方法论与统计学系，乌得勒支大学）

AI总结本文提出EvalMORAAL框架，通过两种评分方法和模型作为裁判的同行评审，评估20个大语言模型的道德对齐情况，发现西方与非西方地区存在显著的道德对齐差距。

Comments Accepted as a poster at *SEM 2026

详情

AI中文摘要

我们提出了EvalMORAAL，一个透明的链式推理（CoT）框架，使用两种评分方法（对数概率和直接评分）以及模型作为裁判的同行评审来评估20个大语言模型的道德对齐。我们对世界价值观调查（55个国家，19个主题）和PEW全球态度调查（39个国家，8个主题）进行了评估。使用EvalMORAAL，顶级模型与调查响应高度一致（WVS上的皮尔逊相关系数r≈0.90）。然而，我们发现明显的区域差异：西方地区平均r=0.82，而非西方地区平均r=0.61（绝对差距0.21），表明存在持续的区域对齐差距。我们的框架增加了三个部分：（1）为所有模型提供两种评分方法以实现公平比较，（2）带有自我一致性检查的结构化Co T协议，以及（3）一个模型作为裁判的同行评审，使用数据驱动的阈值标记348个冲突。同行同意与WVS调查对齐（r=0.74，p<.001；PEW r=0.39，n.s.），支持自动化质量检查。这些结果展示了文化意识AI的真实进展，同时突显了跨区域应用的开放挑战。

英文摘要

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

URL PDF HTML ☆

赞 0 踩 0

2508.16860 2026-05-21 cs.SE cs.AI cs.LG 版本更新

TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings

TriagerX: 用于基于内容和交互的缺陷分类任务的双变换器

Md Afif Al Mamun, Gias Uddin, Lan Xia, Longyu Zhang

发表机构 * University of Calgary（卡尔加里大学）； York University（约克大学）； IBM Canada（IBM加拿大）

AI总结本文提出TriagerX，一种双变换器架构，通过结合内容和交互信息来改进缺陷分类任务的推荐准确性，优于现有最先进方法。

Comments Accepted to IEEE Transactions on Software Engineering (TSE). 17 pages, 15 figures

详情

DOI: 10.1109/TSE.2026.3685573

AI中文摘要

预训练语言模型（PLMs）是基于变换器的架构，可用于缺陷分类任务。PLMs比传统机器学习（ML）模型更能捕捉标记语义（例如TF-IDF、词袋）。然而，PLMs可能仍然会关注在缺陷报告中不相关的标记，这会影响其有效性。此外，当不考虑开发人员围绕类似缺陷的交互历史时，模型的推荐可能不够优化。我们设计了TriagerX来解决这些限制。首先，为了更可靠地评估标记语义，我们利用双变换器架构。与当前最先进的（SOTA）基线使用单一变换器架构不同，TriagerX从两个变换器中收集推荐，每个变换器通过其最后三层提供推荐。这种设置生成了一个稳健的内容基于候选开发人员的排名。TriagerX然后通过一种新的基于交互的排名方法来细化此排名，该方法考虑了开发人员与类似修复缺陷的历史交互。在五个数据集中，TriagerX超越了所有九种基于变换器的方法，包括SOTA基线，通常在Top-1和Top-3开发人员推荐准确性上提高了超过10%。我们与我们的大型行业合作伙伴合作，成功将其部署到他们的开发环境中。合作伙伴要求开发人员和组件的推荐，组件作为团队分配的代理，特别是在开发人员轮岗或团队变化的情况下特别有用。我们训练TriagerX在合作伙伴的数据集上进行两项任务，并在组件推荐上优于SOTA基线最高达10%，在开发人员推荐上最高达54%。

英文摘要

Pretrained Language Models or PLMs are transformer-based architectures that can be used in bug triaging tasks. PLMs can better capture token semantics than traditional Machine Learning (ML) models that rely on statistical features (e.g., TF-IDF, bag of words). However, PLMs may still attend to less relevant tokens in a bug report, which can impact their effectiveness. In addition, the model can be sub-optimal with its recommendations when the interaction history of developers around similar bugs is not taken into account. We designed TriagerX to address these limitations. First, to assess token semantics more reliably, we leverage a dual-transformer architecture. Unlike current state-of-the-art (SOTA) baselines that employ a single transformer architecture, TriagerX collects recommendations from two transformers with each offering recommendations via its last three layers. This setup generates a robust content-based ranking of candidate developers. TriagerX then refines this ranking by employing a novel interaction-based ranking methodology, which considers developers' historical interactions with similar fixed bugs. Across five datasets, TriagerX surpasses all nine transformer-based methods, including SOTA baselines, often improving Top-1 and Top-3 developer recommendation accuracy by over 10%. We worked with our large industry partner to successfully deploy TriagerX in their development environment. The partner required both developer and component recommendations, with components acting as proxies for team assignments-particularly useful in cases of developer turnover or team changes. We trained TriagerX on the partner's dataset for both tasks, and it outperformed SOTA baselines by up to 10% for component recommendations and 54% for developer recommendations.

URL PDF HTML ☆

赞 0 踩 0

2506.08277 2026-05-21 q-bio.NC cs.AI cs.CL cs.CV cs.LG 版本更新

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

基于任务的指令调制多模态大语言模型探测：在自然主义刺激下的区域特定大脑对齐模式

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

发表机构 * Technische Universität Berlin（柏林技术大学）； Rice University（Rice 大学）； AWS AI Labs, Amazon（Amazon 人工智能实验室）； IIT Delhi（德里理工学院）； University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； Spector Inc（Spector 公司）； IIIT-Hyderabad（海得拉巴理工学院）； Microsoft（微软）

AI总结本研究探讨了指令调制多模态大语言模型在自然主义刺激下的大脑对齐模式，通过比较不同模型在视频和音频任务中的表现，揭示了指令调制对模型表示能力的影响。

Comments 57 pages, 39 figures

详情

AI中文摘要

近期的体素级多模态脑编码研究显示，多模态大语言模型（MLLMs）在大脑对齐程度上高于单模态模型。更近期的研究表明，指令调制多模态（IT）模型能够生成与大脑活动强相关的任务特定表示，但大多数先前评估集中在单模态刺激或非指令调制模型上。我们仍然缺乏对指令调制是否使IT-MLLMs围绕功能任务需求组织其表示，还是仅反映表面语义的清晰理解。为此，我们通过预测自然主义电影观看（带音频的视频）期间记录的fMRI响应，来估计大脑对齐情况。使用来自六个视频和两个音频IT-MLLMs的指令特定嵌入，跨13个视频任务指令，我们发现指令调制视频MLLMs的大脑对齐程度高于上下文学习（ICL）多模态模型（~9%）、非指令调制多模态模型（~15%）和单模态基线（~20%）。我们对视频和音频任务以及语言引导的探测评估，产生了不同任务特定的MLLM表示，这些表示在不同大脑区域中变化。我们还发现，ICL模型表现出强语义组织（r=0.78），而IT模型与指令文本语义的耦合较弱（r=0.14），这与与更高大脑对齐相关的任务条件子空间一致。这些发现支持了任务特定指令与更强的大脑-MLLM对齐之间的关联，并为映射两个系统中的联合信息处理开辟了新途径。我们公开了代码 [https://github.com/subbareddy248/mllm_videos]。

英文摘要

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

URL PDF HTML ☆

赞 0 踩 0

2406.03506 2026-05-21 cs.LG cs.AI 版本更新

Fuzzy Convolution Neural Networks for Tabular Data Classification

模糊卷积神经网络用于表格数据分类

Arun D. Kulkarni

发表机构 * Computer Science Department, University of Texas at Tyler（德克萨斯大学泰勒分校计算机科学系）

AI总结本文提出了一种针对表格数据分类的模糊卷积神经网络（FCNN），通过将特征值映射为模糊隶属度并转换为图像来训练CNN模型，从而在表格数据分类任务中实现有效的学习和优于现有方法的性能。

Comments 10 pages, 16 figures, Submitted to IEEE Access

Journal ref IEEE Access, vol. 12, pp. 151846-151855 (2024)

详情

DOI: 10.1109/ACCESS.2024.3479882

AI中文摘要

近年来，由于在各种领域中表现出色，特别是图像和文本分类任务，卷积神经网络（CNNs）已经引起了广泛关注。然而，它们在表格数据分类中的应用仍然很少被探索。在生物信息学、金融、医学等领域，非图像数据普遍存在。将CNNs适应于分类非图像数据仍然极具挑战性。本文研究了CNNs在表格数据分类中的有效性，旨在弥合传统机器学习方法与深度学习技术之间的差距。我们提出了一种专门针对表格数据的新型框架——模糊卷积神经网络（FCNN），以捕捉特征向量中的局部模式。在我们的方法中，我们将特征值映射到模糊隶属度。模糊隶属度向量被转换为图像，用于训练CNN模型。训练后的CNN模型用于分类未知的特征向量。为了验证我们的方法，我们生成了六个复杂的噪声数据集。我们从每个数据集中随机选择70%的样本用于训练，30%用于测试。数据集还使用了最先进的机器学习算法，如决策树（DT）、支持向量机（SVM）、模糊神经网络（FNN）、贝叶斯分类器和随机森林（RF）进行分类。实验结果表明，我们提出的方法能够有效地从表格数据中学习有意义的表示，实现与现有方法相媲美或更优的性能。总体而言，我们的发现表明，所提出的FCNN模型在表格数据分类任务中具有前景，作为一种可行的替代方案，为在结构化数据分析中利用深度学习提供了新的视角和潜在的机会。

英文摘要

Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.

URL PDF HTML ☆

赞 0 踩 0

2305.09620 2026-05-21 cs.CL cs.AI cs.LG 版本更新

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

AI增强的调查：利用大型语言模型和调查进行意见预测

Junsol Kim, Byungkyu Lee

发表机构 * Department of Sociology（社会学系）； University of Chicago（芝加哥大学）； New York University（纽约大学）； Chicago, IL（伊利诺伊州芝加哥市）； New York, NY（纽约州纽约市）

AI总结本文提出了一种基于大型语言模型的框架，通过结合问题、受访者和调查时期的嵌入表示，预测重复横断面调查中缺失的响应，从而弥补传统调查在捕捉历史变化方面的不足。

详情

AI中文摘要

全国代表性调查追踪公众意见，但每年只询问有限的问题，限制了其捕捉历史变化的潜力。为填补这一空白，我们开发了一个基于大型语言模型（LLM）的框架，通过结合问题、受访者和调查时期的嵌入表示，预测重复横断面调查中缺失的响应。我们引入了LLM在调查研究中的两个新应用：回溯预测（预测年度层面的缺失意见）和未询问意见预测（预测完全缺失的意见）。使用1972-2021年一般社会调查的数据，我们的LLM模型在交叉验证和在GSS未询问的年份中通过其他组织测量的公众意见方面表现良好。这些能力使我们能够恢复缺失的趋势并确定公众态度变化的时间，例如同性婚姻支持率的上升。然而，未询问意见预测的性能仍较为有限。我们展示了当我们的模型优于现有基准时的情况，检验了哪些意见和受访者更具可预测性，并评估了我们的方法是否减少了LLM预测响应的同质化倾向。我们的研究证明了LLM和调查可以相互增强：LLM扩大了调查的潜力，而调查则校准LLM以模拟人类意见。

英文摘要

Nationally representative surveys track public opinion, yet they ask only a limited set of questions each year, limiting its potential to capture historical changes. To fill this gap, we develop a large language model (LLM)-based framework for predicting missing responses in repeated cross-sectional surveys by incorporating embeddings for questions, respondents, and survey periods. We introduce two new applications of LLMs to survey research: retrodiction (predicting year-level missing opinions) and unasked opinion prediction (predicting entirely missing opinions). Using data from the 1972-2021 General Social Surveys, our LLM-based models perform strongly in retrodicting masked GSS opinions through cross-validation and public opinions measured by other organizations in years when the GSS did not ask them. These capabilities enable us to recover missing trends and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. However, performance remains modest for unasked opinion prediction. We show when our models outperform established benchmarks, examine which opinions and and respondents are more predictable, and evaluate whether our approach reduces LLMs' tendency to homogenize predicted responses. Our study demonstrates that LLMs and surveys can mutually enhance each other: LLMs broaden survey potential, while surveys calibrate LLMs for simulating human opinions.

URL PDF HTML ☆

赞 0 踩 0