arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2602.01357 2026-06-09 cs.LG 版本更新

Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning

你的自对弈算法其实是一个对抗性模仿者:通过模仿学习的视角理解LLM自对弈

Shangzhe Li, Xuchao Zhang, Chetan Bansal, Weitong Zhang

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Microsoft Research(微软研究院)

AI总结 本文通过将自对弈微调建模为模型与自身参数化的正则化隐式奖励玩家之间的极小极大博弈,统一了自对弈模仿与偏好对齐,并提出了基于χ²散度的新算法,在多种语言模型微调任务上优于现有方法。

Comments 26 pages, 6 tables, 5 figures

详情
AI中文摘要

自对弈后训练方法已成为微调大型语言模型并在没有偏好数据的情况下将弱语言模型转变为强语言模型的有效方法。然而,自对弈微调的理论基础仍未被充分探索。在这项工作中,我们通过将自对弈微调与对抗性模仿学习联系起来,将微调过程建模为模型与由模型自身参数化的正则化隐式奖励玩家之间的极小极大博弈,从而解决了这一问题。这一视角将自对弈模仿和一般偏好对齐统一在一个共同框架内。在此公式下,我们进行了博弈论分析,表明自对弈微调将收敛到其均衡。受这一理论公式的指导,我们提出了一种新的基于χ²散度变分目标的自对弈模仿微调算法,该算法具有有界奖励和改进的稳定性。在各种语言模型微调任务上的实验表明,该方法始终优于现有的自对弈方法,并验证了我们的理论见解。

英文摘要

Self-play post-training methods has emerged as an effective approach for finetuning large language models and turn the weak language model into strong language model without preference data. However, the theoretical foundations for self-play finetuning remain underexplored. In this work, we tackle this by connecting self-play finetuning with adversarial imitation learning by formulating finetuning procedure as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. This perspective unifies self-play imitation and general preference alignment within a common framework. Under this formulation, we present a game-theoretic analysis showing that the self-play finetuning will converge to it's equilibrium. Guided by this theoretical formulation, we propose a new self-play imitation finetuning algorithm based on the $χ^2$-divergence variational objective with bounded rewards and improved stability. Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods and validate our theoretical insights.

2601.21754 2026-06-09 cs.AI 版本更新

Language-based Trial and Error Falls Behind in the Era of Experience

基于语言的试错在经验时代落后了

Haoyu Wang, Guozheng Ma, Shugang Cui, Yilun Kong, Haotian Luo, Li Shen, Mengya Gao, Yichao Wu, Xiaogang Wang, Dacheng Tao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM在非语言环境中探索成本高的问题,提出SCOUT框架,用轻量级模型探索环境,通过SFT和RL激活LLM的世界知识,显著提升性能并降低计算开销。

详情
AI中文摘要

尽管大型语言模型(LLM)在基于语言的智能体任务中表现出色,但它们对未见过的非语言环境(例如符号或空间任务)的适用性仍然有限。先前的工作将这种性能差距归因于预训练分布与测试分布之间的不匹配。在这项工作中,我们证明了主要瓶颈是探索的过高成本:掌握这些任务需要大量的试错,这对于在高维语义空间中运行的参数庞大的LLM来说在计算上是不可持续的。为了解决这个问题,我们提出了SCOUT(子规模协作处理未见任务),一种将探索与利用解耦的新框架。我们使用轻量级的“侦察兵”(例如小型MLP)以远超LLM的速度和规模探测环境动态。收集到的轨迹用于通过监督微调(SFT)引导LLM,然后通过多轮强化学习(RL)激活其潜在的世界知识。实验表明,SCOUT使Qwen2.5-3B-Instruct模型达到了0.86的平均得分,显著优于包括Gemini-2.5-Pro(0.60)在内的专有模型,同时节省了约60%的GPU小时消耗。

英文摘要

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

2602.00238 2026-06-09 cs.CL cs.AI cs.LG 版本更新

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

DIVERGE: 面向开放式信息检索的多样性增强RAG

Tianyi Hu, Niket Tandon, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Microsoft Research(微软研究院)

AI总结 针对现有RAG系统忽略开放式信息检索中多样性需求的问题,提出Diverge框架,通过迭代反思引导的多样化视角探索和多样性感知检索支持,在保持质量的同时将多样性提升约2倍。

详情
AI中文摘要

现有的检索增强生成(RAG)系统通常假设每个查询只有一个正确答案。这种假设忽略了开放式信息检索场景,其中多个合理的答案是有价值的,并且多样性对于创造力、公平性和信息的包容性访问至关重要。我们表明,标准RAG系统未能充分利用多样化的检索上下文:简单地增加检索多样性并不一定会导致多样化的生成。为了解决这一局限性,我们提出了Diverge,一个即插即用的智能体RAG框架,通过迭代、反思引导的多样化视角探索和多样性感知检索支持来改善多样性与质量的权衡。我们进一步引入了用于表征开放式问答中多样性与质量权衡的评估指标。在多个真实世界数据集和骨干LLM上的实验表明,Diverge在竞争基线中实现了最佳的权衡,将多样性提高了约2倍,且没有明显的质量下降。这些结果揭示了当前RAG系统的系统性局限,并展示了显式多样性建模的价值。

英文摘要

Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks open-ended information-seeking scenarios where multiple plausible answers are valuable, and where diversity is important for creativity, fairness, and inclusive access to information. We show that standard RAG systems fail to fully use diverse retrieved contexts: simply increasing retrieval diversity does not necessarily lead to diverse generations. To address this limitation, we propose Diverge, a plug-and-play agentic RAG framework that improves the diversity--quality trade-off through iterative, reflection-guided exploration of diverse viewpoints and diversity-aware retrieval support. We further introduce evaluation metrics for characterizing the diversity-quality trade-off in open-ended question answering. Experiments across multiple real-world datasets and backbone LLMs show that Diverge achieves the best trade-off among competitive baselines, increasing diversity by $\sim2\times$ without noticeable quality degradation. These results reveal a systematic limitation of current RAGs and show the value of explicit diversity modeling.

2510.02014 2026-06-09 cs.LG 版本更新

Normality Calibration in Semi-supervised Graph Anomaly Detection

半监督图异常检测中的正态性校准

Guolei Zeng, Hezhe Qiao, Guoguo Ai, Jinsong Guo, Guansong Pang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphNC框架,通过教师模型在异常分数和表示空间联合校准正态性,解决半监督图异常检测中正态性过拟合问题,降低误报。

Comments Accepted by ICML2026

详情
AI中文摘要

图异常检测(GAD)因其在广泛应用中揭示不规则模式的关键能力而日益受到关注。半监督GAD假设训练期间有部分标注的正常节点可用,是最广泛探索的应用设置之一。然而,现有半监督GAD方法学习到的正态性仅限于标注的正常节点,往往倾向于过拟合给定模式。这可能导致高检测错误,例如高误报率。为克服这一限制,我们提出GraphNC,一种图正态性校准框架,利用标注和未标注数据,在异常分数和节点表示空间中联合校准来自教师模型(预训练的半监督GAD模型)的正态性。GraphNC包括两个主要组件:异常分数分布对齐(ScoreDA)和基于扰动的正态性正则化(NormReg)。ScoreDA通过将我们模型的异常分数与教师模型产生的分数分布对齐来优化异常分数。由于教师模型中大多数正常节点和部分异常节点的分数准确,分数对齐有效地将正常类和异常类的异常分数拉向两端,从而产生更可分离的异常分数。然而,教师模型存在不准确的分数。为减轻这些分数的误导,设计了NormReg来在表示空间中正则化图正态性,通过仅在标注节点上最小化扰动引导的一致性损失,使正常节点的表示更紧凑。

英文摘要

Graph anomaly detection (GAD) has attracted growing interest for its crucial ability to uncover irregular patterns in broad applications. Semi-supervised GAD, which assumes a subset of annotated normal nodes available during training, is among the most widely explored application settings. However, the normality learned by existing semi-supervised GAD methods is limited to the labeled normal nodes, often inclining to overfitting the given patterns. These can lead to high detection errors, such as high false positives. To overcome this limitation, we propose GraphNC , a graph normality calibration framework that leverages both labeled and unlabeled data to calibrate the normality from a teacher model (a pre-trained semi-supervised GAD model) jointly in anomaly score and node representation spaces. GraphNC includes two main components, anomaly score distribution alignment (ScoreDA) and perturbation-based normality regularization (NormReg). ScoreDA optimizes the anomaly scores of our model by aligning them with the score distribution yielded by the teacher model. Due to accurate scores in most of the normal nodes and part of the anomaly nodes in the teacher model, the score alignment effectively pulls the anomaly scores of the normal and abnormal classes toward the two ends, resulting in more separable anomaly scores. Nevertheless, there are inaccurate scores from the teacher model. To mitigate the misleading by these scores, NormReg is designed to regularize the graph normality in the representation space, making the representations of normal nodes more compact by minimizing a perturbation-guided consistency loss solely on the labeled nodes.

2601.23221 2026-06-09 cs.LG 版本更新

Optimal Fair Aggregation of Crowdsourced Noisy Labels using Demographic Parity Constraints

使用人口统计平价约束的众包噪声标签的最优公平聚合

Gabriel Singer, Samuel Gruffaz, Olivier Vo Van, Nicolas Vayatis, Argyris Kalogeratos

发表机构 * University of California, Berkeley(加州大学伯克利分校) Université de Paris(巴黎大学) CNRS(国家科学研究中心)

AI总结 针对众包标签聚合中的公平性问题,提出在ε-公平框架下分析多数投票和最优贝叶斯聚合的公平性差距,并推广多类公平后处理算法以强制执行人口统计平价约束。

详情
AI中文摘要

由于获取可靠的真实标签通常成本高昂或不可行,众包和聚合嘈杂的人类注释是典型的替代方案。然而,聚合主观标签可能会放大个体偏见,特别是关于敏感特征的偏见,引发公平性问题。尽管如此,众包聚合中的公平性在很大程度上仍未得到探索,没有现有的收敛保证,只有有限的后处理方法用于在人口统计平价下强制执行ε-公平性。我们通过在ε-公平框架内分析众包聚合方法的公平性差距来填补这一空白,针对多数投票和最优贝叶斯聚合。在小众群体中,我们推导出多数投票的公平性差距的上界,该上界以个体注释者的公平性差距表示。我们进一步表明,在可解释的条件下,聚合共识的公平性差距指数级收敛到真实标签的公平性差距。由于真实标签本身可能仍然不公平,我们将最先进的多类公平后处理算法从连续设置推广到离散设置,该算法对任何聚合规则强制执行严格的人口统计平价约束。在合成和真实数据集上的实验证明了我们方法的有效性,并证实了理论见解。

英文摘要

As acquiring reliable ground-truth labels is usually costly, or infeasible, crowdsourcing and aggregation of noisy human annotations is the typical resort. Aggregating subjective labels, though, may amplify individual biases, particularly regarding sensitive features, raising fairness concerns. Nonetheless, fairness in crowdsourced aggregation remains largely unexplored, with no existing convergence guarantees and only limited post-processing approaches for enforcing $\varepsilon$-fairness under demographic parity. We address this gap by analyzing the fairness s of crowdsourced aggregation methods within the $\varepsilon$-fairness framework, for Majority Vote and Optimal Bayesian aggregation. In the small-crowd regime, we derive an upper bound on the fairness gap of Majority Vote in terms of the fairness gaps of the individual annotators. We further show that the fairness gap of the aggregated consensus converges exponentially fast to that of the ground-truth under interpretable conditions. Since ground-truth itself may still be unfair, we generalize a state-of-the-art multiclass fairness post-processing algorithm from the continuous to the discrete setting, which enforces strict demographic parity constraints to any aggregation rule. Experiments on synthetic and real datasets demonstrate the effectiveness of our approach and corroborate the theoretical insights.

2601.22736 2026-06-09 cs.LG cs.AI 版本更新

UA-DCM: Uncertainty-aware Causal Decision Making via Effect Bound Decomposition

UA-DCM: 基于效应界分解的不确定性感知因果决策

Md Musfiqur Rahman, Ziwei Jiang, Hilaf Hasson, Murat Kocaoglu

发表机构 * Electrical and Computer Engineering, Purdue University(帕克大学电气与计算机工程系) Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Cohesity

AI总结 提出一种新框架,通过分解因果效应值的可消除与不可消除部分,区分收集更多样本能否帮助识别最优行动,并利用神经因果模型近似实现该分解。

详情
AI中文摘要

从观测数据中进行因果推断可以为决策场景中找到最佳行动提供有力证据,而无需进行昂贵的随机试验。由于未观测到的混杂因素,即使有无限数据,行动的因果效应也往往不是点可识别的。此外,仅有有限样本为因果效应估计增加了另一层不确定性。现有几种方法可用于获得因果效应的上下界,从符号方法到最近的基于神经网络的方法,这些方法隐式地结合了两种不确定性来源。然而,这些方法并未告知收集更多样本是否有助于从观测数据中识别最佳行动,使专家对其数据收集策略一无所知。我们通过一种新颖的框架解决了这个问题,该框架能够区分可能通过收集更多样本消除的因果效应值范围与那些高概率无法通过更多观测样本消除的值范围。我们证明这种划分可以通过求解最大-最小和最小-最大优化问题获得。我们利用神经因果模型在实践中近似恢复这种分解。通过在合成和真实世界数据集上的实验,我们证明了我们的算法可以确定何时收集更多样本无助于确定最佳行动。我们的框架可以帮助从业者决定何时应诉诸非观测研究或寻求测量一些未测量的混杂因素以进行最优决策。

英文摘要

Causal inference from observational data can provide strong evidence for finding the best action in a decision-making scenario without having to perform expensive randomized trials. The causal effect of an action is often not pointwise identifiable even with infinite data due to unobserved confounding factors. Furthermore, having only finitely many samples adds another layer of uncertainty to causal effect estimation. Several existing methods can be used to obtain upper and lower bounds to the causal effect, ranging from symbolic methods to the more recent neural network-based approaches, which implicitly incorporate both sources of uncertainty. However, these methods do not inform whether collecting more samples may or may not help identify the best action from observational data, leaving experts in the dark about their data collection strategies. We address this problem with a novel framework that can distinguish the range of causal effect values that might be eliminated by collecting more samples from the range of values that, with high probability, cannot be eliminated with more observational samples. We show that this partitioning can be obtained by solving max-min and min-max optimization problems. We leverage neural causal models to approximately recover this decomposition in practice. We demonstrate via experiments on synthetic and real-world datasets that our algorithm can determine when collecting more samples will not help determine the best action. Our framework can help practitioners decide when to resort to non-observational studies or seek to measure some of the unmeasured confounders for optimal decision-making.

2508.19857 2026-06-09 cs.LG quant-ph 版本更新

Quantum latent distributions in deep generative models

深度生成模型中的量子潜在分布

Omar Bacarreza, Thorin Farnsworth, Alexander Makarovskiy, Hugo Wallner, Tessa Hicks, Santiago Sempere-Llagostera, John Price, Robert J. A. Francis-Jones, William R. Clements

发表机构 * ORCA Computing(ORCA计算公司)

AI总结 研究量子处理器产生的潜在分布何时及为何能提升生成模型性能,理论上证明其可生成经典分布无法高效产生的数据分布,并在合成和分子数据集上验证了量子干涉统计带来的性能优势。

Comments Accepted at ICML 2026

详情
AI中文摘要

许多成功的生成模型家族利用低维潜在分布映射到数据分布。尽管通常使用简单的潜在分布,但分布的选择对模型性能有强烈影响。最近的实验表明,量子处理器产生的概率分布(通常高度相关且经典上难以处理)可以在某些数据集上带来性能提升。然而,量子处理器产生的潜在分布何时以及为何能提升性能,以及这些改进是否与这些分布的量子性质相关,是我们在本工作中研究的开放问题。我们在理论上证明,在某些条件下,这些“量子潜在分布”使生成模型能够产生经典潜在分布无法高效产生的数据分布。我们提供了关于潜在机制的解释,这些机制可以解释在真实数据集上的性能优势。基于此,我们在合成量子数据集和QM9分子数据集上进行了广泛的基准测试,使用了模拟和真实的光子量子处理器。我们发现,与经典基线相比,量子干涉产生的统计特性带来了更好的生成性能,表明量子处理器可以在扩展深度生成模型的能力方面发挥作用。

英文摘要

Many successful families of generative models leverage a low-dimensional latent distribution that is mapped to a data distribution. Though simple latent distributions are often used, the choice of distribution has a strong impact on model performance. Recent experiments have suggested that the probability distributions produced by quantum processors, which are typically highly correlated and classically intractable, can lead to improved performance on some datasets. However, when and why latent distributions produced by quantum processors can improve performance, and whether these improvements are connected to quantum properties of these distributions, are open questions that we investigate in this work. We show in theory that, under certain conditions, these "quantum latent distributions" enable generative models to produce data distributions that classical latent distributions cannot efficiently produce. We provide intuition as to the underlying mechanisms that could explain a performance advantage on real datasets. Based on this, we perform extensive benchmarking on a synthetic quantum dataset and the QM9 molecular dataset, using both simulated and real photonic quantum processors. We find that the statistics arising from quantum interference lead to improved generative performance compared to classical baselines, suggesting that quantum processors can play a role in expanding the capabilities of deep generative models.

2601.22211 2026-06-09 cs.LG 版本更新

Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions

面向组合动作强化学习的潜在球形流策略

Lingkai Kong, Anagha Satish, Hezi Jiang, Akseli Kangaslahti, Andrew Ma, Wenbo Chen, Mingxiao Song, Lily Xu, Milind Tambe

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出LSFlow方法,通过球形流匹配在紧凑连续潜在空间中学习随机策略,并利用组合优化求解器保证动作可行性,引入平滑贝尔曼算子解决不连续值函数问题,在多个组合RL任务上平均超越基线20.6%。

Comments ICML'26 Spotlight

详情
AI中文摘要

具有组合动作空间的强化学习(RL)仍然具有挑战性,因为可行动作集呈指数级增长且受复杂可行性约束,使得直接策略参数化不切实际。现有方法将任务特定的价值函数嵌入到约束优化程序中,或学习确定性的结构化策略,牺牲了通用性和策略表达能力。我们提出了一种求解器诱导的潜在球形流策略,将现代生成策略的表达能力引入组合RL,同时通过设计保证可行性。我们的方法LSFlow通过球形流匹配在紧凑连续潜在空间中学习随机策略,并将可行性委托给组合优化求解器,该求解器将每个潜在样本映射到有效的结构化动作。为了提高效率,我们直接在潜在空间中训练价值网络,避免在策略优化期间重复调用求解器。为了解决由求解器动作选择引起的分段常数和不连续价值景观,我们引入了一个平滑的贝尔曼算子,该算子产生稳定、定义明确的学习目标。实验表明,我们的方法在一系列具有挑战性的组合RL任务中平均优于最先进的基线20.6%。

英文摘要

Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emph{latent spherical flow policy} that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emph{stochastic} policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6\% across a range of challenging combinatorial RL tasks.

2601.21996 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

机械论数据归因:追踪可解释LLM单元的训练起源

Jianhui Chen, Yuzhang Luo, Liangming Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出机械论数据归因(MDA)框架,利用影响函数将可解释单元追溯到特定训练样本,通过因果验证表明干预高影响样本可显著调节可解释头的涌现,并发现重复结构数据作为机械催化剂,同时验证了归纳头与上下文学习之间的功能联系。

Comments ICML2026 (Oral)

详情
AI中文摘要

尽管机械论可解释性已在LLM中识别出可解释电路,但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械论数据归因(MDA),这是一个可扩展的框架,利用影响函数将可解释单元追溯到特定训练样本。通过在Pythia系列模型上的广泛实验,我们因果验证了目标干预——移除或增加一小部分高影响样本——显著调节了可解释头的涌现,而随机干预则没有效果。我们的分析表明,重复的结构化数据(例如LaTeX、XML)充当了机械催化剂。此外,我们观察到针对归纳头形成的干预会引发模型上下文学习(ICL)能力的同步变化。这为关于归纳头与ICL之间功能联系的长期假设提供了直接的因果证据。最后,我们提出了一种机械论数据增强流水线,该流水线在不同模型规模上一致地加速电路收敛,为引导LLM的发展轨迹提供了一种原则性方法。

英文摘要

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

2601.21816 2026-06-09 cs.LG 版本更新

Nonparametric LLM Evaluation from Preference Data

基于偏好数据的非参数化LLM评估

Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, Stefan Feuerriegel

发表机构 * GitHub

AI总结 提出非参数统计框架DMLRank,通过去偏机器学习从偏好数据中比较和排名大语言模型,引入广义平均排名分数,具有统计高效、兼容黑箱方法、结合预训练评估器和优化数据收集策略等优势。

Comments Accepted at ICML 2026

详情
AI中文摘要

从人类偏好数据中评估大语言模型(LLM)的性能对于获得LLM排行榜至关重要。然而,许多现有方法要么依赖限制性的参数假设,要么在使用灵活的机器学习方法时缺乏有效的不确定性量化。在本文中,我们提出了一种非参数统计框架DMLRank,用于通过去偏机器学习(DML)从偏好数据中比较和排名LLM。为此,我们引入了广义平均排名分数(GARS),它推广了常用的排名模型,包括Bradley-Terry模型或PageRank/排名中心性,并处理了诸如平局等复杂的人类响应。DMLRank具有以下优势:(i)它产生GARS排名分数的统计高效估计。(ii)它自然允许结合黑箱机器学习方法进行估计。(iii)它可以与预训练的LLM评估器(例如,使用LLM-as-a-judge)结合使用。(iv)它建议在预算约束下收集偏好数据的最优策略。我们在理论和实证上,使用合成和真实世界的偏好数据集展示了这些优势。总之,我们的框架为从业者提供了强大的、最先进的方法,用于比较或排名LLM以构建排行榜。

英文摘要

Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, called DMLRank, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLRank comes with the following advantages: (i)~It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs for leaderboards.

2601.21522 2026-06-09 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

更高效利用预算:使用重置与丢弃(ReD)方法在固定预算下提升大型语言模型的推理性能

Sagi Meir, Tommer D. Keidar, Noam Levi, Shlomi Reuveni, Barak Hirshberg

发表机构 * School of Chemistry, Tel Aviv University(特拉维夫大学化学系) The Center for Physics and Chemistry of Living Systems, Tel Aviv University(特拉维夫大学生命系统物理与化学中心) School of Physics and Astronomy, Tel Aviv University(特拉维夫大学物理与天文学系) The Center for Computational Molecular and Materials Science, Tel Aviv University(特拉维夫大学计算分子与材料科学中心)

AI总结 针对固定预算下大型语言模型推理的收益递减问题,提出重置与丢弃(ReD)查询方法,通过优化尝试分配提升覆盖率,并在编码、数学和推理基准上验证了其成本节约效果。

详情
AI中文摘要

大型语言模型(LLMs)在可验证任务上的性能通常通过 pass@k 衡量,即在 k 次尝试中至少正确回答一次的概率。在固定预算下,更合适的指标是 coverage@cost,即作为总尝试次数函数的平均唯一回答问题数量。我们连接这两个指标,并证明 pass@k 中经验观察到的幂律行为导致 coverage@cost 的次线性增长(收益递减)。为解决此问题,我们提出重置与丢弃(ReD),一种 LLMs 的查询方法,无论 pass@k 的形式如何,都能在给定预算下增加 coverage@cost。此外,给定 pass@k,我们可以定量预测使用 ReD 在总尝试次数上的节省。如果模型的 pass@k 不可用,ReD 可以推断其幂律指数。在三个 LLMs 上进行的编码(HumanEval)、数学(GSM8K)和推理(MMLU-Pro)基准测试表明,ReD 显著减少了达到期望覆盖率所需的尝试次数、令牌数和美元成本,同时提供了一种高效测量推理幂律的方法。ReD 的优势在非完美验证器下得以保持,并且优于测试的分配基线。

英文摘要

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for a given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs across coding (HumanEval), math (GSM8K), and reasoning (MMLU-Pro) benchmarks demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws. ReD's advantage is maintained for imperfect verifiers and outperforms the tested allocation baselines.

2601.20503 2026-06-09 cs.CV cs.AI 版本更新

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

使用部分标注数据集训练策略的比较评估:FLAIR MRI中白质高信号和卒中病变分割

Jesse Phitidis, Alison Q. Smithard, William N. Whiteley, Joanna M. Wardlaw, Miguel O. Bernabeu, Maria Valdés Hernández

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本研究系统评估了六种利用部分标注数据训练联合分割白质高信号和缺血性卒中病变模型的策略,发现伪标签法最有效,可提升模型性能并支持大规模临床研究。

详情
AI中文摘要

白质高信号(WMH)和缺血性卒中病变(ISL)是脑小血管疾病(SVD)的关键影像生物标志物,可在磁共振成像(MRI)上检测到。开发稳健的深度学习模型来自动分割和区分这些病理仍然具有挑战性。具体而言,WMH和ISL常在同一受试者中共存,并在液体衰减反转恢复(FLAIR)序列上表现为视觉上混淆的高信号,使其精确勾画复杂化。为了解决完全标注队列稀缺的问题,我们系统评估了六种使用部分标注数据训练联合WMH和ISL分割模型的可行策略。我们汇集了私有和公开数据集,构建了一个包含2,052个MRI体积的大规模队列,其中分别有1,341和1,152个体积包含WMH和ISL的真实标注。我们的分析表明,多种策略有效利用部分标注数据提升整体模型性能,其中伪标签法是最有效的方法。该模型表现出一致的WMH分割策略,并成功检测到大多数FLAIR阳性的ISL。这些发现证明了使用部分标注数据开发可靠自动分割工具的可行性,可支持持续的SVD监测和大规模临床研究中的高通量生物标志物提取。

英文摘要

White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are key imaging biomarkers of cerebral small vessel disease (SVD) detectable on magnetic resonance imaging (MRI). The development of robust deep learning models to automatically segment and differentiate these pathologies remains challenging. Specifically, WMH and ISL frequently co-occur within the same subject and present as visually confounding hyperintensities on fluid-attenuated inversion recovery (FLAIR) sequences, complicating their accurate delineation. To address the scarcity of fully annotated cohorts, we systematically evaluated six accessible strategies for training a joint WMH and ISL segmentation model using partially labelled data. We aggregated privately held and publicly available datasets to curate a large-scale cohort of 2,052 MRI volumes, of which 1341 and 1152 volumes contained ground truth annotations for WMH and ISL, respectively. Our analysis indicates that multiple strategies effectively leverage partially labelled data to enhance overall model performance, with pseudolabelling emerging as the most effective approach. This model exhibited a consistent WMH segmentation policy and successfully detected the majority of FLAIR-positive ISL. These findings demonstrate the viability of using partially labelled data to develop reliable automated segmentation tools, which can support ongoing SVD monitoring and high-throughput biomarker extraction for large-scale clinical research.

2601.19082 2026-06-09 cs.AI cs.CL cs.GT cs.LG cs.MA 版本更新

Payoff scaling shapes cooperation in LLM agents across languages

收益规模塑造跨语言LLM代理的合作行为

Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

发表机构 * Faculty of Information Technology, University of Science (HCMUS), Ho Chi Minh City, Vietnam(信息技术学院,科学大学(HCMUS),胡志明市,越南) Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam(计算机科学与工程学院,胡志明市技术大学(HCMUT),胡志明市,越南) Vietnam National University – Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam(越南国家大学——胡志明市(VNU-HCM),胡志明市,越南) Luxembourg Institute of Science and Technology (LIST), Luxembourg(卢森堡科学与技术研究所(LIST),卢森堡) School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom(计算、工程与数字技术学院,泰赛德大学,米德尔斯布罗,英国)

AI总结 通过监督分类器识别重复囚徒困境中的策略,结合演化博弈论基线,发现随着收益增加,LLM反而更合作,与演化预测相反,表明对齐训练和人类推理模式的影响。

Comments 44 pages, 17 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主代理,代表用户进行谈判、协调和行动。它们在这种环境中是否合作不再只是一个学术问题,而是人工智能治理的核心问题。我们从战略行为的角度出发,探究两个日常杠杆——利害关系的大小和描述交互的语言——如何塑造LLM在重复囚徒困境中采用的策略。我们不直接通过原始行动计数来解读合作,而是训练监督分类器来识别重复博弈的经典策略(始终合作、始终背叛、以牙还牙、赢-留-输-变),并将其作为观察LLM行为的透镜。为了了解在相同收益下策略分布应如何,我们推导了演化博弈论(EGT)基线,并将其与LLM数据进行比较。两种结果以揭示性的方式不一致:随着收益增加,演化理论预测背叛应占据主导,但LLM却向相反方向移动,变得更加合作——我们认为,这是对齐训练和LLM从训练数据中继承的人类推理模式的标志。我们进一步表明,这种情况并非前沿规模、专有模型所特有:它也出现在三个开放权重的较小LLM中。总体而言,我们的分析强调,收益设计和语言框架是强大但未被充分探索的引导LLM行为的杠杆,对评估、对齐和治理部署在高风险、多语言环境中的多代理AI系统具有直接影响。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.

2601.06188 2026-06-09 cs.AI 版本更新

Dynamic Distributed Constraint Optimization and Metareasoning for Continual, Large-Scale Satellite Operations

面向持续大规模卫星运行的动态分布式约束优化与元推理

Itai Zilberstein, Steve Chien

发表机构 * Carnegie Mellon University(卡内基梅隆大学) California Institute of Technology(加州理工学院) Jet Propulsion Laboratory(喷气推进实验室)

AI总结 针对动态大规模卫星调度问题,提出动态分布式约束优化模型DCOSP,并设计元推理框架控制重计算时机,结合D-NSS算法实现近优解,显著优于基线方法。

Comments An earlier version titled "Large-Scale Continual Scheduling and Execution for Dynamic Distributed Satellite Constellation Observation Allocation" appears as an extended abstract in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

随着地球观测卫星星座在规模和能力上的增长,分布式星载控制为新型响应和时效性测量提供了途径。然而,将自主性部署到卫星上需要高效的计算和通信。本文解决了在动态、大规模问题中调度数百颗卫星观测的挑战,涉及数百万个变量。我们提出了动态多卫星星座观测调度问题(DCOSP),这是一种新的动态分布式约束优化问题(DDCOP)形式化,集成了调度与执行。DCOSP具有新颖的最优性条件,为此我们构建了一个精确的全知离线算法。受星载操作强资源约束的启发,我们引入了一个在DDCOP中融入元推理的框架,该框架控制智能体何时消耗资源以重新计算解决方案。此外,我们提出了动态增量邻域随机搜索(D-NSS)算法,这是一种不完整的在线分解型DDCOP算法,通过修复局部子问题来响应动态事件。我们在逼真的仿真中证明,D-NSS收敛到近优解,在解质量、计算时间和消息量方面优于标准DDCOP基线,而我们的元推理框架成功地在资源节约与效用之间取得平衡。作为NASA FAME任务的一部分,这项工作为迄今为止最大规模的空间分布式多智能体AI演示奠定了基础。

英文摘要

As Earth-observing satellite constellations grow in size and capability, distributed onboard control offers a pathway to novel responses and time-sensitive measurements. However, deploying autonomy to satellites requires efficient computation and communication. This work addresses the challenge of scheduling observations for hundreds of satellites in a dynamic, large-scale problem with millions of variables. We present the dynamic multi-satellite constellation observation scheduling problem (DCOSP), a new formulation of dynamic distributed constraint optimization problems (DDCOP) that models integrated scheduling and execution. DCOSP features a novel optimality condition, for which we construct an exact omniscient offline algorithm. Motivated by the strong resource constraints of onboard satellite operations, we introduce a framework to incorporate metareasoning in DDCOPs that controls when agents expend resources to recompute solutions. In addition, we present the dynamic incremental neighborhood stochastic search (D-NSS) algorithm, an incomplete online decomposition-based DDCOP algorithm that repairs localized sub-problems in response to dynamic events. We demonstrate in realistic simulations that D-NSS converges to near-optimal solutions, outperforming standard DDCOP baselines in solution quality, computation time, and message volume, while our metareasoning framework successfully balances resource conservation with utility. As part of the NASA FAME mission, this work lays the foundation for the largest in-space demonstration of distributed multi-agent AI to date.

2601.18585 2026-06-09 cs.CV cs.GR 版本更新

GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization

GimmBO: 基于贝叶斯优化的交互式生成图像模型合并

Chenxi Liu, Selena Ling, Alec Jacobson

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 针对扩散模型适配器合并中权重选择困难的问题,提出GimmBO框架,利用偏好贝叶斯优化实现交互式探索,通过两阶段BO后端提升高维空间采样效率与收敛性。

Comments Accepted at SIGGRAPH NA 2026

详情
AI中文摘要

基于微调的适配被广泛用于定制扩散图像生成,导致大量社区创建的适配器集合,这些适配器捕捉不同的主题和风格。源自同一基础模型的适配器可以通过权重合并,从而在广阔且连续的设计空间中合成新的视觉结果。为了探索这一空间,当前工作流依赖于手动滑块调优,这种方法扩展性差且使得权重选择困难,即使候选集限制在20-30个适配器。我们提出GimmBO,通过偏好贝叶斯优化(PBO)支持图像生成中适配器合并的交互式探索。受实际使用中的观察(包括稀疏性和受限权重范围)启发,我们引入了一个两阶段BO后端,提高了高维空间中的采样效率和收敛性。我们通过模拟用户和用户研究评估了我们的方法,展示了改进的收敛性、高成功率以及相对于BO和线搜索基线的持续增益,并通过几个扩展进一步展示了框架的灵活性。

英文摘要

Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.

2601.18510 2026-06-09 cs.LG cs.AI 版本更新

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

即时强化学习:无需梯度更新的LLM智能体持续学习

Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出JitRL框架,通过动态非参数记忆和即时优势估计,无需梯度更新即可实现LLM智能体的测试时策略优化,在WebArena和Jericho上达到训练无关方法最优,且性能超越微调方法,成本降低30倍以上。

详情
AI中文摘要

尽管大型语言模型(LLM)智能体在通用任务上表现出色,但由于部署后权重冻结,它们在持续适应方面存在固有困难。传统的强化学习(RL)提供了一种解决方案,但会带来高昂的计算成本和灾难性遗忘的风险。我们引入了即时强化学习(JitRL),这是一个无需训练的框架,能够在没有任何梯度更新的情况下实现测试时策略优化。JitRL维护一个动态的非参数经验记忆,并检索相关轨迹以即时估计动作优势。这些估计随后用于直接调制LLM的输出logits。我们从理论上证明,这种加法更新规则是KL约束策略优化目标的精确闭式解。在WebArena和Jericho上的大量实验表明,JitRL在训练无关方法中建立了新的最先进水平。关键的是,JitRL在性能上超越了计算昂贵的微调方法(如WebRL),同时将货币成本降低了30倍以上,为持续学习智能体提供了一条可扩展的路径。代码可在https://github.com/liushiliushi/JitRL获取。

英文摘要

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

2601.15727 2026-06-09 cs.LG cs.CL 版本更新

Towards Automated Kernel Generation in the Era of LLMs

面向LLM时代的自动化内核生成

Yang Yu, Peiyu Zang, Chi Hsu Tsai, Haiming Wu, Yixin Shen, Jialing Zhang, Haoyu Wang, Zhiyou Xiao, Jingze Shi, Yuyu Luo, Wentao Zhang, Chunlei Men, Guang Liu, Yonghua Lin

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Normal University(北京师范大学) Peking University(北京大学) Beijing Institute of Technology(北京理工大学) Cornell University(康奈尔大学) Beijing Jiaotong University(北京交通大学) Renmin University of China(中国人民大学) Hong Kong University of Science and Technology (Guangzhou)(广州科技大学)

AI总结 本文综述了利用大语言模型(LLM)和智能体系统自动化生成与优化GPU内核的方法,系统梳理了现有方法、数据集和基准,并指出了未来研究方向。

Comments In IJCAI 2026. 9 pages, 1 figure

详情
AI中文摘要

现代AI系统的性能从根本上受限于其底层GPU内核的质量,这些内核将高级算法语义转化为低级硬件操作。实现接近最优的内核需要专家级硬件架构和编程模型的理解,使得内核工程成为一个关键但耗时且不可扩展的过程。大语言模型和基于LLM的智能体的最新进展为自动化内核生成和优化开辟了新的可能性。LLM擅长压缩难以形式化的专家级内核知识,而智能体系统通过将内核开发视为迭代、反馈驱动的循环,进一步实现了可扩展的优化。该领域取得了快速进展。然而,该领域仍然分散,缺乏对LLM驱动内核生成的系统视角。本综述通过提供现有方法的结构化概述(涵盖基于LLM的方法和智能体优化工作流程),并系统组织支撑该领域学习和评估的数据集和基准,填补了这一空白。此外,进一步概述了关键开放挑战和未来研究方向,旨在为下一代自动化内核优化建立全面的参考。为跟踪该领域,我们在https://github.com/example维护了一个开源GitHub仓库。

英文摘要

The performance of modern AI systems is fundamentally constrained by the quality of their underlying GPU kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented and lacks a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically organizing the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at https://github.com/flagos-ai/awesome-LLM-driven-kernel-generation.

2512.03470 2026-06-09 cs.CV 版本更新

STGBD-Net: Spatio-temporal Gradient Basis Decomposition Network for Infrared Small Target Detection

STGBD-Net:用于红外小目标检测的时空梯度基分解网络

Chen Hu, Mingyu Zhou, Shuai Yuan, Hongbo Hu, Zhenming Peng, Tian Pu, Xiying Li

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能系统工程学院) School of Information and Communication Engineering and the Laboratory of Imaging Detection and Intelligent Perception, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院和成像检测与智能感知实验室) School of Instrument Science and Opto-Electronics Engineering, Hefei University of Technology(合肥工业大学仪器科学与光电工程学院)

AI总结 针对红外小目标检测中弱目标易被背景杂波淹没的问题,提出基于基分解理论的梯度基分解模块(GDM),将归一化梯度特征作为基向量重构新特征,结合轻量级U-Net实现单帧与多帧检测,在多个基准上达到SOTA性能。

详情
AI中文摘要

红外小目标检测(IRSTD)的一个关键挑战是弱目标信号响应容易被强背景杂波掩盖,经常导致漏检。虽然传统的基于梯度的方法试图捕捉精细细节,但其鲁棒性受到多方向梯度特征静态融合的限制。在本文中,我们从基分解理论的角度重新思考特征融合,并提出一种新颖的框架,将该过程重构为显式且自适应的分解与重建范式。具体而言,我们引入了基分解模块(BDM)及其专门变体——梯度分解模块(GDM),用于IRSTD。GDM将归一化梯度特征视为基向量来重建新特征,从而保持细节结构并突出红外小目标。通过将GDM集成到轻量级的三阶段U-Net中,我们开发了两种统一架构:用于单帧检测的空间梯度基分解网络和用于多帧场景的时空梯度基分解网络。大量实验表明,我们的网络在多个基准上达到了最先进的性能,在检测精度和计算效率之间提供了优越的平衡。我们的代码将在以下网址公开:this https URL。

英文摘要

A key challenge in infrared small target detection (IRSTD) is that weak target signal responses are easily obscured by strong background clutter, frequently resulting in missed detections. While traditional gradient-based methods attempt to capture fine details, their robustness is limited by the static fusion of multi-directional gradient features. In this paper, we rethink feature fusion from the perspective of Basis Decomposition Theory and propose a novel framework that reformulates the process into an explicit and adaptive decomposition-and-reconstruction paradigm. Specifically, we introduce the Basis Decomposition Module (BDM) and its specialized variant, the Gradient Decomposition Module (GDM) for IRSTD. GDMs treat the normalized gradient features as basis vectors to reconstruct a new feature, thereby maintaining detailed structures and highlighting infrared small targets. By integrating GDMs into a lightweight three-stage U-Net, we develop two unified architectures: the Spatial Gradient Basis Decomposition Network for single-frame detection and the Spatio-temporal Gradient Basis Decomposition Network for multi-frame scenarios. Extensive experiments demonstrate that our networks achieve state-of-the-art (SOTA) performance across multiple benchmarks, offering a superior balance between detection accuracy and computational efficiency. Our codes will be made public at: https://github.com/greekinRoma/IRSTD_HC_Platform.

2601.10925 2026-06-09 cs.CL 版本更新

Massively Multilingual Joint Segmentation and Glossing

大规模多语言联合分割与标注

Michael Ginn, Lindia Tjuatja, Enora Rice, Ali Marashian, Maria Valentini, Jasmine Xu, Graham Neubig, Alexis Palmer

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对现有神经标注模型缺乏形态边界预测导致可解释性差的问题,提出PolyGloss模型,通过联合分割与标注提升准确性和对齐度,并支持低秩适应快速迁移。

Comments 15 pages, 9 figures, accepted to ACL 2026 Long Papers

详情
AI中文摘要

利用神经网络进行自动行间标注预测是加速语言文档记录工作的一种有前景的方法。然而,尽管像GlossLM这样的最先进模型在标注基准测试中取得了高分,但语言学家进行的用户研究发现,这些模型在实际场景中的实用性存在关键障碍。特别是,现有模型通常生成语素级别的标注,但将其分配给整个单词而不预测实际的语素边界,这使得预测的可解释性降低,从而对人类标注者来说不可信。我们首次研究了从原始文本中联合预测行间标注和相应形态分割的神经模型。我们进行实验以确定平衡分割和标注准确性以及两个任务之间对齐的最佳模型训练方式。我们扩展了GlossLM的训练语料库,并预训练了PolyGloss,这是一系列用于联合分割和标注的seq2seq多语言模型,在标注方面优于GlossLM,并在分割、标注和对齐方面击败了各种开源LLM。此外,我们证明了PolyGloss可以通过低秩适应快速适应新数据集。

英文摘要

Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.

2601.15423 2026-06-09 cs.LG 版本更新

Lattice: A Confidence-Gated Hybrid System for Uncertainty-Aware Sequential Prediction with Behavioral Archetypes

Lattice: 一种基于置信门控的混合系统,用于具有行为原型的不确定性感知序列预测

Lorian Bannis

发表机构 * banlys.com(banlys公司)

AI总结 提出Lattice混合系统,通过二元置信门控条件激活行为原型,在不确定时回退到骨干预测,在MovieLens上LSTM+Lattice的HR@10提升31.7%,Transformer和SASRec也有提升。

Comments v2 (May 2026): Corrected primary estimand; removed misleading SOTA comparisons; backbone-native transformer/SASRec results; gated vs ungated trade-off; IP-conscious reporting; LIGO/finance demoted to appendix. 11 pages, 1 figure. Patent pending. Contact: LorianBannis@banlys.com for benchmark access

详情
AI中文摘要

我们引入了Lattice,一种混合序列预测系统,该系统使用二元置信门控有条件地激活学习到的行为结构。该系统将行为窗口总结为行为原型,并且仅当支持内置信信号超过验证校准阈值时激活基于原型的评分,在不确定时回退到骨干预测。我们的主要估计量是向固定骨干添加Lattice对相同测试行的控制效应。在MovieLens(30个配对种子,全目录排名)上,LSTM+Lattice相比单独LSTM骨干在HR@10上提升了+31.7%(门控)(p远小于10^-20);非门控融合在同一协议下达到+58.7%。我们不声称门控最大化池化准确率。使用骨干原生原型(在每个骨干的嵌入空间中拟合),在相同评估设计下,门控提升分别为+13.3%(Transformer)和+17.0%(SASRec)。先前版本1中约0%的Transformer行反映了无效的跨骨干迁移,而非组合无法帮助更强编码器的证据。Amazon Electronics提供了跨领域支持证据(+124.0%门控,15个种子,高方差)。受控偏移检查(附录)展示了分布偏移下的门控拒绝。独立的SASRec和BERT4Rec分数是上下文参考,而非目标估计量。我们报告组合实现了什么以及何时激活;生产校准和实现细节因专利申请而保密。

英文摘要

We introduce Lattice, a hybrid sequential prediction system that conditionally activates learned behavioral structure using binary confidence gating. The system summarizes behavior windows as behavioral archetypes and activates archetype-based scoring only when an in-support confidence signal exceeds a validation-calibrated threshold, falling back to backbone predictions when uncertain. Our primary estimand is the controlled effect of adding Lattice to a fixed backbone on identical test rows. On MovieLens (30 paired seeds, full-catalog ranking), LSTM+Lattice improves HR@10 by +31.7% (gated) versus the LSTM backbone alone (p much less than 10^-20); ungated fusion reaches +58.7% on the same protocol. We do not claim gating maximizes pooled accuracy. With backbone-native archetypes (fit in each backbone's embedding space), gated lifts of +13.3% (transformer) and +17.0% (SASRec) hold under the same evaluation design. A prior approximately 0% transformer row in version 1 reflected an invalid cross-backbone transfer, not evidence that composition cannot help stronger encoders. Amazon Electronics provides supporting cross-domain evidence (+124.0% gated, 15 seeds, high variance). Controlled shift checks (appendix) illustrate gate refusal under distribution shift. Standalone SASRec and BERT4Rec scores are contextual references, not the target estimand. We report what composition achieves and when it activates; production calibration and implementation details remain proprietary pending patent prosecution.

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE:基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile(智利天主教大学) CENIA iHEALTH KAUST(科威特皇家科学与技术局)

AI总结 提出CURE框架,通过课程学习动态调整多任务训练,提升医学报告生成的视觉接地准确性和事实一致性,无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289
AI中文摘要

医学视觉语言模型可以自动生成放射学报告,但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐,导致不可靠或弱接地的预测。我们提出CURE,一个错误感知的课程学习框架,无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上,使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样,强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU,报告质量提高了+0.192 CXRFEScore,并将幻觉减少了18.6%。CURE是一个数据高效的框架,增强了接地准确性和报告可靠性。代码可从此https URL获取,模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

2601.14871 2026-06-09 cs.RO 版本更新

On-the-fly hand-eye calibration for the da Vinci surgical robot

达芬奇手术机器人的在线手眼标定

Zejian Cui, Ferdinando Rodriguez y Baena

发表机构 * Department of Mechanical Engineering, Imperial College London(帝国理工学院机械工程系) Mechatronics in Medicine Laboratory(医学机电实验室) Hamlyn Centre for Robotics Surgery(机器人外科哈姆林中心)

AI总结 针对达芬奇机器人因编码器误差导致工具定位不准的问题,提出一种在线计算手眼变换矩阵的标定框架,通过特征关联和手眼标定两个模块实现无预训练的关键点匹配,在多种手术场景下显著降低定位误差且时间效率高。

Comments 18 pages, 17 figures, 5 tables

详情
AI中文摘要

在机器人辅助微创手术(RMIS)中,精确的工具定位对于确保患者安全和成功执行任务至关重要。然而,对于诸如达芬奇机器人等缆线驱动机器人,这仍然具有挑战性,因为错误的编码器读数会导致位姿估计误差。在本研究中,我们提出了一种标定框架,通过在线计算手眼变换矩阵来产生精确的工具定位结果。该框架由两个相互关联的算法组成:特征关联模块和手眼标定模块,前者无需预训练即可为单目图像上检测到的关键点提供鲁棒的对应关系,后者通过采用一系列滤波方法提供适应各种手术场景的通用性。为了验证其有效性,我们在公开可用的视频数据集上广泛测试了该框架,这些数据集包含多种手术器械在体外和离体场景下、不同光照条件和不同关键点测量精度下执行任务的情况。结果表明,在所提出的标定框架下,工具定位误差显著降低,精度与其他最先进方法相当,同时时间效率更高。

英文摘要

In Robot-Assisted Minimally Invasive Surgery (RMIS), accurate tool localization is crucial to ensure patient safety and successful task execution. However, this remains challenging for cable-driven robots, such as the da Vinci robot, because erroneous encoder readings lead to pose estimation errors. In this study, we propose a calibration framework to produce accurate tool localization results through computing the hand-eye transformation matrix on-the-fly. The framework consists of two interrelated algorithms: the feature association block and the hand-eye calibration block, which provide robust correspondences for key points detected on monocular images without pre-training, and offer the versatility to accommodate various surgical scenarios by adopting an array of filter approaches, respectively. To validate its efficacy, we test the framework extensively on publicly available video datasets that feature multiple surgical instruments conducting tasks in both in vitro and ex vivo scenarios, under varying illumination conditions and with different levels of key point measurement accuracy. The results show a significant reduction in tool localization errors under the proposed calibration framework, with accuracies comparable to other state-of-the-art methods while being more time-efficient.

2601.14063 2026-06-09 cs.CL cs.AI cs.CY 版本更新

XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

XCR-Bench:通过文化特定项目和霍尔三元组对大型语言模型进行跨文化推理基准测试

Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Shaoxiong Ji, Hassan Alhuzali, Yuechen Jiang, Jimin Huang, Sophia Ananiadou

发表机构 * Department of Computer Science, National Centre for Text Mining, The University of Manchester(计算机科学系,国家文本挖掘中心,曼彻斯特大学) ELLIS Manchester(曼彻斯特ELLIS) School of Computing, Queen’s University, Ontario, Canada(计算学院,加拿大皇后大学) Computer Science, University of Illinois Chicago(计算机科学,伊利诺伊大学芝加哥分校) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学) Department of Computer Science and Artificial Intelligence, Umm Al-Qura University, Makkah, Saudi Arabia(计算机科学与人工智能系,乌姆·阿勒·卡拉大学,麦加,沙特阿拉伯)

AI总结 提出XCR-Bench基准,包含4.1k平行句和1098个文化特定项目,结合Newmark框架与霍尔三元组评估LLM跨文化推理,发现模型在深层文化层面表现显著下降,且存在区域和民族宗教偏见。

Comments Under Review

详情
AI中文摘要

大型语言模型(LLM)的跨文化能力需要理解并适应不同文化背景下的文化特定项目(CSI)。然而,由于缺乏高质量且带有平行跨文化句子对的CSI标注语料库,评估该能力的进展仍然有限。我们引入了XCR-Bench,一个跨文化推理基准,包含4.1k个平行句子和1,098个CSI,涵盖三个推理任务。XCR-Bench将Newmark的CSI框架与霍尔文化三元组相结合,从而能够评估从可观察实践到隐性社会规范和价值观等不同文化可见性层面的能力。对八个多语言LLM的实验表明,最先进的模型在识别和适应特定CSI类别方面表现出持续的弱点,揭示了表面召回与显式文化推理之间的差距。在文化敏感类别和更深文化层面上,性能显著下降(p<0.005,8/8模型),并且适应质量在不同目标文化和孟加拉语区域变体之间系统性变化,表明即使在单一语言环境中也存在编码的区域和民族宗教偏见。我们公开发布语料库和代码,以支持未来跨文化NLP的研究。

英文摘要

Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited by the lack of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. We introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark containing 4.1k parallel sentences and 1,098 CSIs across three reasoning tasks. XCR-Bench integrates Newmark's CSI framework with Hall's Triad of Culture, enabling evaluation across levels of cultural visibility -- from observable practices to implicit social norms and values. Experiments on eight multilingual LLMs show that state-of-the-art models exhibit consistent weaknesses in identifying and adapting specific categories of CSIs, revealing a gap between surface-level recall and explicit cultural reasoning. Performance declines significantly on culturally sensitive categories and deeper cultural levels (p<0.005, 8/8 models), and adaptation quality varies systematically across target cultures and Bengali regional variants, indicating encoded regional and ethno-religious biases even within a single linguistic setting. We publicly release the corpus and code to support future research on cross-cultural NLP.

2601.10918 2026-06-09 cs.CL 版本更新

Neural Induction of Finite-State Transducers

有限状态转换器的神经归纳

Michael Ginn, Alexis Palmer, Mans Hulden

发表机构 * University of Colorado(科罗拉多大学) New College of Florida(佛罗里达新学院)

AI总结 提出基于循环神经网络隐藏状态几何自动构建无权重有限状态转换器的方法,在形态变化、音素转换等任务上准确率优于传统算法达87%。

Comments 15 pages, 8 figures, accepted to ACL 2026 Findings

详情
AI中文摘要

有限状态转换器(FST)是字符串到字符串重写任务的有效模型,通常提供高性能应用所需的效率,但手动构建转换器很困难。在这项工作中,我们提出了一种新方法,根据循环神经网络学习的隐藏状态几何自动构建无权重FST。我们在形态变化、字素到音素预测和历史规范化的真实数据集上评估了我们的方法,表明构建的FST具有高准确性和鲁棒性,在保留测试集上准确率比经典转换器学习算法高出87%。

英文摘要

Finite-State Transducers (FSTs) are effective models for string-to-string rewriting tasks, often providing the efficiency necessary for high-performance applications, but constructing transducers by hand is difficult. In this work, we propose a novel method for automatically constructing unweighted FSTs following the hidden state geometry learned by a recurrent neural network. We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets, substantially outperforming classical transducer learning algorithms by up to 87% accuracy on held-out test sets.

2504.02983 2026-06-09 cs.CL cs.CV 版本更新

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Hummus:幽默多模态隐喻使用数据集

Xiaoyu Tong, Zhi Zhang, Pia Sommerauer, Martha Lewis, Ekaterina Shutova

发表机构 * ILLC, University of Amsterdam, the Netherlands(阿姆斯特丹大学语言学研究所,荷兰) Vrije Universiteit Amsterdam, the Netherlands(阿姆斯特丹自由大学,荷兰)

AI总结 提出幽默多模态隐喻数据集Hummus,基于不一致理论和概念隐喻理论设计标注方案,测试多模态大语言模型在检测和理解幽默多模态隐喻上的表现,发现现有模型仍存在困难。

详情
AI中文摘要

隐喻和幽默有许多共同点,隐喻是最常见的幽默机制之一。本研究关注多模态隐喻的幽默能力,该领域尚未得到足够关注。我们从幽默的不一致理论、概念隐喻理论以及VU阿姆斯特丹隐喻语料库的标注方案中汲取灵感,开发了一种新的用于图像-标题对中幽默多模态隐喻使用的标注方案。我们创建了幽默多模态隐喻使用数据集Hummus,提供了从《纽约客》标题竞赛语料库中抽取的1000个图像-标题对的专家标注。利用该数据集,我们测试了最先进的多模态大语言模型(MLLMs)在检测和理解幽默多模态隐喻使用方面的能力。实验表明,当前MLLMs在处理幽默多模态隐喻时仍然存在困难,特别是在整合视觉和文本信息方面。我们在该网址发布数据集和代码。

英文摘要

Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

2601.12263 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

多模态生成式引擎优化:针对视觉-语言模型排序器的排名操纵

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

发表机构 * Georgetown University(乔治城大学) University of Southern California(南加州大学) University of Maryland, College Park(马里兰大学学院公园分校) Arizona State University(亚利桑那州立大学)

AI总结 提出多模态生成式引擎优化(MGEO)方法,通过联合优化图像扰动和文本后缀,利用视觉-语言模型内部跨模态知识耦合,实现对产品排名的有效操纵,揭示了多模态基础模型知识基础的脆弱性。

Comments Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM) at ACL 2026

详情
AI中文摘要

视觉-语言模型(VLM)将视觉和文本知识整合到统一表示中,日益成为现代检索和推荐系统的基础。然而,这些模型在对多模态项目进行排序时如何可靠地利用其跨模态知识,以及其知识基础是否可以被颠覆,仍不清楚。在本文中,我们揭示了VLM在多模态产品排序中应用知识的一个基本漏洞:通过多模态生成式引擎优化(MGEO),我们展示了攻击者可以通过联合制作难以察觉的图像扰动和流畅的文本后缀,利用模型内部的跨模态知识耦合,操纵VLM的排序决策。MGEO采用交替优化策略,针对VLM中视觉和语言表示之间的深层交互,实现了远超单模态攻击和由强大商业模型驱动的启发式基线的排名操纵。我们的发现表明,表面内容质量不足以提升排名;相反,需要直接与模型内部知识利用机制对齐。这些结果对多模态基础模型中知识基础的忠实性和鲁棒性提出了重要问题,并激励了未来多模态检索系统防御机制的研究。代码见:this https URL

英文摘要

Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM's ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model's internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model's internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems. Code is available at: https://github.com/glad-lab/MGEO

2601.09285 2026-06-09 cs.LG cond-mat.mtrl-sci 版本更新

Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction

增强大型语言模型在金属有机框架结构预测中的空间推理能力

Mianzhi Pan, JianFei Li, Peishuo Liu, Botian Wang, Yawen Ouyang, Yiming Rong, Hao Zhou, Jianbing Zhang

发表机构 * National Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Nanjing University(南京大学) Institute of AI Industry Research (AIR)(人工智能产业研究院) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Chinese Academy of Sciences(中国科学院大学) ChemBIC(化学信息学中心)

AI总结 针对MOF结构预测中原子数量多、复杂度高的问题,提出MOF-LLM框架,通过空间感知持续预训练、结构监督微调和匹配驱动强化学习,增强Qwen-3 8B模型的空间推理能力,实现35.78%匹配率和0.04秒/结构的采样效率。

Comments KDD 2026

详情
AI中文摘要

金属有机框架(MOFs)是多孔晶体材料,在碳捕获和药物输送等领域有广泛应用,但准确预测其三维结构仍然是一个重大挑战。尽管大型语言模型(LLMs)在生成晶体结构方面显示出潜力,但由于MOF单胞中原子数量多导致的结构高度复杂性,LLMs在MOF上的应用受到阻碍。受深度生成模型中块级范式成功启发,我们率先将LLMs应用于该领域,引入了MOF-LLM,这是第一个专门针对块级MOF结构预测的LLM框架。为了有效利用LLMs完成这一3D模块化组装任务,我们的训练范式整合了空间感知持续预训练(CPT)、结构监督微调(SFT)和匹配驱动强化学习(RL)。通过引入显式空间先验并利用软自适应策略优化(SAPO)优化结构稳定性,我们的方法显著增强了Qwen-3 8B模型在MOF结构预测中的空间推理能力。综合实验表明,MOF-LLM实现了最先进的性能,匹配率达到35.78%,同时展现出卓越的采样效率,每个结构仅需0.04秒。

英文摘要

Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystal structures, their application to MOFs is hindered by MOFs' high structural complexity arising from the large number of atoms in unit cell. Inspired by the success of block-wise paradigms in deep generative models for MOFs, we pioneer the application of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this 3D modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning in a Qwen-3 8B model for MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM achieves state-of-the-art performance with a match rate of 35.78% while exhibiting superior sampling efficiency of 0.04 seconds per structure.

2601.09085 2026-06-09 cs.LG cs.AI cs.CL cs.IR 版本更新

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

MMR-GRPO:通过多样性感知奖励重加权加速GRPO风格训练

Kangda Wei, Ruihong Huang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 提出MMR-GRPO方法,利用最大边际相关性根据完成多样性重加权奖励,减少冗余样本,加速GRPO训练,在保持性能的同时平均减少47.9%训练步数和70.2%时间。

详情
AI中文摘要

组相对策略优化(GRPO)已成为训练数学推理模型的标准方法;然而,它对每个提示依赖多个完成,使得训练计算成本高昂。尽管最近的工作减少了达到峰值性能所需的训练步数,但由于每步成本增加,整体挂钟训练时间通常保持不变甚至增加。我们提出MMR-GRPO,它整合了最大边际相关性,基于完成多样性对奖励进行重加权。我们的关键洞察是,语义冗余的完成贡献有限的学习信号;优先考虑多样化解能产生更有信息量的更新并加速收敛。在三种模型规模(1.5B、7B、8B)、三种GRPO变体和五个数学推理基准上的广泛评估表明,MMR-GRPO在达到相当峰值性能的同时,平均需要减少47.9%的训练步数和70.2%的挂钟时间。这些增益在模型、方法和基准上一致。我们的代码发布在:this https URL。

英文摘要

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

2601.06649 2026-06-09 cs.LG cs.AI 版本更新

Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency

重新审视训练规模:关于令牌计数、功耗和参数效率的实证研究

Joe Dwyer

发表机构 * ECPI University(ECPI大学)

AI总结 通过固定硬件和训练条件的重复测量实验,发现增加训练令牌数会导致训练效率严格单调下降,即使性能有边际提升,也表明能耗效率低下。

详情
AI中文摘要

机器学习研究质疑了训练令牌数的增加是否能在大型语言模型中可靠地产生比例性能提升。基于先前引入能量感知参数效率度量的工作,本研究实证检验了在固定硬件和训练条件下增加训练令牌数的影响。本工作的重要性在于将功耗和执行时长(如功率采样频率所反映的)明确整合到令牌规模分析中,这解决了先前研究强调性能结果而低估计算和能量成本的空白。通过在恒定GPU实例上使用相同模型架构、优化器设置和轮次数的重复测量实验设计,训练了一个11亿参数的TinyLlama模型,使用三个令牌数(500K、1M和2M)。虽然传统性能指标在令牌规模上表现出不一致或递减的回报,但包含功耗和执行时长后,揭示了随着令牌数增加,训练效率严格单调下降。重复测量方差分析表明令牌数对参数效率有强效应,所有配对比较在Bonferroni校正后仍然显著。这些发现表明,即使观察到边际性能提升,增加训练令牌数可能在能量上效率低下,强调了在大型语言模型训练中效率感知评估的重要性。

英文摘要

Research in machine learning has questioned whether increases in training token counts reliably produce proportional performance gains in large language models. Building on prior work introducing an energy-aware parameter efficiency metric, this study empirically examines the effects of increasing training token counts under fixed hardware and training conditions. The significance of this work lies in the explicit integration of power consumption and execution duration, as reflected by the power sampling frequency, into token-scale analysis. This addresses a gap in prior studies emphasizing performance outcomes while underrepresenting computational and energy costs. Using a repeated-measures experimental design on a constant GPU instance with an identical model architecture, optimizer settings, and epoch counts, a 1.1-billion-parameter TinyLlama model was trained at three token counts (500K, 1M, and 2M). While conventional performance metrics exhibited inconsistent or diminishing returns across token scales, the inclusion of power consumption and execution duration revealed a strictly monotonic decline in training efficiency as token count increased. Repeated-measures ANOVA demonstrated a strong effect of token count on parameter efficiency, with all pairwise comparisons remaining significant following Bonferroni correction. These findings indicate that increases in training token counts may be energetically inefficient even when marginal performance improvements are observed, underscoring the importance of efficiency-aware evaluation in large language model training.

2601.06599 2026-06-09 cs.CL cs.AI 版本更新

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

上下文如何塑造真相:LLMs中语句级真相表示的几何变换

Shivam Adarsh, Maria Maistro, Christina Lioma

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 研究LLMs中上下文如何改变真相向量,发现早期层正交、中层收敛,上下文增加向量幅度,大模型通过方向变化区分相关与无关上下文。

Comments ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)通常将语句是否为真编码为其残差流激活中的向量。这些向量,也称为真相向量,已在先前工作中被研究,然而当引入上下文时它们如何变化仍未被探索。我们通过测量(1)有上下文和无上下文时真相向量之间的方向变化($\ heta$)以及(2)添加上下文后真相向量的相对幅度来研究这一问题。在四个LLM和四个数据集上,我们发现:(1)真相向量在早期层大致正交,在中层收敛,在后期层可能稳定或继续增加;(2)添加上下文通常增加真相向量的幅度,即激活空间中真与假表示之间的分离被放大;(3)较大模型主要通过方向变化($\ heta$)区分相关与无关上下文,而较小模型通过幅度差异显示这种区分。我们还发现与参数知识冲突的上下文比参数对齐的上下文产生更大的几何变化。据我们所知,这是首个提供上下文如何在LLMs激活空间中变换真相向量的几何特征描述的工作。

英文摘要

Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ($θ$) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ($θ$), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.