arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2601.09869 2026-06-03 cs.AI cs.HC

A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

拟人化大型语言模型对话代理的伦理视角:一项范围综述

Andrea Ferrario, Rasita Vinay, Matteo Casserini, Alessandro Facchini

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zürich(苏黎世大学生物医学伦理与医学史研究所) Dalle Molle Institute for Artificial Intelligence (IDSIA), SUPSI(瑞士SUPSI人工智能研究所) ETH Zürich(苏黎世联邦理工学院) Institute for Implementation Science in Health Care, University of Zürich(苏黎世大学医疗实施科学研究所) Department of Management, Technology and Economics, ETH Zürich(苏黎世联邦理工学院管理、技术与经济系) Dipartimento Tecnologie Innovative, SUPSI(SUPSI创新技术系) Management in Networked and Digital Societies (MINDS) Department, Kozminski University(科兹明斯基大学网络化与数字化社会管理系)

AI总结 本文通过范围综述,系统梳理了拟人化LLM对话代理的伦理挑战与机遇,包括概念基础、伦理问题及方法论,并提出了研究议程与设计治理建议。

Comments 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT'26)

详情
AI中文摘要

拟人化——将人类特质赋予非人类实体的现象——随着基于大型语言模型(LLM)的对话代理(CAs)的兴起而日益显著。与早期的聊天机器人不同,基于LLM的CA通常会生成互动和语言线索,例如第一人称自我指涉、认知和情感表达,实证研究表明这些可以增加参与度。另一方面,拟人化引发了伦理担忧,包括欺骗、过度依赖和剥削性关系框架,而一些作者认为拟人化互动可能支持自主性、福祉和包容性。尽管对该现象的兴趣日益增加,文献仍跨领域分散,并且在如何定义、操作化和规范性评估拟人化方面存在显著差异。本范围综述绘制了关于拟人化基于LLM的CA的伦理导向工作,覆盖五个数据库和三个预印本存储库。我们综合了(1)概念基础,(2)伦理挑战与机遇,以及(3)方法论方法。我们发现基于归因的定义趋于一致,但操作化存在显著差异,主要是风险导向的规范性框架,以及将观察到的互动效应与可操作的治理指导联系起来的实证工作有限。我们最后提出了研究议程和设计/治理建议,用于在基于LLM的对话代理中伦理地部署拟人化线索。

英文摘要

Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.

2601.08173 2026-06-03 cs.AI

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Agent 的第一天:在工作场景中基准测试学习、探索和调度

Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi

发表机构 * Fudan University(复旦大学) Shanghai AI Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对多模态大语言模型在动态工作场景中面临的任务调度、主动探索和持续学习三大挑战,提出动态评估环境 EvoEnv,实验表明现有 agent 在这些方面存在显著不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速发展推动了工作流自动化;然而,现有研究主要针对静态环境中的性能上限,忽视了随机真实世界部署的鲁棒性。我们识别出三个关键挑战:动态任务调度、不确定性下的主动探索以及从经验中持续学习。为弥补这一差距,我们引入了 \method{},一个动态评估环境,模拟“实习生”agent 持续探索新环境。与传统基准不同,\method{} 从三个维度评估 agent:(1)针对具有不同优先级的流式任务的上下文感知调度;(2)通过主动探索谨慎获取信息以减少幻觉;(3)通过从基于规则的动态生成任务中提炼通用策略实现持续进化。实验表明,最先进的 agent 在动态环境中存在显著缺陷,尤其是在主动探索和持续学习方面。我们的工作建立了一个评估 agent 可靠性的框架,将评估从静态测试转向现实的、面向生产的场景。我们的代码可在 https://github.com/KnowledgeXLab/EvoEnv 获取。

英文摘要

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

2512.23234 2026-06-03 cs.CV cs.AI

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

边缘感知与内容自适应的工业安全监控红外气体泄漏检测

Dongsheng Li, Tianli Ma, Siling Wang, Beibei Duan, Song Gao

发表机构 * School of Mechatronic Engineering, Xi’an Technological University(机械电子工程学院,西安理工大学) School of Electronic Information Engineering, Xi’an Technological University(电子信息工程学院,西安理工大学) Shaanxi Shanhua Coal Chemical Co., Ltd.(陕西神华化工有限公司)

AI总结 针对红外气体羽流微弱、半透明且边界模糊的检测难题,提出一种边缘感知与内容自适应特征融合检测器(ECAF-Det),通过羽流导向的局部-全局特征增强、多尺度边缘感知模块和内容自适应稀疏路由路径聚合网络,在IIG和LangGas数据集上显著提升了检测精度。

详情
AI中文摘要

红外气体泄漏检测对于工业安全和环境监测至关重要,但由于气体羽流通常微弱、细小、半透明且边界模糊,自动检测仍然具有挑战性。本文提出了一种边缘感知与内容自适应特征融合检测器(ECAF-Det),用于杂乱热场景中的弱羽流检测。ECAF-Det集成了三个面向任务的设计:羽流导向的局部-全局特征增强块,用于保留精细边界线索并捕获长程上下文连续性;多尺度边缘感知模块,将方向梯度和相位一致性线索转化为分层边缘先验,用于边界敏感的羽流表示;以及内容自适应稀疏路由路径聚合网络,动态调节多尺度特征传播,以强调信息丰富的羽流特征并抑制冗余背景响应。在IIG数据集上的实验表明,ECAF-Det实现了29.8%的AP、84.3%的AP50和25.3%的小目标AP,分别比RT-DETR-R18基线提高了3.0、6.5和5.4个百分点,计算量为43.7 GFLOPs,参数量为14.9 M。在LangGas数据集上,ECAF-Det实现了36.3%的AP和68.5%的AP50,展示了其对不同红外气体羽流外观的泛化能力。主要的人工智能贡献在于边缘感知表示学习与内容自适应稀疏特征路由,用于弱红外羽流感知。所提出的检测器可作为工业气体泄漏监测中早期预警和远程巡检的视觉感知组件。

英文摘要

Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging because gas plumes are often faint, small, semi-transparent, and weakly bounded. This paper proposes an Edge-Aware and Content-Adaptive Feature Fusion Detector (ECAF-Det) for weak-plume detection in cluttered thermal scenes. ECAF-Det integrates three task-oriented designs: a plume-oriented local-global feature enhancement block to preserve fine boundary cues and capture long-range contextual continuity; a multi-scale edge perception module that transforms directional gradient and phase-consistency cues into hierarchical edge priors for boundary-sensitive plume representation; and a content-adaptive sparse routing path aggregation network that dynamically regulates multi-scale feature propagation to emphasize informative plume features and suppress redundant background responses. Experiments on the IIG dataset show that ECAF-Det achieves 29.8% AP, 84.3% AP50, and 25.3% small-object AP, improving the RT-DETR-R18 baseline by 3.0, 6.5, and 5.4 percentage points, respectively, with 43.7 GFLOPs and 14.9 M parameters. On the LangGas dataset, ECAF-Det achieves 36.3% AP and 68.5% AP50, demonstrating its generalization to different infrared gas plume appearances. The main AI contribution is edge-aware representation learning with content-adaptive sparse feature routing for weak infrared plume perception. The proposed detector can serve as a visual perception component for early warning and remote inspection in industrial gas leak monitoring.

2504.04942 2026-06-03 cs.AI cs.LO

Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Lemmanaid: 神经符号引理猜想

Yousef Alhessi, Sólrún Halla Einarsdóttir, George Granberry, Emily First, Moa Johansson, Sorin Lerner, Nicholas Smallbone

发表机构 * Department of Computer Science and Engineering University of California, San Diego, USA(计算机科学与工程系,加州大学圣地亚哥分校) Department of Computer Science and Engineering Chalmers University of Technology & University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学及哥德堡大学)

AI总结 提出首个神经符号引理猜想工具LEMMANAID,通过类比数学理论生成引理,结合微调LLM与符号方法,在Isabelle测试集上优于纯神经和纯符号方法。

详情
AI中文摘要

数学家和计算机科学家越来越多地利用证明助手来形式化和检查复杂证明,这需要大量的专业知识。我们能否通过自动化猜想有用、有趣且新颖的引理来降低门槛?我们提出了首个神经符号引理猜想工具LEMMANAID,旨在通过类比数学理论来发现猜想。LEMMANAID使用微调后的LLM生成描述引理形状的引理模板,并使用符号方法填充细节。我们将LEMMANAID与直接微调生成引理的相同LLM以及完全符号的猜想方法进行了比较。在来自Isabelle的HOL库和形式化证明档案(AFP)的测试集上,LEMMANAID始终优于神经和符号方法。使用DeepSeek-coder-6.7B作为后端,LEMMANAID发现了50%(HOL)和29%(AFP)的金标准引理,当集成提示策略时,这一比例提高到55%和35%。在关于八元数的案例研究中,LEMMANAID发现了79%的金标准引理,而纯神经方法为62%,最先进的符号工具为23%。此外,在针对性比较中,LEMMANAID发现的金标准引理数量超过了Claude Opus 4.5和GPT-5.2。我们的结果表明,LEMMANAID能够在数学和计算机科学的复杂形式化中猜想出大量有趣的引理。

英文摘要

Mathematicians and computer scientists are increasingly leveraging proof assistants to formalize and check complex proofs, a task that demands substantial expertise. Can we lower the bar by automating the conjecturing of helpful, interesting and novel lemmas? We present the first neuro-symbolic lemma conjecturing tool, LEMMANAID, designed to discover conjectures by drawing analogies between mathematical theories. LEMMANAID uses a fine-tuned LLM to generate lemma templates that describe the shape of a lemma, and symbolic methods to fill in the details. We compare LEMMANAID against the same LLM fine-tuned to generate lemmas directly, as well as a fully symbolic conjecturing method. On test sets from Isabelle's HOL library and Archive of Formal Proofs (AFP), LEMMANAID consistently outperforms both neural and symbolic methods. Using DeepSeek-coder-6.7B as a backend, LEMMANAID discovers 50% (HOL) and 29% (AFP) of the gold standard lemmas, increasing to 55% and 35% when ensembling prompting strategies. In a case study on Octonions, LEMMANAID discovers 79% of the gold standard lemmas, compared to 62% for neural-only and 23% for the state of the art symbolic tool. Furthermore, in a targeted comparison, LEMMANAID discovers more gold standard lemmas than both Claude Opus 4.5 and GPT-5.2. Our results show that LEMMANAID can conjecture a significant number of interesting lemmas across complex formalizations in mathematics and computer science.

2512.10999 2026-06-03 cs.CL

KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

KBQA-R1:强化大语言模型用于知识库问答

Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出KBQA-R1框架,通过强化学习(GRPO)将知识库问答建模为多轮决策过程,并引入参考拒绝采样(RRS)解决冷启动问题,在多个基准上取得最优性能。

Comments ICML 2026

详情
AI中文摘要

知识库问答(KBQA)挑战模型通过生成可执行的逻辑形式来弥合自然语言与严格知识图谱模式之间的差距。虽然大语言模型(LLMs)推动了这一领域的发展,但当前的方法常常陷入两种失败模式:要么生成未经模式验证的幻觉查询,要么表现出僵化的、基于模板的推理,模仿合成轨迹而没有真正理解环境。为了解决这些局限性,我们提出了 extbf{KBQA-R1}框架,该框架通过强化学习将范式从文本模仿转变为交互优化。将KBQA视为一个多轮决策过程,我们的模型学习使用动作列表导航知识库,利用组相对策略优化(GRPO)根据具体的执行反馈而非静态监督来优化其策略。此外,我们引入了 extbf{参考拒绝采样(RRS)},一种数据合成方法,通过严格对齐推理轨迹与真实动作序列来解决冷启动问题。在WebQSP、GrailQA和GraphQuestions上的大量实验表明,KBQA-R1实现了最先进的性能,有效地将LLM推理锚定在可验证的执行中。

英文摘要

Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

2505.11785 2026-06-03 cs.LG cs.AI stat.ML

Improving Coverage in Combined Prediction Sets with Weighted p-values

通过加权p值提高组合预测集的覆盖范围

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种加权聚合预测集的框架,通过为每个预测集分配权重,实现覆盖范围在$1-2α$与$1-α$之间的灵活控制,并推广到数据依赖权重,在混合专家模型等场景中保持有限样本有效性。

详情
Journal ref
AISTATS 2026
AI中文摘要

共形预测通过用有效的预测集增强点预测来量化机器学习模型的不确定性。对于涉及多个试验、模型或数据源的复杂场景,可以聚合共形预测集以创建捕获整体不确定性的预测集,通常能提高精度。然而,聚合具有个体$1-α$覆盖率的多个预测集不可避免地削弱了整体保证,通常导致最坏情况覆盖率为$1-2α$。在这项工作中,我们提出了一个预测集加权聚合的框架,其中根据每个预测集的贡献为其分配权重。我们的框架提供了对集合聚合方式的灵活控制,实现了更紧的覆盖界限,根据权重的分布在组合模型的$1-2α$保证和单个模型的$1-α$保证之间插值。重要的是,我们的框架推广到数据依赖的权重,因为我们推导了一个加权聚合程序,即使权重依赖于数据,也能保持有限样本有效性。这一扩展使我们的框架广泛适用于权重被学习的场景,例如混合专家模型(MoE),并且我们通过在MoE设置中的实验证明,我们的方法实现了自适应覆盖。

英文摘要

Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual $1-α$ coverage inevitably weakens the overall guarantee, typically resulting in $1-2α$ worst-case coverage. In this work, we propose a framework for the weighted aggregation of prediction sets, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the $1-2α$ guarantee of the combined models and the $1-α$ guarantee of an individual model depending on the distribution of weights. Importantly, our framework generalizes to data-dependent weights, as we derive a procedure for weighted aggregation that maintains finite-sample validity even when the weights depend on the data. This extension makes our framework broadly applicable to settings where weights are learned, such as mixture-of-experts (MoE), and we demonstrate through experiments in the MoE setting that our methods achieve adaptive coverage.

2507.16003 2026-06-03 cs.CL cs.LG

Learning without training: The implicit dynamics of in-context learning

无需训练的学习:上下文学习的内在动态

Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

发表机构 * Google(谷歌)

AI总结 本文通过理论分析和实验证明,自注意力层与MLP的组合使Transformer块能够根据上下文隐式修改MLP权重,从而解释大语言模型在推理时无需权重更新即可进行上下文学习的机制。

详情
AI中文摘要

大型语言模型(LLMs)最显著的特征之一是其上下文学习能力。即在推理时,即使提示中呈现的模式在训练中未见,LLM也能在无需额外权重更新的情况下学习新模式。这种机制如何实现仍很大程度上未知。本文中,我们展示了自注意力层与MLP的堆叠使得Transformer块能够根据上下文隐式修改MLP层的权重。通过理论分析和实验,我们认为这种简单机制可能有助于解释为什么LLMs展现出超越训练捕获的上下文学习能力。具体而言,我们证明带有上下文的标准前向传播在数学上等价于无上下文但MLP权重通过表示上下文的最小低秩更新进行更新的前向传播。

英文摘要

One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theoretical analysis and experimentation that this simple mechanism may help explain why LLMs demonstrate capabilities of in-context learning, beyond what is captured during training. Specifically, we show that a standard forward pass with context is mathematically equivalent to a forward pass without context but with the MLP weights updated by a minimal low-rank update representing the context.

2512.15427 2026-06-03 cs.LG cond-mat.stat-mech math.ST stat.TH

Statistics of Min-max Normalized Eigenvalues in Random Matrices

随机矩阵中最小-最大归一化特征值的统计

Hyakka Nakada, Shu Tanaka

发表机构 * Graduate School of Science and Technology(理工学研究科) Keio University(庆应大学) Department of Applied Physics and Physico-Informatics(应用物理与物理信息学系) Keio University Sustainable Quantum Artificial Intelligence Center (KSQAIC)(庆应大学可持续量子人工智能中心) Human Biology-Microbiome-Quantum Research Center (WPI-Bio2Q)(人生物学-微生物组-量子研究中心(WPI-Bio2Q)) Green Computing System Research Organization(绿色计算系统研究机构)

AI总结 研究随机矩阵中最小-最大归一化特征值的统计性质,提出有效分布并推导累积分布的标度律和矩阵分解的残差误差。

Comments 4 pages, 4 figures

详情
Journal ref
Journal of the Physical Society of Japan, vol. 95, no. 6, pp. 064003, 2026
AI中文摘要

随机矩阵理论在纯数学、数学物理和机器学习的各个领域都发挥了重要作用。从数据科学的实际角度来看,输入数据通常在处理前进行归一化。因此,本研究探讨了随机矩阵中最小-最大归一化特征值的统计性质。先前,已经提出了这种归一化特征值的有效分布。在本研究中,我们将其应用于评估累积分布的标度律。此外,我们推导了随机矩阵分解过程中产生的残差误差。我们进行了数值实验来验证这些理论预测。

英文摘要

Random matrix theory has played an important role in various areas of pure mathematics, mathematical physics, and machine learning. From a practical perspective of data science, input data are usually normalized prior to processing. Thus, this study investigates the statistical properties of min-max normalized eigenvalues in random matrices. Previously, the effective distribution for such normalized eigenvalues has been proposed. In this study, we apply it to evaluate a scaling law of the cumulative distribution. Furthermore, we derive the residual error that arises during matrix factorization of random matrices. We conducted numerical experiments to verify these theoretical predictions.

2512.13996 2026-06-03 cs.AI

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

DTop-p MoE:面向基础模型预训练的稀疏度可控动态Top-p MoE

Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohi Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DTop-p动态路由机制,通过比例积分控制器学习Top-p概率阈值并采用动态路由归一化,在全局稀疏约束下实现层间专家选择,一致优于Top-k和固定Top-p基线,且FLOPs与Top-k MoE相当。

详情
AI中文摘要

稀疏混合专家架构对于高效扩展模型容量至关重要,但标准的Top-$k$路由施加了固定的稀疏模式,忽略了令牌难度和层特定计算需求的内在差异。Top-$p$路由更具自适应性,因为它选择专家直到其累积路由概率达到阈值,允许置信令牌使用更少的专家,而模糊令牌则招募更多专家。然而,我们证明,现有的具有固定全局概率阈值的朴素Top-$p$实现相比Top-$k$仅带来边际收益,存在超参数敏感性,并导致不可控的计算成本。在本文中,我们提出**DTop-$p$**,一种稀疏度可控的动态路由机制,它使用比例积分控制器学习Top-$p$概率阈值,并采用动态路由归一化来在全局稀疏约束下支持逐层专家选择。在大语言模型和扩散Transformer上的大量实验表明,**DTop-$p$**在匹配Top-$k$ MoE平均FLOPs的同时,始终优于Top-$k$和固定Top-$p$基线。我们的分析证实,**DTop-$p$**在专家粒度、总专家容量、模型大小和数据集大小方面表现出强大的可扩展性,为基础模型预训练提供了一个鲁棒且高效的MoE框架。

英文摘要

Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-$p$ routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-$p$ implementations with fixed global probability thresholds provide only marginal gains over Top-$k$, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose **DTop-$p$**, a sparsity-controllable dynamic routing mechanism that learns the Top-$p$ probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that **DTop-$p$** consistently outperforms both Top-$k$ and fixed Top-$p$ baselines while matching the average FLOPs of Top-$k$ MoE. Our analysis confirms that **DTop-$p$** exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

2512.11213 2026-06-03 cs.AI cs.CL

FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

FutureWeaver: 面向模块化协作的多智能体系统的测试时计算规划

Dongwon Jung, Peng Shi, Muhao Chen, Yi Zhang

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Waterloo(滑铁卢大学) Greenshoe, Inc(Greenshoe公司)

AI总结 提出FutureWeaver框架,通过双层次规划架构和自诱导协作模块,在固定预算下优化多智能体系统的测试时计算分配,显著提升协作性能。

详情
AI中文摘要

扩展测试时计算已被证明可以在无需额外训练的情况下显著提升大语言模型(LLM)的性能。然而,将这些技术扩展到多智能体系统仍然具有挑战性:现有方法缺乏原则性的机制来分配计算以实现有效协作、扩展协调本身,或在明确的预算约束下优化计算使用。为弥补这一差距,我们提出了FutureWeaver,一个在固定预算下规划和优化多智能体系统中测试时计算分配的框架。它引入了协作模块,形式化为模块化的、可调用的函数,封装了可复用的多智能体工作流,并通过自博弈反思从重复出现的交互模式中自动归纳。基于这些模块,它采用了一种双层次规划架构,联合执行短视动作选择和长远抽象前瞻,以在预算约束下优化推理轨迹。在复杂智能体基准上的实验表明,FutureWeaver在各种预算设置下始终优于基线,验证了其在推理时优化中多智能体协作的有效性。

英文摘要

Scaling test-time computation has been shown to significantly improve large language model (LLM) performance without additional training. However, extending these techniques to multi-agent systems remains challenging: existing approaches lack principled mechanisms for allocating compute to enable effective collaboration, scaling coordination itself, or optimizing compute usage under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. It introduces collaboration modules, formalized as modular, callable functions that encapsulate reusable multi-agent workflows and are automatically induced via self-play reflection from recurring interaction patterns. Building on these modules, it employs \emph{a dual-level planning architecture} that jointly performs short-horizon action selection and long-horizon abstract lookahead to optimize inference trajectories under budget constraints. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

2512.07394 2026-06-03 cs.CV

Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

在手交互时间线中重建第一人称视频中的物体

Zhifan Zhu, Siddhant Bansal, Shashank Tripathi, Dima Damen

发表机构 * University of Bristol, UK(英国布里斯托大学) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根马克斯·普朗克智能系统研究所)

AI总结 提出ROHIT任务,通过定义手交互时间线(HIT)并利用约束优化与传播(COP)框架,在无3D真值的情况下,从第一人称视频中重建刚性物体的姿态,显著提升重建精度。

Comments webpage: https://zhifanzhu.github.io/objects-along-hit

详情
AI中文摘要

我们引入了沿手交互时间线重建物体(ROHIT)的任务。首先从刚性物体的角度定义手交互时间线(HIT)。在HIT中,物体最初相对于场景静止,然后被手持并接触,其姿态发生变化。通常在使用过程中会有一个牢固的抓握,之后物体被释放,再次相对于场景静止。我们对HIT上的这些姿态约束进行建模,并提出沿HIT传播物体姿态,通过我们提出的约束优化与传播(COP)框架实现更优的重建。重要的是,我们关注稳定抓取的时间线——即手稳定地握住物体,在使用过程中保持恒定接触。这使得我们能够在没有3D真值的情况下,高效地标注、研究和评估视频中的物体重建。我们在两个第一人称数据集HOT3D和野外EPIC-Kitchens上评估了我们提出的任务ROHIT。在HOT3D中,我们整理了1.2K个稳定抓取片段。在EPIC-Kitchens中,我们标注了2.4K个稳定抓取片段,包括来自141个环境中日常交互视频的9个类别的390个物体实例。在没有3D真值的情况下,我们利用2D投影误差来评估重建。定量结果表明,COP通过约束姿态传播,将稳定抓取重建提高了6.2-11.3%,将HIT重建提高了高达24.5%。

英文摘要

We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.

2512.03627 2026-06-03 cs.AI

MemVerse: Multimodal Memory for Lifelong Learning Agents

MemVerse:面向终身学习智能体的多模态记忆

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, Ding Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出MemVerse,一种模型无关的即插即用记忆框架,通过分层检索记忆与参数化快速回忆结合,解决智能体在多模态交互中的灾难性遗忘和长程推理问题。

Comments 25 pages, 6 figures, 14 tables

详情
AI中文摘要

尽管大规模语言和视觉模型取得了快速进展,但AI智能体仍然存在一个根本性限制:它们无法记忆。没有可靠的记忆,智能体会灾难性地遗忘过去的经验,难以进行长程推理,并且在多模态或交互环境中无法连贯地运行。我们提出了MemVerse,一种模型无关的即插即用记忆框架,它将快速的参数化回忆与基于检索的分层记忆相结合,实现了可扩展和自适应的多模态智能。MemVerse维护短期记忆以处理近期上下文,同时将原始多模态经验转化为结构化的长期记忆,组织为分层知识图谱。这种设计支持持续整合、自适应遗忘和有界的记忆增长。为了满足实时需求,MemVerse引入了一种周期性蒸馏机制,将长期记忆中的关键知识压缩到参数化模型中,从而实现快速、可微的回忆,同时保持可解释性。大量实验表明,MemVerse显著提高了多模态推理和持续学习效率,使智能体能够在扩展的交互中记忆、适应和连贯推理。

英文摘要

Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.

2512.03019 2026-06-03 cs.LG cs.AI

Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

分布校准的推理时间计算用于思考型LLM作为评判者

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深Mind) University of Cambridge(剑桥大学)

AI总结 针对思考型大语言模型作为评判者时单样本噪声和聚合不一致问题,提出基于Bradley-Terry-Davidson模型的分布校准聚合方案,利用极性(非平局边际)和决定性(非平局率)区分微弱多数与强共识,显著降低MAE并提高成对准确率,匹配或超越人类评判者。

详情
AI中文摘要

用作成对偏好评判的思考型大语言模型在单样本层面仍存在噪声,常见的聚合规则(多数投票、软自一致性或基于指令的自聚合)在允许平局时不一致。我们研究了评估者的推理时间计算(ITC),该评估者为每个项目生成n个独立的思考-评分样本,并提出了一种原则性的、分布校准的聚合方案。我们的方法使用Bradley-Terry-Davidson公式对评分计数进行三向偏好建模,利用极性(非平局间的边际)和决定性(非平局率)来区分微弱多数与强共识。在各种评估基准上,与标准基线相比,我们的方法持续降低MAE并提高成对准确率,并且在针对人类共识元标签进行评估时,匹配或超过单个人类评判者。这些结果表明,精心分配ITC并使用分布感知方法进行聚合,可以将嘈杂的个体模型判断转化为可靠的评估评分。

英文摘要

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking--rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

2511.21731 2026-06-03 cs.CL cs.AI

Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

识别AI语言中的量子结构:人类与人工智能认知进化趋同的证据

Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo

发表机构 * Center Leo Apostel for Interdisciplinary Studies, Vrije Universiteit Brussel (VUB)(利奥·阿波斯泰尔跨学科研究中心,布鲁塞尔自由大学) Department of Economics, University of Bergamo(博洛尼亚大学经济系) Department of Humanities and Cultural Heritage (DIUM) and Centre CQSCS, University of Udine(乌迪内大学人文与文化遗产系及CQSCS中心)

AI总结 通过对大型语言模型进行认知测试,发现其概念组合中存在贝尔不等式显著违背和玻色-爱因斯坦统计,表明人类与人工智能在概念-语言领域均涌现非经典量子结构,支持认知进化趋同假说。

详情
Journal ref
Entropy 28, 622, 2026
AI中文摘要

我们展示了使用特定大型语言模型(LLMs)作为测试对象进行的概念组合认知测试结果。在第一个测试中,使用ChatGPT和Gemini,我们表明贝尔不等式被显著违背,这表明存在一个概率不满足Kolmogorov公理的“非经典概率模型”。在第二个测试中,同样使用ChatGPT和Gemini,我们在大型文本中的单词分布中识别出“玻色-爱因斯坦统计”的存在,而非直觉预期的“麦克斯韦-玻尔兹曼统计”。有趣的是,这些发现与之前在人类参与者认知测试和大规模语料库信息检索测试中获得的结果相呼应。综合来看,它们指向“概念-语言领域中非经典量子类结构的系统性涌现”,无论认知主体是人类还是人工智能。尽管LLMs因历史原因被归类为神经网络,但我们认为,在神经网络之上构建的向量空间的分布式语义结构中,发生了一种更本质的知识组织形式。正是这种承载意义的结构,促成了通过生物进化缓慢建立的人类认知与语言,与通过自我学习和训练快速涌现的LLM认知与语言之间的进化趋同现象。我们分析了支持上述假设的各种方面和实例。我们还提出了一个统一框架,解释了我们识别出的普遍量子组织意义。

英文摘要

We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of a 'non-classical probability model' with probabilities that do not satisfy Kolmogorov's axioms. In the second test, also performed using ChatGPT and Gemini, we identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of non-classical quantum-like structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

2511.19995 2026-06-03 cs.CV

CREward: A Type-Specific Creativity Reward Model

CREward:一种类型特定的创造力奖励模型

Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Sogang University(首尔大学)

AI总结 提出首个类型特定的创造力奖励模型CREward,通过几何、材质和纹理三个轴评估创造力,并应用于创造力评估、可解释创造力及创意样本获取。

Comments Accepted to CVPR 2026

详情
AI中文摘要

创造力是一种复杂现象。在表征和评估创造力时,将其视为单一的未分化量显得幼稚且不足。在这项工作中,我们学习了第一个类型特定的创造力奖励模型,称为CREward,它跨越三个创造力“轴”:几何、材质和纹理,使我们能够通过图像形成流程的视角来审视创造力。为了构建我们的奖励模型,我们首先进行人类基准评估,以捕捉人类对各种创意图像中每种类型的创造力感知。然后,我们分析人类判断与大型视觉语言模型(LVLMs)预测之间的相关性,确认LVLMs与人类感知高度一致。基于这一观察,我们收集LVLM生成的标签来训练我们的CREward模型,该模型适用于创意图像的评估和生成。我们探索了CREward的三个应用:创造力评估、可解释创造力以及创意样本获取,用于人类设计灵感和通过低秩适应引导创意生成。

英文摘要

Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

2511.19959 2026-06-03 cs.LG cs.DC

ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

ParaBlock:面向大语言模型的通信-计算并行块坐标联邦学习

Yujia Wang, Yuanpu Cao, Jinghui Chen

发表机构 * College of Information Sciences and Technology(信息科学与技术学院) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出ParaBlock方法,通过并行化通信与计算线程,在联邦学习大语言模型时提升通信效率,并理论证明其收敛率与标准方法相同,实验验证其性能与效率优势。

Comments Accepted by TMLR

详情
AI中文摘要

联邦学习作为一种隐私保护训练范式已被广泛研究。最近,联邦块坐标下降方案在训练大规模模型中成为流行选择,因为它允许客户端仅本地训练模型的一个子集而非整个模型。然而,在大语言模型时代,即使单个块也可能包含大量参数,导致显著的通信延迟,特别是对于资源受限的客户端。为了解决联邦训练/微调大语言模型中的这一挑战,我们提出了ParaBlock,一种新颖的方法,它建立两个并行线程分别用于通信和计算,以提高通信效率。我们从理论上证明,所提出的ParaBlock实现了与标准联邦块坐标下降方法相同的收敛率。在通用指令遵循和数学推理任务上微调大语言模型的实证评估证实,ParaBlock不仅保持了强大的性能,而且显著提高了通信效率。

英文摘要

Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.

2503.07265 2026-06-03 cs.CV cs.AI cs.CL

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

WISE: 一种基于世界知识的文本到图像生成语义评估方法

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有文本到图像生成模型缺乏复杂语义理解和世界知识整合评估的问题,提出WISE基准,包含25个子领域的1000个精心设计的提示,并引入WiScore指标评估知识-图像对齐,实验表明当前模型在整合世界知识方面存在显著局限。

Comments Accepted to ICML 2026. We have also released an updated version of the benchmark, WISE_Verified. Please refer to https://github.com/PKU-YuanGroup/WISE for the latest version

详情
AI中文摘要

文本到图像(T2I)模型能够生成高质量的艺术创作和视觉内容。然而,现有研究和评估标准主要关注图像真实性和浅层的文本-图像对齐,缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战,我们提出了 extbf{WISE},这是首个专门用于 extbf{W}orld Knowledge- extbf{I}nformed extbf{S}emantic extbf{E}valuation(世界知识引导的语义评估)的基准。WISE超越了简单的词-像素映射,通过1000个精心设计的提示,涵盖文化常识、时空推理和自然科学等25个子领域,对模型进行挑战。为了克服传统CLIP指标的局限性,我们引入了 extbf{WiScore},一种用于评估知识-图像对齐的新型定量指标。通过对20个模型(10个专用T2I模型和10个统一多模态模型)在涵盖25个子领域的1000个结构化提示上进行全面测试,我们的发现揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限,为下一代T2I模型增强知识整合与应用指明了关键路径。代码和数据可在\href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}获取。

英文摘要

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

2511.13020 2026-06-03 cs.CV cs.AI

PHASE: Physiology-Aware Hyperspectral Reconstruction via Object-to-Human Domain Adaptation

PHASE: 通过对象到人体域适应的生理感知高光谱重建

Yufei Wen, Shuxing Zhong, Jingdan Kang, Yuting Zhang, Jintai Chen, Kaishun Wu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China University of Technology(华南理工大学)

AI总结 针对现有高光谱重建方法在生理成像中失效的问题,提出PHASE范式,通过生理通道重新解释和生理约束对齐,实现从对象到人体的域适应,仅需1.5%标注数据即可显著提升重建质量。

Comments To KDD26

详情
AI中文摘要

尽管高光谱成像提供了无与伦比的无创生理洞察,但其笨重的硬件、缓慢的采集速度和监管负担严重限制了其临床可用性。一种自然的替代方案是从无处不在的RGB或CASSI测量中重建高光谱信息。然而,现有的为以对象为中心的场景开发的范式依赖于基于反射率的特征对齐,假设光谱相似性保持语义一致性。这一假设在生理成像中不成立,因为视觉上相似的RGB响应可能源于不同且纠缠的生理状态。这种不匹配促使从反射率对齐转向基于共享光-物质相互作用原理的生理感知表示学习——这一转变引入了来自跨通道语义偏移(C1)和基于RGB采集的不可逆信息丢失(C2)的基本挑战。因此,我们设计了PHASE,一种生理感知的高光谱重建范式,通过生理通道重新解释解耦跨通道生理语义,并通过生理约束对齐将重建限制在生理上合理的解,从根本上重新定义了对象到人体的迁移。在两种源到目标迁移协议下,PHASE仅需1.5%的标注监督,在SSIM上一致优于最先进方法最多+2.20,在SAM上最多-3.06。

英文摘要

Although hyperspectral imaging offers unparalleled non-invasive physiological insight, its bulky hardware, slow acquisition, and regulatory burden severely limit its clinical availability. A natural workaround is to reconstruct hyperspectral information from ubiquitous RGB or CASSI measurements. However, existing paradigms, developed for object-centric scenes, rely on reflectance-based feature alignment, assuming that spectral similarity preserves semantic meaning. This assumption breaks down in physiological imaging, where visually similar RGB responses may arise from distinct and entangled physiological states. This mismatch motivates a shift from reflectance alignment to physiology-aware representation learning, grounded in shared light-matter interaction principles -- a shift that introduces fundamental challenges from cross-channel semantic shifts (C1) and irreversible information loss in RGB-based acquisition (C2). We therefore design PHASE, a physiology-aware hyperspectral reconstruction paradigm that fundamentally redefines object-to-human transfer by disentangling cross-channel physiological semantics via Physiological Channel Reinterpretation and restricting reconstruction to physiologically plausible solutions through Physiologically Constrained Alignment. Under two source-to-target transfer protocols, PHASE consistently outperforms state-of-the-art methods by up to +2.20 SSIM and -3.06 in SAM with merely 1.5% labeled supervision.

2511.11346 2026-06-03 cs.LG

Fast and Expressive Multi-Byte Prediction with Probabilistic Circuits

基于概率电路的快速且富有表现力的多字节预测

Andreas Grivas, Lorenzo Loconte, Emile van Krieken, Piotr Nawrot, Yu Zhao, Euan Wielewski, Pasquale Minervini, Edoardo Ponti, Antonio Vergari

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出MTPC框架,利用概率电路编码未来令牌的联合分布,在字节级LLM中实现快速生成,同时保持表现力。

详情
AI中文摘要

多令牌预测(MTP)是一种显著加速大型语言模型(LLM)生成的突出策略,尤其是在字节级LLM中,这些模型无需分词器但速度极慢。然而,许多现有的MTP方法要么假设未来令牌之间独立,牺牲了表现力,要么在窗口内逐个生成令牌,增加了延迟。在这项工作中,我们在概率电路(PC)框架内研究了MTP中表现力与延迟之间的权衡。我们的框架MTPC允许通过选择电路架构来探索编码未来令牌联合分布的不同方式,推广了经典模型,如(层次)混合模型、隐马尔可夫模型和张量网络。我们通过改造现有的字节级LLM(如EvaByte)和字节化的子词模型(如Llama3.2 3B)展示了MTPC的有效性。实验表明,当与推测解码结合时,与具有独立性假设的MTP相比,MTPC显著加速了生成,同时保证保持原始验证器LLM的性能。我们还严格研究了在探索MTPC的可能参数化(如PC架构以及验证器和草稿LLM之间的部分层共享)时,表现力与延迟之间的最优权衡。

英文摘要

Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), especially in byte-level LLMs, which are tokeniser-free but prohibitively slow. However, many existing MTP methods either assume independence between future tokens, sacrificing expressiveness, or generate tokens one at a time within the window, increasing latency. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs). Our framework, MTPC, allows one to explore different ways to encode the joint distributions over future tokens by selecting circuit architectures, generalising classical models such as (hierarchical) mixture models, hidden Markov models, and tensor networks. We show the efficacy of MTPC by retrofitting existing byte-level LLMs, such as EvaByte, and byte-fied subword models, such as Llama3.2 3B. Our experiments show that, when combined with speculative decoding, MTPC substantially speeds up generation compared to MTP with independence assumptions, while guaranteeing to retain the performance of the original verifier LLM. We also rigorously study the optimal trade-off between expressiveness and latency when exploring the possible parameterisations of MTPC, such as PC architectures and partial layer sharing between the verifier and draft LLMs.

2508.21448 2026-06-03 cs.CL

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

当模型拒绝时:政治可操控性与特征丰富度作为意识形态深度的度量

Shariar Kabir

发表机构 * Bangladesh University of Engineering and Technology(孟加拉工程与技术大学)

AI总结 本文提出意识形态深度概念,通过可操控性和稀疏自编码器测量的特征丰富度,研究大语言模型拒绝遵循良性指令是否源于能力缺陷而非安全规则。

详情
AI中文摘要

大型语言模型有时会拒绝遵循良性指令,例如拒绝论证某种政治立场或采用指定角色,这种拒绝通常被视为安全护栏在起作用。我们探究这些拒绝是否可能表明**能力缺陷**:模型缺乏从指令角度进行推理所需的内部表示。为此,我们引入**意识形态深度**,该属性包含两个组成部分:(i) 模型遵循政治指令而不*失败*的能力(可操控性),以及 (ii) 其内部政治表示的**特征丰富度**,通过稀疏自编码器测量。使用两个广泛使用的开源权重LLM作为候选,我们比较基于提示和激活操控的干预措施,并用公开可用的SAE探测政治特征。我们发现巨大且系统性的差异:在两个意识形态方向上更具可操控性的模型激活了约**7.3倍**更多的独特政治特征,而另一个模型则以增加拒绝作为回应。从前者模型中因果性地消融一小部分目标政治特征,重现了相同的特征贫乏行为并导致拒绝增加。综合这些结果表明,对良性提示的拒绝可能源于**能力缺陷**而非固定的安全规则,并且意识形态深度是LLM的一个可测量属性,有助于预测模型何时会拒绝。

英文摘要

Large language models (LLMs) sometimes refuse to follow benign instructions, such as declining to argue a political position or adopt a stated persona, and such refusals are commonly read as safety guardrails at work. We ask whether they can instead signal a **capability deficit**: a shortage of the internal representations a model needs to reason from the instructed perspective. To investigate, we introduce **ideological depth**, a property with two components: (i) a model's ability to follow political instructions without *failure* (steerability), and (ii) the **feature richness** of its internal political representations, measured with sparse autoencoders (SAEs). Using two widely used openweight LLMs as candidates, we compare interventions based on prompts and activation-steering, and probe political features with publicly available SAEs. We find large, systematic differences: a model that is more steerable in both ideological directions activates **~7.3x** more distinct political features, while the other model instead responds with increased refusals. Causally ablating a small, targeted set of political features from the former model reproduces the same feature-poor behavior and drives up refusals. Together, these results indicate that refusals on benign prompts can arise from **capability deficits** rather than fixed safety rules, and that ideological depth is a measurable property of LLMs that helps predict when a model will refuse.

2511.10055 2026-06-03 cs.CV

Physical Plausibility Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

通过 HCM-GRPO 实现物理合理性推理:赋能紧凑模型以获得卓越性能

Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

发表机构 * Tsinghua University(清华大学) Alibaba Health Information Technology Limited(阿里巴巴健康信息技术有限公司)

AI总结 针对多模态大语言模型在物理合理性推理中数据缺乏和推理能力弱的问题,提出包含大规模数据集和 HCM-GRPO 方法的完整解决方案,以紧凑模型超越大规模开源和闭源模型。

详情
AI中文摘要

近年来,图像生成的性能得到了显著提升。然而,图像筛选的研究很少,且由于缺乏数据以及多模态大语言模型(MLLMs)中物理合理性推理能力较弱,其性能并不令人满意。在这项工作中,我们提出了一个完整的解决方案,从数据和方法论两方面解决这些问题。在数据方面,我们收集了一个包含超过 128k 样本的综合图像筛选数据集,涉及约 640k 张图像。每个样本由一张原始图像和四张生成图像组成。该数据集从四个方面评估物理合理性推理能力:外观变形、物理阴影、放置布局和扩展合理性。关于数据标注,我们研究了多种方法,包括纯人工、全自动和答案驱动的标注,以最经济的方式获取高质量的思维链(CoT)数据。在方法论上,我们将一种硬案例挖掘(HCM)策略与动态比例准确率(DPA)奖励引入到组相对策略优化(GRPO)框架中,称为 HCM-GRPO。与原始 GRPO 相比,这种增强方法展示了更优越的物理合理性推理能力。我们的实验结果表明,即使是像 GPT5.2 和 Gemini3-Pro 这样的最先进的闭源 MLLMs,在物理合理性推理方面也表现出不令人满意的性能。相比之下,通过利用 HCM-GRPO,我们能够以更小的模型超越大规模开源和领先闭源模型的分数。

英文摘要

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare, and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak physical plausibility reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, comprising about 640k images. Each sample consists of an original image and four generated images. The dataset evaluates the physical plausibility reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior physical plausibility reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT5.2 and Gemini3-Pro, exhibit unsatisfactory performance in physical plausibility reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

2511.08247 2026-06-03 cs.CL cs.CY

ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

ParliaBench: 面向大语言模型生成的议会演讲的评估与基准框架

Marios Koniaris, Argyro Tsipi, Panayiotis Tsanakas

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出ParliaBench基准框架,通过构建英国议会数据集、结合计算指标与LLM评判的评估方法以及两种新型嵌入指标(政治光谱对齐和政党对齐),系统评估LLM生成议会演讲的语言质量、语义连贯性和政治真实性,实验表明微调显著提升多数指标且新指标对政治维度具有强区分力。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 4797-4818, European Language Resources Association (ELRA), Palma, Mallorca, Spain, May 2026
AI中文摘要

议会演讲生成对大型语言模型提出了超越标准文本生成任务的特定挑战。与通用文本生成不同,议会演讲不仅需要语言质量,还需要政治真实性和意识形态一致性。当前语言模型缺乏针对议会上下文的专门训练,现有评估方法侧重于标准NLP指标而非政治真实性。为此,我们提出了ParliaBench,一个用于议会演讲生成的基准。我们构建了一个来自英国议会的演讲数据集,以实现系统性的模型训练。我们引入了一个评估框架,将计算指标与LLM-as-a-judge评估相结合,用于衡量生成质量在三个维度上的表现:语言质量、语义连贯性和政治真实性。我们提出了两种新颖的基于嵌入的指标——政治光谱对齐和政党对齐——以量化意识形态定位。我们微调了五个大型语言模型(LLM),生成了28k篇演讲,并使用我们的框架对其进行了评估,比较了基线和微调模型。结果表明,微调在大多数指标上产生了统计显著的改进,并且我们的新颖指标对政治维度表现出强大的区分能力。

英文摘要

Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.

2511.07971 2026-06-03 cs.LG

Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning

低秩曲率用于大语言模型微调中的零阶优化

Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

发表机构 * University of Wisconsin – Madison(威斯康星大学麦迪逊分校) University of Georgia(佐治亚大学) Hanyang University(翰阳大学)

AI总结 提出LOREN方法,通过低秩块对角预条件器捕捉曲率并利用REINFORCE留一法梯度估计器降低方差,在LLM微调中实现更高精度和更快收敛,同时峰值内存使用降低27.3%。

Comments Accepted to the AAAI Conference on Artificial Intelligence (AAAI-2026)

详情
AI中文摘要

我们引入了LOREN,一种用于微调大型语言模型(LLM)的曲率感知零阶(ZO)优化方法。现有的ZO方法通过随机扰动的有限差分估计梯度,常常遭受高方差和次优搜索方向的问题。我们的方法通过以下方式解决这些挑战:(i)将梯度预条件问题重新表述为自适应估计用于梯度估计的各向异性扰动分布的问题,(ii)通过自然进化策略框架,使用低秩块对角预条件器捕捉曲率,以及(iii)应用REINFORCE留一法(RLOO)梯度估计器来降低方差。在标准LLM基准上的实验表明,我们的方法通过实现更高的精度和更快的收敛,优于最先进的ZO方法,同时与MeZO-Adam相比,峰值内存使用减少了高达27.3%。

英文摘要

We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions. Our approach addresses these challenges by: (i) reformulating the problem of gradient preconditioning as that of adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner using the framework of natural evolution strategies, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance. Experiments on standard LLM benchmarks show that our method outperforms state-of-the-art ZO methods by achieving higher accuracy and faster convergence, while cutting peak memory usage by up to 27.3% compared with MeZO-Adam.

2506.08464 2026-06-03 cs.LG

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

MAC:一种使用平均激活近似曲率的高效梯度预条件方法

Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

发表机构 * University of Wisconsin – Madison(威斯康星大学麦迪逊分校) University of Georgia(佐治亚大学) Hanyang University(翰阳大学)

AI总结 提出MAC方法,通过近似KFAC中Fisher信息矩阵的Kronecker因子,降低二阶优化计算负担,并首次将Kronecker分解应用于Transformer注意力层,在多种网络和数据集上优于KFAC等现有方法。

Comments Accepted to the IEEE International Conference on Data Mining (ICDM-2025)

详情
AI中文摘要

用于训练神经网络的二阶优化方法,如KFAC,通过利用损失景观的曲率信息展现出优越的收敛性。然而,这是以高计算负担为代价的。在这项工作中,我们分析了构成KFAC中逐层Fisher信息矩阵(FIM)的两个组件:与激活和预激活梯度相关的Kronecker因子。基于对其特征谱的实证观察,我们提出了它们的有效近似,从而产生了一种计算高效的优化方法,称为MAC。据我们所知,MAC是第一个将Kronecker分解应用于Transformer中注意力层的FIM,并明确将注意力分数整合到预条件中的算法。我们还研究了MAC在非线性神经网络上的收敛性质,并提供了其收敛到全局最小值的两个条件。我们在各种网络架构和数据集上的广泛评估表明,所提出的方法在准确性、端到端训练时间和内存使用方面优于KFAC和其他最先进的方法。

英文摘要

Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage.

2310.00965 2026-06-03 cs.LG

Node Perturbation Can Effectively Train Multi-Layer Neural Networks

节点扰动可以有效训练多层神经网络

Sander Dalm, Marcel van Gerven, Nasir Ahmad

发表机构 * Donders Institute for Brain, Cognition and Behaviour(大脑、认知与行为研究所)

AI总结 通过将节点扰动与方向导数对齐并在每层进行输入去相关,显著提升了节点扰动学习的参数收敛速度和测试性能,接近反向传播。

详情
AI中文摘要

反向传播(BP)仍然是训练深度神经网络参数的主导且最成功的方法。然而,BP依赖于两个计算上不同的阶段,不能提供对生物学习的满意解释,并且可能难以应用于具有不连续性或噪声节点动态的网络训练。相比之下,节点扰动(NP),也称为活动扰动前向梯度,提出通过向网络激活中注入噪声并随后测量引起的损失变化来学习。NP依赖于两次前向(推理)传递,不使用网络导数,并已被提出作为生物系统中学习的模型。然而,标准NP数据效率极低,并且由于其无引导的基于噪声的搜索过程可能不稳定。在这项工作中,我们通过将NP与方向导数相关联并引入输入去相关,发展了一种现代视角。我们发现,与方向导数的更紧密对齐以及每层的输入去相关在理论和实践上增强了NP学习的性能,在参数收敛方面有大幅改进,并且在测试数据上获得更高的性能,接近BP。此外,我们的新公式允许应用于噪声过程本身不可访问的噪声系统,这对于神经形态芯片上的学习特别有意义。

英文摘要

Backpropagation (BP) remains the dominant and most successful method for training parameters of deep neural network models. However, BP relies on two computationally distinct phases, does not provide a satisfactory explanation of biological learning, and can be challenging to apply for training of networks with discontinuities or noisy node dynamics. By comparison, node perturbation (NP), also known as activity-perturbed forward gradients, proposes learning by the injection of noise into network activations, and subsequent measurement of the induced loss change. NP relies on two forward (inference) passes, does not make use of network derivatives, and has been proposed as a model for learning in biological systems. However, standard NP is highly data inefficient and can be unstable due to its unguided noise-based search process. In this work, we develop a modern perspective on NP by relating it to the directional derivative and incorporating input decorrelation. We find that a closer alignment with directional derivatives together with input decorrelation at every layer theoretically and practically enhances performance of NP learning with large improvements in parameter convergence and much higher performance on the test data, approaching that of BP. Furthermore, our novel formulation allows for application to noisy systems in which the noise process itself is inaccessible, which is of particular interest for on-chip learning in neuromorphic systems.

2511.04421 2026-06-03 cs.RO

Temporal Action Selection for Action Chunking

用于动作分块的时间动作选择

Yueyang Weng, Xiaopeng Zhang, Yongjin Mu, Yingcong Zhu, Yanjie Li

发表机构 * Guangdong Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics and School of Intelligence Science and Engineering, the Harbin Institute of Technology Shenzhen, China(广东省智能变形机制与自适应机器人重点实验室和智能科学与工程学院,哈尔滨工业大学深圳学院)

AI总结 提出时间动作选择(TAS)算法,通过缓存多时间步预测的动作块并动态选择最优动作,在保持决策一致性的同时提升反应性,显著提高任务成功率。

详情
AI中文摘要

动作分块是学习从示范(LfD)中广泛采用的方法。通过建模多步动作块而非单步动作,动作分块显著增强了对人类专家策略的建模能力。然而,由于动作分块仅在完整动作块执行后才做出单一决策,由此导致的决策频率降低限制了实时观测的利用,削弱了在动态或嘈杂环境中的反应性。现有解决该问题的尝试主要是在反应性和决策一致性之间进行权衡,未能同时实现两者。为解决这一局限,我们提出了一种新颖算法——时间动作选择(TAS),该算法缓存来自多个时间步的预测动作块,并通过轻量级选择器网络动态选择最优动作。TAS在反应性和决策一致性上实现了平衡优化。跨多个任务及不同基础策略架构的实验表明,TAS显著提高了成功率。此外,将TAS作为基础策略与残差强化学习(RL)相结合,既提升了训练效率,也提高了性能上限。在仿真和物理机器人上的实验均证实了该方法的有效性。

英文摘要

Action chunking is a widely adopted approach in Learning from Demonstration (LfD). By modeling multi-step action chunks rather than single-step actions, action chunking significantly enhances modeling capabilities for human expert policies. However, because action chunking makes a single decision only after a complete action block has been executed, the resulting reduction in decision frequency restricts the utilization of real-time observations, impairing reactivity in dynamic or noisy environments. Existing efforts to address this issue have primarily resorted to trading off reactivity against decision consistency, without achieving both. To address this limitation, we propose a novel algorithm, Temporal Action Selection (TAS), which caches predicted action chunks from multiple timesteps and dynamically selects the optimal action through a lightweight selector network. TAS achieves balanced optimization across both reactivity and decision consistency. Experiments across multiple tasks with diverse base policy architectures show that TAS significantly improves success rates. Furthermore, integrating TAS as a base policy with residual reinforcement learning (RL) improves both training efficiency and the performance ceiling. Experiments in both simulation and physical robots confirm the method's efficacy.

2511.02417 2026-06-03 cs.CV cs.RO

CropCraft: A Procedural World Generator for Robotic Simulation of Agricultural Tasks

CropCraft:用于农业任务机器人仿真的程序化世界生成器

Riccardo Bertoglio, Cyrille Pierre, Johann Laconte, Roland Lenain

发表机构 * Institut National de la Recherche Agronomique(法国国家农业科研院)

AI总结 提出基于Blender和Python的开源程序化世界生成器CropCraft,通过YAML配置生成多样化农田场景,支持间作、葡萄园和杂草田,并生成带标注的3D仿真环境,用于农业机器人感知和导航算法开发。

详情
AI中文摘要

现代农业中 agroecological 实践的采用要求机器人系统能够在高度多样化和复杂的田间环境中运行。开发和评估此类系统严重依赖仿真,但生成代表 agroecological 多样性的逼真且可配置的3D环境仍然是一个主要挑战。本文提出了 CropCraft,一个基于 Blender 和 Python 构建的开源程序化世界生成器,旨在生成适用于农业机器人的3D仿真环境。CropCraft 通过简单的 YAML 配置文件生成作物田,支持多种场景,包括间作、葡萄园和杂草丛生的田地。该工具包含一个多生长阶段的3D植物模型库(作物、草和杂草),并使用随机放置算法真实地再现实际田地中观察到的空间变异性。生成的场景可直接导入 Gazebo 仿真器,并包含所有放置元素的地面真值标注,支持感知和导航算法的开发。为了展示 CropCraft 的实际用途,我们将其应用于使用深度学习的作物-杂草语义分割任务。生成了包含10,000张玉米田合成图像的数据集,这些图像具有不同的杂草密度、生长阶段和光照条件,并用于训练多个分割架构。仅使用合成数据训练的模型在真实田间图像上实现了约10%的平均交并比(mIoU)的 sim-to-real 差距,优于先前的先进合成生成方法。我们进一步表明,即使将少量真实图像与合成数据结合,也能提高跨领域的泛化能力,为农业感知任务中合成数据的有效使用提供了新见解。

英文摘要

The adoption of agroecological practices in modern agriculture requires robotic systems capable of operating in highly diverse and complex field environments. Developing and evaluating such systems relies heavily on simulation, yet generating realistic and configurable 3D environments representative of agroecological diversity remains a major challenge. This paper presents CropCraft, an open-source procedural world generator built on Blender and Python, designed to produce 3D simulation environments tailored to agricultural robotics. CropCraft generates crop fields from a simple YAML configuration file, supporting a wide range of scenarios including intercropping, vineyards, and weed-infested fields. The tool includes a library of 3D plant models (crops, grasses, and weeds) at multiple growth stages, and uses stochastic placement algorithms to realistically reproduce the spatial variability observed in real fields. Generated worlds are directly importable into the Gazebo simulator and include ground-truth annotations for all placed elements, supporting both perception and navigation algorithm development. To demonstrate the practical utility of CropCraft, we apply it to the task of crop-weed semantic segmentation using deep learning. A dataset of 10,000 synthetic images of maize fields with varying weed densities, growth stages, and lighting conditions was generated and used to train several segmentation architectures. Models trained exclusively on synthetic data achieve a sim-to-real gap of approximately 10% mean Intersection over Union (mIoU) on real field images, outperforming previous state-of-the-art synthetic generation approaches. We further show that combining even a few real images with synthetic data improves generalization across domains, providing new insights into the effective use of synthetic data for agricultural perception tasks.

2510.23216 2026-06-03 cs.AI cs.LG

Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

逼真足球模拟中的人性化守门:一种样本高效的强化学习方法

Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Fabio Zinno, Michael Jones, Linus Gisslén

发表机构 * University of Edinburgh(爱丁堡大学) KTH Royal Institute of Technology(皇家理工学院) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种样本高效的深度强化学习方法,通过利用预收集数据和增加网络可塑性,在EA SPORTS FC 25中训练出守门员智能体,其扑救率比内置AI高10%,训练速度比标准DRL快50%,且行为更接近人类。

详情
AI中文摘要

尽管多个知名视频游戏已成为深度强化学习(DRL)的测试平台,但该技术很少被游戏行业用于制作真实的AI行为。先前的研究侧重于使用大型模型训练超人类智能体,这对于资源有限、旨在实现类人智能体的游戏工作室来说并不实际。本文提出了一种样本高效的DRL方法,专为在工业环境(如视频游戏行业)中训练和微调智能体而设计。我们的方法通过利用预收集的数据和增加网络可塑性来提高基于价值的DRL的样本效率。我们在EA SPORTS FC 25(当今最畅销的足球模拟游戏之一)中评估了该方法训练守门员智能体的效果。我们的智能体在扑救率上比游戏内置AI高出10%。消融研究表明,与标准DRL方法相比,我们的方法训练智能体速度提高了50%。最后,领域专家的定性评估表明,与手工制作的智能体相比,我们的方法创造了更人性化的游戏玩法。作为该方法影响力的证明,该技术已被用于该系列的最新版本中。

英文摘要

While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testament to the impact of the approach, the method has been adopted for use in the most recent release of the series.

2510.23469 2026-06-03 cs.LG

Towards Fair Graph Prompting: A Dual-Prompt Mechanism for Mitigating Attribute and Structural Bias

面向公平图提示:一种缓解属性与结构偏差的双提示机制

Yuhan Yang, Xingbo Fu, Jundong Li

发表机构 * University of Michigan(密歇根大学) University of Virginia(弗吉尼亚大学)

AI总结 提出自适应双提示框架(ADPrompt),通过自适应特征修正和自适应消息校准两个模块,在适应预训练GNN的同时缓解节点属性与图结构中的偏差,实现公平的节点分类。

详情
AI中文摘要

对未标记图数据进行自监督预训练已成为图神经网络(GNN)的常见范式。然而,预训练目标与下游任务之间通常存在目标差距。为弥补这一差距,图提示方法通过可学习提示将冻结的预训练GNN适应到特定下游任务。尽管有效,但现有大多数图提示方法主要关注提升模型性能,而很大程度上忽略了公平性问题。由于下游图数据在节点属性和图结构中固有地包含偏差,预训练GNN可能在不同人口统计子组之间产生不同的表示。为解决这一局限,我们提出自适应双提示(ADPrompt),一种公平感知的图提示框架,用于适应预训练GNN。ADPrompt包含两个互补组件:自适应特征修正,学习个性化属性提示以在输入层面抑制敏感信息;以及自适应消息校准,引入逐层结构提示以动态调节来自邻居节点的信息传播。通过联合优化这两个模块,ADPrompt在适应预训练GNN的同时缓解了属性级和结构级偏差。在四个基准数据集上采用多种预训练策略的实验表明,ADPrompt在节点分类任务中始终优于七个竞争基线。

英文摘要

Self-supervised pre-training on unlabeled graph data has become a common paradigm for Graph Neural Networks (GNNs). However, an objective gap often remains between pre-training objectives and downstream tasks. To bridge this gap, graph prompting methods adapt frozen pre-trained GNNs to specific downstream tasks through learnable prompts. Despite its effectiveness, most existing graph prompting methods primarily focus on improving model performance and largely overlook fairness concerns. As downstream graph data inherently contains biases in both node attributes and graph structures, pre-trained GNNs may produce representations that differ across demographic subgroups. To address this limitation, we propose Adaptive Dual Prompting (ADPrompt), a fairness-aware graph prompting framework for adapting pre-trained GNNs. ADPrompt incorporates two complementary components: Adaptive Feature Rectification, which learns personalized attribute prompts to suppress sensitive information at the input level, and Adaptive Message Calibration, which introduces layer-wise structure prompts to dynamically regulate information propagation from neighboring nodes. By jointly optimizing these two modules, ADPrompt adapts the pre-trained GNN while mitigating both attribute-level and structural bias. Experiments on four benchmark datasets with multiple pre-training strategies demonstrate that ADPrompt consistently outperforms seven competitive baselines in node classification tasks.

2510.16282 2026-06-03 cs.CL

Instant Personalized Large Language Model Adaptation via Hypernetwork

通过超网络实现即时个性化大型语言模型自适应

Zhaoxuan Tan, Zixuan Zhang, Haoyang Wen, Zheng Li, Rongzhi Zhang, Pei Chen, Fengran Mo, Zheyuan Liu, Qingkai Zeng, Qingyu Yin, Meng Jiang

发表机构 * University of Notre Dame(诺丁汉大学) Amazon.com Inc(亚马逊公司) Université de Montréal(蒙特利尔大学)

AI总结 提出Profile-to-PEFT框架,使用超网络将用户编码直接映射到适配器参数,实现无需用户训练的即时个性化,在降低计算成本的同时优于现有方法。

Comments accepted to ACL 2026

详情
AI中文摘要

个性化大型语言模型(LLM)利用用户档案或历史记录来定制符合个人偏好的内容。然而,现有的参数高效微调(PEFT)方法,例如“每用户一个PEFT”(OPPU)范式,需要为每个用户训练单独的适配器,这使得它们在计算上昂贵且不适用于实时更新。我们引入了Profile-to-PEFT,一个可扩展的框架,它采用端到端训练的超网络,将用户编码档案直接映射到一组完整的适配器参数(例如LoRA),从而消除了部署时的每用户训练。这种设计实现了即时自适应、对未见用户的泛化以及保护隐私的本地部署。实验结果表明,我们的方法在部署时使用显著更少的计算资源,同时优于基于提示的个性化和OPPU。该框架对分布外用户表现出强大的泛化能力,并在不同的用户活动水平和不同的嵌入骨干下保持鲁棒性。所提出的Profile-to-PEFT框架实现了高效、可扩展且自适应的LLM个性化,适用于大规模应用。

英文摘要

Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User'' (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user's encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.