arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2603.11790 2026-05-27 cs.LG

Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

通过无监督对称群发现实现解缠表示学习

Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

发表机构 * Sorbonne Université, CNRS, Institut des Systèmes Intelligents et de Robotique, ISIR(索邦大学,国家科学研究中心,智能系统与机器人研究所,ISIR)

AI总结 提出一种具身智能体通过与环境的无监督交互自主发现动作空间的群结构的方法,证明了在最小假设下真实对称群分解的可识别性,并推导出两种算法以学习线性对称基解缠表示。

详情
AI中文摘要

基于对称性的解缠表示学习利用环境变换的群结构来揭示潜在的变化因素。先前的基于对称性的解缠方法需要对称群结构的强先验知识,或对子群性质做出限制性假设。在这项工作中,我们通过提出一种方法消除了这些约束,该方法使具身智能体通过与环境的无监督交互自主发现其动作空间的群结构。我们证明了在最小假设下真实对称群分解的可识别性,并推导出两种算法:一种用于从交互数据中发现群分解,另一种用于在不假设特定子群性质的情况下学习线性对称基解缠(LSBD)表示。我们的方法在三个表现出不同群分解的环境中得到了验证,其性能优于现有的LSBD方法。

英文摘要

Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

2603.17218 2026-05-27 cs.CL cs.AI cs.GT

Alignment Makes Language Models Normative, Not Descriptive

对齐使语言模型变得规范,而非描述性

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

发表机构 * Technion – Israel Institute of Technology(技术离子-以色列理工学院)

AI总结 通过对比120个基础-对齐模型对在超过10,000个真实人类决策中的表现,发现对齐诱导了规范性偏差:在单轮教科书式博弈中提升预测,但在多轮战略博弈中因忽略互惠、报复等描述性动态而损害预测。

详情
AI中文摘要

后训练对齐优化语言模型以匹配人类偏好信号,但这一目标并不等同于对观察到的人类行为进行建模。我们在多轮战略博弈——讨价还价、说服、谈判和重复矩阵博弈中,比较了120个基础-对齐模型对在超过10,000个真实人类决策上的表现。在这些设置中,基础模型在预测人类选择方面以近10:1的比例优于其对齐版本,这一结果在模型家族、提示表述和博弈配置中均稳健成立。然而,在人类行为更可能遵循规范预测的设置中,这一模式发生了逆转:对齐模型在所有12种测试的单轮教科书式博弈以及非战略彩票选择中占据主导地位——甚至在多轮博弈本身中,在交互历史发展之前的第一轮也是如此。这种边界条件模式表明,对齐诱导了规范性偏差:当人类行为相对较好地由规范性解决方案捕捉时,它改善了预测;但在多轮战略设置中,当行为由互惠、报复和依赖于历史的适应等描述性动态塑造时,它损害了预测。这些结果揭示了在优化模型以供人类使用和将其用作人类行为代理之间的根本权衡。

英文摘要

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

2601.10566 2026-05-27 cs.CL cs.LG

Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

基于激活签名的表示感知遗忘:从抑制到实体签名擦除

Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, K. M. Shadman Wadith, Nazia Tasnim, Farig Sadeque

发表机构 * Computer Science and Engineering, BRAC University, Dhaka, Bangladesh(布拉格大学计算机科学与工程系,达卡,孟加拉国) Boston University, Boston, MA, USA(波士顿大学,波士顿,马萨诸塞州,美国)

AI总结 提出ERUF框架,通过挖掘实体特异性激活签名并抑制对应方向,实现表示层面的遗忘,同时保持表面抑制、内部衰减和效用保留。

Comments 16 pages, 4 figures

详情
AI中文摘要

实体级遗忘通常通过模型输出评估:是否停止命名目标、拒绝查询或改变真值比分布。然而,这些输出级测试无法显示主体的内部表示是否被衰减。我们引入实体表示遗忘框架(ERUF),这是一个表示感知框架,挖掘主体特定的激活签名,抑制相应的激活方向,并将行为蒸馏到LoRA参数中。在评估的基线中,ERUF是唯一同时实现表面级抑制、内部衰减和效用保留的方法。在TOFU forget10上,ERUF达到FQ=0.99和MU=0.62,匹配报告的神谕效用,同时接近神谕遗忘质量。在大多数标准基础模型设置中,ERUF保持低泄漏和低内部目标激活,SMR在0.00%至1.10%之间,EL10低于0.06,效用漂移低于3%。在Llama-3.1-8B上,对抗性实体恢复从63.89%降至20.15%,而名称无关恢复减少72.7%至77.4%。联合表面/内部诊断进一步揭示了推理优先模型中仅靠表面指标无法发现的尺度依赖行为。我们将这些结果解释为表示层面衰减的操作性证据,而非不可逆删除的正式保证。

英文摘要

Entity-level unlearning is usually evaluated by what a model says: whether it stops naming the target, refuses a query, or shifts a Truth Ratio distribution. These output-level tests, however, do not show whether a subject's internal representation has been attenuated. We introduce the Entity Representation Unlearning Framework (ERUF), a representation-aware framework that mines subject-specific activation signatures, suppresses the corresponding activation direction, and distills the behavior into LoRA parameters. Among evaluated baselines, ERUF is the only method that jointly achieves surface-level suppression, internal attenuation, and utility preservation. On TOFU forget10, ERUF achieves FQ = 0.99 and MU = 0.62, matching reported oracle utility while approaching oracle forget quality. Across most standard foundation-model settings, ERUF maintains low leakage and low internal target activation, with SMR between 0.00% and 1.10%, EL10 below 0.06, and utility drift below 3%. On Llama-3.1-8B, adversarial entity recovery falls from 63.89% to 20.15%, while name-agnostic recovery decreases by 72.7% to 77.4%. Joint surface/internal diagnostics further reveal scale-dependent behavior in reasoning-prior models that surface metrics alone would miss. We interpret these results as operational evidence of representation-level attenuation, not as a formal guarantee of irreversible deletion.

2601.03471 2026-05-27 cs.CL cs.AI

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

EpiQAL:大型语言模型在流行病学问答与推理中的基准测试

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

发表机构 * Emory University(埃默里大学) University of Illinois Chicago(伊利诺伊大学香槟分校) Microsoft(微软公司)

AI总结 提出EpiQAL基准,通过三个子集(事实回忆、多步推理、不完整信息下结论重建)评估LLM在流行病学推理中的表现,发现当前模型在多步推理上表现有限。

Comments 31 pages, 7 figures, 25 tables

详情
AI中文摘要

可靠的流行病学推理需要综合研究证据来推断疾病负担、传播动态和人群层面的干预效果。现有的医学问答基准主要强调临床知识或患者层面的推理,但很少有系统评估基于证据的流行病学推理。我们提出了EpiQAL,这是首个针对多种疾病的流行病学问答诊断基准,包含三个从开放获取文献构建的子集。这三个子集逐步测试事实回忆、多步推理以及在不完整信息下的结论重建,并通过结合分类学指导、多模型验证和难度筛选的质量控制流程构建。对涵盖开源和专有系统的15个模型的实验表明,当前LLM在流行病学推理上表现有限,其中多步推理构成最大挑战。模型排名在不同子集间发生变化,仅靠规模并不能预测成功。思维链提示有利于多步推理,但在其他情况下效果不一。EpiQAL为证据基础、推理推理和结论重建提供了细粒度的诊断信号。

英文摘要

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fifteen models spanning open-source and proprietary systems reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction.

2601.03079 2026-05-27 cs.CL

Learning to Diagnose and Correct Errors: Towards Moral Sensitivity Acquisition in Large Language Models

学习诊断和纠正错误:大型语言模型中的道德敏感性获取

Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi, Xitong Zhang, Kristen Johnson, Guangliang Liu

发表机构 * University of Mississippi(密西根州立大学) Nanyang Technological University(南洋理工大学) Northeastern University(东北大学) AWS AI Labs(AWS AI实验室) Johns Hopkins University(约翰霍普金斯大学) Qualcomm(高通) Michigan State University(密西根州立大学) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 提出一种实用推理方法,通过让大型语言模型诊断和纠正道德错误来获取道德敏感性,并在多个任务上验证了其有效性。

详情
AI中文摘要

道德敏感性是人类道德能力最基础的能力。尽管许多方法旨在使大型语言模型(LLMs)与人类道德价值观对齐,但它们主要关注拟合道德适当文本的分布,而忽视了如何使LLMs获得道德敏感性。在本文中,我们朝着解决以下问题迈出了一步:LLMs如何获得道德敏感性?具体来说,我们提出了一种实用推理方法,通过使LLMs能够诊断和纠正道德错误来促进其道德敏感性的获取。我们的实用推理方法的一个核心优势在于其统一的视角:它不是对语义多样且复杂的表面形式进行道德话语建模,而是提供了一个基于推理负荷设计实用推理过程的原则性框架。实验证据表明,我们的实用方法能够使LLMs获得道德敏感性,并在多个任务中有效泛化。

英文摘要

Moral sensitivity is the most fundamental capability underlying human moral competence. Although many approaches aim to align large language models (LLMs) with human moral values, they primarily focus on fitting the distributions of morally appropriate texts while overlooking how to enable moral sensitivity acquisition in LLMs. In this paper, we take a step toward addressing the question: How can moral sensitivity be acquired in LLMs? Specifically, we propose a pragmatic inference approach that facilitates moral sensitivity acquisition in LLMs by enabling them to diagnose and correct moral errors. A central strength of our pragmatic inference approach lies in its unified perspective: rather than modeling moral discourses across semantically diverse and complex surface forms, it provides a principled framework for designing pragmatic inference procedures grounded in their inferential load. Empirical evidence demonstrates that our pragmatic approach can enable moral sensitivity acquisition in LLMs and generalizes effectively across tasks.

2603.16870 2026-05-27 cs.CV cs.AI

Demystifying Video Reasoning

揭秘视频推理

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

发表机构 * SenseTime Research(秒速科技研究院) Nanyang Technological University(南洋理工大学) University of California, Berkeley(加州大学伯克利分校) University of California, San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过实验揭示视频扩散模型中的推理主要发生在去噪步骤中,提出链式步骤(CoS)机制,并发现工作记忆、自我修正和感知先行等涌现行为,最后提出一种无需训练的集成策略来提升推理能力。

Comments Homepage: https://www.wruisi.com/demystifying_video_reasoning

详情
AI中文摘要

近期视频生成的进展揭示了一个意外现象:基于扩散的视频模型展现出非平凡的推理能力。先前的工作将此归因于链式帧(CoF)机制,假设推理在视频帧间顺序展开。在本工作中,我们挑战这一假设,并揭示了一个根本不同的机制。我们表明视频模型中的推理主要沿着扩散去噪步骤涌现。通过定性分析和针对性探测实验,我们发现模型在早期去噪步骤中探索多个候选解,并逐步收敛到最终答案,我们将此过程称为链式步骤(CoS)。除了这一核心机制,我们还识别出对模型性能至关重要的几种涌现推理行为:(1)工作记忆,支持持久参考;(2)自我修正与增强,允许从不正确的中间解中恢复;(3)先感知后行动,早期步骤建立语义基础,后期步骤执行结构化操作。在扩散步骤内部,我们进一步揭示了扩散变换器中的自演化功能特化:早期层编码密集的感知结构,中间层执行推理,后期层巩固潜在表示。受这些见解的启发,我们提出了一种简单的无需训练的策略作为概念验证,展示了如何通过集成来自相同模型不同随机种子的潜在轨迹来改进推理。总体而言,我们的工作系统性地理解了推理如何在视频生成模型中涌现,为未来研究更好地利用视频模型固有的推理动态作为智能的新基础提供了基础。

英文摘要

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

2603.16654 2026-05-27 cs.CL cs.AI cs.LG

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic:迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

发表机构 * The University of Tokyo(东京大学) Yale University(耶鲁大学) Stanford University(斯坦福大学) Xiaomi EV(小米EV) Soongsil University(顺天大学)

AI总结 针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题,提出Omanic基准,通过分解为单跳子问题并分析步骤级错误,揭示后期跳数瓶颈、事实知识下限和错误传播,微调后提升多个推理基准性能。

详情
AI中文摘要

仅从最终答案评估大语言模型(LLM)的推理能力可能会掩盖中间步骤的失败,尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题,我们引入了Omanic,一个开放域4跳问答基准,它不仅用于衡量最终答案的准确性,还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例(OmanicSynth)和967个经专家审核的人工标注评估示例(OmanicBench),每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明,Omanic具有挑战性,而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准,平均提升7.41分,验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据,在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

2603.13853 2026-05-27 cs.CL cs.AI

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

APEX-Searcher: 通过子目标细化信用分配以增强智能体检索增强生成

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) Wenge Technology Co., Ltd(Wenger科技有限公司)

AI总结 针对复杂多跳问答中检索路径模糊和端到端强化学习奖励稀疏的问题,提出APEX-Searcher,通过分离规划与执行的信用分配(规划用RL优化、执行用SFT学习),在多个基准上取得一致提升。

详情
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)与外部知识连接起来,但单轮检索通常不足以应对复杂的多跳问题。为了增强复杂任务的搜索能力,大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法提高了问题解决性能,但它们仍然面临任务推理和模型训练方面的挑战,尤其是模糊的检索执行路径和端到端强化学习(RL)中的稀疏奖励,这可能导致不准确的检索结果和较低的性能。我们将这些失败归因于层次化的信用纠缠:单一的最终奖励同时更新规划和执行,因此模型无法清晰地区分规划错误和检索错误。我们提出APEX-Searcher,它采用了一种细化信用分配的范式:规划通过带有规划级奖励的RL进行优化,而执行则通过SFT学习。大量实验表明,在多跳RAG和任务规划基准上均取得了一致的提升。

英文摘要

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insufficient for complex multi-hop questions. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches improve problem-solving performance, they still face challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL), which can lead to inaccurate retrieval results and lower performance. We attribute these failures to hierarchical credit entanglement: a single final reward updates planning and execution together, so the model cannot clearly separate plan errors from retrieval errors. We propose APEX-Searcher, which uses a Refining Credit Assignment paradigm: planning is optimized by RL with a plan-level reward, while execution is learned by SFT. Extensive experiments show consistent gains in both multi-hop RAG and task planning across benchmarks.

2603.15500 2026-05-27 cs.AI cs.LG

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

不确定性下通过策略信息分配理解LLM中的推理

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang

发表机构 * Microsoft Research(微软研究院) KAIST(韩国科学技术院) Seoul National University(首尔国立大学)

AI总结 本文提出一个信息论框架,将推理分解为程序推进和认知外化(不确定性标记级外化),证明零散外化能在无显式错误触发时恢复收敛,并通过实验表明小规模SFT即可调控该能力,从而将推理重新定义为不确定性下的策略信息分配。

详情
AI中文摘要

LLM 经常表现出“啊哈”时刻,例如在“Wait”等标记后进行自我修正,但其潜在机制仍不清楚。标准 LLM 主要通过无声发散崩溃,即轨迹偏离正确答案但仍保持局部连贯,因此没有显式错误触发反应性自我修正。我们引入一个信息论框架,将推理分解为程序推进和认知外化(不确定性的标记级外化),并证明零散外化能在没有显式错误触发的情况下恢复向正确答案的收敛。实验上,一个最小的怀疑线索即可恢复失败的轨迹,小规模 SFT 足以灌输或抑制这种能力,这表明强推理更少依赖于非凡的内在机制,而更多依赖于外化不确定性的语言习惯。我们的框架将推理重新定义为不确定性下的策略信息分配,为理解和推进 LLM 推理提供了新视角。

英文摘要

LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLMs collapse mainly through silent divergence, where trajectories drift from the correct answer yet remain locally coherent, so no explicit error triggers reactive self-correction. We introduce an information-theoretic framework that separates reasoning into procedural advancement and epistemic verbalization, the token-level externalization of uncertainty, and prove that sporadic verbalization restores convergence toward the correct answer even without explicit error triggers. Empirically, a minimal doubt cue recovers failed trajectories, and small-scale SFT suffices to instill or suppress this capability, suggesting that strong reasoning hinges less on an extraordinary inner mechanism than on the linguistic habit of externalizing uncertainty. Our framework recasts reasoning as strategic information allocation under uncertainty, offering a new lens for understanding and advancing LLM reasoning.

2603.13282 2026-05-27 cs.LG cs.AI

FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

FedTreeLoRA:协调联邦LoRA微调中的统计异质性与功能异质性

Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

发表机构 * University of Florida, Gainesville, FL 32611(佛罗里达大学) Middle Tennessee State University Murfreesboro, TN 37132(中田纳西州立大学)

AI总结 针对联邦LoRA微调中统计异质性与功能异质性正交耦合的问题,提出树结构聚合框架FedTreeLoRA,通过逐层对齐实现泛化与个性化的有效平衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于低秩自适应(LoRA)的联邦学习(FL)已成为隐私保护的大语言模型微调的标准方法。然而,现有的个性化方法主要在一种限制性的平面模型假设下运行:它们处理客户端的 extit{统计异质性},但将模型视为一个整体块,忽略了跨LLM层的 extit{功能异质性}。我们认为这两个维度——统计(水平)异质性和功能(垂直)异质性——在来源上是正交的,但在交互中是耦合的,这意味着参数共享的最优深度在功能上依赖于客户端的相似性。为了解决这个问题,我们提出了 extbf{FedTreeLoRA},一个采用树结构聚合进行细粒度逐层对齐的框架。通过动态构建聚合层次结构,FedTreeLoRA允许客户端在浅层“树干”上共享广泛共识,同时在深层“树枝”上逐步特化。在NLU和NLG基准上的实验表明,FedTreeLoRA通过有效协调泛化与个性化,显著优于现有最先进方法。

英文摘要

Federated Learning (FL) with Low-Rank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing personalized methods predominantly operated under a restrictive Flat-Model Assumption: they addressed client-side \textit{statistical heterogeneity} but treated the model as a monolithic block, ignoring the \textit{functional heterogeneity} across LLM layers. We argue that these two statistical (horizontal) and functional (vertical) dimensions, are \textit{orthogonal in source yet coupled in interaction}, implying that the optimal depth of parameter sharing is functionally dependent on client similarity. To address this, we propose \textbf{FedTreeLoRA}, a framework employing tree-structured aggregation for fine-grained, layer-wise alignment. By dynamically constructing an aggregation hierarchy, FedTreeLoRA allows clients to share broad consensus on shallow `trunks' while progressively specializing on deep `branches'. Experiments on NLU and NLG benchmarks demonstrate that FedTreeLoRA significantly outperforms state-of-the-art methods by effectively reconciling generalization and personalization.

2603.12754 2026-05-27 cs.CL

A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

一种从语义标注语料库学习大规模计算构式语法的方法

Paul Van Eecke, Katrien Beuls

发表机构 * Artificial Intelligence Laboratory(人工智能实验室) Vrije Universiteit Brussel(布鲁塞尔自由大学) Faculté d’informatique(计算机科学系) Université de Namur(纳慕尔大学)

AI总结 提出一种从语义标注语料库自动学习大规模、广覆盖计算构式语法的方法,生成包含数万构式的网络,支持开放域文本的框架语义分析并揭示句法-语义使用模式。

Comments Accepted for oral presentation at CoNLL 2026

详情
AI中文摘要

我们提出了一种从语言使用语料库中学习大规模、广覆盖构式语法的方法。该方法从带有成分结构和语义框架标注的话语出发,促进学习可解释的计算构式语法,捕捉句法结构与所表达语义关系之间的复杂关联。生成的语法由在流体构式语法框架内形式化的数万个构式网络组成。这些语法不仅支持开放域文本的框架语义分析,还蕴含了关于学习数据中句法-语义使用模式的大量信息。该方法及学习到的语法有助于基于使用的构式主义语言方法的规模化,因为它们证实了若干基本构式语法猜想的可扩展性,同时为广覆盖语料库中英语论元结构的构式主义研究提供了实用工具。

英文摘要

We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

2603.09551 2026-05-27 cs.CV

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

GeoSolver: 利用细粒度过程监督扩展遥感中的测试时推理

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang

发表机构 * College of Computer Science and Technology(计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education Jilin University(教育部符号计算与知识工程重点实验室)

AI总结 提出GeoSolver框架,通过构建大规模过程监督数据集Geo-PRM-2M和训练过程奖励模型GeoPRM,结合过程感知树GRPO强化学习算法,实现遥感中可验证的逐步推理,在多个基准上达到最优性能并支持测试时扩展。

Comments Code: https://github.com/yourname/GeoSolver

详情
AI中文摘要

尽管视觉语言模型(VLM)显著推进了遥感解译,但使其能够执行复杂、逐步推理仍然极具挑战性。最近在该领域引入思维链(CoT)推理的努力显示出前景,但确保这些中间步骤的视觉忠实性仍是一个关键瓶颈。为解决这一问题,我们提出了GeoSolver,一个新颖的框架,将遥感推理转向可验证的、过程监督的强化学习。我们首先构建了Geo-PRM-2M,一个大规模的、令牌级过程监督数据集,通过熵引导的蒙特卡洛树搜索(MCTS)和有针对性的视觉幻觉注入合成。基于该数据集,我们训练了GeoPRM,一个令牌级过程奖励模型(PRM),提供细粒度的忠实性反馈。为了有效利用这些验证信号,我们提出了过程感知树GRPO,一种强化学习算法,将树结构探索与忠实性加权奖励机制相结合,以精确分配中间步骤的信用。大量实验表明,我们的最终模型GeoSolver-9B在多样化的遥感基准上实现了最先进的性能。至关重要的是,GeoPRM解锁了鲁棒的测试时扩展(TTS)。作为通用的地理空间验证器,它无缝地扩展了GeoSolver-9B的性能,并直接增强了通用VLM,突显了其卓越的跨模型泛化能力。

英文摘要

While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.

2512.09700 2026-05-27 cs.CV eess.IV

LiM-YOLO: Less is More with Pyramid Level Shift for Ship Detection in Optical Remote Sensing

LiM-YOLO:基于金字塔层级偏移的光学遥感舰船检测中少即是多

Seon-Hoon Kim, Yerin Kim, Hyeji Sim, Youeyun Jung, Okchul Jung, Daewon Chung

发表机构 * University of Science and Technology (UST)(科学技术大学) Korea Aerospace Research Institute (KARI)(韩国航空航天研究院)

AI总结 针对光学遥感舰船检测中目标尺度小、长宽比高导致深层特征金字塔空间特征稀释的问题,提出LiM-YOLO检测器,通过金字塔层级偏移策略将检测头从步长8、16、32移至4、8、16,并引入组归一化辅助投影模块,在减少64.1%参数量的情况下超越更大规模的SOTA检测器。

Comments 16 pages, 6 figures, 9 tables

详情
AI中文摘要

通用目标检测器在应用于卫星图像中的舰船检测时面临根本性的结构限制,其中舰船尺度分布集中在较小尺寸和高长宽比。在传统的YOLO架构中,最深的特征金字塔层级(步长32)将窄长船只压缩为亚像素表示,导致严重的空间特征稀释并影响准确的舰船边界回归。我们提出Less is More YOLO,一种基于YOLOv9超大变体的精简检测器,以解决这些领域特定的结构冲突。通过对四个主要基准(SODA-A、DOTA-v1.5、FAIR1M-v2.0和ShipRSImageNet)中舰船尺度分布的统计分析,我们引入了一种金字塔层级偏移策略,将检测头从步长8、16、32移至步长4、8、16。该偏移满足基于奈奎斯特-香农原理推导出的最窄目标的空间可表示性条件,同时消除了最深金字塔层级的计算冗余。为了进一步稳定高分辨率卫星输入上的训练,我们引入了一个组归一化辅助投影模块,将组归一化引入投影路径,缓解了内存受限的微批量训练中的梯度不稳定性。在这四个数据集上验证,我们的检测器仅用21.16百万参数就达到了0.600的mAP_{50-95},相比超大YOLOv9基线(58.99百万参数)减少了64.1%。尽管尺寸紧凑,我们的模型超越了多达三倍大的最先进检测器,验证了有针对性的金字塔层级偏移实现了准确性与效率之间的“少即是多”平衡。代码可在https://github.com/egshkim/LiM-YOLO获取。

英文摘要

General-purpose object detectors face fundamental structural limitations when applied to ship detection in satellite imagery, where the ship scale distribution is concentrated at small sizes and high aspect ratios. In conventional You Only Look Once architectures, the deepest feature pyramid level (stride 32) compresses narrow vessels into sub-pixel representations, causing severe spatial feature dilution and compromising accurate ship boundary regression. We propose Less is More YOLO, a streamlined detector built upon the extra-large variant of YOLOv9, to address these domain-specific structural conflicts. From a statistical analysis of ship scale distributions across four major benchmarks (SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet), we introduce a Pyramid Level Shift Strategy that shifts the detection head from strides 8, 16, and 32 to strides 4, 8, and 16. This shift satisfies a spatial representability condition derived from the Nyquist-Shannon principle for the narrowest targets, while eliminating the computational redundancy of the deepest pyramid level. To further stabilize training on high-resolution satellite inputs, we incorporate a group-normalized auxiliary projection module that introduces Group Normalization into the projection path, mitigating gradient instability in memory-constrained micro-batch regimes. Validated on these four datasets, our detector attains an mAP_{50-95} of 0.600 with only 21.16 million parameters, a 64.1% reduction from the extra-large YOLOv9 baseline (58.99 million). Despite this compact size, our model surpasses state-of-the-art detectors up to three times larger, validating that a well-targeted pyramid level shift achieves a "Less is More" balance between accuracy and efficiency. The code is available at https://github.com/egshkim/LiM-YOLO.

2603.08413 2026-05-27 cs.LG cs.AI

Geometrically Constrained Outlier Synthesis

几何约束异常合成

Daniil Karzanov, Marcin Detyniecki

发表机构 * AXA AI Research(AXA人工智能研究) EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院) Polish Academy of Science, IBS PAN, Warsaw, Poland(波兰科学院,IBS PAN,华沙,波兰)

AI总结 提出几何约束异常合成(GCOS)方法,通过在隐藏特征空间中生成受几何约束的虚拟异常样本,结合对比正则化,提升图像分类模型对分布外样本的鲁棒性。

Comments 19 pages, accepted to ICML 2026

详情
AI中文摘要

用于图像分类的深度神经网络通常对分布外(OOD)样本表现出过度自信。为了解决这个问题,我们引入了几何约束异常合成(GCOS),这是一种训练时正则化框架,旨在提高推理时的OOD鲁棒性。GCOS通过生成隐藏特征空间中尊重分布内(ID)数据学习到的流形结构的虚拟异常,解决了先前合成方法的局限性。合成分两个阶段进行:(i)从训练特征中提取的主方差子空间识别出几何信息引导的离流形方向;(ii)由校准集中非一致性得分的经验分位数定义的一个类共形壳,自适应地控制合成幅度以产生边界样本。该壳确保生成的异常既不是微不足道可检测的,也不是与分布内数据无法区分的,从而促进更平滑地学习鲁棒特征。这与对比正则化目标相结合,在选定的得分空间(如马氏距离或基于能量的)中促进ID和OOD样本的可分离性。实验表明,在近OOD基准测试(定义为异常与分布内数据共享相同语义域的任务)上,使用标准基于能量的推理时,GCOS优于最先进的方法。作为探索性扩展,该框架自然地过渡到共形OOD推理,将不确定性得分转化为统计上有效的p值,并启用具有形式误差保证的阈值,为更可预测和可靠的OOD检测提供了途径。

英文摘要

Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.

2603.07211 2026-05-27 cs.LG

CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment

CompassDPO: 用于鲁棒安全对齐的动态控制直接偏好优化

Jilong Liu, Yonghui Yang, Pengyang Shao, Wenjian Tao, Hao Zhan, Haokai Ma, Wei Qin, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) National University of Singapore(新加坡国立大学)

AI总结 提出CompassDPO,通过隐式DPO奖励边际控制更新方向和幅度,无需外部奖励模型,在PKU-SafeRLHF等基准上提升鲁棒性。

详情
AI中文摘要

直接偏好优化(DPO)已成为安全对齐的标准框架,但其对成对偏好更新的依赖使得训练对不完美监督敏感。现有的鲁棒DPO方法通常通过全局损失校正或外部数据级干预来解决这种敏感性,而很大程度上忽略了不可靠比较如何扭曲批次级优化动态。我们提出CompassDPO,一种无奖励的DPO框架,通过动态控制稳定偏好优化。使用隐式DPO奖励边际作为训练时的指南针,CompassDPO沿着两个互补轴调节样本影响:更新方向和更新幅度。对于方向控制,它应用稀疏、有预算和预热延迟的损失混合,以减弱与新兴偏好方向冲突的更新分量。对于幅度控制,它自适应地软温莎化高损失尾部贡献,减少尾部主导同时保留来自困难样本的有用梯度。两种机制仅使用标准DPO训练期间可用的信号,无需外部奖励模型或额外监督。在PKU-SafeRLHF上跨四个骨干网络和多个分布外安全基准的实验表明,CompassDPO在鲁棒性上持续优于普通DPO和强DPO系列基线,特别是在受控标签翻转噪声下。代码可在https://anonymous.4open.science/r/CompassDPO-4D00获取。

英文摘要

Direct Preference Optimization (DPO) has become a standard framework for safety alignment, but its reliance on pairwise preference updates makes training sensitive to imperfect supervision. Existing robust DPO methods often address this sensitivity through global loss corrections or external data-level interventions, while largely overlooking how unreliable comparisons distort batch-level optimization dynamics. We propose CompassDPO, a reward-free DPO framework that stabilizes preference optimization through dynamics control. Using the implicit DPO reward margin as a training-time compass, CompassDPO regulates sample influence along two complementary axes: update direction and update magnitude. For directional control, it applies sparse, budgeted, and warm-up delayed loss mixing to attenuate update components that conflict with the emerging preference direction. For magnitude control, it adaptively soft-winsorizes high-loss tail contributions, reducing tail dominance while preserving useful gradients from hard examples. Both mechanisms use only signals available during standard DPO training and require no external reward model or additional supervision. Experiments on PKU-SafeRLHF across four backbones and multiple out-of-distribution safety benchmarks show that CompassDPO consistently improves robustness over vanilla DPO and strong DPO-family baselines, especially under controlled label-flip noise. Code is available at https://anonymous.4open.science/r/CompassDPO-4D00

2603.03711 2026-05-27 cs.CV

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

LDP-Slicing:通过随机位平面切片实现图像的本地差分隐私

Yuanming Cao, Chengqi Li, Wenbo He

发表机构 * McMaster University(麦斯特大学)

AI总结 提出LDP-Slicing框架,通过将像素值分解为二进制位平面并应用本地差分隐私机制,结合感知混淆模块和隐私预算分配策略,在满足严格像素级ε-LDP的同时保持图像对下游任务的高效用。

详情
AI中文摘要

本地差分隐私(LDP)是隐私保护机器学习的黄金标准信任模型,通过在数据源处保证隐私。然而,由于像素空间的高维性,其在图像数据上的应用长期以来被认为不切实际。典型的LDP机制设计用于低维数据,当应用于高维像素空间时会导致严重的效用退化。本文证明这种效用损失并非LDP固有的,而是源于将其应用于不适当的数据表示。我们引入了LDP-Slicing,一个轻量级、无需训练的框架,解决了这种领域不匹配问题。我们的关键见解是将像素值分解为一系列二进制位平面。这种转换使我们能够直接将LDP机制应用于位级表示。为了进一步加强隐私并保持效用,我们集成了一个感知混淆模块,减轻人类可感知的泄漏,以及一个基于优化的隐私预算分配策略。该流程满足严格的像素级ε-LDP,同时生成对下游任务保持高效用的图像。在人脸识别和图像分类上的大量实验表明,在可比的隐私预算下,LDP-Slicing优于现有的DP/LDP基线,且计算开销可忽略不计。

英文摘要

Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

2602.13626 2026-05-27 cs.LG

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

基准泄露陷阱:我们能信任基于LLM的推荐吗?

Mingqiao Zhang, Qiyao Peng, Yinghui Wang, Hongtao Liu, Yumeng Wang

发表机构 * Nanjing University(南京大学) Tianjin University(天津大学) Beijing Institute of Control and Electronic Technology(北京控制与电子技术研究所)

AI总结 本文识别并研究了基于大语言模型的推荐系统中基准数据泄露问题,通过模拟多种泄露场景揭示了泄露对性能评估的误导性影响。

详情
AI中文摘要

大语言模型(LLMs)在推荐系统中的广泛应用对评估可靠性提出了严峻挑战。本文识别并研究了一个此前被忽视的问题:基于LLM的推荐中的基准数据泄露。当LLMs在预训练或微调过程中暴露于并可能记忆基准数据集时,就会发生这种现象,导致性能指标被人为夸大,无法反映模型真实性能。为验证这一现象,我们通过在战略混合语料库(包括来自域内和域外的用户-物品交互)上对基础模型进行持续预训练,模拟了多种数据泄露场景。我们的实验揭示了数据泄露的双重效应:当泄露数据与领域相关时,会导致显著但虚假的性能提升,误导性地夸大模型能力;相反,与领域无关的泄露通常会降低推荐准确性,突显了这种污染的复杂性和偶然性。我们的发现表明,数据泄露是基于LLM的推荐中一个关键但此前未被考虑的因素,可能影响模型的真实性能。我们在https://github.com/yusba1/LLMRec-Data-Leakage发布了代码。

英文摘要

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec-Data-Leakage.

2603.03585 2026-05-27 cs.CL cs.AI

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Belief-Sim:迈向信念驱动的人口统计错误信息易感性模拟

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) Texas State University(德克萨斯州立大学)

AI总结 提出BeliefSim框架,利用心理学分类和调查先验构建人口信念档案,通过提示条件化和后训练适应,实现基于信念模拟人口统计错误信息易感性,对齐度达92%。

Comments Paper Under Review

详情
AI中文摘要

错误信息是一种日益严重的社会威胁,由于潜在信念的差异,不同人口群体对错误信息的易感性各不相同。随着大型语言模型(LLM)越来越多地被用于模拟人类行为,我们研究它们是否能够模拟人口统计错误信息易感性,将信念视为主要驱动因素。我们引入BeliefSim,一个模拟框架,利用心理学信息错误信息分类法和调查先验构建人口信念档案。我们研究了基于提示的条件化和后训练适应,并使用以下方法进行了多方面的评估:(i)易感性对齐和(ii)反事实人口敏感性。在两个数据集和建模策略中,我们表明信念为模拟错误信息易感性提供了强大的先验,对齐度高达92%。

英文摘要

Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed misinformation taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility alignment and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with alignment up to 92%.

2603.03194 2026-05-27 cs.CL cs.SE

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

BeyondSWE:当前代码代理能否超越单仓库错误修复?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

发表机构 * GitHub

AI总结 提出BeyondSWE基准测试,评估代码代理在跨仓库、领域特定、依赖迁移和文档生成等复杂软件工程任务上的表现,发现现有代理在利用外部信息进行精确代码修改方面仍存在显著不足。

Comments Benchmark: https://huggingface.co/datasets/AweAI-Team/BeyondSWE. Repo: https://github.com/AweAI-Team/BeyondSWE. Scaffold: https://github.com/AweAI-Team/AweAgent

详情
AI中文摘要

当前的代码代理基准主要评估单个目标仓库内的局部问题解决能力,而许多需要外部知识或更广泛仓库级变更的软件工程任务仍未得到充分测试。我们引入了BeyondSWE,这是一个包含500个实例的基准测试,来自246个真实世界的GitHub仓库,用于评估超越单仓库错误修复的代码代理。BeyondSWE涵盖了四种代表性场景:跨仓库问题解决、领域特定问题解决、依赖驱动的迁移以及文档到仓库的生成,涵盖了更广泛的知识范围和解决范围。我们的评估显示,BeyondSWE远未饱和:基于OpenHands的最佳代理达到了46.12的平均分数,而使用GPT-5.4(xhigh)的最强Codex harness在搜索感知提示下达到了56.65。为了研究外部信息访问是否能缩小这一差距,我们使用SearchSWE作为搜索增强编码的受控诊断基线。搜索访问改善了大多数模型,并对某些任务有显著帮助,但收益仍然有限且不均衡,表明当前代理仍然难以将检索到的信息转化为精确、版本兼容且局部可操作的代码更改。这些结果表明,深度编码搜索仍然是一个开放问题:进展需要代理能够可靠地将外部证据与仓库局部推理和基于执行的验证结合起来。

英文摘要

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

2601.09001 2026-05-27 cs.CL

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

熵哨兵:基于STEM解码熵迹的连续LLM准确性监控

Pedro Memoli Buffa, Luciano Del Corro

发表机构 * Departamento de Matematica, FCEyN Universidad de Buenos Aires(数学系,布宜诺斯艾利斯大学) ELIAS Lab, Departamento de Ingeniería Universidad de San Andres(ELIAS实验室,圣安德烈斯大学)

AI总结 提出利用输出熵迹作为推理时信号,通过轻量分类器预测实例正确性并聚合为领域级准确性估计,在STEM推理基准上验证了其用于监控和数据采集的有效性。

详情
AI中文摘要

部署LLM引发两个耦合挑战:(1)监控——在流量和领域漂移时估计模型表现不佳的位置;(2)改进——优先获取数据以缩小最大的性能差距。我们测试推理时信号能否在领域偏移下估计切片级准确性。对于每个响应,我们从最终层下一个词元概率(来自top-$k$ logprobs)计算输出熵迹,并用不同统计量汇总。一个轻量分类器预测实例正确性,平均预测概率得到领域级准确性估计。我们在十个STEM推理基准上进行了详尽的训练/测试组合($k\in\{1,2,3,4\}$;所有$inom{10}{k}$组合),在来自六个系列(3B--20B)的九个LLM上评估不同分类器模型和特征。估计值通常跟踪保留的基准准确性,并且多个模型显示领域近乎单调的排序,为输出熵迹作为可扩展监控和针对性数据采集的可访问信号提供了证据。

英文摘要

Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in\{1,2,3,4\}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B--20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.

2603.01800 2026-05-27 cs.LG cs.AI stat.ML stat.OT

Phase-Type Variational Autoencoders for Heavy-Tailed Data

Phase-Type变分自编码器用于重尾数据

Abdelhakim Ziani, András Horváth, Paolo Ballarini

发表机构 * Université Paris Saclay, Lab. MICS, CentraleSupélec, Gif-sur-Yvette, France(巴黎萨克雷大学,MICS实验室,CentraleSupélec,法国吉夫-sur-依夫)

AI总结 提出Phase-Type变分自编码器(PH-VAE),通过将解码器分布建模为潜在条件相位型分布(连续时间马尔可夫链的吸收时间),灵活适应重尾行为,在合成和真实基准上优于高斯、Student-t和极值VAE解码器。

详情
AI中文摘要

重尾分布在现实世界数据中无处不在,其中罕见但极端的事件主导了风险和变异性。然而,标准变分自编码器(VAE)采用简单的解码器分布,如高斯分布,无法捕捉重尾行为,而现有的重尾感知扩展仍然局限于预定义的参数族,其尾部行为是预先固定的。我们提出了Phase-Type变分自编码器(PH-VAE),其解码器分布是一个潜在条件的Phase-Type(PH)分布,定义为连续时间马尔可夫链(CTMC)的吸收时间。这种公式组合了多个指数时间尺度,产生了一个灵活且解析可处理的解码器,它直接从观测数据中调整其有限范围的尾部行为。在合成和真实世界基准上的实验表明,PH-VAE能够准确逼近各种重尾分布,在建模观测到的尾部行为和极端分位数方面显著优于基于高斯、Student-t和极值的VAE解码器。在多变量设置中,PH-VAE通过其共享的潜在表示捕捉了现实中的跨维度尾部依赖性。据我们所知,这是首次将Phase-Type分布整合到深度生成建模中的工作,桥接了应用概率论和表示学习。

英文摘要

Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions, such as Gaussian distributions, that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution, defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its finite-range tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately approximates diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling observed tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

2602.22190 2026-05-27 cs.LG cs.AI cs.CL

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra:训练原生GUI代理进行推理与行动——基于动作感知监督和部分可验证强化学习

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出GUI-Libra训练方案,通过动作感知SFT和部分可验证RL中的KL正则化,解决GUI代理在长程导航任务中推理与定位冲突及部分可验证性问题,显著提升步骤准确率和任务完成率。

Comments 57 pages, 17 figures

详情
AI中文摘要

开源原生GUI代理在长程导航任务上仍落后于闭源系统。这一差距源于两个限制:缺乏高质量、动作对齐的推理数据,以及直接采用忽视GUI代理独特挑战的通用后训练流程。我们识别出这些流程中的两个基本问题:(i) 带有CoT推理的标准SFT常损害定位能力,(ii) 逐步RLVR式训练面临部分可验证性,即多个动作可能正确但仅有一个示范动作用于验证。这使得离线逐步指标成为在线任务成功的弱预测器。在本工作中,我们提出GUI-Libra,一种定制化训练方案以应对这些挑战。首先,为缓解动作对齐推理数据的稀缺性,我们引入数据构建和过滤流程,并发布精心整理的81K GUI推理数据集。其次,为调和推理与定位,我们提出动作感知SFT,混合推理后动作和直接动作数据,并重新加权token以强调动作和定位。第三,为在部分可验证性下稳定RL,我们识别出RLVR中KL正则化被忽视的重要性,并证明KL信任域对改善离线到在线可预测性至关重要;我们进一步引入成功自适应缩放以降低不可靠负梯度的权重。在多种Web和移动基准测试中,GUI-Libra一致地提升了步骤准确率和端到端任务完成率。我们的结果表明,精心设计的后训练和数据整理可以在无需昂贵在线数据收集的情况下,释放显著更强的任务解决能力。我们发布数据集、代码和模型,以促进对具备推理能力的GUI代理的数据高效后训练的进一步研究。

英文摘要

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

2602.19206 2026-05-27 cs.CV

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

GS-CLIP: 基于几何感知提示与协同视图表示学习的零样本3D异常检测

Zehao Deng, An Liu, Yan Wang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院)

AI总结 提出GS-CLIP框架,通过几何感知提示和协同视图表示学习,在零样本设置下有效检测3D点云中的几何异常。

Comments Accepted by CVPR 2026

详情
AI中文摘要

零样本3D异常检测是一项新兴任务,旨在无需目标训练数据的情况下检测目标数据集中的异常,这在样本稀缺和数据隐私受限的场景中尤为重要。当前方法通过将3D点云投影到2D表示来适配CLIP,但面临挑战:投影会固有地丢失一些几何细节,且依赖单一2D模态导致视觉理解不完整,限制了检测多样异常类型的能力。为解决这些局限,我们提出几何感知提示与协同视图表示学习(GS-CLIP)框架,通过两阶段学习使模型能够识别几何异常。第一阶段,我们动态生成嵌入3D几何先验的文本提示,这些提示包含由我们的几何缺陷蒸馏模块(GDDM)提炼的全局形状上下文和局部缺陷信息。第二阶段,我们引入协同视图表示学习架构,并行处理渲染图像和深度图像,随后通过协同精炼模块(SRM)融合两个流的特征,利用它们的互补优势。在四个大规模公共数据集上的全面实验结果表明,GS-CLIP在检测中取得了优越性能。代码可在 https://github.com/zhushengxinyue/GS-CLIP 获取。

英文摘要

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.

2602.21636 2026-05-27 cs.CV

Axial-Centric Cross-Plane Attention for 3D Medical Image Classification

轴向中心跨平面注意力用于3D医学图像分类

Doyoung Park, Jinsoo Kim, Lohendran Baskaran

发表机构 * National Heart Centre Singapore, Singapore(新加坡国家心脏中心) CVS.AI, National Heart Research Institute of Singapore, Singapore(CVS.AI、新加坡国家心脏研究院) Independent Researcher, Republic of Korea(韩国独立研究员)

AI总结 提出轴向中心跨平面注意力架构,通过不对称建模解剖平面间依赖关系,在MedMNIST3D基准上优于现有3D和多平面模型。

Comments Submitted to BMVC 2026

详情
AI中文摘要

(缩写版)临床医生通常通过检查多个解剖平面而非依赖体积视图来解释3D医学图像。在临床CT工作流中,轴向平面常作为主要诊断参考,而辅助平面提供互补空间上下文。然而,许多现有3D深度学习方法要么整体处理体积数据,要么对所有平面赋予相同重要性,未能反映这种不对称的轴向中心解释策略。为此,我们提出一种用于3D医学图像分类的轴向中心跨平面注意力架构,该架构建模解剖平面间的不对称依赖关系。该架构使用大规模轴向CT图像预训练的MedDINOv3作为冻结特征提取器,用于轴向、冠状和矢状平面。RICA块和平面内变换器编码器捕获平面特定的位置和上下文信息,而轴向中心跨平面变换器编码器选择性地以互补的辅助表示条件化轴向表示。在MedMNIST3D基准的六个数据集上的实验表明,所提方法在ACC和AUC上持续优于现有3D和多平面模型。轻量级变体AC-Tiny以显著更少的可训练参数实现了竞争性能,表明架构设计对性能提升的贡献大于模型规模增加。消融研究进一步验证了轴向中心查询、QKV分配、定向跨平面融合、无残差交叉注意力和分类头设计的重要性。切片级Grad-CAM可视化表明,模型在所有平面上识别出诊断相关区域。这些发现强调了将架构设计与临床解释工作流对齐对于稳健的3D医学图像分析的价值。

英文摘要

Abridged: Clinicians commonly interpret 3D medical images by examining multiple anatomical planes rather than relying on volumetric views. In clinical CT workflows, the axial plane often serves as the primary diagnostic reference, while the auxiliary planes provide complementary spatial context. However, many existing 3D deep learning approaches either process volumetric data holistically or assign equal importance to all planes, failing to reflect this asymmetric, axial-centric interpretation strategy. To address this, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that models asymmetric dependencies between anatomical planes. The architecture employs large-scale axial CT images pretrained MedDINOv3 as a frozen feature extractor for axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information, while axial-centric cross-plane transformer encoders selectively condition axial representations on complementary auxiliary representations. Experiments on six datasets from the MedMNIST3D benchmark show that the proposed method consistently outperforms existing 3D and multi-plane models in ACC and AUC. A lightweight variant, AC-Tiny, achieves competitive performance with substantially fewer trainable parameters, suggesting that architectural design contributes more to performance gains than increased model scale. Ablation studies further validate the importance of axial-centric querying, QKV allocation, directional cross-plane fusion, residual-free cross-attention, and classification head design. Slice-level Grad-CAM visualizations demonstrate that the model identifies diagnostically relevant regions across all planes. These findings highlight the value of aligning architectural design with clinical interpretation workflows for robust 3D medical image analysis.

2510.07231 2026-05-27 cs.CL cs.AI

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

EconCausal: 面向大语言模型的上下文感知经济推理基准

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim

发表机构 * Graduate School of Data Science, KAIST(韩国科学技术院数据科学研究生院) College of Business, KAIST(韩国科学技术院商学院) Data Science for Humanity Group, MPI-SP(马克斯·普朗克所际数据科学为人类集团) School of Computing, KAIST(韩国科学技术院计算学院) Division of Social Science, HKUST(香港科技大学社会科学系)

AI总结 提出EconCausal基准,包含从顶级经济金融期刊提取的10,490个上下文标注因果三元组,评估大语言模型在指定上下文中推断因果方向及随上下文变化调整判断的能力。

详情
AI中文摘要

社会经济因果效应高度依赖于制度和环境背景。相同的干预措施在不同监管制度、市场条件、时间段或人群中可能产生不同甚至相反的效果。这对大语言模型(LLM)在决策支持角色中提出了挑战:它们能否在指定上下文中推断因果效应的方向,并在上下文变化时修正该判断?为此,我们引入了EconCausal,这是一个大规模基准,包含从顶级经济和金融期刊的2,595项高质量实证研究中提取的10,490个上下文标注因果三元组,通过严格的四阶段流程构建,包括多轮共识、上下文细化和多批评者过滤。跨模型实验表明,LLM往往无法根据上下文调整其预测。虽然顶级模型在固定、显式上下文中达到88%的准确率,但在需要跨上下文修正符号的情况下,准确率下降32.6个百分点(从73.9%降至41.3%),一旦引入误导性的符号证据,准确率降至50%以下。模型还过度承诺于方向性(+/-)符号,仅在13.8%的情况下识别出零效应,且在这些类别上校准不良。数据集和基准公开于 https://anonymous.4open.science/r/econcausal-benchmark-6F12。

英文摘要

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different, even opposite, effects across regulatory regimes, market conditions, time periods, or populations. This poses a challenge for large language models (LLMs) in decision-support roles: can they infer the direction of a causal effect under a specified context, and revise that judgment when the context changes? To address this, we introduce EconCausal, a large-scale benchmark of 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies in top-tier economics and finance journals, constructed through a rigorous four-stage pipeline with multi-run consensus, context refinement, and multi-critic filtering. Across models, LLMs often fail to condition their predictions on context. While top models reach 88% accuracy in fixed, explicit contexts, accuracy falls by 32.6~pp on cases that require revising the sign across contexts (73.9% to 41.3%), and drops below 50% once misleading signed evidence is introduced. Models also over-commit to directional (+/-) signs, recognizing null effects only 13.8% of the time while remaining poorly calibrated on these categories. The dataset and benchmark are publicly available at https://anonymous.4open.science/r/econcausal-benchmark-6F12.

2602.18907 2026-05-27 cs.LG cs.CV cs.CY

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR: 利用多模态大语言模型挖掘深度多兴趣用于生成式推荐

Yangchen Zeng, Zhenyu Yu, Zhiyuan Hu, Wenxin Zhang, Jinze Wang, Rongfeng Guo

发表机构 * Southeast University(东南大学)

AI总结 提出DeepInterestGR框架,通过多LLM兴趣挖掘、奖励标记深度兴趣和兴趣增强物品离散化,解决生成式推荐中的浅层兴趣问题,在三个Amazon数据集上显著提升推荐性能。

详情
AI中文摘要

我们介绍了DeepInterestGR,一个将深度兴趣挖掘集成到生成式推荐流程中的新颖框架。这解决了“浅层兴趣”问题——现有的生成方法依赖于表面文本特征,未能捕捉潜在的用户动机,限制了个性化深度和推荐可解释性。我们的方法通过结构化推理提示利用多LLM兴趣挖掘(MLIM),通过奖励标记深度兴趣(RLDI)进行质量控制,通过RQ-VAE进行兴趣增强物品离散化(IEID),并结合由兴趣感知奖励引导的两阶段SFT-GRPO训练流程。我们在三个Amazon Review基准(Beauty、Sports、Instruments)上验证了DeepInterestGR,与包括SASRec、BERT4Rec、TIGER、LC-Rec和S-DPO在内的14个最先进基线进行了比较。我们的方法在HR@10上实现了5.8%-8.3%的相对改进,在NDCG@10上实现了7.7%-9.9%的相对改进,跨领域泛化增益达到+24.8%。这些结果证明,融入深度语义兴趣可以有效改进基于SID的生成式推荐。

英文摘要

We introduce DeepInterestGR, a novel framework that integrates deep interest mining into the generative recommendation pipeline. This addresses the "Shallow Interest" problem - existing generative methods rely on surface-level textual features and fail to capture latent user motivations, limiting personalization depth and recommendation interpretability. Our approach leverages Multi-LLM Interest Mining (MLIM) via structured reasoning prompting, Reward-Labeled Deep Interest (RLDI) for quality control, and Interest-Enhanced Item Discretization (IEID) via RQ-VAE, combined with a two-stage SFT-GRPO training pipeline guided by an Interest-Aware Reward. We validate DeepInterestGR on three Amazon Review benchmarks (Beauty, Sports, Instruments), comparing against 14 state-of-the-art baselines including SASRec, BERT4Rec, TIGER, LC-Rec, and S-DPO. Our method achieves 5.8%-8.3% relative improvements on HR@10 and 7.7%-9.9% on NDCG@10 over the strongest baseline, with cross-domain generalization gains of +24.8%. These results provide evidence that incorporating deep semantic interests can effectively improve SID-based generative recommendation.

2602.17822 2026-05-27 cs.RO

Evolution of Safety Requirements in Industrial Robotics: Comparative Analysis of ISO 10218-1/2 (2011 vs. 2025) and Integration of ISO/TS 15066

工业机器人安全要求的演进:ISO 10218-1/2(2011 与 2025 版)比较分析及 ISO/TS 15066 的整合

Daniel Hartmann, Kristýna Hamříková, Aleš Vysocký, Vendula Laciok, Aleš Bernatík

发表机构 * Faculty of Mechanical Engineering, VSB—Technical University of Ostrava(机械工程学院,奥斯特拉瓦技术大学) Faculty of Safety Engineering, VSB—Technical University of Ostrava(安全工程学院,奥斯特拉瓦技术大学)

AI总结 本文通过比较 ISO 10218:2011 与 ISO 10218:2025 标准,分析工业机器人安全要求在功能安全、网络安全、机器人分类及协作应用等方面的演进,并整合 ISO/TS 15066,建立现代机器人系统设计与运行的全面框架。

详情
AI中文摘要

工业机器人已成为大型制造企业不可或缺的组成部分。同时,协作机器人日益突出,引入了人机交互的新范式。这些进步促使安全标准进行全面修订,特别是纳入了网络安全和防止未经授权访问网络化机器人系统的要求。本文对 ISO 10218:2011 和 ISO 10218:2025 标准进行了比较分析,考察了其结构、术语、技术要求和附录的演进。分析揭示了功能安全和网络安全方面的显著扩展,引入了机器人和协作应用的新分类,以及技术规范 ISO/TS 15066 的规范性整合。因此,新版本综合了机械、功能和数字安全要求,为现代机器人系统的设计和运行建立了全面框架。

英文摘要

Industrial robotics has established itself as an integral component of large-scale manufacturing enterprises. Simultaneously, collaborative robotics is gaining prominence, introducing novel paradigms of human-machine interaction. These advancements have necessitated a comprehensive revision of safety standards, specifically incorporating requirements for cybersecurity and protection against unauthorized access in networked robotic systems. This article presents a comparative analysis of the ISO 10218:2011 and ISO 10218:2025 standards, examining the evolution of their structure, terminology, technical requirements, and annexes. The analysis reveals significant expansions in functional safety and cybersecurity, the introduction of new classifications for robots and collaborative applications, and the normative integration of the technical specification ISO/TS 15066. Consequently, the new edition synthesizes mechanical, functional, and digital safety requirements, establishing a comprehensive framework for the design and operation of modern robotic systems.

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应:基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

发表机构 * University of Michigan, Ann Arbor, MI, USA(密歇根大学,安阿伯分校) Washington University in St. Louis, St. Louis, MO, USA(华盛顿大学圣路易斯分校)

AI总结 提出一个统一的地理空间发现框架,结合主动学习、在线元学习和概念引导推理,通过概念加权不确定性采样和相关性感知元批次形成策略,在有限数据和动态环境下高效发现隐藏目标。

详情
AI中文摘要

在环境监测中,数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS(全氟和多氟烷基物质)污染尤其如此,与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说,在灾害响应和公共卫生环境中也出现了类似的挑战,动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而,稀疏且有偏差的地理空间标签限制了现有基于学习方法(如强化学习)的适用性。为了解决这个问题,我们提出了一个统一的地理空间发现框架,该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新,该概念捕捉领域特定因素如何影响目标存在:一个*概念加权不确定性采样策略*,其中不确定性通过从现成概念(如土地覆盖和源距离)学习到的相关性进行调节;以及一个*相关性感知元批次形成策略*,该策略在在线元更新期间促进语义多样性,提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架,这是一个受真实世界启发的环境监测任务,展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

2602.17443 2026-05-27 cs.CL

AIDG: A Formal Decomposition of Information Extraction and Containment Asymmetries in Multi-Turn LLM Dialogue

AIDG:多轮LLM对话中信息提取与包含不对称性的形式化分解

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 提出AIDG框架,将多轮对抗对话形式化为部分可观察随机博弈,分解提取与包含角色,揭示防御性能聚类而攻击性能分散等不对称性。

Comments 20 pages, 5 figures, 13 tables. Includes appendix and supplementary materials

详情
AI中文摘要

多轮LLM评估通常报告为单一胜率标量,混淆了不同能力。我们引入AIDG(对抗信息推断游戏),将多轮对抗对话形式化为双人部分可观察随机博弈(POSG),并沿着搜索者(提取)和持有者(包含)角色分解性能。该分解隔离了三种失败模式:合作先验泄漏、约束推理干扰和低效假设空间遍历。在六个前沿LLM的439场游戏中,防御性能紧密聚类(sigma = 1.9 ELO),而攻击性能差异显著(sigma = 53.3 ELO);确认框架使提取几率比无信息推断高7.75倍(p < 0.00001);约束违规占推断失败的41.3%,与规模无关(rho = 0.0)。我们将包含优于提取的差距定位为局部可解的防御决策与全局耦合的攻击规划的可测量结果,而非令人惊讶的发现,并使用该分解将差距归因于每个模型。所有设计选择,包括轮次衰减加权和Bradley-Terry评级模型,均源自明确假设。

英文摘要

Multi-turn LLM evaluation is typically reported as a single win-rate scalar, conflating distinct capabilities. We introduce AIDG (Adversarial Information Deduction Game), formalizing multi-turn adversarial dialogue as a two-player partially observable stochastic game (POSG) and decomposing performance along Seeker (extraction) and Holder (containment) roles. The decomposition isolates three failure modes: cooperative-prior leakage, constraint-reasoning interference, and inefficient hypothesis-space traversal. Across 439 games over six frontier LLMs, defensive performance is tightly clustered (sigma = 1.9 ELO) while offensive performance varies substantially (sigma = 53.3 ELO); confirmation framing increases extraction odds 7.75x over uninformed deduction (p < 0.00001); and constraint violations account for 41.3% of deductive failures, uncorrelated with scale (rho = 0.0). We position the containment-over-extraction gap not as a surprising finding but as a measurable consequence of locally resolvable defensive decisions versus globally coupled offensive planning, and use the decomposition to attribute the gap per model. All design choices, including turn-decay weighting and the Bradley-Terry rating model, are derived from explicit assumptions.

2510.03352 2026-05-27 cs.CV cs.AI cs.LG

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯A&M大学)

AI总结 提出一种即插即用、无需训练的推理时搜索框架,将侧信息融入现有扩散模型逆问题求解器,显著提升重建质量。

详情
AI中文摘要

扩散模型已被用作解决逆问题的先验。然而,现有方法通常忽略了能够显著提高重建质量的侧信息,尤其是在严重病态设置中。在这项工作中,我们提出了一种新颖的框架,通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题(包括图像修复、超分辨率和几种去模糊任务)以及多种基于扩散模型的逆问题求解器(DPS、DAPS和MPGD)上的大量实验,我们表明,用我们的框架增强每个求解器,其重建质量始终优于相应的原始方法。为了展示我们方法的通用性,我们考虑了多种形式的侧信息,包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取:https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.