arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2412.20505 2026-05-27 cs.AI cs.CL cs.LG

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

LiPUP-MA:一种以居住体验为中心的循环参与式城市多智能体规划框架

Hang Ni, Yuzhi Wang, Yizhi Song, Hao Liu

AI总结 提出LiPUP-MA多智能体框架,通过模拟居住生活与体验驱动的计划修订循环,利用基于图的经验库和空间约束技能增强规划器,解决参与式城市规划中经验落地与反馈空间化问题。

详情
AI中文摘要

参与式城市规划(PUP)日益得到基于LLM的智能体的支持,但现有方法主要依赖于静态偏好 elicitation 和一次性利益相关者讨论,忽视了现实世界规划的周期性——居住生活、经验收集和计划调整持续互动。我们提出循环参与式城市规划(LiPUP),一种在模拟居住生活和经验驱动的计划修订之间交替的闭环范式,同时面临两个关键挑战:将分散的居住经验锚定到具体的城市背景中,以及将主观反馈转化为空间连贯的规划行动。为实例化LiPUP,我们引入LiPUP-MA,一个基于LLM的多智能体框架,它构建了一个以计划为中心的基于图的经验库,用于组织来自生活模拟的基于城市的居住反馈,并配备了一个空间约束的技能增强规划器智能体,通过协调经验、视觉和地理空间证据来修订计划。实验表明,LiPUP-MA在传统的静态规划指标和基于生活的指标上均持续优于基线,而迭代的LiPUP循环进一步提高了计划质量。

英文摘要

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference elicitation and one-shot stakeholder discussions, overlooking the cyclical nature of real-world planning, where residential life, experience collection, and plan adjustment continually interact. We propose Living-in-the-loop Participatory Urban Planning (LiPUP), a closed-loop paradigm that alternates between simulated residential living and experience-driven plan revision, while posing two key challenges: grounding scattered living experience in concrete urban contexts and translating subjective feedback into spatially coherent planning actions. To instantiate LiPUP, we introduce LiPUP-MA, an LLM-based multi-agent framework that constructs a Plan-centric Graph-based Experience Bank to organize urban-grounded residential feedback from living simulation and equips a Spatially-constrained Skill-augmented Planner agent to revise plans by harmonizing experiential, visual, and geospatial evidence. Experiments show that LiPUP-MA consistently outperforms baselines on both conventional static planning metrics and living-based metrics, while iterative LiPUP cycles further improve plan quality.

2510.17790 2026-05-27 cs.CV cs.CL

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA: 一种具有混合动作的计算机使用智能体基础模型

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

AI总结 提出UltraCUA基础模型,通过混合动作(融合原始GUI操作与高级工具执行)克服计算机使用智能体仅依赖原始GUI动作的局限性,采用自动化管道、合成数据引擎、混合动作轨迹收集和两阶段训练方法,在OSWorld和WindowsAgentArena上分别实现22%的相对性能提升和21.7%的成功率。

详情
AI中文摘要

计算机使用智能体面临一个根本限制:它们仅依赖原始GUI动作(点击、键入、滚动),导致脆弱的执行链容易发生级联故障。虽然API驱动的智能体通过结构化接口和工具利用丰富的能力,但计算机使用智能体仍然局限于低层视觉交互。我们提出UltraCUA,一种通过混合动作(无缝统一原始GUI操作与高层工具执行)超越这一限制的基础模型。我们的创新基于四个关键进展。首先,一个自动化管道从软件文档和代码仓库中提取并扩展工具能力。其次,一个合成数据引擎生成超过17,000个可验证任务,捕捉真实世界的计算机使用复杂性。第三,全面的混合动作轨迹收集融合了GUI原语和策略性工具调用。第四,一种两阶段训练方法结合了监督微调和在线强化学习,实现了GUI与API之间的智能动作选择。对我们的7B和32B UltraCUA模型的评估揭示了变革性的性能提升。在OSWorld上,UltraCUA平均实现了22%的相对改进,同时执行速度比现有方法快11%。在WindowsAgentArena上的跨域验证展示了鲁棒的泛化能力,成功率达到21.7%,超过了在Windows上训练的基线。混合动作范式被证明至关重要,在减少错误传播的同时提高了执行效率。这项工作建立了一个可扩展的范式,桥接了原始GUI交互与高层工具智能,为多样环境和复杂现实任务提供了更具弹性和适应性的计算机使用智能体。

英文摘要

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.

2512.04868 2026-05-27 cs.CL cs.AI

SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

SEAL: 面向知识图谱对话问答的自我演进智能体学习

Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li, Shunyu Yao, Yanzeng Li, Xinchi Li

AI总结 提出SEAL两阶段语义解析框架,通过自我演进智能体学习解决知识图谱对话问答中的指代消解、上下文依赖和复杂逻辑推理问题,在SPICE基准上达到最先进性能。

Comments Accept by NeuroComputing

详情
AI中文摘要

基于知识的对话问答(KBCQA)在解决指代消解、上下文依赖建模和执行复杂逻辑推理方面面临持续挑战。现有方法通常存在不准确性和高昂的计算成本,尤其是在处理大规模知识图谱上的复杂查询时。具体而言,大型语言模型(LLM)倾向于为复杂的多跳或聚合查询生成语法无效或语义错位的逻辑形式,而传统的实体-关系链接方法则面临候选空间指数级增长的问题。为了解决这些限制,我们引入了SEAL,一种基于自我演进智能体学习的新型两阶段语义解析框架。在第一阶段,LLM提取一个捕获核心语义的最小S表达式核心,然后通过智能体校准模块进行修正,以纠正语法不一致性并将实体和关系与知识图谱对齐。第二阶段采用基于问题类型预测的模板补全来构建完全可执行的S表达式。关键的是,SEAL包含一种自我演进机制,将局部和全局记忆与反射模块相结合,能够从对话历史和执行反馈中持续适应,而无需显式重新训练。在SPICE基准上的大量实验表明,SEAL在多跳推理、比较和聚合任务中实现了最先进的性能,验证了在结构准确性和计算效率方面的显著提升。

英文摘要

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches often suffer from inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. Specifically, large language models (LLMs) tend to generate syntactically invalid or semantically misaligned logical forms for complex multi-hop or aggregation queries, while conventional entity-relation linking methods face an exponentially growing candidate space. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, an LLM extracts a minimal S-expression core capturing the essential semantics, which is then refined by an agentic calibration module to correct syntactic inconsistencies and align entities and relations with the knowledge graph. The second stage employs template-based completion guided by question-type prediction to construct a fully executable S-expression. Crucially, SEAL incorporates a self-evolving mechanism integrating local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance in multi-hop reasoning, comparison, and aggregation tasks, validating notable gains in both structural accuracy and computational efficiency.

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

AI总结 提出 Athena-PRM,一种多模态过程奖励模型,通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签,在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情
AI中文摘要

我们提出了 Athena-PRM,一种多模态过程奖励模型(PRM),旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入,主要因为需要推理步骤的逐步标注。传统的自动标注方法,如蒙特卡洛估计,通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM 在仅5000个样本的情况下,在各种场景和基准测试中展现出卓越的效果。此外,我们还开发了两种有效策略来提升PRM的性能:ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是,当使用 Qwen2.5-VL-7B 作为策略模型时,Athena-PRM 在 WeMath 上提升了10.2个百分点,在 MathVista 上提升了7.1个百分点(测试时扩展)。此外,Athena-PRM 在 VisualProcessBench 上取得了最先进(SoTA)结果,比之前的 SoTA 高出3.9个F1分数,展示了其准确评估推理步骤正确性的强大能力。另外,利用 Athena-PRM 作为奖励模型,我们通过奖励排序微调开发了 Athena-7B,在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

2512.04085 2026-05-27 cs.CV

Unique Lives, Shared World: Learning from Single-Life Videos

独特生活,共享世界:从单个人生视频中学习

Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen

AI总结 提出“单个人生”学习范式,利用单个人拍摄的自我中心视频通过多视角自监督学习视觉编码器,发现不同人生训练的模型具有高度对齐的几何理解,且学到的表示可泛化到下游任务,与大量网络数据性能相当。

详情
AI中文摘要

我们引入了“单个人生”学习范式,其中我们仅针对一个人拍摄的自我中心视频训练一个独特的视觉模型。我们利用单个人生中自然捕获的多个视角,以自监督方式学习视觉编码器。我们的实验展示了三个关键发现。首先,独立在不同人生上训练的模型发展出高度对齐的几何理解。我们通过在捕获不同人生(包括室内和室外)的不同数据集上训练视觉编码器,并引入一种新的基于交叉注意力的度量来量化不同模型发展的内部表示的功能对齐,来证明这一点。其次,我们展示了单个人生模型学习到可泛化的几何表示,这些表示能有效迁移到下游任务,如未见环境中的深度估计。第三,我们证明,对同一个人一周内最多30小时的数据进行训练,其性能与在30小时多样化网络数据上训练相当,突出了单个人生表示学习的优势。总体而言,我们的结果确立了世界的共享结构既导致了在个人人生上训练的模型的一致性,也为视觉表示学习提供了强大的信号。

英文摘要

We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

2511.01724 2026-05-27 cs.CV cs.LG

PRBench: A Standardized Probabilistic Robustness Benchmark

PRBench:标准化概率鲁棒性基准

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

AI总结 提出PRBench基准,通过统一评估协议和理论分析,比较对抗训练与概率鲁棒性训练方法在干净准确率、鲁棒性及泛化误差上的表现。

详情
AI中文摘要

深度学习模型因对不可察觉扰动的脆弱性而闻名。现有研究大多集中于对抗鲁棒性(AR),它通过检查确定性对抗样本(AE)的存在性,在最坏情况下评估模型。相比之下,概率鲁棒性(PR)采用统计视角,衡量在随机扰动下预测保持正确的概率。尽管PR被广泛视为AR的实用补充,但专门用于提升PR的训练方法仍相对未被充分探索,尽管已有初步进展。在少数针对PR的训练方法中,我们发现了三个局限性:(i) 不可比较的评估协议;(ii) 尽管AT能带来PR提升的轶事证据,但与强AT基线的比较有限;(iii) 缺乏统一框架来比较这些方法的泛化能力。因此,我们引入了PRBench,这是第一个专门评估不同鲁棒性训练方法在PR提升上的基准。PRBench使用一套全面的指标,包括干净准确率、PR和AR性能、训练效率以及泛化误差(GE),对最常见的AT和针对PR的训练方法进行实证比较。我们还对不同训练方法的PR性能的GE进行了理论分析。PRBench揭示的主要发现包括:在跨不同超参数设置提升AR和PR性能方面,AT方法比针对PR的训练方法更具通用性,而针对PR的训练方法始终产生更低的GE和更高的干净准确率。包含229个训练模型(覆盖7个数据集和10种模型架构)的排行榜公开于 https://wellzline.github.io/PRBenchLeaderboard/。

英文摘要

Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 229 trained models across 7 datasets and 10 model architectures is publicly available at https://wellzline.github.io/PRBenchLeaderboard/.

2511.17852 2026-05-27 cs.LG stat.ML

Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

带RL或SFT的Transformer可证明学习稀疏布尔函数,但方式不同

Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu

AI总结 本文通过统一分析RL(过程奖励)和SFT微调Transformer学习可递归分解的k-稀疏布尔函数的动态,证明两者都能学习k-PARITY、k-AND、k-OR等函数,但RL同时学习整个CoT链,而SFT逐步学习。

Comments 50 pages, 12 figures

详情
AI中文摘要

Transformer可以通过微调获得思维链(CoT)能力来解决复杂的推理任务。强化学习(RL)和监督微调(SFT)是实现这一目标的两种主要方法。在这项工作中,我们专门研究了使用过程奖励的RL和SFT,通过类似于CoT的中间推理步骤,用单层Transformer学习$k$-稀疏布尔函数。特别地,我们考虑可以递归分解为固定2-稀疏布尔函数的$k$-稀疏布尔函数。我们首先以统一的方式分析使用过程奖励的RL微调和SFT的学习动态。这使我们能够识别出Transformer可证明学习这些稀疏布尔函数的充分条件。然后,我们验证了这些条件在三个基本示例(包括$k$-PARITY、$k$-AND和$k$-OR)中成立,从而证明了它们通过RL和SFT的可学习性。值得注意的是,我们揭示了RL和SFT表现出不同的学习行为:RL同时学习整个CoT链,而SFT自然地逐步学习CoT链。总体而言,我们的发现为RL和SFT的底层机制以及它们在触发Transformer的CoT能力方面的差异提供了见解,并表明RL和SFT之间的比较可能需要考虑奖励设计和教师强制(teacher forcing)的使用。

英文摘要

Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end. In this work, we specifically examine RL with process rewards and SFT for learning $k$-sparse Boolean functions with a one-layer transformer through intermediate reasoning steps akin to CoT. In particular, we consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We first analyze the learning dynamics of RL fine-tuning with process reward and SFT in a unified way. This allows us to identify sufficient conditions under which the transformer provably learns these sparse Boolean functions. We then verify that these conditions hold for three basic examples, including $k$-PARITY, $k$-AND, and $k$-OR, thus demonstrating their learnability via both RL and SFT. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT naturally learns the CoT chain step by step. Overall, our findings provide insights on the mechanisms underlying RL and SFT and how they differ in triggering the CoT capabilities of transformers, and suggest that the comparison between RL and SFT may need to consider the reward design and the use of teacher forcing.

2505.13775 2026-05-27 cs.LG cs.AI

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

超越语义:无理由中间标记的不合理有效性

Karthik Valmeekam, Vardhan Palod, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

AI总结 通过从零训练Transformer模型于形式可验证推理轨迹,发现模型在正确与损坏轨迹上表现相似,且损坏轨迹在分布外任务上泛化更好,挑战了中间标记反映或诱导可预测推理行为的假设。

Comments Published in Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

近期大型推理模型的显著成果被解读为思维链(CoT)的胜利,尤其是基于基础LLM采样的CoT训练过程有助于发现新的推理模式。虽然这些轨迹确实有助于模型性能,但其影响机制尚不明确:一些研究赋予其语义,另一些则警告不要将其视为模型内部计算过程的透明忠实代理。为系统探究推导轨迹的终端用户语义作用,我们设置了一项受控研究,从零开始训练Transformer模型于形式可验证的推理轨迹及其导向的解决方案。我们注意到,尽管相比仅解决方案的基线有所提升,但训练于完全正确轨迹的模型在得出正确解决方案时仍可能产生无效推理轨迹。更有趣的是,实验表明,训练于损坏轨迹(其中间推理步骤与所附问题无关)的模型与训练于正确轨迹的模型表现相似,甚至在分布外任务上泛化更好。我们还研究了基于GRPO的RL后训练对轨迹有效性的影响,发现虽然解决方案准确性提高,但轨迹有效性并未随之改善。最后,我们考察了推理轨迹长度是否反映推理时扩展,发现轨迹长度在很大程度上与所解决问题的底层计算复杂度无关。这些结果挑战了中间标记或“思维链”反映或诱导可预测推理行为的假设,并警示不要将此类输出拟人化或过度解读(尽管其表面形式看似合理)为语言模型中类人或类算法行为的证据。

英文摘要

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. While these traces certainly seem to help model performance, it is not clear how they influence it, with some works ascribing semantics to them and others cautioning against relying on them as transparent and faithful proxies of the model's internal computational process. To systematically investigate the role of end-user semantics of derivational traces, we set up a controlled study where we train transformer models from scratch on formally verifiable reasoning traces and the solutions they lead to. We notice that, despite gains over the solution-only baseline, models trained on entirely correct traces can still produce invalid reasoning traces even when arriving at correct solutions. More interestingly, our experiments also show that models trained on corrupted traces, whose intermediate reasoning steps bear no relation to the problem they accompany, perform similarly to those trained on correct ones, and even generalize better on out-of-distribution tasks. We also study the effect of GRPO-based RL post-training on trace validity, noting that while solution accuracy increases, this is not accompanied by improvements in trace validity. Finally, we examine whether reasoning-trace length reflects inference-time scaling and find that trace length is largely agnostic to the underlying computational complexity of the problem being solved. These results challenge the assumption that intermediate tokens or ``Chains of Thought'' reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly seemingly forms) as evidence of human-like or algorithmic behaviors in language models.

2511.14993 2026-05-27 cs.CV cs.AI cs.LG

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0:图像与视频生成的基础模型系列

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

AI总结 本文介绍Kandinsky 5.0系列模型,通过多阶段训练、自监督微调和强化学习后训练,实现高分辨率图像和10秒视频的高质量生成。

Comments Website: https://kandinskylab.ai/

详情
AI中文摘要

本报告介绍了Kandinsky 5.0,一系列用于高分辨率图像和10秒视频合成的最先进基础模型。该框架包含三个核心模型系列:Kandinsky 5.0 Image Lite——6B参数的图像生成模型系列,Kandinsky 5.0 Video Lite——快速轻量级的2B参数文本到视频和图像到视频模型,以及Kandinsky 5.0 Video Pro——19B参数模型,实现了卓越的视频生成质量。我们全面回顾了数据策展生命周期——包括收集、处理、过滤和聚类——用于多阶段训练流程,该流程涉及广泛的预训练,并融入了质量增强技术,如自监督微调(SFT)和基于强化学习(RL)的后训练。我们还介绍了新颖的架构、训练和推理优化,使Kandinsky 5.0能够在各种任务上实现高生成速度和最先进的性能,如人类评估所示。作为一个大规模、公开可用的生成框架,Kandinsky 5.0充分利用其预训练及后续阶段的全部潜力,以适应广泛的生成应用。我们希望本报告,连同我们开源代码和训练检查点的发布,将大大促进高质量生成模型的研究社区发展和可访问性。

英文摘要

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

2511.14683 2026-05-27 cs.CL

Quadratic Term Correction on Heaps' Law

Heap定律的二次项修正

Oscar Fontanelli, Wentian Li

AI总结 针对Heap定律在双对数坐标下仍呈轻微凹形的问题,提出二次函数拟合方法,并通过二十部英文小说验证,发现线性系数略大于1、二次系数约为-0.02,且曲率与“伪方差”相关。

Comments 3 figures

详情
AI中文摘要

Heap或Herdan定律通过幂律函数表征词类型与词例之间的关系,该函数在线性-线性尺度上是凹的,但在双对数尺度上是直线。然而,即使在双对数尺度上,类型-词例曲线仍轻微凹形,使幂律关系失效。作为下一阶近似,我们通过二十部英文小说(部分从其他语言翻译成英文)证明,双对数尺度下的二次函数能完美拟合类型-词例数据。对log(类型)-log(词例)数据同时包含线性和二次项的回归分析一致地得出线性系数略大于1,二次系数约为-0.02。利用“从袋中有放回地随机抽取彩色球”模型,我们证明双对数尺度的曲率等于一个负的“伪方差”。尽管当词例数量较大时,由于伪权重值较大,伪方差计算可能遇到数值不稳定性,但该形式为词例数量较小时提供了曲率的粗略估计。

英文摘要

Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

2511.14075 2026-05-27 cs.LG cs.AI

CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction

CFG-OEC: 带正交误差校正的无分类器引导

Nakgyu Yang, Yechan Lee, SooJean Han

AI总结 针对扩散模型中无分类器引导的采样规则与训练目标不匹配导致的误差,提出正交误差校正方法(CFG-OEC)通过减少条件与无条件预测误差的交互项来提升采样质量,并在Stable Diffusion上验证了FID和CLIP分数的改进。

详情
AI中文摘要

无分类器引导是扩散模型中条件采样的标准方法,但其采样规则与训练中使用的目标不一致。这种不匹配通过条件预测误差和无条件预测误差的相互作用引入了结构性采样误差。我们通过将采样误差分解为基础项和由两个误差对齐决定的交叉项来分析该问题。基于此分析,我们提出了带正交误差校正的无分类器引导(CFG-OEC),这是一种减少交互项的结构性修改。对于无法观测到真实噪声的实际场景,我们引入了一个从模型预测计算得到的代理量,以及一种跨扩散时间步稳定校正的动态方法。在受控环境下的实验验证了我们的理论误差分解和代理量构造。在Stable Diffusion v1.5和Stable Diffusion XL上的图像生成表明,CFG-OEC在多个采样器和引导机制下比CFG和CFG++改进了FID和CLIP分数。

英文摘要

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the objective used in training. This mismatch induces a structural sampling error through the interaction of conditional and unconditional prediction errors. We analyze this issue by decomposing the sampling error into a base term and a cross term determined by the alignment of the two errors. Based on this analysis we propose CFG with orthogonal error correction (CFG-OEC), a structural modification that reduces the interaction term. For practical settings where ground truth noise is not observable, we introduce a proxy computed from model predictions and a dynamic method that stabilizes correction across diffusion timesteps. Experiments in a controlled environment validate our theoretical error decomposition and proxy construction. Image generation on Stable Diffusion v1.5 and Stable Diffusion XL show that CFG-OEC improves FID and CLIP scores over CFG and CFG++ across multiple samplers and guidance regimes.

2511.07667 2026-05-27 cs.AI

AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation

AI驱动的贡献评估与冲突解决:群体工作量调查的框架与设计

Jakub Slapek, Mir Seyedebrahimi, Jianhua Yang

AI总结 提出一个AI增强的框架和实现设计,通过整合异构工件并利用大语言模型进行验证和上下文分析,以解决团队中个人贡献的公平评估和冲突解决难题。

Comments 20 pages, 8 figures, 8 tables

详情
AI中文摘要

团队中个人贡献的公平评估仍然是一个持续的挑战,工作量的冲突和差异可能导致不公平的绩效评估,通常需要人工干预——这是一个成本高昂且困难的过程。我们调查了现有工具的功能,并发现了冲突解决方法和AI集成方面的空白。为了解决这个问题,我们提出了一种新颖的AI增强工具的框架和实现设计,该工具协助争议调查。该框架将异构工件——提交物(代码、文本、媒体)、通信(聊天、电子邮件)、协调记录(会议日志、任务)、同行评估和上下文信息——组织成三个维度,包含九个基准:贡献、互动和角色。客观度量被归一化,按维度聚合,并与不平等度量(基尼指数)配对,以揭示冲突标记。大语言模型(LLM)架构对这些度量进行验证和上下文分析,以生成可解释且透明的咨询判断。我们论证了在当前法规和机构政策下的可行性,并概述了实际分析(情感、任务忠实度、字数/行数等)、偏见防护、限制和实际挑战。

英文摘要

The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can result in unfair performance evaluation, often requiring manual intervention - a costly and challenging process. We survey existing tool features and identify a gap in conflict resolution methods and AI integration. To address this, we propose a framework and implementation design for a novel AI-enhanced tool that assists in dispute investigation. The framework organises heterogeneous artefacts - submissions (code, text, media), communications (chat, email), coordination records (meeting logs, tasks), peer assessments, and contextual information - into three dimensions with nine benchmarks: Contribution, Interaction, and Role. Objective measures are normalised, aggregated per dimension, and paired with inequality measures (Gini index) to surface conflict markers. A Large Language Model (LLM) architecture performs validated and contextual analysis over these measures to generate interpretable and transparent advisory judgments. We argue for feasibility under current statutory and institutional policy, and outline practical analytics (sentimental, task fidelity, word/line count, etc.), bias safeguards, limitations, and practical challenges.

2511.02525 2026-05-27 cs.LG cs.AI

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

一种用于求解带容量约束选址-路径问题的端到端学习方法

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

AI总结 提出基于深度强化学习与异构查询机制(DRLHQ)的端到端方法,首次将编码器-解码器结构应用于带容量约束的选址-路径问题(CLRP)及其开放变体(OCLRP),通过异构查询注意力机制动态协调选址与路径决策,在合成和基准数据集上优于传统方法和现有DRL基线。

详情
AI中文摘要

带容量约束的选址-路径问题(CLRPs)是组合优化中的经典问题,需要同时做出选址和路径决策。在CLRPs中,复杂的约束以及各种决策之间的复杂关系使得问题难以求解。随着深度强化学习(DRL)的出现,它已被广泛应用于解决车辆路径问题及其变体,而与CLRPs相关的研究仍有待探索。在本文中,我们提出了带有异构查询的DRL(DRLHQ)来分别求解CLRP和开放CLRP(OCLRP)。我们是首个为CLRPs提出端到端学习方法的工作,遵循编码器-解码器结构。具体而言,我们将CLRPs重新表述为一个针对各种决策量身定制的马尔可夫决策过程,这是一个通用的建模框架,可适用于其他基于DRL的方法。为了更好地处理选址和路径决策之间的相互依赖关系,我们还引入了一种新颖的异构查询注意力机制,旨在动态适应不同的决策阶段。在合成和基准数据集上的实验结果表明,我们提出的方法在求解CLRP和OCLRP时,相较于代表性的传统方法和基于DRL的基线,具有更优的解质量和更好的泛化性能。

英文摘要

The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.

2510.23486 2026-05-27 cs.LG

Learning to Reason Efficiently with Discounted Reinforcement Learning

通过折扣强化学习高效推理

Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, Karim Bouyarmane

AI总结 针对大型推理模型消耗过多token导致计算成本高的问题,提出使用折扣强化学习(解释为小token成本)惩罚推理token,结合Blackwell最优性分析,在保持准确性的同时缩短推理链。

详情
AI中文摘要

大型推理模型(LRMs)通常消耗过多的token,增加了计算成本和延迟。更广泛地说,在目标到达的序列决策问题中,我们通常希望快速到达目标,而LRM推理可以从这个角度看待。我们挑战了较长响应能提高准确性的假设。通过使用折扣强化学习设置(可解释为小的token成本)惩罚推理token,并分析受限策略类中的Blackwell最优性,我们鼓励简洁而准确的推理,类似于在随机最短路径问题中偏好更短的成功轨迹。实验证实了我们的理论结果,即这种方法在保持准确性的同时缩短了思维链。

英文摘要

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can be viewed through this lens. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning, analogous to preferring shorter successful trajectories in a stochastic shortest path problem. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

2510.10774 2026-05-27 cs.SD cs.AI cs.HC cs.LG

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice: 面向文本到语音合成的大规模多说话人波斯语语音语料库

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

AI总结 提出ParsVoice,目前最大的公开波斯语语音-文本语料库,通过可扩展的流水线从长篇有声读物构建高质量数据,用于训练多说话人TTS系统,并验证了其在零样本多说话人TTS中的有效性。

详情
AI中文摘要

波斯语在开放的语音-文本资源中仍然严重不足,限制了多说话人文本到语音(TTS)、语音语言建模和低资源语音处理的进展。我们介绍了ParsVoice,这是目前最大的公开波斯语语音-文本语料库,专为训练多说话人TTS系统而设计,同时提供了一个可扩展的流水线,用于从长篇有声读物录音中构建高质量的语音-文本数据。该流水线结合了微调的ParsBERT句子补全分类器、基于ASR的边界优化、标点恢复、说话人识别以及涵盖音频和波斯语特定文本属性的多维质量评估。最终发布的版本包含一个2200小时的TTS就绪子集,包含来自1815个自动识别说话人ID的136万个对齐片段,比之前最大的公开波斯语TTS数据集大25倍以上。为了验证该语料库,我们微调了XTTS,一个直接操作原始波斯语文本(无需音素表示)的零样本多语言TTS模型,实现了自然度MOS为3.6/5,说话人相似度MOS为4.0/5。ParsVoice数据集公开在:https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。

英文摘要

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

2509.04310 2026-05-27 cs.AI

EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

EvoEmo:面向多轮价格谈判中对抗性LLM智能体的进化情感策略

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

AI总结 提出EvoEmo进化强化学习框架,通过将情感状态转移建模为马尔可夫决策过程并采用种群遗传优化,动态优化多轮谈判中的情感表达,显著提升LLM智能体的谈判成功率、效率和买家节省。

详情
AI中文摘要

最近关于大型语言模型(LLM)中思维链(CoT)推理的研究表明,智能体可以参与 extit{复杂}、 extit{多轮}谈判,为智能体AI开辟了新途径。然而,现有的LLM智能体在很大程度上忽略了情感在此类谈判中的功能作用,而是生成被动、偏好驱动的情感反应,使其容易受到对抗方的操纵和策略性利用。为弥补这一差距,我们提出了EvoEmo,一个进化强化学习框架,用于优化谈判中的动态情感表达。EvoEmo将情感状态转移建模为马尔可夫决策过程,并采用基于种群的遗传优化,在多样化的谈判场景中进化出高奖励的情感策略。我们进一步提出了一个评估框架,包含两个基线——原始策略和固定情感策略——用于基准测试情感感知谈判。大量实验和消融研究表明,EvoEmo在成功率、效率和买家节省方面均持续优于两个基线。这一发现强调了适应性情感表达在使LLM智能体更有效地进行多轮谈判中的重要性。代码可在\href{https://github.com/Yunbo-max/EvoEmo}{ extcolor{red}{https://github.com/Yunbo-max/EvoEmo}}获取。

英文摘要

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines -- vanilla strategies and fixed-emotion strategies -- for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation. The code is available at \href{https://github.com/Yunbo-max/EvoEmo}{\textcolor{red}{https://github.com/Yunbo-max/EvoEmo}}.

2510.09606 2026-05-27 cs.CV

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

SpaceVista:从毫米到公里的全尺度视觉空间推理

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

AI总结 本文提出全尺度空间推理解决方案,通过结构化知识系统、尺度感知建模和渐进训练范式,构建SpaceVista-1M数据集(38K视频场景、约1M空间QA对)和SpaceVista-7B模型,在5个基准上展现强泛化能力。

Comments Project Page: https://peiwensun2000.github.io/mm2km/

详情
AI中文摘要

随着当前空间推理探索的兴起,研究人员在理解室内场景方面取得了显著进展,但在机器人技术和自动驾驶等多样化应用中仍面临挑战。本文旨在通过解决两个关键挑战来推进跨不同场景的全尺度空间推理:1)数据集构建严重依赖室内3D扫描和劳动密集型人工标注;2)缺乏有效的全尺度场景建模,常常导致对单个场景的过拟合。本文提出了一种整体解决方案,集成了结构化空间推理知识系统、尺度感知建模和渐进训练范式,据我们所知,这是首次尝试拓宽多模态大语言模型的全尺度空间智能。通过任务特定、专家驱动的自动化流水线,我们在5个空间尺度上整理了超过38K个视频场景,创建了SpaceVista-1M数据集,该数据集包含约100万个空间问答对,涵盖19种不同的任务类型。虽然专家模型可以注入有用的领域知识,但它们不适合用于评估。因此,我们通过手动记录、检索和组装基于视频的数据,构建了一个具有精确标注的全尺度基准。然而,由于潜在的知识冲突,使用SpaceVista-1M进行简单训练往往会产生次优结果。因此,我们引入了SpaceVista-7B,一个空间推理模型,它接受超出语义的密集输入,并使用尺度作为尺度感知专家和渐进奖励的锚点。最后,在包括我们的SpaceVista-Bench在内的5个基准上的广泛评估展示了竞争性能,展现了跨所有尺度和场景的强大泛化能力。我们的数据集、模型和基准将在https://peiwensun2000.github.io/mm2km上发布。

英文摘要

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

2510.09405 2026-05-27 cs.LG

Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training

基于特征解耦与对抗训练的射频指纹识别跨接收机泛化

Yuhao Pan, Xiucheng Wang, Fushuo Huo, Nan Cheng, Wenchao Xu

AI总结 提出一种特征解耦与对抗训练框架,通过分离发射机与接收机特征并抑制接收机信息,解决射频指纹识别中接收机更换导致的性能下降问题。

详情
AI中文摘要

射频指纹识别(RFFI)是无线网络安全的关键技术,利用硬件固有缺陷实现发射机识别。尽管深度神经网络能有效提取判别性射频特征,但在实际部署中,其性能受接收机引入的变异性显著影响。真实场景中,射频信号天然地混合了发射机特定特征与接收机依赖失真,导致模型在相同设备上训练和评估时会捕获接收机相关模式。因此,部署时更换接收机常导致性能显著下降。为解决此问题,我们提出一种跨接收机鲁棒的RFFI框架,明确解耦发射机特定和接收机特定表示。该方法整合对抗域对齐与接收机感知正则化,抑制发射机特征中的残余接收机信息,同时强制接收机特定表示的内部一致性。进一步引入特征分离约束,在潜在空间中解耦两个组件。在多接收机WiFi数据集上的大量实验表明,所提方法在跨接收机评估中持续优于最先进基线,并显著提升对接收机更换的鲁棒性。

英文摘要

Radio frequency fingerprint identification (RFFI) is a key technique for wireless network security, leveraging intrinsic hardware imperfections to enable transmitter identification. Although deep neural networks are effective at extracting discriminative RF features, their performance is significantly affected by receiver-induced variability in practical deployments. In real-world scenarios, RF signals inherently entangle transmitter-specific characteristics with receiver-dependent distortions, leading models to capture receiver-related patterns when training and evaluation are conducted on the same device. Consequently, replacing the receiver during deployment often results in notable performance degradation. To address this issue, we propose a cross-receiver robust RFFI framework that explicitly disentangles transmitter-specific and receiver-specific representations. The proposed method integrates adversarial domain alignment with receiver-aware regularization to suppress residual receiver information in transmitter features while enforcing intra-receiver consistency in receiver-specific representations. A feature separation constraint is further introduced to decouple the two components in the latent space. Extensive experiments on multi-receiver WiFi datasets demonstrate that the proposed method consistently outperforms state-of-the-art baselines under cross-receiver evaluation and significantly improves robustness to receiver replacement.

2510.08932 2026-05-27 cs.LG cs.IR

MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths

MATT-CTR:一种模型无关的测试时范式,用于通过置信度引导的推理路径进行CTR预测

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng

AI总结 提出一种模型无关的测试时范式MATT,利用特征组合的置信度分数生成多条推理路径并聚合预测,以缓解低置信度特征对CTR预测的影响。

详情
AI中文摘要

近期,越来越多的研究致力于优化CTR模型架构以更好地建模特征交互,或改进训练目标以辅助参数学习,从而获得更好的预测性能。然而,以往的工作主要集中在训练阶段,很大程度上忽视了推理阶段的优化机会。特别是,不常出现的特征组合会降低预测性能,导致不可靠或低置信度的输出。为了释放已训练CTR模型的预测潜力,我们提出了一种模型无关的测试时范式(MATT),该范式利用特征组合的置信度分数来指导生成多条推理路径,从而减轻低置信度特征对最终预测的影响。具体来说,为了量化特征组合的置信度,我们引入了一种层次概率哈希方法来估计不同阶数特征组合的出现频率,这些频率作为对应的置信度分数。然后,以置信度分数作为采样概率,通过迭代采样生成多条实例特定的推理路径,并随后聚合来自多条路径的预测分数以进行稳健预测。最后,广泛的离线实验和在线A/B测试强有力地验证了MATT在现有CTR模型上的兼容性和有效性。

英文摘要

Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.

2506.23274 2026-05-27 cs.LG cs.AI

Real-Time Progress Prediction in Reasoning Language Models

推理语言模型中的实时进度预测

Hans Peter Lyngsøe Raaschou-Jensen, Constanza Fierro, Anders Søgaard

AI总结 研究通过离散化推理轨迹训练线性探针和微调模型生成0-100%进度估计,实现推理语言模型中的实时进度预测,并在数学推理任务上达到0.161 MAE。

详情
AI中文摘要

最近的推理语言模型,特别是那些采用长潜在思维链的模型,在复杂的智能体任务上表现出色。然而,随着这些模型在越来越长的时间范围内运行,其内部进展对用户变得不透明,使得期望管理和实时监督变得困难。在这项工作中,我们研究了对此类模型进行实时进度预测的可行性。我们首先通过离散化推理轨迹并训练线性探针对推理状态进行分类,测试隐藏状态是否编码进度信息。然后,我们微调模型以在思维链推理过程中生成0-100%的进度估计。我们最强的进度报告检查点在数学推理轨迹上达到了0.161的平均绝对误差,并在此设置中优于位置基线。最后,我们通过测量相同部分展开中隐含进度值的变化程度,量化了进度标签的内在模糊性。这种模糊性在Qwen3-4B中最低,其延续产生的展开离散度最小,表明更大的模型可以通过减少剩余解决方案长度的变化来使进度标签更稳定。

英文摘要

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, their internal progress becomes opaque to users, making expectation management and real-time oversight difficult. In this work, we investigate whether real-time progress prediction is feasible for such models. We first test whether hidden states encode progress information by discretizing reasoning trajectories and training a linear probe to classify reasoning states. We then fine-tune models to generate progress estimates from 0--100\% during chain-of-thought reasoning. Our strongest progress-reporting checkpoint reaches 0.161 MAE on mathematical reasoning traces and outperforms position baselines in this setting. Finally, we quantify the intrinsic ambiguity of progress labels by measuring how much the implied progress value varies from the same partial rollout. This ambiguity is lowest for Qwen3-4B, whose continuations produce the smallest rollout dispersion, suggesting that larger models can make progress labels more stable by reducing variation in remaining solution length.

2510.06843 2026-05-27 cs.CL cs.AI

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

自信号驱动的多LLM辩论以实现高效准确的推理

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

AI总结 提出一种利用模型级置信度和token级语义焦点两种自信号来自适应引导多LLM辩论过程的方法,在提高准确性的同时减少token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在 diverse 应用领域展现了令人印象深刻的能力。最近的工作探索了多LLM智能体辩论(MAD),通过使多个LLM迭代讨论和细化响应来增强性能。然而,现有的MAD方法主要关注利用外部结构(如辩论图)和LLM作为评判者,而忽略了生成过程中出现的自信号(如token logits和注意力)。这种遗漏导致了冗余计算和潜在的性能下降。在本文中,我们将重点转移到多LLM辩论的自信号上,并引入了一种自信号驱动的多LLM辩论(SID),它利用两种类型的自信号:模型级置信度和token级语义焦点,来自适应地引导辩论过程。我们的方法使高置信度智能体能够在模型级别提前退出,并基于注意力机制压缩冗余辩论内容。我们在多个具有挑战性的基准测试上,对各种LLMs和多模态LLMs评估了我们的方法。实验结果表明,我们的方法不仅在准确性上优于现有的MAD技术,而且还减少了token消耗,突显了利用自信号在提高多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{https://github.com/xuhang2019/SID}{ exttt{https://github.com/xuhang2019/SID}} 上提供。

英文摘要

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

2510.06381 2026-05-27 cs.LG cs.AI

Monte Carlo Permutation Search

蒙特卡洛排列搜索

Tristan Cazenave

AI总结 提出一种改进GRAVE算法的通用蒙特卡洛树搜索算法MCPS,通过利用路径上所有节点的统计信息,在多种游戏中优于GRAVE,并给出了统计权重公式的数学推导。

详情
AI中文摘要

我们提出蒙特卡洛排列搜索(MCPS),一种改进GRAVE算法的通用蒙特卡洛树搜索(MCTS)算法。当深度强化学习不可行或游戏前可用计算资源有限时(如通用游戏博弈),MCPS具有相关性。MCPS的原理是在节点的探索项中包含从根节点到该节点路径上所有走法的所有模拟的统计信息。我们在多种游戏上测试MCPS:Hex、Go、AtariGo、NoGo和一个Wargame。MCPS几乎总是优于GRAVE。我们还提供了用于加权三种统计来源的公式的数学推导。这些公式是对GRAVE公式的改进,因为它们不再使用GRAVE的偏差超参数。

英文摘要

We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option or when the computing power available before play is not substantial, such as in General Game Playing. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We test MCPS on a variety of games: Hex, Go, AtariGo, NoGo and a Wargame. MCPS almost always outperforms GRAVE. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE.

2510.05864 2026-05-27 cs.CL cs.CY

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

关于指令微调大语言模型对长输入中有害句子的敏感性

Faeze Ghorbanpour, Alexander Fraser

AI总结 通过构建长输入并系统变化长度、有害比例、显隐性和位置,研究LLM对稀疏嵌入有害句子的敏感性,发现敏感性非单调、随长度下降、早期位置优先、显性危害更易识别。

详情
AI中文摘要

大型语言模型(LLM)越来越多地处理长输入,但当有害句子稀疏地嵌入其中时,其行为仍知之甚少。我们提出了一种敏感性分析,探究LLM如何提取嵌入在长输入中的有害句子。我们通过组合中性和有害句子构建长输入,并系统变化四个因素:输入长度(600–30,000个token)、有害句子比例(0.01–0.50)、危害实现方式(显性与隐性)以及有害句子在输入中的位置(开头、中间、结尾),从而进行受控压力测试评估。针对有毒、冒犯和仇恨内容,以及LLaMA-3.1、Qwen-2.5和Mistral的实验揭示了一致模式:敏感性相对于有害流行率是非单调的,在中等水平达到峰值;敏感性随输入长度增加而下降;位于输入较早位置的有害句子被更强烈地优先处理;显性危害比隐性危害更可靠地被识别。这些发现提供了在受控压力条件下LLM如何优先处理长输入中有害句子的系统视角,突出了安全相关应用中新兴的优势和持续的挑战。

英文摘要

Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.

2510.05141 2026-05-27 cs.CL

To model human linguistic prediction, make LLMs less superhuman

为了模拟人类语言预测,让大语言模型不那么超人类

Byung-Doh Oh, Tal Linzen

AI总结 本文指出大语言模型因超人类预测能力而无法解释人类阅读行为,主张通过模拟人类记忆来改进模型,并提出新实验方向。

Comments Accepted to Trends in Cognitive Sciences

详情
AI中文摘要

当我们阅读时,我们会预测即将出现的单词;这些预测会影响我们的阅读行为。大语言模型(LLMs)与人类一样,会对即将出现的单词进行预测,它们的成功促使它们被用作人类语言预测的模型。令人惊讶的是,在过去几年中,随着LLMs预测下一个单词的能力提高,它们解释阅读行为的能力却下降了。我们认为这是因为当前的LLMs预测即将出现的单词的能力远高于人类读者。这种“超人类性”是由LLMs大量的训练数据、对训练示例更强的长期记忆以及更强的短期记忆驱动的。我们主张开发具有类人记忆的LLMs,并进行新的实验来衡量人类与LLMs之间的一致性,并概述了实现这些目标的方向。

英文摘要

When we read, we make predictions about upcoming words; these predictions influence our reading behavior. The success of large language models (LLMs), which, like humans, make predictions about upcoming words, has motivated their use as models of human linguistic prediction. Surprisingly, in the last few years, as LLMs' ability to predict the next word has improved, their ability to explain reading behavior has declined. We argue this is because current LLMs can predict upcoming words much better than human readers can. This 'superhumanness' is driven by LLMs' extensive training data, stronger long-term memory of training examples, and stronger short-term memory. We advocate for LLMs with human-like memory and for new experiments to measure the alignment between humans and LLMs, and outline directions towards achieving these goals.

2510.04533 2026-05-27 cs.CV

TAG: Tangential Amplifying Guidance for Hallucination-Resistant Sampling

TAG: 切向放大引导用于抗幻觉采样

Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin

AI总结 提出一种无需训练、与架构无关的即插即用引导方法TAG,通过放大估计分数的切向分量来纠正采样轨迹,减少语义不一致性并提高保真度。

Comments Accepted to ICML 2026 (Regular)

详情
AI中文摘要

扩散模型实现了最先进的图像生成,但经常产生语义不一致或幻觉。现有的推理时引导方法依赖外部信号或架构修改,增加了计算开销。我们提出切向放大引导(TAG),一种无需训练、与架构无关、即插即用的引导方法,仅基于轨迹信号操作。TAG使用中间样本作为投影基,放大估计分数的切向分量以纠正采样轨迹。一阶泰勒分析表明,这会将状态引导至数据流形的高概率区域,减少不一致性并提高保真度,同时为现有采样器增加可忽略的开销。代码可在我们的项目页面(https://hyeon-cho.github.io/TAG/)获取。

英文摘要

Diffusion models achieve state-of-the-art image generation but often produce semantic inconsistencies, or hallucinations. Existing inference-time guidance methods rely on external signals or architectural modifications, adding computational overhead. We propose $\mathbf{T}$angential $\mathbf{A}$mplifying $\mathbf{G}$uidance $\mathbf{(TAG)}$, a training-free, architecture-agnostic, plug-and-play guidance method that operates purely on trajectory signals. TAG uses an intermediate sample as a projection basis and amplifies the tangential components of the estimated score to correct the sampling trajectory. A first-order Taylor analysis shows that this steers the state toward higher-probability regions of the data manifold, reducing inconsistencies and improving fidelity while adding negligible overhead to existing samplers. Code is available at our Project Page (https://hyeon-cho.github.io/TAG/).

2510.01833 2026-05-27 cs.AI cs.CL

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

先规划后行动:面向LLM推理的高层规划引导强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

AI总结 提出PTA-GRPO两阶段框架,通过高层规划引导与强化学习联合优化,提升LLM在数学和自然科学推理任务中的准确性和泛化能力。

Comments 19 pages and 5 figures

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)展现出强大的推理能力,但其token级别的生成倾向于局部决策,缺乏全局规划,常常导致冗余或不准确的推理。现有方法(如基于树的搜索和强化学习)试图解决这一问题,但计算成本高,且仍难以产生可靠的推理轨迹。为应对这些挑战,我们提出先规划后行动增强推理与组相对策略优化(PTA-GRPO),这是一个两阶段框架,旨在联合改进高层规划和细粒度CoT推理。具体而言,在第一阶段,给定LLM负责将CoT推理总结为紧凑的高层指导,然后用于监督微调。接着,我们引入一种指导感知的强化学习方法,联合优化最终输出和指导质量,提升推理效果。我们在数学和自然科学的十个推理基准上,使用五个覆盖多种数据模态的多样化基础模型进行评估。结果表明,PTA-GRPO在模型和任务上持续带来显著改进,展现出强大的有效性和泛化能力。

英文摘要

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

2510.01336 2026-05-27 cs.CL cs.AI cs.LG

HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec: 分层推测解码用于大语言模型

Avinash Kumar, Sujay Sanghavi, Poulami Das

AI总结 提出HiSpec框架,利用早期退出模型进行低开销中间验证,通过重用键值缓存和隐藏状态提高吞吐量,平均加速1.28倍,最高2.01倍,且不损失准确性。

详情
AI中文摘要

推测解码通过使用较小的草稿模型推测令牌,再由较大的目标模型验证,从而加速LLM推理。验证通常是瓶颈(例如,当3B模型为70B目标模型推测时,验证速度比令牌生成慢4倍),但大多数先前工作只关注加速草稿生成。“中间”验证通过早期丢弃不准确的草稿令牌来减少验证时间,但现有方法在引入中间验证器时会产生大量训练开销,增加内存占用以协调中间验证步骤,并依赖近似启发式方法损害准确性。我们提出$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$,一种高吞吐量推测解码框架,利用早期退出模型进行低开销中间验证。早期退出模型允许令牌通过跳过层遍历提前退出,并经过显式训练,使得选定层的隐藏状态可解释,从而在不显著增加计算和内存开销的情况下,非常适合中间验证。为了进一步提高资源效率,我们设计了一种方法,使HiSpec能够在草稿模型、中间验证器和目标模型之间重用键值缓存和隐藏状态。为了保持准确性,HiSpec定期针对目标模型验证中间验证器接受的草稿令牌。我们在各种代表性基准和模型上的评估表明,与基线单层推测相比,HiSpec平均提高吞吐量1.28倍,最高达2.01倍,且不损失准确性。

英文摘要

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

2510.00902 2026-05-27 cs.CV cs.CY cs.HC

Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification

机器学习研究者关于医学图像分类迁移学习的直觉

Yucheng Lu, Hubert Dariusz Zając, Veronika Cheplygina, Amelia Jiménez-Sánchez

AI总结 通过任务调查揭示机器学习从业者选择源数据集的直觉依据,发现选择依赖于任务、社区实践和相似性感知,但相似性与性能并不一致,且缺乏伦理考量。

Comments Under review

详情
AI中文摘要

迁移学习对医学影像至关重要,然而源数据集的选择往往依赖于研究者的直觉而非系统原则,这可能影响算法的泛化能力,进而影响患者预后。本研究通过对机器学习从业者进行基于任务的调查来探究这些决策。与先前对模型和实验设置进行基准测试的工作不同,我们从人机交互(HCI)角度研究从业者如何选择源数据集。我们的发现表明,选择依赖于任务,并受到社区实践、数据集属性、计算(数据嵌入)或感知的视觉或语义相似性的影响。然而,相似性评分与预期性能并不总是一致,挑战了传统的“越相似越好”的观点。此外,伦理和公平性考虑在源数据集选择中基本缺失。参与者常使用模糊术语,这表明需要更清晰的定义和工具使其明确且可用。通过阐明这些启发式方法并引入迁移学习因素的概念框架,本研究为迁移学习中更系统的源选择提供了实用见解。

英文摘要

Transfer learning is crucial for medical imaging, yet the selection of source datasets often relies on researchers' intuition rather than systematic principles, which can impact the generalizability of algorithms and, thus, patient outcomes. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-computer interaction (HCI) perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Moreover, ethical and fairness considerations remain largely absent from source dataset sections. Participants often used ambiguous terminology, which suggests a need for clearer definitions and tools to make them explicit and usable. By clarifying these heuristics and introducing a conceptual framework of transfer learning factors, this work provides practical insights for more systematic source selection in transfer learning.

2509.26600 2026-05-27 cs.CL cs.AI

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

当LLM自我基准测试:解构自动评估中的自我偏见

Wenda Xu, Sweta Agrawal, Vilém Zouhar, Markus Freitag, Daniel Deutsch

AI总结 研究LLM自动创建基准测试时存在的自我偏见问题,发现测试集生成和评估两个环节均产生偏见,导致模型偏爱自身输出,并提出了多样性指标以部分缓解该偏见。

详情
AI中文摘要

随着LLM迅速饱和现有基准测试,使用LLM自动创建基准测试(LLM-as-a-benchmark)——即模型生成测试输入(LLM-as-a-testset)并评估输出(LLM-as-an-evaluator)——已成为人工策划的廉价替代方案。我们表明,这种范式存在一个根本问题:LLM生成的基准测试系统性地偏爱创建它们的模型。以机器翻译为主要测试平台,我们发现自我偏见源于两个叠加来源:LLM-as-a-testset和LLM-as-an-evaluator,它们的组合放大了这种效应。关键的是,即使测试数据在显式多样性控制下生成,每个模型的隐式风格倾向也会产生同质的、模型特定的输出,从而抬高其自身分数。使用我们提出的多样性度量增加源文本多样性,可以部分缓解这种偏见。自我偏见足够强,以至于每个模型都将自己排在首位,覆盖了同行共识排序。我们确认该现象扩展到Chatbot Arena任务上的开放式生成。

英文摘要

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluator, and their combination amplifies the effect. Crucially, even when test data is generated with explicit diversity controls, each model's implicit stylistic tendencies produce homogeneous, model-specific outputs that inflate its own scores. Increasing source text diversity, using our proposed diversity metric, partially mitigates this bias. Self-bias is strong enough to cause each model to rank itself first, overriding the peer-consensus ordering. We confirm that the phenomenon extends to open-ended generation on the Chatbot Arena task.

2509.21552 2026-05-27 cs.CV cs.CL

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

通过视觉反馈学习具有空间推理能力的 GUI 定位

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim

AI总结 本文提出将 GUI 定位重构为交互式搜索任务,利用多步在线强化学习训练 GUI-Cursor 模型,通过光标视觉反馈提升空间推理能力,在 GUI 定位和代理任务上超越强基线。

Comments Accepted at ICML 2026

详情
AI中文摘要

图形用户界面(GUI)定位通常被构建为坐标预测任务——给定自然语言指令,生成屏幕上用于点击和按键等操作的坐标。然而,最近的视觉语言模型(VLM)在处理高分辨率和复杂布局的 GUI 图像时,往往无法预测准确的数字坐标。为了解决这个问题,我们将 GUI 定位重构为交互式搜索任务,其中 VLM 生成动作以移动 GUI 中的光标来定位 UI 元素。在每一步,模型确定目标对象,评估光标与目标之间的空间关系,并根据移动历史将光标移近目标。在这个交互过程中,渲染的光标提供视觉反馈,帮助模型将其预测与相应的屏幕位置对齐。我们使用基于密集轨迹奖励函数的多步在线强化学习来训练我们的 GUI 定位模型 GUI-Cursor。实验结果表明,GUI-Cursor 在 GUI 定位和代理任务上超越了强基线,在相同基础模型下实现了更优性能,同时需要更少的训练数据。进一步分析表明,GUI-Cursor 学会在更困难的示例上自适应地执行更多步骤,并在分布外领域获得更好的空间推理能力。

英文摘要

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Experimental results demonstrate that GUI-Cursor surpasses strong baselines in GUI grounding and agentic tasks, achieving superior performance with the same base models while requiring less training data. Further analysis shows that GUI-Cursor learns to adaptively conduct more steps on more difficult examples, and it obtains better spatial reasoning capability on out-of-distribution domains.