arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2512.22666 2026-05-27 cs.CV cs.LG

INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

INTERACT-CMIL:用于结膜黑色素细胞上皮内病变分级的任务共享学习与任务间一致性

Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind, Julia M. Weller, Yousef Yeganeh, Martina C. Herwig-Carl, Shadi Albarqouni

发表机构 * Clinic for Diagnostic and Interventional Radiology, University Hospital Bonn, Germany(波恩大学诊断与介入放射科) Department of Ophthalmology, Friedrich-Alexander University Erlangen-Nürnberg, Germany(埃尔兰根-纽伦堡弗里德里希-亚历山大大学眼科部) TUM School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院) Munich Center for Machine Learning, Germany(慕尼黑机器学习中心) Helmholtz AI, Helmholtz Center Munich, Germany(海德堡人工智能,海德堡慕尼黑研究中心)

AI总结 提出INTERACT-CMIL多任务深度学习框架,通过共享特征学习、组合部分监督和任务间一致性损失联合预测五个组织病理学轴,在486张结膜活检图像数据集上相比CNN和基础模型实现最高55.1%的宏F1提升。

Journal ref IEEE ISBI 2026

详情
AI中文摘要

结膜黑色素细胞上皮内病变(CMIL)的准确分级对于治疗和黑色素瘤预测至关重要,但由于细微的形态学线索和相互关联的诊断标准,仍然困难。我们提出INTERACT-CMIL,一个多头深度学习框架,通过共享特征学习与组合部分监督以及强制跨任务一致性的相互依赖损失,联合预测五个组织病理学轴:WHO4、WHO5、水平扩散、垂直扩散和细胞异型性。在来自三家大学医院的486张专家注释的结膜活检斑块的新整理多中心数据集上进行训练和评估,INTERACT-CMIL在CNN和基础模型(FM)基线上取得了一致的改进,相对宏F1增益高达55.1%(WHO4)和25.0%(垂直扩散)。该框架提供与专家分级一致的连贯、可解释的多标准预测,为CMIL诊断提供了可重复的计算基准,并朝着标准化数字眼科病理学迈出了一步。

英文摘要

Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

2512.19602 2026-05-27 cs.CV

No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

无数据?没问题:面向缺失值的鲁棒视觉-表格学习

Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(计算、信息与技术学院,慕尼黑技术大学,德国) Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany(生物医学成像中的机器学习研究所,海德堡慕尼黑,德国) School of Biomedical Engineering and Imaging Sciences, King’s College London, UK(生物医学工程与成像科学学院,伦敦国王学院,英国) Department of Diagnostic and Interventional Radiology, TUM University Hospital, Technical University of Munich, Germany(诊断与介入放射科,慕尼黑技术大学医院,德国) Munich Center for Machine Learning, Germany(慕尼黑机器学习中心,德国)

AI总结 提出RoVTL框架,通过对比预训练中的表格属性缺失增强和下游任务中的Tabular More vs. Fewer损失,实现从0%到100%表格数据可用性下的鲁棒多模态学习。

详情
AI中文摘要

大规模医学数据库提供成像数据以及广泛的表格信息,如临床测量或人口统计数据。然而,这种丰富的表格属性并不反映现实世界的数据集,其中可能只有一部分属性可用。这种差异要求方法在推理时对缺失值保持鲁棒。为了解决这一挑战,我们提出了RoVTL(鲁棒视觉-表格学习),一个旨在处理任何级别表格数据可用性(从0%到100%)的框架。RoVTL包括两个关键阶段:对比预训练,其中我们将表格属性缺失作为数据增强引入以促进鲁棒性;以及下游任务微调,其中表格缺失通过一种新颖的Tabular More vs. Fewer损失来补充,该损失根据可用表格数据的数量对性能进行排序。结合门控交叉注意力融合模块,我们的微调方法在所有表格数据完整性场景下实现了一致的性能。我们在英国生物银行的 cardiac MRI 扫描上评估了RoVTL,证明了与先前方法相比对缺失表格数据的优越鲁棒性。此外,RoVTL成功泛化到外部 cardiac MRI 数据集进行多模态疾病分类,并扩展到自然图像领域,在汽车广告数据集上实现了鲁棒性能。模型权重和代码可在 https://github.com/marteczkah/RoVTL 获取。

英文摘要

Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as clinical measurements or demographics. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that remain robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning, where tabular missingness is complemented by a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with gated-cross attention fusion module, our tuning approach enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The model weights and code are available at https://github.com/marteczkah/RoVTL.

2512.19332 2026-05-27 cs.LG cs.LO

A Logical View of GNN-Style Computation and the Role of Activation Functions

GNN风格计算的逻辑视角与激活函数的作用

Pablo Barceló, Floris Geerts, Matthias Lanzinger, Klara Pakhomenko, Jan Van den Bussche

发表机构 * Institute for Mathematical and Computational Engineering(数学与计算工程研究所) Pontifical Catholic University of Chile(天主教智利大学) IMFD CENIA Department of Computer Science, University of Antwerp(安特卫普大学计算机科学系) Institute for Logic and Computation, TU Wien(逻辑与计算研究所,维也纳技术大学) Data Science Institute, Universiteit Hasselt(数据科学研究所,哈塞尔特大学)

AI总结 本文通过定义语言MPLang,从逻辑角度研究图神经网络的计算能力,重点分析激活函数(特别是ReLU与有界激活函数)对数值和布尔表达能力的影响,并首次证明在存在线性层时,ReLU比有界激活函数具有更强的数值查询表达能力。

详情
AI中文摘要

我们研究了MPLang的数值和布尔表达能力,MPLang是一种声明式语言,通过线性消息传递和激活函数捕获图神经网络(GNN)的计算。我们从A-MPLang(无激活函数的片段)开始,并基于游走求和特征刻画了其表达能力。对于有界激活函数,我们证明(在温和条件下)所有最终恒定的激活函数产生相同的表达能力——数值和布尔——并且它包含了先前为具有最终恒定激活函数但无线性层的GNN建立的逻辑。最后,我们证明了在存在线性层的情况下,无界激活函数与有界激活函数之间的第一个表达能力分离:使用ReLU的MPLang在数值查询上严格强于使用最终恒定激活函数(例如截断ReLU)的MPLang。这依赖于线性聚合与最终恒定非线性之间的微妙交互,并确立了使用ReLU的GNN比那些仅限于最终恒定激活函数和线性层的GNN更具表达能力。

英文摘要

We study the numerical and Boolean expressiveness of MPLang, a declarative language that captures the computation of graph neural networks (GNNs) through linear message passing and activation functions. We begin with A-MPLang, the fragment without activation functions, and give a characterization of its expressive power in terms of walk-summed features. For bounded activation functions, we show that (under mild conditions) all eventually constant activations yield the same expressive power - numerical and Boolean - and that it subsumes previously established logics for GNNs with eventually constant activation functions but without linear layers. Finally, we prove the first expressive separation between unbounded and bounded activations in the presence of linear layers: MPLang with ReLU is strictly more powerful for numerical queries than MPLang with eventually constant activation functions, e.g., truncated ReLU. This hinges on subtle interactions between linear aggregation and eventually constant non-linearities, and it establishes that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.

2512.17090 2026-05-27 cs.LG cs.AI

How to Square Tensor Networks and Circuits Without Squaring Them

如何平方张量网络和电路而不进行平方操作

Lorenzo Loconte, Adrián Javaloy, Antonio Vergari

发表机构 * School of Informatics, University of Edinburgh, UK(爱丁堡大学信息学院)

AI总结 提出一种参数化方法,通过正交性和确定性条件简化平方张量网络和电路的边际化计算,避免额外复杂度,并在分布估计任务中保持表达能力且提升学习效率。

详情
AI中文摘要

平方张量网络(TNs)及其作为计算图的扩展——平方电路——已被用作表达性的分布估计器,同时支持闭式边际化。然而,平方操作在计算配分函数或边际化变量时引入了额外的复杂性,这阻碍了它们在机器学习中的应用。为了解决这个问题,张量网络的正则形式通过酉矩阵参数化以简化边际计算。然而,这些正则形式不适用于电路,因为电路可以表示不直接映射到已知张量网络的分解。受正则形式中的正交性和电路中实现可处理最大化的确定性的启发,我们展示了如何参数化平方电路以克服其边际化开销。我们的参数化即使在不同于张量网络的分解中也能实现高效的边际化,这些分解编码为电路,否则其结构会使边际化计算变得困难。最后,我们在分布估计上的实验表明,我们提出的平方电路条件在没有任何表达能力损失的情况下,实现了更高效的学习。

英文摘要

Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

2512.16111 2026-05-27 cs.LG eess.SP

BUILD with Precision: Bottom-Up Inference of Linear DAGs

精确构建:线性有向无环图的由底向上推断

Hamed Ajorlou, Samuel Rey, Gonzalo Mateos, Geert Leus, Antonio G. Marques

发表机构 * University of Rochester(罗切斯特大学) Universidad Rey Juan Carlos(雷耶皇家大学) Delft University of Technology(代尔夫特理工大学)

AI总结 提出BUILD算法,利用等噪声方差线性高斯SEM下观测数据的集成精度矩阵的独特结构,通过确定性逐步方法精确重构DAG,并在有限数据下通过周期性重估计精度矩阵增强鲁棒性。

详情
AI中文摘要

从观测数据中学习有向无环图(DAG)的结构是因果发现、统计信号处理和机器学习中的核心问题。在等噪声方差的线性高斯结构方程模型(SEM)下,该问题是可识别的,并且我们证明观测数据的集成精度矩阵展现出一种有助于DAG恢复的独特结构。利用这一性质,我们提出了BUILD(线性DAG的由底向上推断),一种确定性的逐步算法,该算法识别叶节点及其父节点,然后通过移除关联边来修剪叶节点以进入下一步,从真实的精度矩阵中精确重构DAG。在实践中,精度矩阵必须从有限数据中估计,而病态条件可能导致BUILD步骤中的误差累积。作为一种缓解策略,我们定期重新估计精度矩阵(随着叶节点被修剪,变量减少),以运行时换取增强的鲁棒性。在具有挑战性的合成基准上的可重复结果表明,BUILD与最先进的DAG学习算法相比具有优势,同时提供了对复杂性的明确控制。

英文摘要

Learning the structure of directed acyclic graphs (DAGs) from observational data is a central problem in causal discovery, statistical signal processing, and machine learning. Under a linear Gaussian structural equation model (SEM) with equal noise variances, the problem is identifiable and we show that the ensemble precision matrix of the observations exhibits a distinctive structure that facilitates DAG recovery. Exploiting this property, we propose BUILD (Bottom-Up Inference of Linear DAGs), a deterministic stepwise algorithm that identifies leaf nodes and their parents, then prunes the leaves by removing incident edges to proceed to the next step, exactly reconstructing the DAG from the true precision matrix. In practice, precision matrices must be estimated from finite data, and ill-conditioning may lead to error accumulation across BUILD steps. As a mitigation strategy, we periodically re-estimate the precision matrix (with less variables as leaves are pruned), trading off runtime for enhanced robustness. Reproducible results on challenging synthetic benchmarks demonstrate that BUILD compares favorably to state-of-the-art DAG learning algorithms, while offering an explicit handle on complexity.

2512.14561 2026-05-27 cs.CL

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

大型语言模型与人类评分者在论文评分中的一致性:一项研究综合

Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng

发表机构 * Department of Educational Policy Studies, Georgia State University(乔治亚州立大学教育政策研究系)

AI总结 通过综合65项研究,发现大型语言模型与人类评分者在论文评分中的一致性高度依赖上下文,且跨研究及研究内部差异显著。

详情
AI中文摘要

尽管大型语言模型(LLMs)在自动论文评分(AES)中展现出越来越大的潜力,但关于其与人类评分者相比可靠性的实证结果仍然不一。遵循PRISMA 2020指南,我们综合了2022年1月至2025年8月间65项已发表和未发表的研究,这些研究考察了LLM生成的分数与人类评分之间的一致性。一致性水平在研究之间和研究内部均有显著差异,报告值范围广泛。总体而言,研究结果表明LLM-人类一致性高度依赖上下文。讨论了未来研究的启示、挑战和方向。

英文摘要

Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is highly context-dependent. Implications, challenges, and directions for future research are discussed.

2512.14140 2026-05-27 cs.CV

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

SketchAssist:一种用于语义编辑和精确局部重绘的实用助手

Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan

发表机构 * Global Business Unit, Baidu Inc.(百度公司全球业务部)

AI总结 提出SketchAssist,一种结合指令引导编辑和线条引导区域重绘的交互式草图助手,通过可控数据生成管道和基于DiT的统一框架(集成任务引导的混合专家模块)实现高效、可控的草图操作,在语义和结构一致性上达到最先进性能。

详情
AI中文摘要

草图编辑需要同时处理高级语义变化和精确的局部重绘,这种组合对于稀疏且风格敏感的线条艺术尤其具有挑战性。与自然图像不同,草图依赖于最小的视觉线索,使得现有方法难以在保持整体一致性的同时协调全局语义修改与细粒度结构控制。我们提出了SketchAssist,一种交互式草图助手,它统一了指令引导编辑和线条引导区域重绘,在保持整体构图的同时实现高效且可控的草图操作。为了支持这一任务,我们引入了一个可控数据生成管道,该管道构建具有精确属性变化的结构化编辑序列,并在多步修改中保持结构对齐,同时通过保持风格的变换扩展风格多样性。基于这些数据,SketchAssist采用基于DiT的统一框架,使用多通道输入表示在单一接口内编码草图、掩码和引导信号。为了进一步处理不同的编辑模式,我们将任务引导的混合专家(T-MoE)集成到LoRA层中,实现对语义和结构引导的自适应控制。大量实验表明,在两个任务上都达到了最先进的性能,与最近的方法相比,实现了更强的指令遵循以及改进的结构和风格一致性。总之,我们的方法为草图编辑提供了一种实用且可控的解决方案。

英文摘要

Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minimal visual cues, making it difficult for existing methods to reconcile global semantic modifications with fine-grained structural control while preserving overall coherence. We present SketchAssist, an interactive sketch assistant that unifies instruction-guided editing with line-guided region redrawing, enabling efficient and controllable sketch manipulation while preserving overall composition. To support this task, we introduce a controllable data generation pipeline that constructs structured edit sequences with precise attribute variations and maintains structural alignment across multi-step modifications, while expanding stylistic diversity via style-preserving transformations. Building on this data, SketchAssist adopts a unified framework based on DiT, using a multi-channel input representation to encode sketches, masks, and guidance signals within a single interface. To further handle different editing modes, we integrate a Task-guided Mixture-of-Experts (T-MoE) into LoRA layers, enabling adaptive control over semantic and structural guidance. Extensive experiments demonstrate state-of-the-art performance on both tasks, achieving strong instruction adherence and improved structural and style consistency compared to recent methods. Together, our method provide a practical and controllable solution for sketch editing.

2512.12413 2026-05-27 cs.AI cs.HC

Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale

生成式人工智能使用中的批判性思维:批判性思维在AI使用中的量表开发、验证与关联因素

Gabriel R. Lau, Wei Yan Low, Louis Tay, Ysabel Guevarra, Dragan Gašević, Andree Hartanto

发表机构 * School of Social Sciences, Nanyang Technological University(南洋理工大学社会科学学院) Interdisciplinary Graduate Programme, Nanyang Technological University(南洋理工大学跨学科研究生项目) College of Health and Human Sciences, Purdue University(普渡大学健康与人类科学学院) School of Social Sciences, Singapore Management University(新加坡管理学院社会科学学院) Faculty of Information Technology, Monash University(墨尔本大学信息技术学院)

AI总结 本研究开发并验证了13项批判性思维在AI使用中的量表,发现其包含验证、动机和反思三个因子,并与开放性、外向性、积极情感和AI使用频率正相关,且能预测更频繁的验证策略和更高的真实性判断准确性。

Journal ref Computers in Human Behavior Reports, 22, 101103 (2026)

详情
AI中文摘要

生成式AI工具日益嵌入日常工作和学习中,但其流畅性、不透明性和产生幻觉的倾向意味着用户必须批判性地评估AI输出,而不是全盘接受。本研究将AI使用中的批判性思维概念化为一种倾向性特质,包括验证AI生成信息的来源和内容、理解模型的工作原理及其失败之处,以及反思依赖AI的更广泛影响。通过六项研究(N=1365),我们开发并验证了13项批判性思维在AI使用中的量表,并绘制了其法则网络。研究1生成并内容验证了量表项目。研究2支持了三因子结构(验证、动机和反思)。研究3、4和5确认了这一高阶模型,展示了内部一致性、重测信度、强因子载荷、性别不变性以及收敛和判别效度。研究3和4进一步揭示,AI使用中的批判性思维与开放性、外向性、积极特质情感和AI使用频率正相关。最后,研究6展示了量表的效标效度,更高的批判性思维在AI使用中的得分预测了更频繁和多样化的验证策略、在新型自然主义ChatGPT驱动的事实核查任务中更高的真实性判断准确性,以及对负责任AI的更深入反思。总之,当前工作阐明了人们为何以及如何对生成式AI输出进行监督,并提供了一个经过验证的量表和生态学基础的任务范式,以支持关于批判性参与生成式AI输出的理论检验、跨群体和纵向研究。

英文摘要

Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.

2512.11280 2026-05-27 cs.CL

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

AdaSD:面向高效语言模型推理的自适应推测解码

Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu

发表机构 * Institute of Information Science, Academia Sinica(学术院信息研究所) Department of Computer Science and Information Engineering, National Taiwan University(台湾大学计算机科学与信息工程系)

AI总结 提出一种无超参数的自适应推测解码方法AdaSD,通过动态调整生成长度和接受标准,在保持准确率下降低于1.8%的同时实现最高1.46倍加速。

详情
AI中文摘要

大型语言模型(LLM)在广泛任务中取得了显著性能,但其不断增长的参数规模显著拖慢了推理速度。推测解码通过利用较小的草稿模型预测候选令牌,再由较大的目标模型验证,从而缓解这一问题。然而,现有方法通常需要额外训练、大量超参数调整或在部署前对模型和任务进行预先分析。在本文中,我们提出自适应推测解码(AdaSD),一种无超参数的解码方案,在推理过程中动态调整生成长度和接受标准。AdaSD引入两个自适应组件:一个决定何时停止候选令牌生成,另一个决定令牌接受,两者均基于令牌熵和Jensen-Shannon距离实时更新。该方法无需预先分析或微调,且兼容现有模型。在基准数据集上的实验表明,AdaSD相比普通推测解码实现最高1.46倍加速,同时将准确率下降限制在1.8%以内,使其成为高效且自适应的LLM推理的实用解决方案。

英文摘要

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive components: one to determine when to stop candidate token generation and the other to decide token acceptance, updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 1.46x speedup over vanilla speculative decoding while limiting accuracy degradation to under 1.8%, making it a practical solution for efficient and adaptive LLM inference.

2512.08371 2026-05-27 cs.LG stat.ML

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

基于多元伯努利的采样方法用于多标签数据及其在元研究中的应用

Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown

发表机构 * Department of Biostatistics, University of Arkansas for Medical Sciences(生物统计学系,亚拉巴马州医学科学大学) Arkansas Children’s Research Institute(亚拉巴马州儿童研究研究所) Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington(流行病学与生物统计学系,印第安纳大学公共健康学院-布卢明顿分校) Department of Psychology, Emory University(心理学系,埃默里大学)

AI总结 针对多标签数据中标签频率差异大且存在依赖关系的问题,提出一种基于多元伯努利分布的加权采样算法,通过估计标签组合权重实现目标分布特征,并在Web of Science研究文章数据上验证了其增强少数类别代表性的效果。

详情
AI中文摘要

数据集可能包含具有多个标签的观测值。如果标签不是互斥的,并且标签的频率差异很大,那么获取一个样本,该样本包含足够多的稀有标签观测值以对这些标签进行推断,并且以已知方式偏离总体频率,这带来了挑战。在本文中,我们将多元伯努利分布视为多标签问题的底层分布。我们提出了一种新颖的采样算法,该算法考虑了标签依赖性。它使用观测到的标签频率来估计多元伯努利分布参数,并为每个标签组合计算权重。这种方法确保加权采样在考虑标签依赖性的同时获得目标分布特征。我们将该方法应用于各种数据集,包括来自Web of Science的研究文章样本,这些文章标有64个生物医学主题类别。我们的目标是保持类别频率顺序,减少最常见和最不常见类别之间的频率差异,并考虑类别依赖性。该方法产生了更平衡的子样本,增强了少数类别的代表性。

英文摘要

Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculates weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a variety of datasets, including a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

2412.20505 2026-05-27 cs.AI cs.CL cs.LG

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

LiPUP-MA:一种以居住体验为中心的循环参与式城市多智能体规划框架

Hang Ni, Yuzhi Wang, Yizhi Song, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LiPUP-MA多智能体框架,通过模拟居住生活与体验驱动的计划修订循环,利用基于图的经验库和空间约束技能增强规划器,解决参与式城市规划中经验落地与反馈空间化问题。

详情
AI中文摘要

参与式城市规划(PUP)日益得到基于LLM的智能体的支持,但现有方法主要依赖于静态偏好 elicitation 和一次性利益相关者讨论,忽视了现实世界规划的周期性——居住生活、经验收集和计划调整持续互动。我们提出循环参与式城市规划(LiPUP),一种在模拟居住生活和经验驱动的计划修订之间交替的闭环范式,同时面临两个关键挑战:将分散的居住经验锚定到具体的城市背景中,以及将主观反馈转化为空间连贯的规划行动。为实例化LiPUP,我们引入LiPUP-MA,一个基于LLM的多智能体框架,它构建了一个以计划为中心的基于图的经验库,用于组织来自生活模拟的基于城市的居住反馈,并配备了一个空间约束的技能增强规划器智能体,通过协调经验、视觉和地理空间证据来修订计划。实验表明,LiPUP-MA在传统的静态规划指标和基于生活的指标上均持续优于基线,而迭代的LiPUP循环进一步提高了计划质量。

英文摘要

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference elicitation and one-shot stakeholder discussions, overlooking the cyclical nature of real-world planning, where residential life, experience collection, and plan adjustment continually interact. We propose Living-in-the-loop Participatory Urban Planning (LiPUP), a closed-loop paradigm that alternates between simulated residential living and experience-driven plan revision, while posing two key challenges: grounding scattered living experience in concrete urban contexts and translating subjective feedback into spatially coherent planning actions. To instantiate LiPUP, we introduce LiPUP-MA, an LLM-based multi-agent framework that constructs a Plan-centric Graph-based Experience Bank to organize urban-grounded residential feedback from living simulation and equips a Spatially-constrained Skill-augmented Planner agent to revise plans by harmonizing experiential, visual, and geospatial evidence. Experiments show that LiPUP-MA consistently outperforms baselines on both conventional static planning metrics and living-based metrics, while iterative LiPUP cycles further improve plan quality.

2510.17790 2026-05-27 cs.CV cs.CL

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA: 一种具有混合动作的计算机使用智能体基础模型

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

发表机构 * Apple(苹果公司) The University of Hong Kong(香港大学)

AI总结 提出UltraCUA基础模型,通过混合动作(融合原始GUI操作与高级工具执行)克服计算机使用智能体仅依赖原始GUI动作的局限性,采用自动化管道、合成数据引擎、混合动作轨迹收集和两阶段训练方法,在OSWorld和WindowsAgentArena上分别实现22%的相对性能提升和21.7%的成功率。

详情
AI中文摘要

计算机使用智能体面临一个根本限制:它们仅依赖原始GUI动作(点击、键入、滚动),导致脆弱的执行链容易发生级联故障。虽然API驱动的智能体通过结构化接口和工具利用丰富的能力,但计算机使用智能体仍然局限于低层视觉交互。我们提出UltraCUA,一种通过混合动作(无缝统一原始GUI操作与高层工具执行)超越这一限制的基础模型。我们的创新基于四个关键进展。首先,一个自动化管道从软件文档和代码仓库中提取并扩展工具能力。其次,一个合成数据引擎生成超过17,000个可验证任务,捕捉真实世界的计算机使用复杂性。第三,全面的混合动作轨迹收集融合了GUI原语和策略性工具调用。第四,一种两阶段训练方法结合了监督微调和在线强化学习,实现了GUI与API之间的智能动作选择。对我们的7B和32B UltraCUA模型的评估揭示了变革性的性能提升。在OSWorld上,UltraCUA平均实现了22%的相对改进,同时执行速度比现有方法快11%。在WindowsAgentArena上的跨域验证展示了鲁棒的泛化能力,成功率达到21.7%,超过了在Windows上训练的基线。混合动作范式被证明至关重要,在减少错误传播的同时提高了执行效率。这项工作建立了一个可扩展的范式,桥接了原始GUI交互与高层工具智能,为多样环境和复杂现实任务提供了更具弹性和适应性的计算机使用智能体。

英文摘要

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.

2512.04868 2026-05-27 cs.CL cs.AI

SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

SEAL: 面向知识图谱对话问答的自我演进智能体学习

Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li, Shunyu Yao, Yanzeng Li, Xinchi Li

发表机构 * Institute of Big Data and Artificial Intelligence, China Telecom Research Institute(大数据与人工智能研究院,中国电信研究院) Wangxuan Institute of Computer Technology, Peking University(王宣计算机技术研究所,北京大学) School of Artificial Intelligence, China University of Geosciences (Beijing)(人工智能学院,中国地质大学(北京)) Center for Cognition and Neuroergonomics, State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University(认知与神经工效学中心,认知神经科学与学习国家重点实验室,北京师范大学) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(人工智能与未来网络研究院,北京师范大学)

AI总结 提出SEAL两阶段语义解析框架,通过自我演进智能体学习解决知识图谱对话问答中的指代消解、上下文依赖和复杂逻辑推理问题,在SPICE基准上达到最先进性能。

Comments Accept by NeuroComputing

详情
AI中文摘要

基于知识的对话问答(KBCQA)在解决指代消解、上下文依赖建模和执行复杂逻辑推理方面面临持续挑战。现有方法通常存在不准确性和高昂的计算成本,尤其是在处理大规模知识图谱上的复杂查询时。具体而言,大型语言模型(LLM)倾向于为复杂的多跳或聚合查询生成语法无效或语义错位的逻辑形式,而传统的实体-关系链接方法则面临候选空间指数级增长的问题。为了解决这些限制,我们引入了SEAL,一种基于自我演进智能体学习的新型两阶段语义解析框架。在第一阶段,LLM提取一个捕获核心语义的最小S表达式核心,然后通过智能体校准模块进行修正,以纠正语法不一致性并将实体和关系与知识图谱对齐。第二阶段采用基于问题类型预测的模板补全来构建完全可执行的S表达式。关键的是,SEAL包含一种自我演进机制,将局部和全局记忆与反射模块相结合,能够从对话历史和执行反馈中持续适应,而无需显式重新训练。在SPICE基准上的大量实验表明,SEAL在多跳推理、比较和聚合任务中实现了最先进的性能,验证了在结构准确性和计算效率方面的显著提升。

英文摘要

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches often suffer from inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. Specifically, large language models (LLMs) tend to generate syntactically invalid or semantically misaligned logical forms for complex multi-hop or aggregation queries, while conventional entity-relation linking methods face an exponentially growing candidate space. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, an LLM extracts a minimal S-expression core capturing the essential semantics, which is then refined by an agentic calibration module to correct syntactic inconsistencies and align entities and relations with the knowledge graph. The second stage employs template-based completion guided by question-type prediction to construct a fully executable S-expression. Crucially, SEAL incorporates a self-evolving mechanism integrating local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance in multi-hop reasoning, comparison, and aggregation tasks, validating notable gains in both structural accuracy and computational efficiency.

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.(先进微器件公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Athena-PRM,一种多模态过程奖励模型,通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签,在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情
AI中文摘要

我们提出了 Athena-PRM,一种多模态过程奖励模型(PRM),旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入,主要因为需要推理步骤的逐步标注。传统的自动标注方法,如蒙特卡洛估计,通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM 在仅5000个样本的情况下,在各种场景和基准测试中展现出卓越的效果。此外,我们还开发了两种有效策略来提升PRM的性能:ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是,当使用 Qwen2.5-VL-7B 作为策略模型时,Athena-PRM 在 WeMath 上提升了10.2个百分点,在 MathVista 上提升了7.1个百分点(测试时扩展)。此外,Athena-PRM 在 VisualProcessBench 上取得了最先进(SoTA)结果,比之前的 SoTA 高出3.9个F1分数,展示了其准确评估推理步骤正确性的强大能力。另外,利用 Athena-PRM 作为奖励模型,我们通过奖励排序微调开发了 Athena-7B,在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

2512.04085 2026-05-27 cs.CV

Unique Lives, Shared World: Learning from Single-Life Videos

独特生活,共享世界:从单个人生视频中学习

Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出“单个人生”学习范式,利用单个人拍摄的自我中心视频通过多视角自监督学习视觉编码器,发现不同人生训练的模型具有高度对齐的几何理解,且学到的表示可泛化到下游任务,与大量网络数据性能相当。

详情
AI中文摘要

我们引入了“单个人生”学习范式,其中我们仅针对一个人拍摄的自我中心视频训练一个独特的视觉模型。我们利用单个人生中自然捕获的多个视角,以自监督方式学习视觉编码器。我们的实验展示了三个关键发现。首先,独立在不同人生上训练的模型发展出高度对齐的几何理解。我们通过在捕获不同人生(包括室内和室外)的不同数据集上训练视觉编码器,并引入一种新的基于交叉注意力的度量来量化不同模型发展的内部表示的功能对齐,来证明这一点。其次,我们展示了单个人生模型学习到可泛化的几何表示,这些表示能有效迁移到下游任务,如未见环境中的深度估计。第三,我们证明,对同一个人一周内最多30小时的数据进行训练,其性能与在30小时多样化网络数据上训练相当,突出了单个人生表示学习的优势。总体而言,我们的结果确立了世界的共享结构既导致了在个人人生上训练的模型的一致性,也为视觉表示学习提供了强大的信号。

英文摘要

We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

2511.01724 2026-05-27 cs.CV cs.LG

PRBench: A Standardized Probabilistic Robustness Benchmark

PRBench:标准化概率鲁棒性基准

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

发表机构 * WMG, University of Warwick(沃里克大学WMG学院) Department of Computer Science, University of Liverpool(利物浦大学计算机科学系) College of Computer Science, Nankai University(南开大学计算机学院) School of Computing, National University of Singapore(新加坡国立大学计算学院)

AI总结 提出PRBench基准,通过统一评估协议和理论分析,比较对抗训练与概率鲁棒性训练方法在干净准确率、鲁棒性及泛化误差上的表现。

详情
AI中文摘要

深度学习模型因对不可察觉扰动的脆弱性而闻名。现有研究大多集中于对抗鲁棒性(AR),它通过检查确定性对抗样本(AE)的存在性,在最坏情况下评估模型。相比之下,概率鲁棒性(PR)采用统计视角,衡量在随机扰动下预测保持正确的概率。尽管PR被广泛视为AR的实用补充,但专门用于提升PR的训练方法仍相对未被充分探索,尽管已有初步进展。在少数针对PR的训练方法中,我们发现了三个局限性:(i) 不可比较的评估协议;(ii) 尽管AT能带来PR提升的轶事证据,但与强AT基线的比较有限;(iii) 缺乏统一框架来比较这些方法的泛化能力。因此,我们引入了PRBench,这是第一个专门评估不同鲁棒性训练方法在PR提升上的基准。PRBench使用一套全面的指标,包括干净准确率、PR和AR性能、训练效率以及泛化误差(GE),对最常见的AT和针对PR的训练方法进行实证比较。我们还对不同训练方法的PR性能的GE进行了理论分析。PRBench揭示的主要发现包括:在跨不同超参数设置提升AR和PR性能方面,AT方法比针对PR的训练方法更具通用性,而针对PR的训练方法始终产生更低的GE和更高的干净准确率。包含229个训练模型(覆盖7个数据集和10种模型架构)的排行榜公开于 https://wellzline.github.io/PRBenchLeaderboard/。

英文摘要

Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 229 trained models across 7 datasets and 10 model architectures is publicly available at https://wellzline.github.io/PRBenchLeaderboard/.

2511.17852 2026-05-27 cs.LG stat.ML

Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

带RL或SFT的Transformer可证明学习稀疏布尔函数,但方式不同

Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu

发表机构 * School of Electronics and Computer Science, University of Southampton, United Kingdom(南安普顿大学电子与计算机科学学院,英国) Independent Researcher(独立研究员)

AI总结 本文通过统一分析RL(过程奖励)和SFT微调Transformer学习可递归分解的k-稀疏布尔函数的动态,证明两者都能学习k-PARITY、k-AND、k-OR等函数,但RL同时学习整个CoT链,而SFT逐步学习。

Comments 50 pages, 12 figures

详情
AI中文摘要

Transformer可以通过微调获得思维链(CoT)能力来解决复杂的推理任务。强化学习(RL)和监督微调(SFT)是实现这一目标的两种主要方法。在这项工作中,我们专门研究了使用过程奖励的RL和SFT,通过类似于CoT的中间推理步骤,用单层Transformer学习$k$-稀疏布尔函数。特别地,我们考虑可以递归分解为固定2-稀疏布尔函数的$k$-稀疏布尔函数。我们首先以统一的方式分析使用过程奖励的RL微调和SFT的学习动态。这使我们能够识别出Transformer可证明学习这些稀疏布尔函数的充分条件。然后,我们验证了这些条件在三个基本示例(包括$k$-PARITY、$k$-AND和$k$-OR)中成立,从而证明了它们通过RL和SFT的可学习性。值得注意的是,我们揭示了RL和SFT表现出不同的学习行为:RL同时学习整个CoT链,而SFT自然地逐步学习CoT链。总体而言,我们的发现为RL和SFT的底层机制以及它们在触发Transformer的CoT能力方面的差异提供了见解,并表明RL和SFT之间的比较可能需要考虑奖励设计和教师强制(teacher forcing)的使用。

英文摘要

Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end. In this work, we specifically examine RL with process rewards and SFT for learning $k$-sparse Boolean functions with a one-layer transformer through intermediate reasoning steps akin to CoT. In particular, we consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We first analyze the learning dynamics of RL fine-tuning with process reward and SFT in a unified way. This allows us to identify sufficient conditions under which the transformer provably learns these sparse Boolean functions. We then verify that these conditions hold for three basic examples, including $k$-PARITY, $k$-AND, and $k$-OR, thus demonstrating their learnability via both RL and SFT. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT naturally learns the CoT chain step by step. Overall, our findings provide insights on the mechanisms underlying RL and SFT and how they differ in triggering the CoT capabilities of transformers, and suggest that the comparison between RL and SFT may need to consider the reward design and the use of teacher forcing.

2505.13775 2026-05-27 cs.LG cs.AI

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

超越语义:无理由中间标记的不合理有效性

Karthik Valmeekam, Vardhan Palod, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

发表机构 * School of Computing and AI(计算与人工智能学院) Arizona State University(亚利桑那州立大学) Amazon AGI(亚马逊人工通用智能) Yale University(耶鲁大学)

AI总结 通过从零训练Transformer模型于形式可验证推理轨迹,发现模型在正确与损坏轨迹上表现相似,且损坏轨迹在分布外任务上泛化更好,挑战了中间标记反映或诱导可预测推理行为的假设。

Comments Published in Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

近期大型推理模型的显著成果被解读为思维链(CoT)的胜利,尤其是基于基础LLM采样的CoT训练过程有助于发现新的推理模式。虽然这些轨迹确实有助于模型性能,但其影响机制尚不明确:一些研究赋予其语义,另一些则警告不要将其视为模型内部计算过程的透明忠实代理。为系统探究推导轨迹的终端用户语义作用,我们设置了一项受控研究,从零开始训练Transformer模型于形式可验证的推理轨迹及其导向的解决方案。我们注意到,尽管相比仅解决方案的基线有所提升,但训练于完全正确轨迹的模型在得出正确解决方案时仍可能产生无效推理轨迹。更有趣的是,实验表明,训练于损坏轨迹(其中间推理步骤与所附问题无关)的模型与训练于正确轨迹的模型表现相似,甚至在分布外任务上泛化更好。我们还研究了基于GRPO的RL后训练对轨迹有效性的影响,发现虽然解决方案准确性提高,但轨迹有效性并未随之改善。最后,我们考察了推理轨迹长度是否反映推理时扩展,发现轨迹长度在很大程度上与所解决问题的底层计算复杂度无关。这些结果挑战了中间标记或“思维链”反映或诱导可预测推理行为的假设,并警示不要将此类输出拟人化或过度解读(尽管其表面形式看似合理)为语言模型中类人或类算法行为的证据。

英文摘要

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. While these traces certainly seem to help model performance, it is not clear how they influence it, with some works ascribing semantics to them and others cautioning against relying on them as transparent and faithful proxies of the model's internal computational process. To systematically investigate the role of end-user semantics of derivational traces, we set up a controlled study where we train transformer models from scratch on formally verifiable reasoning traces and the solutions they lead to. We notice that, despite gains over the solution-only baseline, models trained on entirely correct traces can still produce invalid reasoning traces even when arriving at correct solutions. More interestingly, our experiments also show that models trained on corrupted traces, whose intermediate reasoning steps bear no relation to the problem they accompany, perform similarly to those trained on correct ones, and even generalize better on out-of-distribution tasks. We also study the effect of GRPO-based RL post-training on trace validity, noting that while solution accuracy increases, this is not accompanied by improvements in trace validity. Finally, we examine whether reasoning-trace length reflects inference-time scaling and find that trace length is largely agnostic to the underlying computational complexity of the problem being solved. These results challenge the assumption that intermediate tokens or ``Chains of Thought'' reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly seemingly forms) as evidence of human-like or algorithmic behaviors in language models.

2511.14993 2026-05-27 cs.CV cs.AI cs.LG

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0:图像与视频生成的基础模型系列

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

发表机构 * Kandinsky Lab(Kandinsky 实验室)

AI总结 本文介绍Kandinsky 5.0系列模型,通过多阶段训练、自监督微调和强化学习后训练,实现高分辨率图像和10秒视频的高质量生成。

Comments Website: https://kandinskylab.ai/

详情
AI中文摘要

本报告介绍了Kandinsky 5.0,一系列用于高分辨率图像和10秒视频合成的最先进基础模型。该框架包含三个核心模型系列:Kandinsky 5.0 Image Lite——6B参数的图像生成模型系列,Kandinsky 5.0 Video Lite——快速轻量级的2B参数文本到视频和图像到视频模型,以及Kandinsky 5.0 Video Pro——19B参数模型,实现了卓越的视频生成质量。我们全面回顾了数据策展生命周期——包括收集、处理、过滤和聚类——用于多阶段训练流程,该流程涉及广泛的预训练,并融入了质量增强技术,如自监督微调(SFT)和基于强化学习(RL)的后训练。我们还介绍了新颖的架构、训练和推理优化,使Kandinsky 5.0能够在各种任务上实现高生成速度和最先进的性能,如人类评估所示。作为一个大规模、公开可用的生成框架,Kandinsky 5.0充分利用其预训练及后续阶段的全部潜力,以适应广泛的生成应用。我们希望本报告,连同我们开源代码和训练检查点的发布,将大大促进高质量生成模型的研究社区发展和可访问性。

英文摘要

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

2511.14075 2026-05-27 cs.LG cs.AI

CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction

CFG-OEC: 带正交误差校正的无分类器引导

Nakgyu Yang, Yechan Lee, SooJean Han

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science(韩国科学技术院电子工程学院)

AI总结 针对扩散模型中无分类器引导的采样规则与训练目标不匹配导致的误差,提出正交误差校正方法(CFG-OEC)通过减少条件与无条件预测误差的交互项来提升采样质量,并在Stable Diffusion上验证了FID和CLIP分数的改进。

详情
AI中文摘要

无分类器引导是扩散模型中条件采样的标准方法,但其采样规则与训练中使用的目标不一致。这种不匹配通过条件预测误差和无条件预测误差的相互作用引入了结构性采样误差。我们通过将采样误差分解为基础项和由两个误差对齐决定的交叉项来分析该问题。基于此分析,我们提出了带正交误差校正的无分类器引导(CFG-OEC),这是一种减少交互项的结构性修改。对于无法观测到真实噪声的实际场景,我们引入了一个从模型预测计算得到的代理量,以及一种跨扩散时间步稳定校正的动态方法。在受控环境下的实验验证了我们的理论误差分解和代理量构造。在Stable Diffusion v1.5和Stable Diffusion XL上的图像生成表明,CFG-OEC在多个采样器和引导机制下比CFG和CFG++改进了FID和CLIP分数。

英文摘要

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the objective used in training. This mismatch induces a structural sampling error through the interaction of conditional and unconditional prediction errors. We analyze this issue by decomposing the sampling error into a base term and a cross term determined by the alignment of the two errors. Based on this analysis we propose CFG with orthogonal error correction (CFG-OEC), a structural modification that reduces the interaction term. For practical settings where ground truth noise is not observable, we introduce a proxy computed from model predictions and a dynamic method that stabilizes correction across diffusion timesteps. Experiments in a controlled environment validate our theoretical error decomposition and proxy construction. Image generation on Stable Diffusion v1.5 and Stable Diffusion XL show that CFG-OEC improves FID and CLIP scores over CFG and CFG++ across multiple samplers and guidance regimes.

2511.07667 2026-05-27 cs.AI

AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation

AI驱动的贡献评估与冲突解决:群体工作量调查的框架与设计

Jakub Slapek, Mir Seyedebrahimi, Jianhua Yang

发表机构 * University of Warwick(沃里克大学) Warwick Manufacturing Group(沃里克制造集团)

AI总结 提出一个AI增强的框架和实现设计,通过整合异构工件并利用大语言模型进行验证和上下文分析,以解决团队中个人贡献的公平评估和冲突解决难题。

Comments 20 pages, 8 figures, 8 tables

详情
AI中文摘要

团队中个人贡献的公平评估仍然是一个持续的挑战,工作量的冲突和差异可能导致不公平的绩效评估,通常需要人工干预——这是一个成本高昂且困难的过程。我们调查了现有工具的功能,并发现了冲突解决方法和AI集成方面的空白。为了解决这个问题,我们提出了一种新颖的AI增强工具的框架和实现设计,该工具协助争议调查。该框架将异构工件——提交物(代码、文本、媒体)、通信(聊天、电子邮件)、协调记录(会议日志、任务)、同行评估和上下文信息——组织成三个维度,包含九个基准:贡献、互动和角色。客观度量被归一化,按维度聚合,并与不平等度量(基尼指数)配对,以揭示冲突标记。大语言模型(LLM)架构对这些度量进行验证和上下文分析,以生成可解释且透明的咨询判断。我们论证了在当前法规和机构政策下的可行性,并概述了实际分析(情感、任务忠实度、字数/行数等)、偏见防护、限制和实际挑战。

英文摘要

The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can result in unfair performance evaluation, often requiring manual intervention - a costly and challenging process. We survey existing tool features and identify a gap in conflict resolution methods and AI integration. To address this, we propose a framework and implementation design for a novel AI-enhanced tool that assists in dispute investigation. The framework organises heterogeneous artefacts - submissions (code, text, media), communications (chat, email), coordination records (meeting logs, tasks), peer assessments, and contextual information - into three dimensions with nine benchmarks: Contribution, Interaction, and Role. Objective measures are normalised, aggregated per dimension, and paired with inequality measures (Gini index) to surface conflict markers. A Large Language Model (LLM) architecture performs validated and contextual analysis over these measures to generate interpretable and transparent advisory judgments. We argue for feasibility under current statutory and institutional policy, and outline practical analytics (sentimental, task fidelity, word/line count, etc.), bias safeguards, limitations, and practical challenges.

2511.02525 2026-05-27 cs.LG cs.AI

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

一种用于求解带容量约束选址-路径问题的端到端学习方法

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(中国自动化智能无人系统国家级实验室,北京理工大学)

AI总结 提出基于深度强化学习与异构查询机制(DRLHQ)的端到端方法,首次将编码器-解码器结构应用于带容量约束的选址-路径问题(CLRP)及其开放变体(OCLRP),通过异构查询注意力机制动态协调选址与路径决策,在合成和基准数据集上优于传统方法和现有DRL基线。

详情
AI中文摘要

带容量约束的选址-路径问题(CLRPs)是组合优化中的经典问题,需要同时做出选址和路径决策。在CLRPs中,复杂的约束以及各种决策之间的复杂关系使得问题难以求解。随着深度强化学习(DRL)的出现,它已被广泛应用于解决车辆路径问题及其变体,而与CLRPs相关的研究仍有待探索。在本文中,我们提出了带有异构查询的DRL(DRLHQ)来分别求解CLRP和开放CLRP(OCLRP)。我们是首个为CLRPs提出端到端学习方法的工作,遵循编码器-解码器结构。具体而言,我们将CLRPs重新表述为一个针对各种决策量身定制的马尔可夫决策过程,这是一个通用的建模框架,可适用于其他基于DRL的方法。为了更好地处理选址和路径决策之间的相互依赖关系,我们还引入了一种新颖的异构查询注意力机制,旨在动态适应不同的决策阶段。在合成和基准数据集上的实验结果表明,我们提出的方法在求解CLRP和OCLRP时,相较于代表性的传统方法和基于DRL的基线,具有更优的解质量和更好的泛化性能。

英文摘要

The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.

2510.23486 2026-05-27 cs.LG

Learning to Reason Efficiently with Discounted Reinforcement Learning

通过折扣强化学习高效推理

Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, Karim Bouyarmane

发表机构 * Amazon(亚马逊公司) University of Alberta(阿尔伯塔大学)

AI总结 针对大型推理模型消耗过多token导致计算成本高的问题,提出使用折扣强化学习(解释为小token成本)惩罚推理token,结合Blackwell最优性分析,在保持准确性的同时缩短推理链。

详情
AI中文摘要

大型推理模型(LRMs)通常消耗过多的token,增加了计算成本和延迟。更广泛地说,在目标到达的序列决策问题中,我们通常希望快速到达目标,而LRM推理可以从这个角度看待。我们挑战了较长响应能提高准确性的假设。通过使用折扣强化学习设置(可解释为小的token成本)惩罚推理token,并分析受限策略类中的Blackwell最优性,我们鼓励简洁而准确的推理,类似于在随机最短路径问题中偏好更短的成功轨迹。实验证实了我们的理论结果,即这种方法在保持准确性的同时缩短了思维链。

英文摘要

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can be viewed through this lens. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning, analogous to preferring shorter successful trajectories in a stochastic shortest path problem. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

2510.10774 2026-05-27 cs.SD cs.AI cs.HC cs.LG

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice: 面向文本到语音合成的大规模多说话人波斯语语音语料库

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran(塔里哈大学电气与计算机工程学院) Institute for Research in Fundamental Sciences (IPM)(基础科学研究所(IPM))

AI总结 提出ParsVoice,目前最大的公开波斯语语音-文本语料库,通过可扩展的流水线从长篇有声读物构建高质量数据,用于训练多说话人TTS系统,并验证了其在零样本多说话人TTS中的有效性。

详情
AI中文摘要

波斯语在开放的语音-文本资源中仍然严重不足,限制了多说话人文本到语音(TTS)、语音语言建模和低资源语音处理的进展。我们介绍了ParsVoice,这是目前最大的公开波斯语语音-文本语料库,专为训练多说话人TTS系统而设计,同时提供了一个可扩展的流水线,用于从长篇有声读物录音中构建高质量的语音-文本数据。该流水线结合了微调的ParsBERT句子补全分类器、基于ASR的边界优化、标点恢复、说话人识别以及涵盖音频和波斯语特定文本属性的多维质量评估。最终发布的版本包含一个2200小时的TTS就绪子集,包含来自1815个自动识别说话人ID的136万个对齐片段,比之前最大的公开波斯语TTS数据集大25倍以上。为了验证该语料库,我们微调了XTTS,一个直接操作原始波斯语文本(无需音素表示)的零样本多语言TTS模型,实现了自然度MOS为3.6/5,说话人相似度MOS为4.0/5。ParsVoice数据集公开在:https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。

英文摘要

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

2509.04310 2026-05-27 cs.AI

EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

EvoEmo:面向多轮价格谈判中对抗性LLM智能体的进化情感策略

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) Rotman School of Management, University of Toronto(多伦多大学罗特曼管理学院) TUM School of Management, Technical University of Munich(慕尼黑技术大学管理学院) The Alan Turing Institute, London, UK(伦敦阿尔安·图灵研究院)

AI总结 提出EvoEmo进化强化学习框架,通过将情感状态转移建模为马尔可夫决策过程并采用种群遗传优化,动态优化多轮谈判中的情感表达,显著提升LLM智能体的谈判成功率、效率和买家节省。

详情
AI中文摘要

最近关于大型语言模型(LLM)中思维链(CoT)推理的研究表明,智能体可以参与 extit{复杂}、 extit{多轮}谈判,为智能体AI开辟了新途径。然而,现有的LLM智能体在很大程度上忽略了情感在此类谈判中的功能作用,而是生成被动、偏好驱动的情感反应,使其容易受到对抗方的操纵和策略性利用。为弥补这一差距,我们提出了EvoEmo,一个进化强化学习框架,用于优化谈判中的动态情感表达。EvoEmo将情感状态转移建模为马尔可夫决策过程,并采用基于种群的遗传优化,在多样化的谈判场景中进化出高奖励的情感策略。我们进一步提出了一个评估框架,包含两个基线——原始策略和固定情感策略——用于基准测试情感感知谈判。大量实验和消融研究表明,EvoEmo在成功率、效率和买家节省方面均持续优于两个基线。这一发现强调了适应性情感表达在使LLM智能体更有效地进行多轮谈判中的重要性。代码可在\href{https://github.com/Yunbo-max/EvoEmo}{ extcolor{red}{https://github.com/Yunbo-max/EvoEmo}}获取。

英文摘要

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines -- vanilla strategies and fixed-emotion strategies -- for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation. The code is available at \href{https://github.com/Yunbo-max/EvoEmo}{\textcolor{red}{https://github.com/Yunbo-max/EvoEmo}}.

2510.09606 2026-05-27 cs.CV

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

SpaceVista:从毫米到公里的全尺度视觉空间推理

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

发表机构 * Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室) Beijing University of Posts(北京邮电大学) Hong Kong University of Science(香港理工大学)

AI总结 本文提出全尺度空间推理解决方案,通过结构化知识系统、尺度感知建模和渐进训练范式,构建SpaceVista-1M数据集(38K视频场景、约1M空间QA对)和SpaceVista-7B模型,在5个基准上展现强泛化能力。

Comments Project Page: https://peiwensun2000.github.io/mm2km/

详情
AI中文摘要

随着当前空间推理探索的兴起,研究人员在理解室内场景方面取得了显著进展,但在机器人技术和自动驾驶等多样化应用中仍面临挑战。本文旨在通过解决两个关键挑战来推进跨不同场景的全尺度空间推理:1)数据集构建严重依赖室内3D扫描和劳动密集型人工标注;2)缺乏有效的全尺度场景建模,常常导致对单个场景的过拟合。本文提出了一种整体解决方案,集成了结构化空间推理知识系统、尺度感知建模和渐进训练范式,据我们所知,这是首次尝试拓宽多模态大语言模型的全尺度空间智能。通过任务特定、专家驱动的自动化流水线,我们在5个空间尺度上整理了超过38K个视频场景,创建了SpaceVista-1M数据集,该数据集包含约100万个空间问答对,涵盖19种不同的任务类型。虽然专家模型可以注入有用的领域知识,但它们不适合用于评估。因此,我们通过手动记录、检索和组装基于视频的数据,构建了一个具有精确标注的全尺度基准。然而,由于潜在的知识冲突,使用SpaceVista-1M进行简单训练往往会产生次优结果。因此,我们引入了SpaceVista-7B,一个空间推理模型,它接受超出语义的密集输入,并使用尺度作为尺度感知专家和渐进奖励的锚点。最后,在包括我们的SpaceVista-Bench在内的5个基准上的广泛评估展示了竞争性能,展现了跨所有尺度和场景的强大泛化能力。我们的数据集、模型和基准将在https://peiwensun2000.github.io/mm2km上发布。

英文摘要

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

2510.09405 2026-05-27 cs.LG

Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training

基于特征解耦与对抗训练的射频指纹识别跨接收机泛化

Yuhao Pan, Xiucheng Wang, Fushuo Huo, Nan Cheng, Wenchao Xu

发表机构 * Division of Integrative Systems and Design, Hong Kong University of Science and Technology, Hong Kong, China(香港理工大学整合系统与设计学院,中国香港,香港) State Key Laboratory of ISN and School of Telecommunications Engineering, Xidian University, Xi’an 710071, China(西安电子科技大学信息与通信国家重点实验室及电信工程学院,中国西安,710071) School of Cyber Science and Engineering, Southeast University, Nanjing, China(东南大学网络科学与工程学院,中国南京)

AI总结 提出一种特征解耦与对抗训练框架,通过分离发射机与接收机特征并抑制接收机信息,解决射频指纹识别中接收机更换导致的性能下降问题。

详情
AI中文摘要

射频指纹识别(RFFI)是无线网络安全的关键技术,利用硬件固有缺陷实现发射机识别。尽管深度神经网络能有效提取判别性射频特征,但在实际部署中,其性能受接收机引入的变异性显著影响。真实场景中,射频信号天然地混合了发射机特定特征与接收机依赖失真,导致模型在相同设备上训练和评估时会捕获接收机相关模式。因此,部署时更换接收机常导致性能显著下降。为解决此问题,我们提出一种跨接收机鲁棒的RFFI框架,明确解耦发射机特定和接收机特定表示。该方法整合对抗域对齐与接收机感知正则化,抑制发射机特征中的残余接收机信息,同时强制接收机特定表示的内部一致性。进一步引入特征分离约束,在潜在空间中解耦两个组件。在多接收机WiFi数据集上的大量实验表明,所提方法在跨接收机评估中持续优于最先进基线,并显著提升对接收机更换的鲁棒性。

英文摘要

Radio frequency fingerprint identification (RFFI) is a key technique for wireless network security, leveraging intrinsic hardware imperfections to enable transmitter identification. Although deep neural networks are effective at extracting discriminative RF features, their performance is significantly affected by receiver-induced variability in practical deployments. In real-world scenarios, RF signals inherently entangle transmitter-specific characteristics with receiver-dependent distortions, leading models to capture receiver-related patterns when training and evaluation are conducted on the same device. Consequently, replacing the receiver during deployment often results in notable performance degradation. To address this issue, we propose a cross-receiver robust RFFI framework that explicitly disentangles transmitter-specific and receiver-specific representations. The proposed method integrates adversarial domain alignment with receiver-aware regularization to suppress residual receiver information in transmitter features while enforcing intra-receiver consistency in receiver-specific representations. A feature separation constraint is further introduced to decouple the two components in the latent space. Extensive experiments on multi-receiver WiFi datasets demonstrate that the proposed method consistently outperforms state-of-the-art baselines under cross-receiver evaluation and significantly improves robustness to receiver replacement.

2510.08932 2026-05-27 cs.LG cs.IR

MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths

MATT-CTR:一种模型无关的测试时范式,用于通过置信度引导的推理路径进行CTR预测

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出一种模型无关的测试时范式MATT,利用特征组合的置信度分数生成多条推理路径并聚合预测,以缓解低置信度特征对CTR预测的影响。

详情
AI中文摘要

近期,越来越多的研究致力于优化CTR模型架构以更好地建模特征交互,或改进训练目标以辅助参数学习,从而获得更好的预测性能。然而,以往的工作主要集中在训练阶段,很大程度上忽视了推理阶段的优化机会。特别是,不常出现的特征组合会降低预测性能,导致不可靠或低置信度的输出。为了释放已训练CTR模型的预测潜力,我们提出了一种模型无关的测试时范式(MATT),该范式利用特征组合的置信度分数来指导生成多条推理路径,从而减轻低置信度特征对最终预测的影响。具体来说,为了量化特征组合的置信度,我们引入了一种层次概率哈希方法来估计不同阶数特征组合的出现频率,这些频率作为对应的置信度分数。然后,以置信度分数作为采样概率,通过迭代采样生成多条实例特定的推理路径,并随后聚合来自多条路径的预测分数以进行稳健预测。最后,广泛的离线实验和在线A/B测试强有力地验证了MATT在现有CTR模型上的兼容性和有效性。

英文摘要

Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.

2506.23274 2026-05-27 cs.LG cs.AI

Real-Time Progress Prediction in Reasoning Language Models

推理语言模型中的实时进度预测

Hans Peter Lyngsøe Raaschou-Jensen, Constanza Fierro, Anders Søgaard

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 研究通过离散化推理轨迹训练线性探针和微调模型生成0-100%进度估计,实现推理语言模型中的实时进度预测,并在数学推理任务上达到0.161 MAE。

详情
AI中文摘要

最近的推理语言模型,特别是那些采用长潜在思维链的模型,在复杂的智能体任务上表现出色。然而,随着这些模型在越来越长的时间范围内运行,其内部进展对用户变得不透明,使得期望管理和实时监督变得困难。在这项工作中,我们研究了对此类模型进行实时进度预测的可行性。我们首先通过离散化推理轨迹并训练线性探针对推理状态进行分类,测试隐藏状态是否编码进度信息。然后,我们微调模型以在思维链推理过程中生成0-100%的进度估计。我们最强的进度报告检查点在数学推理轨迹上达到了0.161的平均绝对误差,并在此设置中优于位置基线。最后,我们通过测量相同部分展开中隐含进度值的变化程度,量化了进度标签的内在模糊性。这种模糊性在Qwen3-4B中最低,其延续产生的展开离散度最小,表明更大的模型可以通过减少剩余解决方案长度的变化来使进度标签更稳定。

英文摘要

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, their internal progress becomes opaque to users, making expectation management and real-time oversight difficult. In this work, we investigate whether real-time progress prediction is feasible for such models. We first test whether hidden states encode progress information by discretizing reasoning trajectories and training a linear probe to classify reasoning states. We then fine-tune models to generate progress estimates from 0--100\% during chain-of-thought reasoning. Our strongest progress-reporting checkpoint reaches 0.161 MAE on mathematical reasoning traces and outperforms position baselines in this setting. Finally, we quantify the intrinsic ambiguity of progress labels by measuring how much the implied progress value varies from the same partial rollout. This ambiguity is lowest for Qwen3-4B, whose continuations produce the smallest rollout dispersion, suggesting that larger models can make progress labels more stable by reducing variation in remaining solution length.

2510.06843 2026-05-27 cs.CL cs.AI

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

自信号驱动的多LLM辩论以实现高效准确的推理

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

发表机构 * University of Cambridge(剑桥大学) Sorbonne Université(索邦大学) University of Science and Technology of China(中国科学技术大学) Beihang University(北航大学) Nanyang Technological University(南洋理工大学)

AI总结 提出一种利用模型级置信度和token级语义焦点两种自信号来自适应引导多LLM辩论过程的方法,在提高准确性的同时减少token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在 diverse 应用领域展现了令人印象深刻的能力。最近的工作探索了多LLM智能体辩论(MAD),通过使多个LLM迭代讨论和细化响应来增强性能。然而,现有的MAD方法主要关注利用外部结构(如辩论图)和LLM作为评判者,而忽略了生成过程中出现的自信号(如token logits和注意力)。这种遗漏导致了冗余计算和潜在的性能下降。在本文中,我们将重点转移到多LLM辩论的自信号上,并引入了一种自信号驱动的多LLM辩论(SID),它利用两种类型的自信号:模型级置信度和token级语义焦点,来自适应地引导辩论过程。我们的方法使高置信度智能体能够在模型级别提前退出,并基于注意力机制压缩冗余辩论内容。我们在多个具有挑战性的基准测试上,对各种LLMs和多模态LLMs评估了我们的方法。实验结果表明,我们的方法不仅在准确性上优于现有的MAD技术,而且还减少了token消耗,突显了利用自信号在提高多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{https://github.com/xuhang2019/SID}{ exttt{https://github.com/xuhang2019/SID}} 上提供。

英文摘要

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.