arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.11983 2026-05-13 cs.LG stat.ML

QDSB: Quantized Diffusion Schrödinger Bridges

Tobias Fuchs, Florian Kalinke, Nadja Klein

AI总结 在源分布和目标分布仅通过未配对样本指定的情况下,生成模型的学习变得越来越重要。本文提出了一种名为QDSB的量化扩散Schrödinger桥方法,用于加速无模拟Schrödinger桥的训练过程。该方法通过在锚点量化后的分布上计算端点耦合,并通过单元采样将结果映射回原始数据点,从而减少计算成本并保持全局传输结构的稳定性。实验表明,QDSB在保持样本质量的同时显著提升了训练效率。

详情
英文摘要

Learning generative models in settings where the source and target distributions are only specified through unpaired samples is gaining in importance. Here, one frequently-used model are Schrödinger bridges (SB), which represent the most likely evolution between both endpoint distributions. To accelerate training, simulation-free SBs avoid the path simulation of the original SB models. However, learning simulation-free SBs requires paired data; a coupling of the source and target samples is obtained as the solution of the entropic optimal transport (OT) problem. As obtaining the optimal global coupling is infeasible in many practical cases, the entropic OT problem is iteratively solved on minibatches instead. Still, the repeated cost remains substantial and the locality can distort the global transport geometry. We propose quantized diffusion Schrödinger bridges (QDSB), which compute the endpoint coupling on anchor-quantized endpoint distributions and lift the resulting plan back to original data points through cell-wise sampling. We show that the regularized optimal coupling is stable w.r.t. anchor quantization, with an error controlled by the quality of the anchor approximation. In real-world experiments, QDSB matches the sample quality of existing baselines, requiring substantially less time. Code and data are available at github.com/mathefuchs/qdsb.

2605.11978 2026-05-13 cs.CL

On Predicting the Post-training Potential of Pre-trained LLMs

Xiaoyuan Li, Yubo Ma, Kexin Yang, Moxin Li, Keqin Bao, Wenie Wang, Fuli Feng, Dayiheng Liu

AI总结 本文研究如何在预训练阶段预测大型语言模型在后续微调后的性能,以提高模型选择效率。作者提出了一种基于评分标准的判别评估框架RuDE,通过构建细粒度对比样本来评估模型的可塑性,并引入4C分类体系指导实验设计。实验表明,RuDE能以超过90%的相关性预测模型微调后的表现,并通过强化学习验证其有效性,为高效开发基础模型提供了新方法。

Comments Under Review

详情
英文摘要

The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.

2605.11977 2026-05-13 cs.CV

Optimizing 4D Wires for Sparse 3D Abstraction

Dong-Yi Wu, Tong-Yee Lee

AI总结 本文提出了一种基于单一连续4D曲线(B样条)的3D几何抽象统一框架,通过参数化空间坐标和变量宽度来表示复杂形状。与传统方法中使用多个独立曲线段导致结构碎片化不同,该方法通过保证全局拓扑一致性,实现了更整洁美观且结构连贯的3D抽象。研究引入了可微渲染管道,支持基于梯度的优化,并在图像到3D抽象、多视角线稿生成等任务中表现出更高的语义保真度和结构一致性。

详情
英文摘要

We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.

2605.11974 2026-05-13 cs.LG

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

Xu Chu, Guanyu Wang, Zhijie Tan, Xinrong Chen, Ziyu Li, Tong Mo, Weiping Li

AI总结 大型语言模型(LLMs)在处理输入元素顺序时存在偏差,影响其在上下文学习和检索增强生成等场景中的应用。为解决这一问题,本文提出了一种基于强化学习的双重群体优势优化方法(DGAO),通过平衡组内准确率优势和组间稳定性优势,同时提升模型的准确性和顺序稳定性。DGAO还引入了两个新指标——一致性率和过度自信率,用于更全面地评估模型性能,实验表明该方法在保持模型性能的同时显著提升了顺序公平性。

详情
英文摘要

Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization (\textbf{DGAO}), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: https://github.com/Hyalinesky/DGAO.

2605.11972 2026-05-13 cs.RO cs.AI cs.ET cs.SY eess.SY

Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation

Mohammad Khoshkdahan, John Pravin Arockiasamy, Andy Flores Comeca, Alexey Vinel

AI总结 该研究针对非视线交叉路口的碰撞问题,提出了一种结合集体感知与协作机器人的交通调控系统。系统通过双摄像头和V2X技术融合感知信息,实时监测道路环境,并由协作机器人在检测到潜在碰撞风险时发出停止手势,阻止车辆违规合并。实验表明,该方法能有效提升非视线条件下的交通安全,填补了现有V2X技术在未连接车辆中的感知与干预空白。

Comments Accepted for publication in the Proceedings of the 2026 IEEE Vehicular Technology Conference (VTC2026-Spring)

详情
英文摘要

Collisions at non-line-of-sight (NLOS) intersections remain a major safety concern because drivers have limited visibility of approaching traffic. V2X based warnings can reduce these risks, yet many vehicles are not equipped with V2X and drivers may ignore in vehicle alerts. Collective perception (CP) can compensate for low V2X penetration by extending the awareness of connected vehicles, but it cannot influence unconnected vehicles. To fill this gap, our work introduces a complementary concept that adds a cooperative humanoid robot as an active traffic moderator capable of physically stopping a vehicle that attempts to merge into an unseen traffic stream. The system operates on two parallel perception pathways. A dual camera infrastructure unit detects the position, speed and motion of approaching vehicles and transmits this information to the robot as a collective perception message (CPM). The robot also receives cooperative awareness messages (CAM) from connected vehicles through its onboard V2X unit and can act as a relay for decentralized environmental notification messages (DENM) when safety events originate elsewhere along the road. A fusion module combines these streams to maintain a robust real time view of the main road. A Zone of Danger (ZoD) is defined and used to predict whether an approaching vehicle creates a collision risk for a merging road user. When such a risk is detected, the robot issues a human-like STOP gesture and blocks the merging path until the hazard disappears. The full system was deployed at the Future Mobility Park (FMP) in Rotterdam. Experiments show that the combined vision and V2X perception allows the robot to detect approaching vehicles early, predict hazards reliably and prevent unsafe merges in real world NLOS conditions.

2605.11967 2026-05-13 cs.CV

H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes

ByungHa Ko, Youngmin Lee, Dong Hwan Kim

AI总结 本文提出了一种名为H2G的层次感知双曲分组方法,用于在无需语义标签的情况下对3D场景进行多粒度分组。该方法通过将2D基础模型的相似性线索转化为层次化监督,并将其嵌入到双曲特征场中,以更好地建模树状结构。H2G通过一种层次感知的目标函数,实现了对细粒度部件、物体结构及层次顺序的统一建模,从而在单一特征空间中完成多层级的语义分组。

详情
英文摘要

Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta's objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.

2605.11964 2026-05-13 cs.CL

Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging

Maodong Li, Yancui Li, Fang Kong

AI总结 该研究旨在提升目标引导的主动对话系统在引导对话向预设目标(如关键词或主题)发展的能力。通过联合建模用户画像和领域知识构建对话场景,并引入意图关键词桥梁机制预测未来对话轮次中的关键词,从而为系统生成提供更高级和灵活的引导。实验表明,该方法显著提升了系统的主动性、流畅性和信息量,有效缩小了与真实对话的差距。

Comments 21 pages, 9 Figures, 18 Tables

详情
英文摘要

A target-guided proactive dialogue system aims to steer conversations proactively toward pre-defined targets, such as designated keywords or specific topics. During guided conversations, dynamically modeling conversational scenarios and intent keywords to guide system utterance generation is beneficial; however, existing work largely overlooks this aspect, resulting in a mismatch with the dynamics of real-world conversations. In this paper, we jointly model user profiles and domain knowledge as conversational scenarios to introduce a scenario bias that dynamically influences system utterances, and employ intent-keyword bridging to predict intent keywords for upcoming dialogue turns, providing higher level and more flexible guidance. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target-guided proactive dialogue systems, thereby narrowing the gap with real world interactions.

2605.11963 2026-05-13 cs.CV

What Does It Mean for a Medical AI System to Be Right?

Antony Gitau

AI总结 本文探讨了医疗AI系统“正确”的含义,以骨髓穿刺涂片中浆细胞的自动分类为例,分析了其在多发性骨髓瘤诊断中的应用。作者指出,医疗AI的正确性并非仅由基准性能决定,而是一个多维概念,涉及数据标注、模型可解释性、临床指标的相关性以及人机协作中的责任分配。文章从科学哲学和研究伦理角度出发,揭示了真实标签的不稳定性、过度自信AI的不透明性、标准临床指标的不足以及高压环境下自动化偏见等关键问题。

Comments Part of a PhD ethics course

详情
英文摘要

This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.

2605.11960 2026-05-13 cs.CV

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

Gengluo Li, Shangpin Peng, Xingyu Wan, Chengquan Zhang, Hao Feng, Xin Xu, Pian Wu, Bang Li, Zengmao Ding, Yongge Liu, Yipei Ye, Yang Yang, Zhan Shu, Guojun Yan, Zhe Li, Can Ma, Weiping Wang, Yu Zhou, Han Hu

AI总结 该研究提出Chronicles-OCR,首个用于评估视觉大语言模型跨时代感知能力的综合性基准,聚焦于汉字在七种书写体系演变过程中的视觉感知挑战。该数据集包含2800张严格平衡的图像,涵盖从甲骨到纸张等多种载体,通过提出阶段自适应注释范式,构建了包括跨时期字形识别、古文解析等在内的多项任务,旨在揭示当前模型在历史文字感知中的局限性,推动更加鲁棒的演变感知研究。

详情
英文摘要

Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at https://github.com/VirtualLUOUCAS/Chronicles-OCR.

2605.11959 2026-05-13 cs.CV cs.CL

Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

Maham Nazir, Muhammad Aqeel, Richong Zhang, Francesco Setti

AI总结 本文研究了如何利用视觉-语言模型对教学视频进行多模态抽象摘要生成。作者提出了一种名为ClipSum的框架,通过冻结CLIP预训练模型的视觉特征,并结合显式的时序建模和维度自适应融合,实现了更有效的视频摘要生成。实验表明,ClipSum在YouCook2数据集上取得了优于传统方法的ROUGE-1指标,验证了语义对齐在跨模态任务中的重要性。

Comments Accepted to ICPR 2026

详情
英文摘要

Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum

2605.11951 2026-05-13 cs.RO

From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

Sheng Xu, Ruixing Jin, Huayi Zhou, Bo Yue, Guanren Qiao, Yunxin Tai, Yueci Deng, Kui Jia, Guiliang Liu

AI总结 尽管机器人操作已取得显著进展,但在动态和非结构化环境中,任务失败不可避免,可靠执行仍具挑战。为应对这一问题,本文提出AgentChord,一种基于智能体的系统,通过构建带有前瞻性恢复分支的任务图,实现对潜在失败的主动规划与快速响应。该系统由多个专门智能体协作运行,显著提升了操作任务的成功率与执行效率,增强了现实机器人系统的可靠性和自主性。

Comments 18 pages, accepted to RSS 2026

详情
英文摘要

Although robotic manipulation has made significant progress, reliable execution remains challenging because task failures are inevitable in dynamic and unstructured environments. To handle such failures, existing frameworks typically follow a stepwise detect-reason-recover pipeline, which often incurs high latency and limited robustness due to delayed reasoning and reactive planning. Inspired by the human capability to anticipate and proactively plan for potential failures, we introduce AgentChord, an agentic system that models a manipulation task as a directed task graph. Before execution, this graph is enriched with anticipatory recovery branches that specify context-aware corrective behaviors, enabling immediate and targeted responses when failures occur. Specifically, AgentChord operates through a choreography of specialized agents: a composer that structures the nominal task graph, an arranger that augments the graph with anticipatory recovery branches, and a conductor that compiles and coordinates executable transitions using low-latency monitors to detect deviations and trigger pre-compiled recoveries without re-planning. Empirical studies on diverse long-horizon bimanual manipulation tasks demonstrate that AgentChord substantially improves success rates and execution efficiency, advancing the reliability and autonomy of real-world robotic systems. The project page is available at: https://shengxu.net/AgentChord/.

2605.11939 2026-05-13 cs.CV

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

Boyang Guo, Liang Li, Lin Peng, Yuhan Gao, Xichun Sheng, Chenggang Yan

AI总结 本文提出了一种名为Cluster-Aware Neural Collapse Prompt Tuning(CPT)的方法,旨在提升视觉-语言模型在长尾数据集上的泛化能力。该方法通过构建语义不变空间并引入神经崩溃驱动的判别优化,增强了尾部类别的可区分性,同时保持模型整体的泛化性能。实验表明,CPT在多个数据集上优于现有方法,尤其在长尾类别上的表现更为突出。

详情
英文摘要

Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.

2605.11936 2026-05-13 cs.AI

From Noise to Diversity: Random Embedding Injection in LLM Reasoning

Heejun Kim, Seungpil Lee, Jewon Yeom, Jaewon Sok, Seonghyeon Park, Jeongjae Park, Taesup Kim, Sundong Kim

AI总结 该研究探讨了在大语言模型推理中使用随机嵌入注入(RSP)的方法,旨在分离软提示效果中来自训练内容与注入行为本身的影响。通过在输入中附加随机生成的嵌入向量,RSP无需训练即可在数学推理任务中达到与优化软提示相当的性能。研究揭示了RSP通过提升早期生成token的多样性,结合温度采样可提高多尝试正确率,并将该机制扩展至训练阶段,展示了其在推理与训练中的广泛适用性。

Comments 30 pages, 5 figures, 6 tables. Under review

详情
英文摘要

Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt -- training-free, freshly resampled -- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.

2605.11934 2026-05-13 cs.CV

Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution

Chen Wu, Ling Wang, Zhuoran Zheng, Xiangyu Chen, Jingyuan Xia, Weidong Jiang, Jiantao Zhou

AI总结 本文研究了在高分辨率RGB图像指导下从低分辨率深度图重建高分辨率深度图的引导深度超分辨率(GDSR)问题。为了解决现有方法在模态间建模效率与语义交互能力之间的矛盾,作者提出了一种基于交互状态空间模型的新型GDSR框架,引入了跨模态局部扫描机制,实现了RGB与深度特征之间的细粒度语义交互,并结合Mamba架构实现了线性复杂度的全局建模,显著提升了模型效率与重建质量。

Comments ISCAS2026

详情
英文摘要

Guided depth super-resolution (GDSR) reconstructs HR depth maps from LR inputs with HR RGB guidance. Existing methods either model each modality independently or rely on computationally expensive attention mechanisms with quadratic complexity, hindering the establishment of efficient and semantically interactive joint representations. In this paper, we observe that feature maps from different modalities exhibit semantic-level correlations during feature extraction. This motivates us to develop a more flexible approach enabling dense, semantically-aware deep interactions between modalities. To this end, we propose a novel GDSR framework centered around the Interactive State Space Model. Specifically, we design a cross-modal local scanning mechanism that enables fine-grained semantic interactions between RGB and depth features. Leveraging the Mamba architecture, our framework achieves global modeling with linear complexity. Furthermore, a cross-modal matching transform module is introduced to enhance interactive modeling quality by utilizing representative features from both modalities. Extensive experiments demonstrate competitive performance against state-of-the-art methods.

2605.11931 2026-05-13 cs.CV

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao

AI总结 本文研究了如何通过自改进训练提升多模态大语言模型(MLLMs)的推理能力。针对现有方法中数据不平衡和语言先验偏差的问题,提出了一种视觉感知的自改进训练框架VISTA,通过前缀重采样策略和视觉感知注意力评分,有效提升了模型对视觉信息的关注与利用。实验表明,VISTA在多种下游任务中显著提升了MLLMs的多模态推理性能。

Comments Accepted by ICML 2026

详情
英文摘要

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

2605.11928 2026-05-13 cs.AI

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

Xiaolin Zhou, Aojie Yuan, Zheng Luo, Zipeng Ling, Xixiao Pan, Yicheng Gao, Haiyue Zhang, Jiate Li, Shuli Jiang, Prince Zizhuang Wang, Zixuan Zhu, Jinbo Liu, Ryan A. Rossi, Hua Wei, Xiyang Hu

AI总结 该研究针对工具使用语言代理在真实部署中面临的模拟到现实(sim-to-real)差距问题,提出了一个名为 RobustBench-TC 的基准测试平台,涵盖22种基于部分可观测马尔可夫决策过程(POMDP)不同组件的扰动类型。研究还提出了一种基于领域随机化的强化学习方法 ToolRL-DR,通过在训练中引入扰动增强轨迹,显著提升了代理在面对观察、奖励相关元数据和状态转移扰动时的鲁棒性,尤其在未接触过特定扰动的情况下仍能有效提升性能。

Comments Dataset, code, and benchmark leaderboard are available at https://github.com/WillChow66/robustbench-tc-release.git and https://huggingface.co/spaces/willchow66/robustbench-tc-leaderboard

详情
英文摘要

Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.

2605.11927 2026-05-13 cs.CV

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

Qi Zhao, Jun Chen, Ivor Tsang, Guang Dai

AI总结 RealDiffusion 是一种用于多角色绘本生成的物理感知注意机制框架,旨在解决扩散模型在生成连续图像序列时面临的叙事动态性与角色一致性之间的平衡问题。该方法引入热扩散作为去噪先验,结合区域感知的随机过程,有效抑制角色特征漂移并保持帧间身份稳定,同时通过可配置的物理系统建模特征演化,实现对时空关系的正则化。实验表明,RealDiffusion 在保持叙事动态性的同时显著提升了角色一致性,优于现有先进方法。

Comments CVPR2026

详情
英文摘要

While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.

2605.11920 2026-05-13 cs.AI

Domain Restriction via Multi SAE Layer Transitions

Elias Shaheen, Avi Mendelson

AI总结 本文研究了如何通过分析大语言模型(LLM)内部处理过程中的多层稀疏自编码器(SAE)过渡,来识别和限制其在特定领域的应用范围。作者提出了一种基于SAE层间动态变化的轻量方法,能够有效区分领域外(OOD)输入,从而提升模型在特定任务中的表现和可控性。实验表明,该方法在捕捉输入细节方面具有显著优势,并在多个模型上验证了其有效性。

详情
英文摘要

The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.

2605.11919 2026-05-13 cs.LG

STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning

Zekai Chen, Xun Wu, Xunkai Li, Yihan Sun, Rong-Hua Li, Guoren Wang

AI总结 该论文研究了多模态联邦图学习(MM-FGL)中语义漂移的问题,即不同客户端在训练前因模态差异导致语义表示不一致,影响模型协作效果。为解决这一问题,作者提出了STAGE框架,通过构建共享语义空间,将异构模态特征转化为可比表示,并控制其在本地图结构中的传播方式,从而提升语义对齐效果并降低不一致性放大的风险。实验表明,STAGE在多个任务和数据集上均取得领先性能,同时减少了每轮通信开销。

详情
英文摘要

Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. As graph data increasingly contain multimodal node attributes such as text and images, multimodal federated graph learning (MM-FGL) has become an important yet substantially harder setting. The key challenge is that clients from different modality domains may not share a common semantic space: even for the same concept, their local encoders can produce inconsistent representations before collaboration begins. This makes direct parameter coordination unreliable and further causes two downstream problems: forcing heterogeneous client representations into a naively shared semantic space may create false semantic agreement, and graph message passing may amplify residual inconsistency across neighborhoods. To address this issue, we propose \textbf{STAGE}, a protocol-first framework for MM-FGL. Instead of relying on direct parameter averaging, STAGE builds a shared semantic space that first translates heterogeneous multimodal features into comparable representations and then regulates how these representations propagate over local graph structures. In this way, STAGE not only improves cross-client semantic calibration, but also reduces the risk of inconsistency amplification during graph learning. Extensive experiments on 8 multimodal-attributed graphs across 5 graph-centric and modality-centric tasks show that STAGE consistently achieves state-of-the-art performance while reducing per-round communication payload.

2605.11913 2026-05-13 cs.CV

Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization

Jaerin Lee, Kanggeon Lee, Kyoung Mu Lee

AI总结 该论文提出了一种名为Vector Scaffolding的新型分层优化框架,用于解决可微分图像矢量化中的拓扑崩溃问题。传统方法在像素级优化过程中容易导致结构失真,而该方法通过引入内部梯度聚合、渐进分层和快速膨胀调度等技术,实现了多尺度曲线混合的稳定学习,显著提升了优化效率和图像质量。实验表明,该方法在优化速度和图像保真度方面均优于现有技术。

Comments 22 pages, 12 figures

详情
英文摘要

Differentiable vector graphics have enabled powerful gradient-based optimization of vector primitives directly from raster images. However, existing frameworks formulate this as a flat optimization problem, forcing hundreds to thousands of randomly initialized curves to blindly compete for pixel-level error reduction. This disordered optimization leads to topology collapse, where macroscopic structures are distorted by internal high-frequency noise, resulting in a redundant and uneditable "polygon soup" that limits practical editability. To address this limitation, we propose Vector Scaffolding, a novel hierarchical optimization framework that shifts from flat pixel-matching to structured topological construction tailored for vector graphics. By identifying a key cause of topology collapse as the mathematical imbalance between area and boundary gradients, we introduce Interior Gradient Aggregation to stabilize the learning dynamics of multi-scale curve mixtures. Upon this stabilized landscape, we employ Progressive Stratification and Rapid Inflation Scheduling to progressively densify vector primitives with extremely high learning rates ($\times 50$). Experiments demonstrate that our approach accelerates optimization by $2.5\times$ while simultaneously improving PSNR by up to 1.4 dB over the previous state of the art.

2605.11910 2026-05-13 cs.AI

Rethinking Positional Encoding for Neural Vehicle Routing

Chuanbo Hua, Federico Berto, Andre Hottung, Nayeli Gast Zepeda, Yining Ma, Zihan Ma, Paula Wong-Chung, Changhyun Kwon, Cathy Wu, Kevin Tierney, Jinkyoo Park

AI总结 本文研究了在神经车辆路径规划(VRP)中位置编码(PE)的设计问题,指出传统自然语言处理中的位置编码难以满足VRP问题的结构特性。作者提出了三个应被位置编码遵循的结构属性,并基于几何基础设计了一种层次化的各向异性位置编码方法,该方法结合了路线内环形一致的编码与以仓库为中心的跨路线角度编码。实验表明,这种基于几何的位置编码在多种VRP变体中均优于传统基于索引的编码方法。

详情
英文摘要

Transformer-based models have become the dominant paradigm for neural combinatorial optimization (NCO) of vehicle routing problems (VRPs), yet the role of positional encoding (PE) in these architectures remains largely unexplored. Unlike natural language, where tokens are uniformly spaced on a line, routing solutions exhibit several properties that render standard NLP positional encodings inadequate. In this work, we formalize three such structural properties that a routing-aware PE should respect, namely anisometric node distances, cyclic and direction-aware topology, and hierarchical depot-anchored global multi-route structure, combining them with a unifying design principle of geometric grounding. Guided by these criteria, we analyze and compare PE methods spanning NLP, graph-transformer, and routing-specific families, and propose a hierarchical anisometric PE that combines a distance-indexed, circularly consistent in-route encoding with a depot-anchored angular cross-route encoding. Extensive experiments across diverse VRP variants demonstrate that geometry-grounded PE consistently outperforms index-based alternatives, with gains that transfer across problem variants, model architectures, and distribution shifts.

2605.11908 2026-05-13 cs.LG

Delightful Gradients Accelerate Corner Escape

Jincheng Mei, Ian Osband

AI总结 本文研究了Softmax策略梯度方法在接近次优角点时收敛速度缓慢的问题,提出了一种名为“愉悦策略梯度”(DG)的新方法,通过将策略梯度项与优势值和动作惊喜度的乘积进行门控,有效缓解了角点陷阱现象。理论分析表明,DG在多臂老虎机和表格MDP中能够以渐近$O(1/t)$的速率全局收敛到最优策略,而在共享函数逼近的环境下可能失效,但实验显示其在MNIST上下文老虎机任务中仍优于传统策略梯度方法。

Comments Preprint

详情
英文摘要

Softmax policy gradient converges at $O(1/t)$, but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For $K$-armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its contribution to escape is non-negative. Combining corner instability with a monotonic value improvement identity, we prove that DG converges globally to the optimal policy in both bandits and tabular MDPs at an asymptotic $O(1/t)$ rate. We also show, via an exact counterexample, that this tabular mechanism can fail under shared function approximation. In MNIST contextual bandits with a shared-parameter neural network, DG nevertheless recovers from bad initializations faster than standard policy gradient, suggesting that the counterexample marks a boundary of the theory rather than a practical prohibition.

2605.11906 2026-05-13 cs.CL

YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

Yifan Le

AI总结 该研究提出了一种名为YFPO的神经元引导偏好优化框架,旨在提升大型语言模型在数学推理任务中的表现。不同于依赖外部偏好数据的传统方法,YFPO通过AttnLRP技术识别与数学推理相关的神经元,并利用这些神经元在优选与非优选响应间的激活差异构建辅助奖励信号,从而将内部神经元信息与外部偏好学习相结合。初步实验表明,该方法能够在一定程度上增强模型的推理能力,为更细粒度和可解释的后训练方法提供了新方向。

Comments 10 pages, 2figures. Work in progress

详情
英文摘要

Preference optimization has become an important post-training paradigm for improving the reasoning abilities of large language models. Existing methods typically rely on externally constructed preference data, using preferred and dispreferred responses as sample-level supervision. However, such external signals rarely make explicit use of capability-related information contained in the model's internal representations. For mathematical reasoning, certain neuron groups may exhibit activation patterns associated with mathematical knowledge, symbolic manipulation, or logical reasoning. Similar to reflexive behavioral signals, these internal activations may provide a coarse indication of whether the model is engaging math-related capabilities.We introduce YFPO, short for Yoked Feature Preference Optimization, a preliminary neuron-guided preference optimization framework for mathematical reasoning. YFPO first uses AttnLRP to identify math-related neurons, and then constructs an auxiliary reward from their activation margin between preferred and dispreferred responses. This design augments external preference learning with internal neuron-level signals. We conduct preliminary experiments on a small-scale language model using GSM8K as the main benchmark. Results suggest that neuron-level signals can interact with preference optimization and occasionally improve reasoning performance, offering a promising direction for more fine-grained and interpretable reasoning-oriented post-training.

2605.11905 2026-05-13 cs.AI

Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

Shuo Xu, Jiakun Zhang, Junyu Lai, Chun Cao, Jingwei Xu

AI总结 本文重新思考了监督粒度问题,提出了一种基于证明轨迹的段级监督方法,用于训练基于大语言模型的定理证明系统。该方法通过提取局部连贯的证明片段构建训练数据,既保留了全局结构信息,又避免了细粒度步骤预测带来的碎片化问题。实验表明,该方法在多个基准数据集上显著优于现有的步骤级和全证明生成方法,并能有效提升现有证明器的性能与推理效率。

Comments 22 pages, 4 figures, 6 tables

详情
英文摘要

Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG-ATP.

2605.11904 2026-05-13 cs.CV cs.AI

Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning

Huiyu Yi, Zhiming Xu, Dunwei Tu, Zhicheng Wang, Baile Xu, Furao Shen

AI总结 本文针对类增量学习(CIL)中传统最近类均值(NCM)分类器因特征漂移和非线性结构而表现不佳的问题,提出了一种基于拓扑感知的分层分类器HC-SOINN。该方法通过“局部到全局”的表示方式捕捉类间流形的拓扑结构,并引入结构-拓扑对齐残差(STAR)方法,实现对复杂非线性特征漂移的精确适应。实验表明,该方法在多种先进模型中均能有效提升分类性能,展现出良好的鲁棒性和泛化能力。

Comments accepted by ICML2026

详情
英文摘要

The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at https://github.com/yhyet/HC_SOINN.

2605.11900 2026-05-13 cs.CV

Mobile Traffic Camera Calibration from Road Geometry for UAV-Based Traffic Surveillance

Alexey Popov, Natalia Trukhina, Vadim Vashkelis

AI总结 本文研究如何利用道路几何信息对无人机拍摄的交通视频进行标定,以生成可用于交通分析的鸟瞰图(BEV)表示。通过车道线、道路边界等可见道路特征估计图像坐标到地面坐标系的单应性变换,进而将车辆检测结果投影到BEV中,实现车辆轨迹、速度、方向及三维立方体的估计。该方法在UAVDT数据集上进行了验证,展示了从单目无人机视频生成可解释交通分析结果的可行性,同时也指出了远距离车辆对单应性误差敏感、自动标定可靠性不足等局限性。

详情
英文摘要

Unmanned aerial vehicles (UAVs) can provide flexible traffic surveillance where fixed roadside cameras are unavailable, costly, or impractical. However, raw UAV video is difficult to use for traffic analytics because vehicle motion is observed in perspective image coordinates rather than in a stable metric road coordinate system. This paper presents a lightweight pipeline for converting monocular oblique UAV traffic video into a local metric bird's-eye-view (BEV) representation. Visible road geometry, including lane markings, road borders, and crosswalks, is used to estimate a road-plane homography from image coordinates to metric ground-plane coordinates. Vehicle observations from dataset annotations or detectors are then projected to BEV using estimated ground contact points. The resulting trajectories support estimation of vehicle direction, speed, heading, and dynamic 3D cuboids on the road plane. We evaluate the pipeline on UAVDT using ground-truth annotations to isolate calibration and geometric reconstruction from detector and tracker errors. For sequence M1401, 40 sampled frames from img000001-img000196 produce 632 metric cuboid instances across 23 tracks. Results show that road-geometry calibration can transform monocular UAV footage into interpretable traffic-camera-style analytics, including BEV tracks and synchronized 3D cuboid visualizations. They also reveal key limitations: far-field vehicles are sensitive to homography errors, manual validation is currently more reliable than fully automatic calibration, and the single-plane assumption limits performance in non-planar or ambiguous road regions. The proposed pipeline provides a practical foundation for deployable UAV traffic cameras and future real-time traffic digital-twin systems.

2605.11898 2026-05-13 cs.CV

Few-Shot Synthetic Data Generation with Diffusion Models for Downstream Vision Tasks

Daniil Dushenev, Nazariy Karpov, Daniil Zinovjev, Alexander Gorin, Konstantin Kulikov

AI总结 本文针对视觉识别中罕见类样本不足的问题,提出了一种基于扩散模型的轻量级合成数据生成方法。该方法仅需少量真实样本(20-50张)微调LoRA适配器,即可生成用于训练的合成数据,有效提升罕见类的召回率和F1值。实验在胸部X光病理分类和工业表面缺陷检测两个不同领域进行,验证了该方法在数据稀缺场景下的有效性与可扩展性。

Comments 5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026

详情
英文摘要

Class imbalance is a persistent challenge in visual recognition, particularly in safety-critical domains where collecting positive examples is expensive and rare events are inherently underrepresented. We propose a lightweight synthetic data augmentation pipeline that fine-tunes a LoRA adapter on as few as 20-50 real images of a rare class and uses a pretrained diffusion model to generate synthetic samples for training. We systematically vary the synthetic-to-real ratio and evaluate the approach across two structurally different domains: chest X-ray pathology classification (NIH ChestX-ray14) and industrial surface crack detection (Magnetic Tile Defect dataset). All evaluations are performed on held-out sets of real images only. Across both domains, synthetic augmentation consistently improves rare-class recall and F1 compared to training with real data alone. Performance improves with moderate synthetic augmentation and shows diminishing returns as the synthetic ratio increases. These results suggest that LoRA-adapted diffusion models provide a simple and scalable mechanism for augmenting rare classes, enabling effective learning in data-scarce scenarios across heterogeneous visual domains.

2605.11893 2026-05-13 cs.AI

Toward Modeling Player-Specific Chess Behaviors

Loris Sogliuzzo, Aloïs Rautureau, Eric Piette

AI总结 尽管人工智能在国际象棋中已达到超人类水平,但准确模拟人类棋手个性化决策风格的模型仍是一个挑战。本文提出了一种基于Maia-2模型的架构,通过冠军特定嵌入和有限蒙特卡洛树搜索(MCTS)增强战术探索,以更好地捕捉历史冠军的棋风特征。研究引入了一种基于詹森-香农散度的行为评估指标,通过自编码器和UMAP降维技术比较玩家与AI模型的行为相似性,实验表明该方法在提升风格一致性方面优于传统基于移动准确率的评估方式。

详情
英文摘要

While artificial intelligence has achieved superhuman performance in chess, developing models that accurately emulate the individualized decision-making styles of human players remains a significant challenge. Existing human-like chess models capture general population behaviors based on skill levels but fail to reproduce the behavioral characteristics of specific historical champions. Furthermore, the standard evaluation metric, move accuracy, inherently penalizes natural human variance and ignores long-term behavioral consistency, leading to an incomplete assessment of stylistic fidelity. To address these limitations, an architecture is proposed that adapts the unified Maia-2 model to champion-specific embeddings, further enhanced by the integration of a limited Monte Carlo Tree Search (MCTS) process to enrich tactical exploration during move selection. To robustly evaluate this approach, a novel behavioral metric based on the Jensen-Shannon divergence is introduced. By compressing high-dimensional board representations into a latent space using an AutoEncoder and Uniform Manifold Approximation and Projection (UMAP), move distributions are discretized on a common grid to compare behavioral similarities. Results across 16 historical world champions indicate that while integrating MCTS decreases standard move accuracy, it improves stylistic alignment according to the proposed metric, substantially reducing the average Jensen-Shannon divergence. Ultimately, the proposed metric successfully discriminates between individual players and provides promising evidence toward more comprehensive evaluations of behavioral alignment between players and AI models.

2605.11889 2026-05-13 cs.LG cs.AI

Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning

Rachael Hwee Ling Sim, Jue Fan, Xiao Tian, Xinyi Xu, Patrick Jaillet, Bryan Kian Hsiang Low

AI总结 该研究探讨了如何在贝叶斯学习中激励数据源提供真实数据并实现协作公平性。为解决现有方法无法保证数据真实性的问题,作者提出了一种机制,结合半值(如夏普利值)确保公平性,并基于数据源未知的验证集设计了激励真实性的数据估值函数。该机制在均衡状态下可同时保证协作公平性和数据真实性,理论分析与实验验证均表明其有效性。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper

详情
英文摘要

Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.

2605.11887 2026-05-13 cs.CL cs.LG

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Boyi Deng, Xu Wang, Yaoning Wang, Yu Wan, Yubo Ma, Baosong Yang, Haoran Wei, Jialong Tang, Huan Lin, Ruize Gao, Tianhao Li, Qian Cao, Xuancheng Ren, Xiaodong Deng, An Yang, Fei Huang, Dayiheng Liu, Jingren Zhou

AI总结 Qwen-Scope 是一个基于 Qwen 模型家族构建的开源稀疏自动编码器(SAEs)工具套件,旨在将大语言模型中的稀疏特征转化为实用的开发工具。该研究展示了 SAEs 不仅可用于事后分析,还能在推理引导、评估分析、数据工作流和后训练优化等方面发挥作用,为模型的诊断、控制和改进提供可复用的接口。这一工作推动了大语言模型的机制可解释性研究,并加速了模型内部结构与下游行为之间的联系。

详情
英文摘要

Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.