arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2604.12812 2026-05-12 cs.AI

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai

AI总结 现有多模态大语言模型在处理长文档理解任务时,随着文档长度增加性能显著下降。为解决这一问题,本文提出了一种结构化的分析、定位与推理工作流,通过两阶段训练框架提升模型对关键证据的定位能力和推理准确性,并引入证据引导的资源分配策略以应对多页文档的训练内存限制。实验表明,DocSeeker 在领域内和领域外任务中均表现出优越性能,能够从短文档训练稳健推广到超长文档,并与视觉检索增强生成系统具有良好兼容性。

Comments CVPR 2026 Highlight

详情
英文摘要

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured Analysis, Localization and Reasoning workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

2604.12027 2026-05-12 cs.RO

3DRO: Lidar-level SE(3) Direct Radar Odometry Using a 2D Imaging Radar and a Gyroscope

Cedric Le Gentil, Daniil Lisus, Timothy D. Barfoot

AI总结 本文提出了一种基于2D成像雷达和陀螺仪的三维直接雷达里程计方法3DRO,用于实现SE(3)空间中的六自由度运动估计。该方法在保留DRO框架中2D速度估计的基础上,结合陀螺仪在SO(3)上的三维旋转测量,实现了高精度的三维位姿估计。实验表明,该方法在Boreas-RT数据集上达到了与激光雷达相当的里程计精度。

Comments Accepted for presentation at the ICRA 2026 Workshop on Radar in Robotics (poster: https://drive.google.com/file/d/1P_iBrGxPiZL644B-dHxbvdY-UJUzd4Kp/view )

详情
英文摘要

Recently, the robotics community has regained interest in radar-based perception and state estimation. A 2D imaging radar provides dense 360deg information about the environment. Despite the radar antenna's cone of emission and reception, the collected data is generally assumed to be limited to the plane orthogonal to the radar's spinning axis. Accordingly, most methods based on 2D imaging radars only perform SE(2) state estimation. This paper presents 3DRO, an extension of the SE(2) Direct Radar Odometry (DRO) framework to perform state estimation in SE(3). While still assuming planarity of the data through DRO's 2D velocity estimates, it integrates 3D gyroscope measurements over SO(3) to estimate SE(3) ego motion. While simple, this approach provides lidar-level odometry accuracy as demonstrated using 643km of data from the Boreas-RT dataset.

2604.11808 2026-05-12 cs.CV

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai

AI总结 生成高保真度的室内3D场景由于数据稀缺和复杂空间关系建模的困难,仍是一个重大挑战。本文提出Pair2Scene,一种基于局部物体关系学习的程序化场景生成框架,通过结合局部规则、场景层次结构和物理算法,有效捕捉支撑关系和功能关系两种关键物体间交互模式。该方法利用自建的3D-Pairs数据集进行训练,在推理阶段通过递归应用模型并结合碰撞感知的拒绝采样,生成符合物理和语义合理性的复杂场景,显著优于现有方法。

Comments ICML 2026

详情
英文摘要

Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.

2604.11087 2026-05-12 cs.LG

CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models

Linggang Kong, Lei Wu, Yunlong Zhang, Xiaofeng Zhong, Zhen Wang, Yongjie Wang, Yao Pan

AI总结 尽管大语言模型(LLMs)取得了突破性进展,但幻觉问题仍然是其在高风险领域应用的关键瓶颈。为了解决现有方法依赖静态信号、忽视因果机制的问题,本文提出CausalGaze,一种基于结构因果模型的新型幻觉检测框架,通过反事实干预揭示模型内部的因果推理路径,提升模型可解释性。实验表明,CausalGaze在多个数据集和主流模型上均表现出优越的检测性能,尤其在TruthfulQA数据集上实现了3.3%的AUROC提升。

Comments Accepted as ACL2026 Findings

详情
英文摘要

Despite the groundbreaking advancements made by large language models (LLMs), hallucination remains a critical bottleneck for their deployment in high-stakes domains. Existing classification-based methods mainly rely on static and passive signals from internal states, which often captures the noise and spurious correlations, while overlooking the underlying causal mechanisms. To address this limitation, we shift the paradigm from passive observation to active intervention by introducing CausalGaze, a novel hallucination detection framework based on structural causal models (SCMs). CausalGaze models LLMs' internal states as dynamic causal graphs and employs counterfactual interventions to disentangle causal reasoning paths from incidental noise, thereby enhancing model interpretability. Extensive experiments across four datasets and three widely used LLMs demonstrate the effectiveness of CausalGaze, especially achieving 3.3% improvement in AUROC on the TruthfulQA dataset compared to state-of-the-art baselines.

2604.08178 2026-05-12 cs.AI

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, Lan-Zhe Guo

AI总结 本文提出Plan-RewardBench,一个用于评估智能体轨迹级奖励模型(RM)性能的基准,旨在解决当前在工具集成环境中缺乏专门评估RM能力的挑战。该基准涵盖四个任务类别,包含经过验证的正向轨迹和具有混淆性的负向轨迹,通过多种方式生成以测试模型区分能力。实验表明,现有奖励模型在处理长轨迹任务时表现显著下降,突显了在智能体系统中进行轨迹级奖励建模的必要性。

Comments accepted to ACL 2026 main conference

详情
英文摘要

In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges -- most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.

2604.08153 2026-05-12 cs.RO

Semantic-Aware UAV Command and Control for Efficient IoT Data Collection

Assane Sankara, Daniel Bonilla Licea, Hajar El Hammouti

AI总结 本文研究了如何利用语义感知的无人机指挥与控制(C&C)技术,提高从物联网设备中高效采集图像数据的效率。提出了一种结合语义通信与无人机轨迹控制的框架,通过深度联合源信道编码生成图像的语义压缩表示,并采用基于双重深度Q学习的自适应飞行策略优化无人机轨迹,以最大化图像重建质量。实验表明,该方法在设备覆盖范围和语义重建质量方面优于传统方法。

Comments Accepted for publication at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). v2: added clarification on the DDQN implementation and TSP algorithm

详情
英文摘要

Unmanned Aerial Vehicles (UAVs) have emerged as a key enabler technology for data collection from Internet of Things (IoT) devices. However, effective data collection is challenged by resource constraints and the need for real-time decision-making. In this work, we propose a novel framework that integrates semantic communication with UAV command-and-control (C&C) to enable efficient image data collection from IoT devices. Each device uses Deep Joint Source-Channel Coding (DeepJSCC) to generate a compact semantic latent representation of its image to enable image reconstruction even under partial transmission. A base station (BS) controls the UAV's trajectory by transmitting acceleration commands. The objective is to maximize the average quality of reconstructed images by maintaining proximity to each device for a sufficient duration within a fixed time horizon. To address the challenging trade-off and account for delayed C&C signals, we model the problem as a Markov Decision Process and propose a Double Deep Q-Learning (DDQN)-based adaptive flight policy. Simulation results show that our approach outperforms baseline methods such as greedy and traveling salesman algorithms, in both device coverage and semantic reconstruction quality.

2604.06774 2026-05-12 cs.LG cs.AI math.FA

Sparse-Aware Neural Networks for Nonlinear Functionals: Mitigating the Exponential Dependence on Dimension

Jianfei Li, Shuo Huang, Han Feng, Ding-Xuan Zhou, Gitta Kutyniok

AI总结 本文研究了如何利用稀疏性来解决函数空间中非线性泛函学习的高维难题。作者提出了一种结合卷积架构与全连接网络的框架,通过有限样本提取稀疏特征并有效逼近非线性泛函。该方法基于通用离散化技术,证明了稀疏近似器能够从离散样本中稳定恢复函数,并适用于确定性和随机采样方案,从而在多种函数空间中提升了逼近效率并降低了样本需求,为缓解高维学习中的维度灾难提供了新的理论见解。

详情
英文摘要

Deep neural networks have emerged as powerful tools for learning operators defined over infinite-dimensional function spaces. However, existing theories frequently encounter difficulties related to dimensionality and limited interpretability. This work investigates how sparsity can help address these challenges in functional learning, a central ingredient in operator learning. We propose a framework that employs convolutional architectures to extract sparse features from a finite number of samples, together with deep fully connected networks to effectively approximate nonlinear functionals. Using universal discretization methods, we show that sparse approximators enable stable recovery from discrete samples. In addition, both the deterministic and the random sampling schemes are sufficient for our analysis. These findings lead to improved approximation rates and reduced sample sizes in various function spaces, including those with fast frequency decay and mixed smoothness. They also provide new theoretical insights into how sparsity can alleviate the curse of dimensionality in functional learning.

2604.06473 2026-05-12 cs.LG

MICA: Multivariate Infini Compressive Attention for Time Series Forecasting

Willa Potosnak, Nina Żukowska, Michał Wiliński, Dan Howarth, Ignacy Stępka, Mononito Goswami, Artur Dubrawski

AI总结 本文提出了一种名为MICA的多变量时间序列预测模型,旨在解决Transformer在处理高维时间序列时因跨通道注意力机制导致的计算复杂度过高的问题。MICA通过将高效的压缩注意力机制从序列维度扩展到通道维度,实现了对通道间依赖关系的建模,同时保持计算复杂度随通道数和上下文长度线性增长。实验表明,MICA在多个预测基准上显著降低了预测误差,并在多变量预测任务中优于现有深度Transformer和MLP模型,验证了其在可扩展性方面的优势。

详情
英文摘要

Multivariate forecasting with Transformers faces a core scalability challenge: modeling cross-channel dependencies via attention compounds attention's quadratic sequence complexity with quadratic channel scaling, making full cross-channel attention impractical for high-dimensional time series. We propose Multivariate Infini Compressive Attention (MICA), an architectural design to extend channel-independent Transformers to channel-dependent forecasting. By adapting efficient attention techniques from the sequence dimension to the channel dimension, MICA adds a cross-channel attention mechanism to channel-independent backbones that scales linearly with channel count and context length. We evaluate channel-independent Transformer architectures with and without MICA across multiple forecasting benchmarks. MICA reduces forecast error over its channel-independent counterparts by 5.4% on average and up to 25.4% on individual datasets, highlighting the importance of explicit cross-channel modeling. Moreover, models with MICA rank first among deep multivariate Transformer and MLP baselines. MICA models also scale more efficiently with respect to both channel count and context length than Transformer baselines that compute attention across both the temporal and channel dimensions, establishing compressive attention as a practical solution for scalable multivariate forecasting.

2604.05064 2026-05-12 cs.LG cs.AI

Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series

Annita Vapsi, Penghang Liu, Saheed Obitayo, Aakriti, Manoj Cherukumalli, Prathamesh Patil, Amit Varshney, Nicolas Marchesotti, Elizabeth Fons, Vamsi K. Potluru, Manuela Veloso

AI总结 该研究针对时间序列基础模型训练中合成数据缺乏真实多变量间动态相关性的不足,提出了一种动态线性共区域化模型DynLMC,能够生成具有时变相关结构和跨通道滞后关系的多变量时间序列。实验表明,使用DynLMC生成的数据对基础模型进行微调,能在多个基准任务中显著提升零样本预测性能,验证了动态建模在提升模型泛化能力中的重要性。

Comments ICLR 2026 Workshop on Time Series in the Age of Large Models

详情
英文摘要

Synthetic data is essential for training foundation models for time series (FMTS), but most generators assume static correlations, and are typically missing realistic inter-channel dependencies. We introduce DynLMC, a Dynamic Linear Model of Coregionalization, that incorporates time-varying, regime-switching correlations and cross-channel lag structures. Our approach produces synthetic multivariate time series with correlation dynamics that closely resemble real data. Fine-tuning three foundational models on DynLMC-generated data yields consistent zero-shot forecasting improvements across nine benchmarks. Our results demonstrate that modeling dynamic inter-channel correlations enhances FMTS transferability, highlighting the importance of data-centric pretraining.

2604.03928 2026-05-12 cs.LG cs.AI cs.CV stat.ML

Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look

Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade

AI总结 本文重新审视了在冻结的预训练卷积神经网络特征上应用监督降维方法的有效性,特别是线性判别分析(LDA)。研究对比了多种降维策略在多个视觉任务上的表现,发现LDA在粗粒度分类任务中能显著提升分类准确率并大幅降低特征维度,但在细粒度任务中效果较差。实验表明,LDA在类间结构较明显时表现优异,而对需要细微区分的任务则可能适得其反,为冻结特征分类流程中的降维应用提供了实用指导。

Comments 11 pages, 5 figures, 5 tables. Code available at https://github.com/IndarKarhana/lda-image-classification

详情
英文摘要

Frozen pretrained image representations are widely used for transfer learning: a backbone is kept fixed, feature vectors are extracted, and a lightweight classifier is trained on top. This pipeline usually feeds the full feature vector to the classifier, even when the target task has far fewer classes than the pretraining task. We revisit a classical alternative: supervised dimensionality reduction with Linear Discriminant Analysis (LDA) before linear probing. We evaluate ten dimensionality-reduction strategies on frozen features from six backbones -- ResNet-18, ResNet-50, MobileNetV3-Small, EfficientNet-B0, ViT-B/16, and DINOv2-ViT-S/14 -- across CIFAR-100, Tiny ImageNet, and CUB-200-2011. Under a fixed logistic-regression protocol, LDA improves accuracy over full features in 11 of 12 coarse-grained configurations, with gains up to 4.5 percentage points while reducing feature dimensionality by 48-87%. The same projection consistently hurts on fine-grained CUB-200, where full features win across all six backbones. This establishes a practical boundary condition: LDA is useful when class-level structure is coarse enough to be captured by mean-separating directions, but it can discard subtle cues needed for fine-grained recognition. We also compare LDA with PCA, PCA+LDA, regularized LDA, Local Fisher Discriminant Analysis, Neighbourhood Components Analysis, and three lightweight LDA extensions. The results show that plain LDA offers the best accuracy-cost tradeoff for most coarse-grained settings, while more complex supervised reduction methods rarely justify their additional cost. Overall, the study provides concrete guidance for when post-hoc supervised projection should, and should not, be inserted into frozen-feature image classification pipelines.

2604.03883 2026-05-12 cs.LG cs.AI cs.SY eess.SY stat.ML

Regime-Calibrated Fleet Repositioning with a Spatial Queue-Regret Decomposition

Indar Kumar, Akanksha Tiwari

AI总结 本文研究了网约车和自动驾驶按需出行运营商在未完全观测未来需求前对闲置运力进行再分配的问题,提出了一种基于历史需求模式校准的预测-优化方法。核心方法包括训练一个能减少需求误差、接单位置偏差和排队短缺风险的相似性门控,并构建了空间排队遗憾分解模型,以稳定队列代理模型分析需求场误差对等待时间的影响。实验表明,该方法在纽约市多个场景中有效降低了平均等待时间,优于传统调优方法和分布型基线。

Comments 13 pages, 4 figures, 8 tables. Code: https://github.com/IndarKarhana/regime-calibrated-dispatch

详情
英文摘要

Ride-hailing and autonomous mobility-on-demand operators reposition idle supply before future demand is fully observed. We study a retrieval-calibrated predict-then-optimize approach for this problem: historical demand regimes are matched to the current query block, combined into a calibrated demand prior, and passed to a fleet-balancing controller. The paper makes three contributions. First, we train a leakage-safe similarity gate whose objective penalizes demand error, pickup spatial mismatch, and queue shortage risk rather than retrieval rank alone. Second, we develop a spatial queue-regret decomposition for a stable queueing surrogate, linking demand-field error to wait through queueing sensitivity, allocator sensitivity, and Wasserstein pickup mismatch. Third, we evaluate learned retrieval and external-style rebalancing baselines in a common simulator. In the calibrated-demand gate experiment, across eight New York City scenarios and ten seeds, the spatial gate reduces mean wait to 82.3s, compared with 85.3s for hand-tuned similarity and 85.8s for a distributional-only baseline. In a separate replay-demand controller comparison, a scenario chance-MPC analog and a share-target transportation LP improve on Wen-style rebalancing (92.2s/92.2s vs. 100.1s), a reduced GPR chance-MPC comparator is intermediate at 94.4s, and an oracle MPC diagnostic is 91.3s.

2604.02151 2026-05-12 cs.LG

Auction-Based Online Policy Adaptation for Evolving Objectives

Guruprerana Shabadi, Kaushik Mallik

AI总结 本文研究了多目标强化学习中目标动态变化的问题,提出了一种基于拍卖机制的在线策略适应框架。该方法为每个目标设计一个局部策略,并通过拍卖机制协调策略执行,策略根据当前状态的紧急程度进行竞标,最高出价者决定动作选择,从而实现目标间的动态权衡。当目标发生变化时,系统只需增减对应策略即可快速适应,且同类型目标可复用参数化策略,提升运行时适应效率。实验表明,该方法在多个任务中优于传统单一策略方法。

Comments 22 pages, 8 figures

详情
英文摘要

We consider multi-objective reinforcement learning problems where objectives come from an identical family -- such as the class of reachability objectives -- and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum Markov game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. Under mild assumptions, we prove the existence of Nash equilibria where dishonest bidding leads to suboptimal outcome, and the most urgent objectives win control automatically. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on two Atari games and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.

2604.01824 2026-05-12 cs.CV

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz

AI总结 STRIVE 是一种用于视频问答的结构化时空强化学习框架,旨在解决现有方法在奖励方差低、策略更新不稳定的问题。该方法通过构建输入视频的多个时空变体,并在文本生成和视觉变体之间进行联合归一化,从而丰富奖励信号并提升策略更新的稳定性。此外,STRIVE 引入了基于重要性的采样机制,确保探索过程语义相关且保持时间覆盖,实验表明其在多个视频推理基准上优于现有强化学习方法。

详情
英文摘要

We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

2603.28254 2026-05-12 cs.LG stat.ML

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, Ganzhao Yuan

AI总结 本文提出了一种名为MuonEq的轻量级预正交化均衡方法,用于改进矩阵参数优化中的正交化更新策略。该方法在正交化之前对动量矩阵进行行或列归一化,从而提升正交化过程中的几何特性,改善训练效果。实验表明,MuonEq在多个大规模语言模型的预训练任务中表现优于原有方法,具有更快的收敛速度和更低的验证困惑度。

详情
英文摘要

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions typically either rescale updates after orthogonalization or use heavier whitening-based preconditioners before it. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon with three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). By rebalancing the momentum matrix before finite-step Newton--Schulz orthogonalization, {\method} improves the geometry seen by orthogonalization. We show that finite-step orthogonalization is governed by the input spectrum, especially stable rank and condition number, and that row/column normalization acts as a zeroth-order surrogate for whitening. For hidden matrix weights, R is the default variant. Theoretically, {\method} (R) retains the standard $\widetilde{\mathcal O}(T^{-1/4})$ Muon-type nonconvex stationarity guarantee with decoupled weight decay and a horizon-free diminishing learning-rate schedule, and extends it to finite-step NS5 up to an explicit inexactness constant. In LLaMA2 pretraining on C4, {\method} (R) consistently outperforms Muon on 130M, 350M, and 1B models, with faster convergence and lower validation perplexity. The code is available at the \href{https://github.com/MaeChd/muon-eq}{MuonEq codebase}.

2603.27977 2026-05-12 cs.AI

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama

AI总结 该研究提出了一种无需标签的强化学习框架SARL,旨在提升大型推理模型的通用推理能力。不同于传统方法关注推理结果,SARL通过奖励推理过程的结构拓扑,引导模型学习更合理、连贯的推理路径。实验表明,SARL在多个数学和开放性任务中均优于现有无监督强化学习方法,且训练过程更加稳定和具有探索性。

详情
英文摘要

Reinforcement learning is critical to improving large reasoning models, but its success relies heavily on verifiable rewards (RLVR), making it hard to use in open-ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimizing solely toward the final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning), and we extend traditional RLVR to open-ended settings. We introduce Structure-Aware Reinforcement Learning (SARL), a label-free framework that constructs per-response reasoning maps from intermediate thinking steps and rewards their reasoning topology. SARL shifts supervision from destination to path, encouraging reasoning trajectories that are both locally coherent and globally efficient. On verifiable math tasks, SARL outperforms prior label-free RL baselines and even exceeds RL methods with ground truth supervision, with average gains of +9.1% under PPO and +11.6% under GRPO across four math benchmarks, with particularly large improvements on AIME25 (+35.5% with PPO and +44.7% with GRPO). On non-verifiable open-ended tasks, SARL achieves average gains of +34.6% under PPO and +30.4% under GRPO on WildBench across five task categories, outperforming prior label-free RL methods and DPO, which relies on additional preference labels. Beyond strong performance, SARL exhibits substantially lower KL divergence and higher policy entropy, indicating more stable and exploratory training dynamics. Code and data are available at \href{https://github.com/cacayaya/SARL}{Code Link}.

2603.25561 2026-05-12 cs.LG

An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Single-Cell Protein Production in Saccharomyces cerevisiae

Neha K. Nair, Aaron D'Souza

AI总结 本研究提出了一种结合基因组规模代谢模型(GEM)与机器学习的计算框架,用于预测和优化酿酒酵母中单细胞蛋白(SCP)的产量。通过整合Yeast9代谢模型与随机森林、变分自编码器和贝叶斯优化等方法,研究识别了影响SCP合成的关键代谢反应,并实现了生物量通量的显著提升。该框架为SCP生产提供了系统性的预测与优化策略,具有重要的工业应用价值。

Comments 22 pages, 7 figures, and 4 tables

详情
英文摘要

Saccharomyces cerevisiae is increasingly recognised as a key source for single-cell protein (SCP) production, a rising solution to global protein-supply challenges. This study presents a computational framework combining the Yeast9 genome-scale metabolic model (GEM) with machine learning and optimisation to predict and enhance biomass flux for SCP yield. The Yeast9 GEM, comprising 4,131 reactions, 2,806 metabolites, and 1,161 genes, was simulated using flux balance analysis (FBA) across 2,000 Latin Hypercube-sampled flux profiles. Random Forest and XGBoost regressors achieved R2 values of 0.9999760 and 0.9997702, respectively. A variational autoencoder (VAE) identified four metabolic clusters with mean biomass fluxes of 0.472, 0.493, 0.527, and 0.505 gDW/hr. SHAP-based feature attribution identified twenty key reactions in glycolysis, the TCA cycle, and amino-acid biosynthesis; 18/20 (90%) were confirmed essential by in silico knockout. Bayesian optimisation produced a 12.13-fold improvement in biomass flux (0.0858 to 1.041 gDW/hr) at glucose = -20.0, oxygen = -20.0, and ammonium = -8.9 mmol/gDW/hr. A generative adversarial network (GAN) generated novel flux configurations (variance = 0.124); stoichiometric feasibility verification returned 0/100 feasible profiles due to incomplete generator convergence, reported as a limitation. Pareto front analysis identified an optimal SCP operating point at 0.0858 gDW/hr biomass flux with amino-acid biosynthesis score of 1000.029 mmol/gDW/hr.

2603.25074 2026-05-12 cs.CV

Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, Jifeng Guo, Yalan Qin, Yeying Jin, Hongwei Zheng, Faguo Wu, Wenjun Wu

AI总结 Z-Erase 是一种针对单流扩散变压器(如 Z-Image)设计的概念擦除方法,旨在从文本到图像模型中安全地去除不需要的概念。该方法提出了流解耦概念擦除框架和拉格朗日引导的自适应擦除调制算法,有效解决了单流模型中直接应用传统擦除方法导致的生成崩溃问题,并在多项任务中取得了最先进的性能。

详情
英文摘要

Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

2603.21362 2026-05-12 cs.AI cs.CL

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Liang Ding

AI总结 AdaRubric 是一种任务自适应的评估框架,旨在解决大语言模型代理评估中任务相关性不足的问题。该方法通过从任务描述中自动生成任务特定的评估标准,结合置信度加权的分项评分和密集奖励信号生成,提升了评估的准确性和可靠性。实验表明,AdaRubric 在多个基准上显著优于现有方法,并且能够零样本推广到新领域和多模态代理任务。

Comments KnowFM @ ACL 2026

详情
英文摘要

Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics

2603.21357 2026-05-12 cs.AI cs.CL

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Liang Ding

AI总结 本文提出了一种名为 AgentHER 的方法,通过将 hindsight experience replay(HER)应用于大语言模型代理的自然语言轨迹,将失败的轨迹重新标记为替代目标的正确示例,从而提升训练效率。该方法采用四阶段流程,将废弃的失败轨迹转化为监督微调、直接偏好优化和 ShareGPT 的训练数据,显著提高了模型在 WebArena 和 ToolBench 任务上的性能。实验表明,AgentHER 在多个模型上实现了更高的样本效率和性能提升,并通过鲁棒性机制有效降低了标签噪声。

详情
英文摘要

LLM-agent training pipelines routinely discard failed trajectories even though GPT-4o achieves only 14-20% on WebArena and below 55% pass@1 on ToolBench; even specialised systems at 50-65% leave the majority of trajectories unused. We introduce AgentHER, which recovers this lost signal by adapting Hindsight Experience Replay (HER) to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. AgentHER realises this through a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts discarded failures into SFT, DPO, and ShareGPT training data. On WebArena and ToolBench under a strict task-disjoint held-out protocol, AgentHER improves over success-only SFT by +7.6-11.4% across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), achieves 2x sample efficiency, and beats the strongest experience-centric baseline (Agent Workflow Memory) by +3.0-6.2%. Two robustness mechanisms, failure-severity weighting and cross-model multi-judge verification (gpt-4o-mini paired with Qwen2.5-72B-Instruct), reduce label noise from 5.9% to 2.9% and raise human-rated relabeling precision to 97.1% on WebArena and 96.0% on ToolBench. A full system-cost audit shows the entire relabeling pipeline costs 2.98 and 26 wall-clock minutes for 3,000 trajectories, i.e. 1.4 x 10^-3 per accepted pair. Code: https://github.com/alphadl/AgentHER

2603.16869 2026-05-12 cs.CV

SegviGen: Repurposing 3D Generative Model for Part Segmentation

Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng

AI总结 本文提出了一种名为SegviGen的框架,通过重用预训练的3D生成模型,实现高效的3D部件分割。该方法利用生成模型中编码的结构先验知识,通过独特的部件着色策略引导分割过程,避免了传统方法中多视角不一致和边界模糊的问题。实验表明,SegviGen在交互式分割和全分割任务中分别优于现有最佳方法40%和15%,且仅需极少量的标注数据,展示了预训练3D生成模型在部件分割任务中的强大迁移能力。

Comments Project page: https://fenghora.github.io/SegviGen-Page/

详情
英文摘要

We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.

2603.14937 2026-05-12 cs.LG cs.CL

LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs

Ying Zhang, Hang Yu, Haipeng Zhang, Peng Di

AI总结 本文研究了如何在富含文本信息的图中更有效地进行结构关系推理,提出了一种将大语言模型(LLM)作为图核的新方法。核心方法RAMP通过原始文本锚定的消息传递机制,将LLM直接作为图的聚合算子,避免了传统方法对文本信息的压缩与丢失。该方法在统一生成框架下处理判别和生成任务,实验表明其在图传播与深度文本推理之间取得了良好效果,为LLM在图学习中的应用提供了新思路。

Comments 23 pages, 5 figures

详情
英文摘要

Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node's raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.

2603.14107 2026-05-12 cs.LG cs.AI cs.CE cs.ET cs.NE

ST-ResGAT: Explainable Spatio-Temporal Graph Neural Network for Road Condition Prediction and Priority-Driven Maintenance

Mohsin Mahmud Topu, Azmine Toushik Wasi, Mahfuz Ahmed Anik, MD Manjurul Ahsan

AI总结 本文提出了一种名为ST-ResGAT的可解释时空图神经网络,用于预测道路状况并制定优先级驱动的维护策略。该方法结合残差图注意力编码与GRU时间聚合,能够准确预测路面退化情况,并直接生成符合ASTM标准的维护优先级。实验表明,ST-ResGAT在真实数据集上取得了优异的预测性能,同时通过图解释技术验证了模型决策与工程理论的一致性,为高风险、低资源地区的智能基础设施管理提供了实用且可持续的解决方案。

Comments 40 Pages. 10 Tables. 8 Figures

详情
Journal ref
Intelligent Transportation Infrastructure, 2026
英文摘要

Climate-vulnerable road networks require a paradigm shift from reactive, fix-on-failure repairs to predictive, decision-ready maintenance. This paper introduces ST-ResGAT, a novel Spatio-Temporal Residual Graph Attention Network that fuses residual graph-attention encoding with GRU temporal aggregation to forecast pavement deterioration. Engineered for resource-constrained deployment, the framework translates continuous Pavement Condition Index (PCI) forecasts directly into the American Society for Testing and Materials (ASTM)-compliant maintenance priorities. Using a real-world inspection dataset of 750 segments in Sylhet, Bangladesh (2021-2024), ST-ResGAT significantly outperforms traditional non-spatial machine learning baselines, achieving exceptional predictive fidelity (R2 = 0.93, RMSE = 2.72). Crucially, ablation testing confirmed the mathematical necessity of modeling topological neighbor effects, proving that structural decay acts as a spatial contagion. Uniquely, we integrate GNNExplainer to unbox the model, demonstrating that its learned priorities align perfectly with established physical engineering theory. Furthermore, we quantify classification safety: achieving 85.5% exact ASTM class agreement and 100% adjacent-class containment, ensuring bounded, engineer-safe predictions. To connect model outputs to policy, we generate localized longitudinal maintenance profiles, perform climate stress-testing, and derive Pareto sustainability frontiers. ST-ResGAT therefore offers a practical, explainable, and sustainable blueprint for intelligent infrastructure management in high-risk, low-resource geological settings.

2603.13131 2026-05-12 cs.AI

MineEvolve: Self-Evolution with Accumulated Knowledge for Long-Horizon Embodied Minecraft Agents

Zhengwei Xie, Zhisheng Chen, Ziyan Weng, Jinhan Li, Chenglong Li, Zikai Xiao, Jingwei Song, Jinhao Jing, Vireo Zhang, Kun Wang

AI总结 本文提出了一种名为 MineEvolve 的知识驱动型自进化框架,旨在提升长时域环境下 Minecraft 代理的自主学习与任务执行能力。该框架通过将执行过程中的反馈转化为结构化的行为知识,帮助代理在遇到失败或停滞时自动调整策略,从而逐步优化其行为。实验表明,MineEvolve 显著提升了多种语言模型规划器在长时域任务中的表现,尤其在依赖关系复杂的任务中效果更为明显。

详情
英文摘要

Long-horizon embodied intelligence requires agents to improve through interaction, not merely to execute plans generated from static goals. A central challenge is therefore to transform past executions into knowledge that can shape future decisions. Minecraft provides a representative testbed for this problem, where tasks such as crafting tools, building redstone components, and obtaining diamond equipment involve long prerequisite chains and are frequently disrupted by missing tools, blocked paths, GUI failures, or stagnant execution. To this end, we propose \textbf{MineEvolve}, a knowledge-driven self-evolution framework that converts execution feedback into actionable behavioral knowledge. MineEvolve first uses \underline{\emph{\textbf{\ding{182}Monitor}}} to convert each subgoal execution into typed feedback, including state changes, inventory changes, failure types, progress signals, and stagnation indicators. \underline{\emph{\textbf{\ding{183}Inducer}}} then derives reusable skills from successful executions and remedies from failed or stagnant executions. \underline{\emph{\textbf{\ding{184}Curator}}} validates, merges, filters, and retrieves these knowledge entries, while \underline{\emph{\textbf{\ding{185}Adaptor}}} uses them to repair the unfinished part of the plan under repeated failures or stagnation. Experiments on the Minecraft MCU long-horizon task suite show that MineEvolve consistently improves performance across multiple language-model planners, with larger gains on high-dependency task groups. Ablation and knowledge-accumulation studies further demonstrate that converting execution signals into structured behavioral knowledge is an effective path toward self-evolving embodied agents in long-horizon environments. Our code is available at https://github.com/xzw-ustc/MC-MineEvolve.

2603.07686 2026-05-12 cs.RO cs.CV

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

Yu Gao, Jijun Wang, Zongzheng Zhang, Anqing Jiang, Yiru Wang, Yuwen Heng, Shuo Wang, Hao Sun, Zhangfeng Hu, Hao Zhao

AI总结 该论文提出了一种名为UniUncer的统一动态静态不确定性框架,用于端到端自动驾驶系统,旨在提升系统对环境不确定性的感知与应对能力。该方法通过将确定性模型转换为概率回归模型,同时引入不确定性融合模块和不确定性感知门控机制,实现了对静态地图元素和动态交通参与者不确定性的联合建模与利用。实验表明,UniUncer在多个基准数据集上有效提升了轨迹预测和驾驶决策的性能,且计算开销极小。

Comments Accepted ICRA 2026

详情
英文摘要

End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.

2603.02678 2026-05-12 cs.LG cs.ET cs.HC stat.ME stat.ML

Causal Discovery Should Embrace the Wisdom of the Crowd

Ryan Feng Lin, Yuantao Wei, Huiling Liao, Xiaoning Qian, Shuai Huang

AI总结 本文提出了一种基于“群体智慧”的因果学习新范式,主张通过整合多人提供的分散且可能带有噪声的因果知识来构建全局因果结构。研究引入了众包平台、专家知识获取与聚合技术以及大语言模型辅助的信息获取等手段,构建了一个涵盖知识获取、建模、聚合与优化的群体因果学习框架。该方法为因果学习提供了新的研究方向,同时也带来了跨学科合作的机遇与挑战。

详情
英文摘要

This paper argues for recognizing an emerging paradigm of causal learning by wisdom of the crowd. Recent developments in government, industry, and research point to the rise of decentralized and crowd-based approaches within causal modeling, where causal knowledge distributed across many contributors can be systematically elicited and integrated with causal learning workflows. In this paradigm, causal learning becomes a distributed decision-making problem: each participant contributes partial and potentially noisy knowledge, while collective contributions help construct a global causal structure. This direction is enabled by advances in crowdsourcing platforms, expert knowledge elicitation, aggregation techniques, and large language model (LLM)-augmented information acquisition. Its promise is increasingly visible in early research and emerging real-world practices. Building on this momentum, we outline a framework for crowd-based causal learning spanning elicitation, modeling, aggregation, and optimization. We further discuss the opportunities and challenges introduced by this paradigm and call for interdisciplinary collaboration across causal learning, collective intelligence, human-AI interaction, and decision science.

2603.00918 2026-05-12 cs.CV cs.AI

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim, Minsu Cho

AI总结 本文提出了一种名为SOLACE的后训练框架,用于提升文本到图像生成的质量。该方法通过模型自身对生成图像进行重噪声处理,并衡量其恢复噪声的准确性,从而生成内在的自信信号作为强化学习的奖励,无需外部奖励模型或人工标注。实验表明,SOLACE在组合生成、文本渲染和图文对齐等方面均取得了一致性提升,并能与外部奖励结合实现互补改进。

Comments 22 pages, accepted to CVPR 2026. Project page https://wookiekim.github.io/SOLACE/

详情
英文摘要

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.

2603.00166 2026-05-12 cs.CV cs.AI

Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang

AI总结 本文探讨了生成式AI在执行简单任务时表现出的“简洁性悖论”,即模型在生成复杂场景时表现优异,却难以完成如生成纯色图像等简单任务。研究提出“AI服从性”概念,构建了一个分层评估框架,并设计了首个系统性基准Violin,用于评估模型从概率近似到像素级确定性的转换能力。实验表明,闭源模型在确定性任务上的表现优于开源模型,且其性能与自然图像生成能力存在相关性,为理解模型指令对齐问题提供了基础框架和工具。

详情
英文摘要

Recent advances in generative AI have shown human-level performance in complex content creation. However, we identify a "Paradox of Simplicity": models that can render complex scenes often fail at trivial, low-entropy tasks, such as generating a uniform pure color image. We argue this is a systemic failure related to uncontrollable emergent abilities. As models scale, strong priors for aesthetics and complexity override deterministic simplicity, creating an "aesthetic bias" that hinders the model's transition from data simulation to true intellectual abstraction. To better investigate this problem, we formalize the concept of AI Obedience, a hierarchical framework that grades a model's ability to transition from probabilistic approximation to pixel-level determinism (Levels 1 to 5). We introduce Violin, the first systematic benchmark designed to evaluate Level 4 Obedience through three deterministic tasks: color purity, image masking, and geometric shape generation. Using Violin, we evaluate several state-of-the-art models and reveal that closed-source models generally outperform open-source ones in deterministic precision. Interestingly, performance on our benchmark correlates with the benchmark in natural image generation. Our work provides a foundational framework and tools for achieving better alignment between human instructions and model outputs.

2602.22508 2026-05-12 cs.AI

Metacognitive Behavioral Tuning of Large Language Models for Multi-Hop Question Answering

Ik-hwan Kim, Hyeongrok Han, Mingi Jung, Sangwon Yu, Jinseok Hong, Sang Hun Kim, Yoonyoung Choi, Sungroh Yoon

AI总结 本文研究了大语言模型在多跳问答任务中即使已有正确中间结论仍会给出错误答案的问题,认为其根源在于模型自我调节能力不足。为此,提出了一种名为“元认知行为调优”(MBT)的后训练框架,通过注入五阶段元认知结构来增强推理过程的自我调控能力。实验表明,MBT在多个基准数据集上取得了最高的准确率-效率得分,同时显著缩短了推理轨迹长度并减少了冗余,验证了其结构先验的有效性。

Comments 41 pages

详情
英文摘要

Large Language Models (LLMs) often produce incorrect answers on multi-hop question answering even when the reasoning trace already contains a correct intermediate conclusion. We attribute this gap to weak self-regulation rather than insufficient reasoning capacity. Without explicit regulation, valid intermediate conclusions are overridden by continued exploration or left unrecognized as logically sufficient. We propose Metacognitive Behavioral Tuning (MBT), a post-training framework that injects a five-phase metacognitive structure into reasoning traces. The five phases are understanding and filtering, planning, execution and monitoring, self-correction, and verification. MBT has two formulations. MBT-S synthesizes new metacognitive traces from scratch, while MBT-R rewrites the student's own traces into a metacognitive form. Across HotpotQA, MuSiQue, and 2WikiMultiHopQA, MBT attains the highest Accuracy-Efficiency Score (AES) across model scales. MBT lifts task accuracy while keeping traces short and stable, with mean response length on MuSiQue an order of magnitude shorter than baseline methods and degeneration counts reduced by a similar margin. A matched-control study further confirms that the gain stems from the five-phase structural prior itself. To qualitatively assess the regulatory behavior of reasoning traces, we introduce two new metrics, the Reach-Redundancy Profile (RRP) and the length-aware Metacognitive Quality Index (MQI). RRP captures when the answer is reached and how much of the trace is redundant, and MQI quantifies how richly the five phases appear. Under both metrics, MBT achieves the earliest answer arrival, the lowest redundancy, and the richest phase-level behavior across model scales.

2602.21581 2026-05-12 cs.CV

MultiAnimate: Pose-Guided Image Animation Made Extensible

Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu

AI总结 本文提出了一种可扩展的多角色图像动画框架 MultiAnimate,旨在解决基于姿势引导的多角色视频生成中身份混淆和不合理遮挡的问题。该方法基于现代扩散变换器(DiT),引入了身份分配器和身份适配器两个关键组件,用于捕捉个体位置信息和角色间空间关系,从而提升模型的灵活性和泛化能力。实验表明,该方法在多角色图像动画任务中取得了优于现有扩散模型的最先进性能。

Comments CVPR2026 Accepted. Project page at https://hyc001.github.io/MultiAnimate/

详情
英文摘要

Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

2602.17283 2026-05-12 cs.CL cs.AI

Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective

Yukun Chen, Xinyu Zhang, Boyi Deng, Jialong Tang, Yu Wan, Fei Huang, Yuxi Zhou, Baosong Yang, Yiming Li

AI总结 随着大语言模型在全球范围内的广泛应用,现有对其多语言能力的评估体系主要关注事实任务表现,忽视了跨语言内容深层价值观判断的能力。为此,本文从文化多样性和学科复杂性两个核心挑战出发,提出了一种两阶段的人机协作标注框架,构建了首个跨语言价值观判断基准X-Value,包含14种语言的4750个问答对及12项细粒度标注元数据,系统评估了17个大语言模型在跨语言价值观判断任务中的表现,揭示了其在不同类别和语言间的性能差异,突显了提升模型价值观判断能力的迫切性。

详情
英文摘要

As large language models (LLMs) are employed worldwide, existing evaluation paradigms for their multilingual capabilities primarily focus on factual task performance, neglecting the ability to judge content's deep-level values across multiple languages. To bridge this gap, we first reveal two primary challenges in constructing values judgment benchmarks, cultural diversity and disciplinary complexity, and propose a novel two-stage human-AI collaborative annotation framework to alleviate them. This framework identifies the issue scope and nature, establishes specific annotation criteria, and utilizes multiple LLMs for final review. Building upon this framework, we introduce \textbf{X-Value}, the first \textit{Cross-lingual Values Judgment Benchmark} designed to evaluate the capability of LLMs in judging deep-level values of content. X-Value comprises 4,750 Question-Answer pairs across 14 languages, covering 7 major global issue categories, and provides 12 granular annotation metadata to facilitate a rigorous evaluation of model performance. Systematic evaluations of X-Value are conducted across 17 LLMs using distinct prompting strategies. Multi-dimensional analysis of accuracy and F1-scores reveals their limitations in cross-lingual values judgment and indicates performance disparities across categories and languages. This work highlights the urgent need to improve the underlying, values-aware content judgment capability of LLMs.\footnote{Samples of X-Value are available at https://huggingface.co/datasets/Whitolf/X-Value.}