arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13087 2026-05-14 cs.CL cs.AI

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil

AI总结 该研究针对多语言语音识别模型在低资源语言上的微调问题,提出了Vividh-ASR基准,用于评估印地语和马拉雅拉姆语在不同复杂度场景下的识别性能。通过分析学习率时机和课程学习顺序,研究发现早期大参数更新和由易到难的课程学习策略能显著提升模型性能,特别是对自发语音的识别效果。基于这些发现,作者提出了逆向多阶段微调方法(R-MFT),使参数高效的244M Whisper模型在性能上达到甚至超越传统微调的769M模型。

Comments Submitted to Interspeech 2026

详情
英文摘要

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

2605.13086 2026-05-14 cs.RO

Object Manipulation of the Variable Topology Truss system

Andrew Jang-Ho Bae, Myeongjin Choi, Haorui Li, Mark Yim, TaeWon Seo

AI总结 本文提出了一种针对可变拓扑桁架(VTT)系统的物体操作策略,该系统由带有被动球形关节的驱动桁架杆件组成。为实现有效操作,研究引入了一种混合控制框架,能够同时调节位置和力,无需显式解耦。通过实验验证了该方法在单个杆件模块和完整VTT系统中的力跟踪性能,并展示了两种典型配置下的物体操作效果,证明了该方法在位置和力跟踪方面的可靠性和一致性。

Comments 15 pages, 14 figures

详情
英文摘要

This paper presents an object manipulation strategy for the Variable Topology Truss (VTT) system, a truss robot that comprises actuated truss members connected by passive spherical joints. Although truss robots were originally proposed as rapidly deployable manipulators, manipulation strategy has not been studied thoroughly. To enable manipulation, we introduce a hybrid control framework that regulates position and force concurrently without explicit decoupling. At the actuator level, each member employs a sensor-based force feedback controller to generate the desired axial forces despite high actuator friction. At the task level, the forces applied at the end-effector nodes are produced by computing the required member forces using a static model of the VTT. We evaluate force-tracking performance through experiments on both a single member module and the full VTT system. Finally, we demonstrate object manipulation using two representative configurations and quantitatively assess combined position and force tracking performance. Experimental results confirm that the proposed approach enables consistent and reliable object manipulation with the VTT system.

2605.13083 2026-05-14 cs.RO

TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

Jianyi Zhou, Ziteng Gao, Feiyang Hong, Zirui Liu, Guannan Zhang, Weisheng Dai, Ruichen Zhen, Chuqiao Lyu, Haotian Wu, Yinian Mao, Xushi Wang, Yuxiang Jiang, Wenbo Ding, Shuo Yang

AI总结 本文提出了一种名为TouchAnything的框架和一个大规模数据集EgoTouch,用于从第一人称视角视频中估计双臂操作物体时的触觉信息。研究解决了现有数据集缺乏触觉信号的问题,通过可穿戴触觉传感器同步采集多视角视频、双手3D姿态和压力图,构建了包含208个操作任务的数据集。基于该数据集,作者设计了一个多视角视觉到触觉预测框架,实验表明结合手腕视角信息可有效提升触觉预测性能。

详情
英文摘要

Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.

2605.13080 2026-05-14 cs.CV

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

Junha Song, Byeongho Heo, Geonmo Gu, Jaegul Choo, Dongyoon Han, Sangdoo Yun

AI总结 本文研究了多模态大语言模型在视觉描述任务中如何更高效地关注图像关键区域的问题。作者提出了一种新的注意力机制——Gaze Attention,通过将视觉嵌入分组为紧凑的注视区域,并动态选择与任务相关的区域进行注意力计算,从而减少冗余计算并提升聚焦效果。此外,为保持全局上下文信息,作者还引入了可学习的上下文标记。实验表明,该方法在图像和视频理解任务中表现优异,且显著降低了视觉键值对的使用量。

详情
英文摘要

When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

2605.13079 2026-05-14 cs.LG cs.AI

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, Trung Le

AI总结 本文研究了优化器 Muon 的成功机制,揭示其核心在于通过正交化动量缓冲区实现谱平坦化,从而提升学习率容忍度和收敛速度。作者证明,Muon 的最大稳定步长与梯度的平均奇异值相关,而非最大值,这突破了传统梯度下降的瓶颈。此外,将 Muon 视为预条件梯度方法,其收敛效率的提升由梯度协方差的谱特性所控制。实验表明,Muon 在更大学习率下仍保持稳定,并比标准梯度下降更快达到精度目标。

详情
英文摘要

Muon orthogonalizes the momentum buffer before each update, replacing its singular values with ones via Newton-Schulz iterations. This simple change lets Muon tolerate far larger learning rates and converge faster than other optimizers, but why? We show that the mechanism is spectral flattening, and develop two results around it. First, we prove that Muon's maximal stable step size scales with the average singular value of the gradient rather than the largest, which bottlenecks standard gradient descent. Second, we recast Muon as a preconditioned gradient method and show, under a Kronecker-factored curvature model, that it improves the effective convergence factor, with the improvement controlled by the spectrum of the gradient covariance. Extensive experiments validate both results: Muon remains stable at learning rates that cause SGD to diverge within the first few iterations, and reaches accuracy milestones several epochs earlier even at identical step sizes. Taken together, our results offer a principled, geometric explanation for Muon's empirical success.

2605.13076 2026-05-14 cs.CL cs.FL cs.SE

TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

Yoshio Kato, Shuhei Tarashima

AI总结 TruncProof 是一种用于在令牌长度限制下生成语法正确的 JSON 输出的新型语法约束生成方法。该方法利用 LL(1) 解析器的特性,在解码过程中高效估计完成合法 JSON 所需的最小令牌数,从而确保生成结果既符合语法规范又不超出预设长度限制。实验表明,TruncProof 在严格令牌约束下仍能生成语义准确的 JSON,并可与先进解码策略结合使用,提升生成质量。

Comments Main paper (8 pages). Accepted at the International Joint Conference on Neural Networks (IJCNN 2026)

详情
英文摘要

The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with advanced decoding strategies, resulting in outputs that are not only grammatically valid but also semantically accurate.

2605.13068 2026-05-14 cs.LG

Local Inverse Geometry Can Be Amortized

Aaditya L. Kachhadiya

AI总结 该论文研究了非线性反问题中的局部逆几何学习方法,提出了一种通过学习可复用的逆算子来替代传统曲率感知优化方法的新框架。核心方法是构建双向代理模型Deceptron,并结合D-IPG迭代求解器,利用雅可比矩阵组合惩罚(JCP)机制训练逆雅可比以近似前向雅可比的局部左逆。实验表明,该方法在多个偏微分方程反问题基准上优于传统方法,具有更高的求解效率和恢复质量。

Comments Preprint. 21 pages, 8 figures, 8 tables. Code available at https://github.com/AadityaKachhadiya/deceptron

详情
英文摘要

Nonlinear inverse problems often trade inexpensive but fragile first-order updates against curvature-aware methods such as Gauss-Newton and Levenberg-Marquardt, which obtain stronger directions by repeatedly solving Jacobian-based linearized systems. We propose a learned alternative: amortize local inverse geometry into a reusable reverse operator. Our framework learns a bidirectional surrogate, Deceptron, and deploys it through D-IPG (Deceptron Inverse-Preconditioned Gradient), an iterative solver that pulls residual-corrected measurement-space proposals back to latent space. The key mechanism is a Jacobian Composition Penalty (JCP), which trains the reverse Jacobian to act as a local left inverse of the forward Jacobian; its runtime counterpart, RJCP, measures the same inverse-consistency error along optimization trajectories. We prove that D-IPG is first-order equivalent to damped Gauss-Newton under local pseudoinverse consistency, with deviation controlled by composition error and conditioning. Across seven PDE inverse-problem benchmarks, D-IPG outperforms standard baselines, achieves 94.8% mean success across the six-problem reliability suite, and reaches comparable or better recovery quality at up to 77x lower inference-time solve cost on the main benchmarks.

2605.13067 2026-05-14 cs.RO cs.AI

When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation

Maxime Alvarez, Ryo Watanabe, Paul Crook, Afshin Zeinaddini Meymand, Suvin Kurian, Pablo Ferreiro, Genki Sano

AI总结 随着端到端机器人策略在现实任务中的应用增多,训练与推理条件之间的差距成为一大挑战。本文研究了如何通过改进机器人本体感觉状态的编码方式,提升其在分布内和分布外场景下的性能,特别是在面对未知测试条件时的鲁棒性。研究发现,采用基于任务的相对参考系编码方法,在实际机器人实验中表现出优于现有方法的性能,为利用不同参考系下的数据提升机器人泛化能力提供了可行路径。

Comments Accepted to ICRA 2026 Workshop: From Data to Decisions

详情
英文摘要

As end-to-end robotic policies are progressively deployed in the real world to solve real tasks, they face a gap between the training and inference conditions. Scaling the amount and diversity of the training data has shown some success in improving zero-shot generalization, yet robots still fail when faced with new, unseen test conditions. For instance, while robots with fixed frames of reference are common, those with moving frames pose a greater challenge for deployment. To address this specific instance of the issue, we present a study of strategies for encoding the robot's proprioceptive state to improve both in- and out-of-distribution performance at test time. Through a systematic study of joint representations, we find that a simple episode-wise relative frame provides the best trade-off between task performance and robustness, outperforming the baselines in extensive real-robot experiments conducted in a realistic test environment. The results suggest a practical path to leveraging data collected by robots with varying frames of reference and deployment to unseen test configurations.

2605.13063 2026-05-14 cs.LG

Ergodic Trajectory Design by Learned Pushforward Maps: Provable Coverage via Conditional Flow Matching

Ehsan Aghazadeh, Masoud Malekzadeh, Ahmad Ghasemi, Hossein Pishro-Nik

AI总结 本文研究了如何设计连续轨迹,使其时间平均占用密度能够可证明地匹配给定的空间密度,即“遍历覆盖”问题,该问题在无人机数据采集、机器人探索和移动监测等领域具有重要意义。作者提出了一种名为epushforward的框架,通过将遍历性与密度匹配解耦,利用最优传输条件流匹配方法学习一个离线映射,将简单环形区域上的均匀遍历轨迹转换为目标密度。该方法在训练完成后可支持无限数量的轨迹和多智能体系统,并能自然处理多种可微操作约束,具有理论保证的覆盖性能。

详情
英文摘要

Designing continuous trajectories whose time-averaged occupancy provably matches a prescribed spatial density (the \emph{ergodic coverage} problem) is central to UAV-assisted data collection and sensing, robotic exploration, and mobile monitoring. For flying agents in particular, this challenge is acute: trajectories must balance coverage fidelity against tight energy budgets, no-fly zones, and acceleration limits. Existing methods either re-optimize each trajectory online (with cost growing in the horizon and re-running for every target, agent, and realization) or rely on bespoke analytical constructions that must be re-derived for each new constraint. We propose a \emph{epushforward} framework that decouples ergodicity from density matching: an analytic latent trajectory provides exact uniform ergodicity on a simple annular domain, and a single map, learned offline by optimal-transport conditional flow matching, transports this latent occupancy onto the prescribed target density. The composed trajectory is then asymptotically ergodic with respect to the learned pushforward distribution, with deviation from the target controlled by the flow-matching training loss. Once trained for a given target density and constraint set, the map serves an unbounded number of trajectories and a multi-agent fleet without per-agent retraining, and many differentiable operational constraints (no-fly zones, acceleration ceilings, or fairness penalties) enter as additive soft penalties in the training loss without re-deriving the design. We prove three results (an acceleration-energy bound, an $O(1/\sqrt{K})$ ergodic convergence rate in the number of trajectory cycles $K$, and an approximation-error bound) that combine into an end-to-end coverage bound estimable from CFM training diagnostics (certified given an architectural Lipschitz bound on $v_θ$).

2605.13062 2026-05-14 cs.CV

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, Yuanxing Zhang

AI总结 近年来,图像编辑模型在指令理解、多模态感知和复杂视觉编辑方面取得了显著进展,但现有基准测试难以准确反映人类判断,尤其在评估前沿模型时存在任务难度有限和评价方式粗粒度的问题。为解决这一问题,本文提出Edit-Compass和EditReward-Compass,一个统一的图像编辑与奖励模型评估基准。Edit-Compass包含2,388个精细标注的样本,涵盖六个逐步提升难度的任务类别,采用多维细粒度评价框架;EditReward-Compass则包含2,251个偏好对,用于模拟实际强化学习中的奖励建模场景,为模型评估提供了更真实可靠的依据。

详情
英文摘要

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

2605.13059 2026-05-14 cs.CV

BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability

Guangqian Yang, Tong Ding, Wenlong Hou, Yue Xun, Ye Du, Qian Niu, Shujun Wang

AI总结 本文提出了一种名为BrainAnytime的统一预训练框架,用于处理在任意模态可用情况下的脑影像分析任务。该方法通过跨模态蒸馏和基于图谱的课程掩码技术,在共享的三维掩码自编码器中学习MRI与PET之间的结构-分子对应关系,并关注疾病易感解剖区域。实验表明,BrainAnytime在多种临床模态设置下显著优于现有模型,尤其在阿尔茨海默病分类任务中提升了平均准确率。

Comments Early accepted by MICCAI 2026

详情
英文摘要

Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at https://github.com/SDH-Lab/BrainAnytime.

2605.13058 2026-05-14 cs.RO

MUJICA: Multi-skill Unified Joint Integration of Control Architecture for Wheeled-Legged Robots

Yuqi Li, Peng Zhai, Yueqi Zhang, Xiaoyi Wei, Quancheng Qian, Zhengxu He, Qianxiang Yu, Lihua Zhang

AI总结 本文提出了一种名为MUJICA的统一控制架构,用于轮腿机器人,旨在解决其在复杂地形中轮式移动与腿部控制之间的协调问题。该方法通过单一策略集成多种低级技能,如全向移动、高平台攀爬和跌落恢复,并结合精确的直流电机约束建模进行联合训练,同时引入基于本体感觉的高层技能选择器,实现对环境的自适应响应。实验表明,MUJICA显著提升了轮腿机器人在非结构化环境中的适应能力和任务成功率。

详情
英文摘要

Wheeled-legged robots hold promise for traversing complex terrains and offer superior mobility compared to legged robots. However, wheeled-legged robots must effectively balance both wheeled driving and legged control. Furthermore, due to noisy proprioceptive sensing and real-world motor constraints, realizing robust and adaptive locomotion at peak performance of motors remains challenging. We propose the Multi-skill Unified Joint Integration of Control Architecture (MUJICA), a unified, fully proprioceptive control framework for wheeled-legged robots that integrates diverse low-level skills-including omnidirectional moving, high platform climbing, and fall recovery-within a single policy. All skills, distinguished by unique indicator variables, are trained jointly with accurate DC-motor constraint modeling. Additionally, a high-level skill selector is learned to dynamically choose the optimal skill based solely on proprioceptions, enabling adaptive responses to the surrounding environment. Therefore, MUJICA enhances sim-to-real robustness and enables seamless transitions across diverse locomotion modes, facilitating autonomous adjustment to the environment. We validate our framework in both simulation and real-world experiments on the Unitree Go2-W robot, demonstrating significant improvements in adaptability and task success in unstructured environments.

2605.13055 2026-05-14 cs.CL cs.CY

The Cost of Perfect English: Pragmatic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI

Ao Liu, Shanhua Zhu

AI总结 该研究探讨了生成式人工智能(GenAI)在辅助非母语者(如中国B2级大学生)写作时,可能引发的“语用扁平化”现象,即文化特定的礼貌表达和作者立场被系统性地抹去。通过对比分析使用GenAI润色前后的议论文,研究发现尽管模型在语法和语义层面表现良好,但在对话互动和知识立场等语用维度上存在显著差异,导致作者独特的声音被同质化的英语表达所取代。研究指出,应推动批判性AI素养教育,帮助多语写作者在使用GenAI提升语言质量的同时,保留其语用多样性和修辞性能。

Comments 16 pages, 2 figures

详情
英文摘要

The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating "pragmatic flattening" - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of argumentative essays by Chinese B2-level university students from the ICNALE corpus. The original texts were polished via the APIs of four leading Large Language Models at a zero-temperature setting for reproducibility. Findings reveal a nuanced "dimensional divergence" within the Semantic Preservation Paradox. While models corrected lexicogrammatical errors and retained propositional meaning, sociopragmatic interventions were bifurcated. In the interactive dimension, all models showed a drastic collapse of dialogic engagement markers, turning negotiated discourse into monologic assertions. Conversely, in the epistemic stance dimension, models showed architecture-based variability: some aggressively scrubbed epistemic markers, while others reinforced tentative hedging as decontextualized algorithmic caution. This confirms that while GenAI enhances accuracy, it systematically overwrites L2 writers' unique rhetorical identities into a homogenized Anglo-American paradigm. We argue that future instruction must move beyond error correction, advocating for Critical AI Literacy to empower multilingual writers to use GenAI for linguistic enhancement while safeguarding sociopragmatic diversity and rhetorical agency.

2605.13054 2026-05-14 cs.LG cs.AI

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

Minung Kim, Jeongmo Kim, Gwanwoo Choi, Seungyul Han

AI总结 该论文研究了如何在仅有预收集数据的情况下,将源域的策略适应到目标域的离线强化学习问题,特别是在目标域数据极为有限的情况下。为了解决域间分布差异带来的挑战,作者提出了目标对齐的覆盖扩展(TCE)框架,通过理论分析指导源数据的使用方式,包括直接引入接近目标域的转移或通过目标对齐生成扩展状态覆盖。实验表明,TCE在多种跨域环境中显著优于现有的离线强化学习方法。

详情
英文摘要

Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly when the target dataset is extremely limited. To address this, we propose Target-aligned Coverage Expansion (TCE), a framework that decides how source data should be used, either by directly incorporating target-near transitions or by expanding state coverage through target-aligned generation, guided by theoretical analysis. TCE builds on a dual score-based generative model to synthesize target-consistent transitions over an expanded state region. Extensive experiments across diverse cross-domain environments show that TCE consistently outperforms state-of-the-art cross-domain offline RL baselines.

2605.13049 2026-05-14 cs.CV

Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images

Xingyuan Li, Haoyuan Xu, Xingyue Zhu, Jun Ma, Yang Zou, Zhiying Jiang, Jinyuan Liu

AI总结 红外与可见光图像融合(IVIF)在复杂环境下具有广泛应用,但未对齐条件下的融合面临固有的错位问题。现有方法多采用粗到细的变形参数预测或多尺度变形场估计,却忽视了注册过程中的累积误差,影响融合质量。本文提出了一种融合空间-频率域注册与融合的SFRF框架,通过引入不确定性估计和红外热辐射分布一致性,统一处理注册误差累积问题,提升跨空间与频率域的融合鲁棒性。该方法通过多尺度迭代注册和双分支空间-频率融合模块,实现了更精确的对齐与更高质量的图像重建。

Comments 10 pages, 5 figures, 4 tables

详情
英文摘要

Infrared and Visible Image Fusion (IVIF) has shown promise in visual tasks under challenging environments, but fusion under unregistered conditions faces inherent misalignments. Current studies to solve them either predict the deformation parameters coarse-to-fine (i.e., coarse registration and fine registration) or estimate the deformation fields in multi-scales for registration. Though straightforward, they overlook the cumulative errors in registration, which contaminate the fusion stage and severely deteriorate the resulting images. We introduce the Spatial-Frequency Registration and Fusion (SFRF) framework, which incorporates uncertainty estimation and infrared thermal radiation distribution consistency into a unified pipeline to handle the error accumulation for robust registration and fusion across both spatial and frequency domains. Specifically, SFRF constructs a Multi-scale Iterative Registration (MIR) framework that iteratively refines the deformation field across scales, leveraging uncertainty estimation at each stage to mitigate error accumulation and enhance alignment accuracy dynamically. To ensure the accurate alignment of infrared thermal distributions during registration, thermal radiation distribution consistency is employed as a frequency-domain supervisory signal, promoting global consistency in the frequency domain. Based on the spatial-frequency alignment, SFRF further adopts a Dual-branch Spatial-Frequency Fusion (DSFF) module, which incorporates spatial geometric features and frequency distribution information to reconstruct visually appealing images. SFRF achieves impressive performance across diverse datasets.

2605.13047 2026-05-14 cs.CV cs.AI

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

AI总结 该研究探讨了视觉语言模型(VLM)在高层次语义场景理解方面与人类感知的差异。为此,作者提出了一种黑盒、模型无关的方法——反事实语义显著性(CSS),通过衡量物体在场景中被移除后引起的语义变化,量化其重要性。实验结果表明,VLM在理解场景时表现出对大物体、画面中心物体和高显著性物体的过度依赖,而对场景中人物的依赖则低于人类,揭示了模型与人类在语义理解上的显著差距。

详情
英文摘要

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

2605.13046 2026-05-14 cs.AI

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

AI总结 本文提出了一种基于智能体的大型语言模型(LLM)框架,用于大规模人群心理健康筛查。该框架通过将每个处理阶段封装为由明确策略和代理引导评估驱动的LangChain智能体,实现了对非结构化临床信息的处理与个性化适应。研究展示了该框架在基于对话记录的抑郁检测中的应用,验证了其在稳定配置收敛、成本控制和避免性能退化方面的有效性,为大规模临床数据下的心理健康筛查提供了可信、可复现且适应性强的解决方案。

Comments 8 pages, conference paper presented at IEEE BigData 2025, Macau

详情
英文摘要

Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.

2605.13045 2026-05-14 cs.LG cs.CL

Large Language Models Lack Temporal Awareness of Medical Knowledge

Zihan Guan, Qiao Jin, Guangzhi Xiong, Fangyuan Chen, Mengxuan Hu, Qingyu Chen, Yifan Peng, Zhiyong Lu, Anil Vullikanti

AI总结 现有评估大语言模型(LLM)医学知识的方法多基于静态的考试式基准,未能反映医学知识随时间动态变化的特性。为此,研究者构建了TempoMed-Bench,首个用于评估LLM时间感知能力的医学领域基准,揭示了LLM在时间特定医学知识上的不足,包括知识随时间逐渐退化、对过时知识的遗忘以及预测结果的时间不一致性等问题。该工作指出了LLM在医学知识时间感知方面的关键挑战,并为未来研究提供了方向。

Comments 35 pages, 18 figures

详情
英文摘要

The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack temporal awareness in medical knowledge through the key findings: (1) model performance on up-to-date medical knowledge exhibits a gradual linear decline over time rather than a sharp knowledge-cutoff behavior, suggesting that parametric medical knowledge is not strictly bounded by knowledge cutoffs; (2) LLMs consistently struggle more with recalling outdated historical medical knowledge than with up-to-date recommendations: accuracy of historical knowledge is only 25.37%-53.89% of up-to-date knowledge, indicating potential knowledge forgetting effects during training; and (3) LLMs often exhibit temporally inconsistent behaviors, where predictions fluctuate irregularly across neighboring years. We also show that the temporal awareness problem is a challenge that cannot be easily solved when integrated with agentic search tools (-3.15%-14.14%). This work highlights an important yet underexplored challenge and motivates future research on developing LLMs that can better encode time-specific medical knowledge.

2605.13043 2026-05-14 cs.CL

Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

Yejin Lee, Yo-Sub Han

AI总结 扩散语言模型(DLMs)通过迭代去噪和双向精炼生成文本,但在中间去噪步骤中生成的有害内容可能传播到后续过程,导致最终输出不安全。为此,本文提出了一种基于去噪过程中逐步干预的推理时防御框架,通过对比安全方向(SGD)检测有害语义并进行重掩码和自适应引导,从而在不牺牲生成质量的前提下提升模型安全性。实验表明,该方法显著降低了越狱成功率,同时保持了接近原始模型的生成质量。

Comments 17 pages, 3 figures

详情
英文摘要

Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug-and-play module, our method circumvents the need for additional fine-tuning and can be directly incorporated into off-the-shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step-wise intervention for safe diffusion language model generation. Our code is available at https://github.com/leeyejin1231/DLM_Steering_Remasking.

2605.13041 2026-05-14 cs.CV

EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing

Inwoo Hwang, Donggeun Lim, Hojun Jang, Young Min Kim

AI总结 EgoForce 是一种用于从噪声的自中心视角输入中在线重建长期全身运动的框架。该方法采用基于扩散的模型,并引入时间非对称的噪声调度策略,以应对实时应用中稀疏和噪声观测的挑战。通过建模时间演化的不确定性并逐步去噪,EgoForce 在严格因果约束下生成稳定且连贯的全身运动,实验表明其在复杂自中心场景中优于现有在线和离线方法。

Comments Project page: https://inwoohwang.me/EgoForce

详情
英文摘要

With recent advances in embodied agents and AR devices, egocentric observations are readily available as input for real-world interactive online applications. However, egocentric viewpoints can only sporadically observe hands, in addition to the estimated head trajectory. We propose EgoForce, an online framework for reconstructing long-term full-body motion from noisy egocentric input. While existing generative frameworks can robustly handle noisy and sparse measurements, they assume a fixed-length observation window is available and are thus not suitable for real-time applications. Faster inference often relies on autoregressive prediction, sacrificing robustness. In contrast, we adopt a diffusion-based method with a temporally asymmetric noise schedule inspired by Diffusion Forcing. Specifically, our approach models temporally evolving uncertainty and incrementally denoises states as new streaming observations arrive. Combined with a noise-robust imputation strategy, EgoForce progressively generates stable and coherent full-body motion under strict causal constraints. Experiments demonstrate that our online framework outperforms existing online and offline methods, enabling long-horizon, full-body motion reconstruction in challenging egocentric scenarios.

2605.13038 2026-05-14 cs.CV cs.AI

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

Liangjing Shao, Beilei Cui, Hongliang Ren

AI总结 本文提出CoGE,一种用于结肠镜检查的单目在线几何估计框架,旨在解决实际场景中深度估计和场景重建的难题。该方法通过引入基于Retinex理论的光照感知监督模块和基于小波分解的结构感知感知模块,有效应对结肠镜场景中的光照差异和结构特征提取问题。实验表明,仅使用模拟数据训练的CoGE在模拟和真实场景中均取得了最先进的几何估计性能。

Comments Early Accepted by MICCAI 2026

详情
英文摘要

Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

2605.13037 2026-05-14 cs.AI

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

Yuxin Liu, Ziang Ye, Yueqing Sun, Mingye Zhu, Jinwei Xiao, Zhuowen Han, Qi GU, Xunliang Cai, Lei Zhang

AI总结 当前交互式大语言模型代理依赖于目标引导的逐步规划,环境理解是在执行过程中被动获取的,导致环境感知延迟和知识瓶颈问题。本文提出了一种“先地图后行动”的MAP范式,通过全局探索、任务映射和知识增强执行三个阶段,提前建立环境认知地图,从而提升任务执行效率。实验表明,MAP在多个基准测试中均取得显著提升,并且基于MAP的轨迹数据集MAP-2K在训练中表现优于专家轨迹,说明环境理解比模仿更为关键。

详情
英文摘要

Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment-general priors; (2) Task-Specific Mapping, constructing a structured cognitive map; and (3) Knowledge-Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC-AGI-3, MAP enables frontier models to surpass near-zero baseline performance in 22 of 25 game environments. We further introduce MAP-2K, a dataset of map-then-act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.

2605.13034 2026-05-14 cs.CV cs.IR

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Zhuofan Shi, Peilun Jia, Baoqin Sun, Haiyang Shen, Sixiong Xie, Yun Ma, Xiang Jing

AI总结 ViDR 是一种多模态深度研究框架,旨在通过源图示作为证据来生成内容详实且有依据的研究报告。该方法将源图示视为可检索、可解释、可追踪和可验证的证据对象,并结合上下文感知过滤、大纲感知重排序和视觉语言模型分析等技术,提升图示证据的准确性和相关性。ViDR 还引入了 MMR Bench+ 评估基准,实验证明其在报告质量、图示整合和可验证性方面优于现有主流模型,凸显了源视觉证据在多模态深度研究中的重要性。

详情
英文摘要

Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

2605.13030 2026-05-14 cs.LG cs.AI

FeatCal: Feature Calibration for Post-Merging Models

Yanggan Gu, Shuo Cai, Zihao Wang, Wenjun Wang, Yuanyi Wang, Pengkai Wang, Sirui Huang, Su Lu, Jianmin Wu, Hongxia Yang

AI总结 FeatCal 是一种针对模型合并后性能下降问题的特征校准方法,通过分析合并模型与专家模型之间的特征漂移,提出了一种层序校准策略,有效提升了合并模型的表现。该方法利用少量校准数据,以闭式解形式逐层调整模型权重,无需梯度下降或额外模块,既保持了合并模型的优势,又显著提升了任务性能。实验表明,FeatCal 在多个基准测试中优于现有校准方法,且在样本效率和校准成本方面表现更优。

详情
英文摘要

Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.

2605.13028 2026-05-14 cs.RO cs.SY eess.SY

Local Conformal Calibration of Dynamics Uncertainty from Semantic Images

Luís Marques, Dmitry Berenson

AI总结 本文提出了一种基于符合性预测的算法OCULAR,用于从语义图像中对动态不确定性进行局部校准,从而为未知测试环境提供不确定性量化保证。该方法利用视觉相似环境的数据,对任意保真度的线性高斯动力学模型进行可证明的校准,能够在存在随机扰动和模型偏差的情况下,保证预测区域以用户设定的概率包含未来系统状态。该方法无需对真实系统动力学做出强假设,且能够区分不同输入导致的不确定性差异,有助于实现概率安全规划,并在多个实验场景中验证了其有效性。

Comments 26 pages, 8 figures. Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR) 2026

详情
英文摘要

We introduce Observation-aware Conformal Uncertainty Local-Calibration (OCULAR), a conformal prediction-based algorithm that uses perception information to provide uncertainty quantification guarantees for unseen test-time environments. While previous conformal approaches lack the ability to discriminate between state-action space regions leading to higher or lower model mismatch, and require environment-specific data, our method uses data collected from visually similar environments to provably calibrate a given linear Gaussian dynamics model of arbitrary fidelity. The prediction regions generated from OCULAR are guaranteed to contain the future system states with, at least, a user-set likelihood, despite both aleatoric and epistemic uncertainty -- i.e., uncertainty arising from both stochastic disturbances and lack of data. Our guarantees are non-asymptotic and distribution-free, not requiring strong assumptions about the unknown real system dynamics. Our calibration procedure enables distinguishing between observation-velocity-action inputs leading to higher and lower next-state-uncertainty, which is helpful for probabilistically-safe planning. We numerically validate our algorithm on a double-integrator system subject to random perturbations and significant model mismatch, using both a simplified sensor and a more realistic simulated camera. Our approach appropriately quantifies uncertainty both when in-distribution and out-of-distribution, being comparatively volume-efficient to baselines requiring environment-specific data.

2605.13027 2026-05-14 cs.CV

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

Zihang Xu, Xiaoyang Liu, Zheng Chen, Yulun Zhang, Xiaokang Yang

AI总结 本文提出了一种基于扩散模型的文本图像超分辨率方法PRISM,旨在解决在严重退化情况下文本细节生成中的可靠性与结构准确性问题。该方法通过引入流匹配先验校正(FMPR)和结构引导的不确定性感知残差编码器(SURE),分别提升全局文本先验的可靠性与局部笔画边界的精确性。实验表明,PRISM在合成和真实数据集上均取得了最先进的性能,且推理速度达到毫秒级。

Comments Code is available at https://github.com/faithxuz/PRISM

详情
英文摘要

Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at https://github.com/faithxuz/PRISM.

2605.13026 2026-05-14 cs.LG cs.AI cs.CL

Understanding and Accelerating the Training of Masked Diffusion Language Models

Chunsan Hong, Sanghyun Lee, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji, Seungryong Kim, Jong Chul Ye

AI总结 本文研究了掩码扩散语言模型(MDMs)训练速度较慢的问题,并提出了加速训练的有效方法。通过分析发现,语言的局部性偏差是导致训练缓慢的主要原因,作者提出了一种基于钟形时间采样的训练策略,显著提升了训练效率。实验表明,该方法在保持最终性能的同时,使MDMs在LM1B基准上的训练速度提升了约4倍,并在生成困惑度和下游任务表现上也取得了更快的提升。

Comments Preprint

详情
英文摘要

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

2605.13025 2026-05-14 cs.LG cs.GT

Offline Two-Player Zero-Sum Markov Games with KL Regularization

Claire Chen, Yuheng Zhang, Xinyu Liu, Zixuan Xie, Shuze Daniel Liu, Nan Jiang

AI总结 本文研究了在离线两人零和马尔可夫博弈中学习纳什均衡的问题。不同于现有方法依赖显式悲观策略应对分布偏移,作者证明仅使用KL正则化即可稳定学习过程并保证收敛。文中提出了正则化离线序贯均衡(ROSE)理论框架,实现了在单边可集中性条件下的快速收敛速率$\widetilde{\mathcal{O}}(1/n)$,并设计了基于最小二乘值估计和迭代自博弈更新的实用无模型算法SOS-MD,其最终迭代结果在自博弈次数$T$下也达到了相近的统计收敛速率。

详情
英文摘要

We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.

2605.13021 2026-05-14 cs.LG cs.AI

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

Xu Bai, Bin Lu, Kun Zhang, Shengbo Chen, Xinbing Wang, Chenghu Zhou, Meng Jin

AI总结 本文提出了一种基于非自私性原理的高效图粗化方法NOPE,旨在解决传统图粗化方法中因节点独立匹配带来的高计算和内存开销问题。该方法通过优先考虑邻域的集体影响,实现了线性内存消耗和接近线性的计算复杂度,并进一步提出了更快的变体NOPE*,在局部各向同性假设下将干扰评估复杂度从O(δ·d)降低至O(d),显著提升了高度节点的处理效率。实验表明,NOPE*相比原方法速度提升1.8到10倍,且在图学习任务中表现优异,甚至优于基于大语言模型的图推理方法。

详情
英文摘要

Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pair-wise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non-selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named NOPE, which achieves linear memory consumption and near-linear computational complexity in the number of nodes. Furthermore, we derive a faster variant NOPE*, which reduces O(δ\dot d) interference evaluation to O(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high-degree nodes. Experimental results show that NOPE* achieves 1.8-10\times speedup over NOPE and surpass almost all baselines with 1-3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM-based graph reasoning owing to compact graph information. The code can be available at https://github.com/dazonglian/NOPE-main.

2605.13018 2026-05-14 cs.CV

OCH3R: Object-Centric Holistic 3D Reconstruction

Yi Du, Yang You, Xiang Wan, Leonidas Guibas

AI总结 OCH3R 是一种面向对象的统一三维重建框架,能够从单张RGB图像中同时预测场景中所有物体的6D姿态及其详细三维重建结果。其核心方法基于一种变压器架构,通过预测每个像素的类别嵌入、度量深度、归一化物体坐标(NOCS)以及每个物体的固定数量的三维高斯分布,实现端到端的一次性推理。该方法通过将预测的高斯分布转换到规范空间并与预渲染的真值对齐,避免了高昂的逐图像标注成本,显著提升了重建精度与推理效率。

详情
英文摘要

Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.