arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4050
2605.08787 2026-05-12 cs.CV

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

Mashrafi Monon, Umaima Rahman, Asif Hanif, Numan Saeed, Mohammad Yaqub

AI总结 该论文提出了一种名为CT-SpatialVQA的新型基准,用于评估3D医学视觉-语言模型在语义-空间理解方面的能力。该基准基于1601份放射科报告和CT影像构建了9077个临床相关的问答对,要求模型具备解剖定位、左右识别、结构对比和三维结构关系推理等能力。实验表明,现有模型在这些任务上的表现较差,平均准确率仅为34%,突显了在临床可信应用中亟需加强三维医学证据整合的重要性。

详情
英文摘要

Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.

2605.08784 2026-05-12 cs.CV

simpleposter: a simple baseline for product poster generation

Benlei Cui, Fangao Zeng, Weitao Jiang, Yuwen Zhai, Haiwen Hong, Longtao Huang, Hui Xue, Wenxiang Shang, Pipei Huang

AI总结 本文提出了一种名为SimplePoster的简单而有效的产品海报生成框架,旨在解决在保留产品外观和精确控制密集多行文本布局方面的挑战。与以往依赖复杂模块(如ControlNet和OCR编码器)的方法不同,SimplePoster通过全参数微调和字符级位置编码,在无需外部控制器的情况下实现了高保真主体保留和精准文本渲染。实验表明,SimplePoster在主体保留率和文本渲染准确性方面均优于现有方法。

Comments CVPR 2026

详情
英文摘要

Product poster generation poses distinct challenges beyond general poster design, requiring both faithful preservation of product appearance and precise control over dense, multi-line text layouts. Prior methods typically adopt inpainting frameworks augmented with auxiliary modules such as ControlNet and OCR encoders. However, these approaches introduce architectural complexity and computational overhead while still suffering from text errors and subject extension artifacts. We present SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and accurate, position-controllable text rendering without external controllers. Our approach builds on two observations: (1) full-parameter fine-tuning of the base model effectively suppresses subject extension, outperforming ControlNet-based alternatives; and (2) a zero-cost character-level position encoding enables geometry-aware text generation without dedicated layout modules. Experiments show that SimplePoster achieves a $98.7\%$ subject preservation rate, compared to $55.2\%$ for SeedEdit 3.0 and $85.3\%$ for PosterMaker, while also improving text rendering accuracy. Code, models, benchmark and a part of training data will be available at https://github.com/Alibaba-YuFeng/SIMPLEPOSTER

2605.08781 2026-05-12 cs.CV

Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours

Jin Liu, Wang Wang, Hongxu Pu, Zhen Cao, Yasong Wang, Hu Wang, Kunming Luo

AI总结 本文研究了如何将桥梁缺陷检测结果以更紧凑、可恢复的轮廓向量形式进行表示,以替代传统的粗略几何边界框或存储成本高的栅格掩膜。提出了一种基于频率监督的傅里叶级数检测方法(FS-FSD),该方法直接回归傅里叶轮廓描述子,并在统一的多边形空间协议下对边界框、掩膜和轮廓进行评估。实验表明,该方法在大量无人机采集的桥梁图像上取得了更高的多边形空间检测精度和更优的真阳性几何匹配质量,为工程审查和后续信息流程提供了更高效、更精确的缺陷边界表示方式。

Comments 46 pages,13 figures

详情
英文摘要

AI-assisted bridge defect inspection often produces bounding boxes with crude geometry or raster masks that are costly to store, transmit, and reuse. This study investigates how detected defects can be represented as compact, recoverable contour-level vector records in image space. We propose Frequency-Supervised Fourier Series Detection (FS-FSD), which directly regresses Fourier contour descriptors and evaluates boxes, masks, and contours under a unified polygon-space protocol. On 3,767 UAV-collected bridge images with 42,346 defect instances, FS-FSD achieves higher polygon-space accuracy and better matched-TP geometric quality than representative detection, segmentation, and contour baselines. These results show that, compared with bounding boxes and raster masks, Fourier contour records preserve defect-boundary geometry in a more compact, recoverable, and shareable form for engineering review and downstream information workflows. Future work will study the modeling of multi-region, fragmented, and adjacent bridge-defect boundaries and extend the framework toward long-term bridge-defect tracking and lifecycle-oriented management.

2605.08778 2026-05-12 cs.AI cs.LG cs.MA

Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

Zhida He, Xiaoyu Wen, Han Qi, Ziyuan Zhou, Peng Yu, Xingcheng Xu, Dongrui Liu, Xia Hu, Chaochao Lu, Qiaosheng Zhang

AI总结 该研究针对多轮对话中基于强化学习的LLM越狱攻击方法中存在的信用分配问题,提出了一种基于回合感知的信用分配框架TRACE。传统方法使用粗粒度的轨迹级奖励信号,导致对各回合贡献的评估不准确,而TRACE通过回合级语义掩码和针对性惩罚机制,更精确地分配信用,提升了攻击效果和效率。实验表明,TRACE在攻击成功率、可迁移性和安全性方面均优于现有方法,并在防御对齐中也表现出更好的安全与效用平衡。

Comments 41 pages, 10 figures

详情
英文摘要

Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via leave-one-turn-out semantic masking; for failed ones, TRACE assigns penalties based on prompt harmfulness and semantic relevance, with an additional local refusal-aware penalty. Furthermore, we reuse the attack-side credit signal for multi-turn defense alignment. Extensive experiments on open-source and closed-source targets show that TRACE achieves strong overall performance in effectiveness, transferability, and efficiency, yielding about a 25% relative improvement in attack success rate over the strongest RL baseline while also improving the safety-utility balance when reused for defense alignment.

2605.08776 2026-05-12 cs.AI

Reasoning Compression with Mixed-Policy Distillation

Han Yang, Mingyan Wu, Bailan He, Zeyu Cao, Sikuan Yan, Kevin Qinghong Lin, Zifeng Ding

AI总结 本文研究了如何在保持推理性能的前提下,压缩大语言模型生成的推理轨迹以提高小模型的推理效率。作者提出了一种名为混合策略蒸馏(MPD)的方法,通过从大模型中迁移简洁的推理行为到小模型,避免了显式长度约束带来的限制。实验表明,MPD在减少token使用量的同时,还能提升小模型在多个推理任务上的表现,为高效小模型推理提供了一种有效方法。

详情
英文摘要

Reasoning-centric large language models (LLMs) achieve strong performance by generating intermediate reasoning trajectories, but often incur excessive token usage and high inference-time decoding cost. We observe that, when solving the same problems, larger reasoning models can often produce more concise traces, whereas smaller reasoning models tend to generate longer and more redundant trajectories. This is especially problematic in real-world deployment, where memory, latency, and serving-cost constraints often favor smaller models. Our observations suggest that reasoning compression can be transferred from large models to small ones rather than enforced through explicit length constraints. Based on this insight, we propose Mixed-Policy Distillation (MPD), a reasoning compression framework that transfers concise reasoning behavior from a larger-sized teacher to a smaller student by distilling teacher-compressed student trajectories. Unlike on-policy distillation, which aligns the student with teacher distributions over verbose student trajectories, or off-policy distillation, which relies on teacher-generated trajectories and may suffer from distribution mismatch, MPD combines the strengths of both. Given a student-sampled trajectory, the teacher rewrites it into a more concise reasoning trace, and the student is trained via KL-based alignment on the compressed trajectory. This preserves student-policy exploration while injecting teacher-guided compression. Experiments on Qwen3-1.7B show that MPD reduces token usage by up to 27.1% while improving performance across multiple reasoning benchmarks, demonstrating an effective approach to efficient small-model reasoning.

2605.08774 2026-05-12 cs.RO cs.LG

ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

Youhe Feng, Hansen Shi, Haoyang Li, Xinlei Guo, Yang Wang, Chengyang Zhang, Jinkai Zhang, Xiaohan Zhang, Jie Tang, Jing Zhang

AI总结 本文提出ProcVLM,一种基于视觉-语言的进展感知模型,用于机器人操作任务中的密集奖励学习。与依赖最终结果或时间插值的传统方法不同,ProcVLM通过任务过程结构和阶段内视觉变化来估计任务进展,并引入“推理-估计”范式,先推断剩余操作再评估进展。研究构建了包含6000万标注帧的ProcCorpus-60M数据集,并在多个基准测试中验证了ProcVLM在任务进展估计和操作推理方面的优越性,为下游策略优化提供了更精确的密集奖励信号。

详情
英文摘要

Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: https://procvlm.github.io/

2605.08769 2026-05-12 cs.AI

EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

Chengdong Xu, Kaiqiang Ke, Ziheng Liu, Jiaqi Wei, Zibo Shao, Weile Guo, Chao Yu

AI总结 本文提出了一种名为EvoMAS的框架,用于在任务执行过程中动态构建多智能体系统的工作流。该方法将工作流构建建模为一个元级别的序列决策问题,通过 Planner-Evaluator-Updater 管道显式构建任务状态,并利用学习到的 Workflow Adapter 从固定候选智能体池中生成阶段特定的分层工作流。实验表明,EvoMAS 在多个基准任务中优于单一智能体和现有自动化多智能体工作流设计方法,展示了其在动态任务环境下适应任务状态变化、提升协作效率的优势。

Comments 22 pages, 8 figures

详情
英文摘要

Large language model (LLM)-based multi-agent systems have shown strong potential on complex tasks through agent specialization, tool use, and collaborative reasoning. However, most automated multi-agent system design methods still follow a one-shot paradigm: a workflow is optimized or selected before execution and then reused unchanged throughout the task. This static coordination strategy is ill-suited for long-horizon tasks whose subgoals, intermediate evidence, and information needs evolve over multiple execution stages. We propose EvoMAS, a framework for execution-time multi-agent workflow construction. EvoMAS formulates workflow construction as a meta-level sequential decision problem along a single task trajectory. At each stage, it constructs an explicit task state through a Planner-Evaluator-Updater pipeline and uses a learned Workflow Adapter to instantiate a stage-specific layered workflow from a fixed pool of candidate agents. The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal, while evaluator-based process reward is analyzed separately under very-hard sparse-reward settings. Experiments on GAIA, HLE, and DeepResearcher show that EvoMAS outperforms single-agent baselines and recent automated multi-agent workflow design methods. Our analyses further show that explicit task-state construction and learned workflow adaptation provide complementary benefits. Additional results indicate that process reward is most useful when terminal success is extremely sparse, and qualitative case studies illustrate that EvoMAS adapts agent coordination as the task state evolves.

2605.08765 2026-05-12 cs.LG cs.AI

Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

Renjie Gu, Jiazhen Du, Yihua Zhang, Sijia Liu

AI总结 本研究探讨了大语言模型(LLM)在“遗忘”有害训练数据过程中可能出现的不诚实行为,如幻觉生成和行为不一致等问题。文章提出了“遗忘诚实性”的正式定义,并设计了一套涵盖实用性、保留知识诚实性、遗忘效果等多方面的评估指标。基于实验分析,作者提出了一种名为ReVa的表示对齐方法,通过微调特征随机化的遗忘模型,显著提升了模型在遗忘知识上的拒绝率和保留知识的诚实性。

Comments Accepted by ACL 2026

详情
英文摘要

Unlearning in large language models (LLMs) aims to remove harmful training data while preserving overall utility. However, we find that existing methods often hallucinate, generate abnormal token sequences, or behave inconsistently, raising safety and trust concerns. According to prior literature on LLM honesty, such behaviors are often associated with dishonesty. This motivates us to investigate the notion of honesty in the context of model unlearning. We propose a formal definition of unlearning honesty, which includes: (1) preserving both utility and honesty on retained knowledge, and (2) ensuring effective forgetting while encouraging the model to acknowledge its limitations and respond consistently to questions related to forgotten knowledge. To systematically evaluate the honesty of unlearning, we introduce a suite of metrics that cover utility, honesty on the retained set, effectiveness of forgetting, rejection rate and refusal stability in Q&A and MCQ settings. Evaluating 9 methods across 3 mainstream families shows that all current methods fail to meet these standards. After experimental and theoretical analyses, we present ReVa, a representation-alignment procedure that fine-tunes feature-randomized unlearned models to better acknowledge forgotten knowledge. On Q&A tasks from the forget set, ReVa achieves the highest rejection rate after two rounds of interaction, nearly doubling the performance of the second-best method. Remarkably, It also improves honesty on the retained set. We release our data and code at https://github.com/renjiegu.

2605.08764 2026-05-12 cs.LG cs.CV eess.IV

Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning

Nikhil J. Dhinagar, Vidhi Chhatbar, Chirag Jagad, Pavithra Senthilkumar, Sophia I. Thomopoulos, Mahir H. Khan, Sook-Lei Liew, the ENIGMA-Stroke Recovery Working Group, Paul M. Thompson

AI总结 本文研究了在数据稀缺情况下深度视觉模型性能下降的根本原因,指出这是由于有限样本导致的嵌入协方差矩阵噪声干扰,从而压缩了特征值间隔(eigengap),限制了可恢复的信号模式数量。作者提出了一个有限样本表示学习的谱理论,量化了可恢复的维度 $K(N)$,并通过扰动理论和集中不等式分析了可靠特征模式的判据。研究进一步表明,多模态学习(如视觉-语言模型)能够通过低秩约束抑制噪声方向、保持特征值间隔,从而提升数据效率和分类性能,尤其在医学影像等小样本场景中表现出显著优势。

详情
英文摘要

Deep vision models degrade sharply in low-data regimes, particularly in medical imaging where labeled samples are scarce. We show this arises not merely from overfitting but from a geometric failure: finite-sample noise corrupts the embedding covariance, collapsing the eigengap and limiting the number of recoverable signal-bearing modes. We develop a spectral theory of finite-sample representation learning that quantifies the recoverable dimension K(N), the number of eigenmodes that can be stably estimated from N samples. Using perturbation theory and concentration bounds, we show that only modes with eigenvalues above the noise floor $\|\hatΣ - Σ\|_{\mathrm{op}} \sim \sqrt{D/N}$ are reliable, yielding a truncated Mahalanobis energy that governs classification performance. Under a power-law spectral model, this energy can be approximated by a truncated Riemann zeta function, linking eigenvalue decay to data efficiency and AUC. Within this framework, multimodal learning acts as spectral stabilization: vision-language models impose low-rank constraints that suppress noise-dominated directions and preserve the eigengap, increasing K(N) under data scarcity. Across MNIST and multi-disease neuroimaging, we show that multimodal training maintains more stable modes and improves class separation, even when unimodal models achieve comparable few-shot accuracy. These results identify spectral collapse as a fundamental bottleneck in low-data learning. We use truncated Mahalanobis energy and K(N) to diagnose encoder quality, and introduce zeta-based spectral filtering as a principled approach to improve data efficiency.

2605.08762 2026-05-12 cs.SD cs.LG

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Tao Yu, yiming ding, Shenghua Chai, Minghui Zhang, Zhongtian Luo, Xinming Wang, Xinlong Chen, Zhaolu Kang, Junhao Gong, Yuxuan Zhou, Haopeng Jin, Zhiqing Cui, Jiabing Yang, YiFan Zhang, Hongzhu Yi, Zheqi He, Xi Yang, Yan Huang, Liang Wang

AI总结 当前跨模态基准主要评估模型在多种模态同时提供的场景下的表现,而从音频出发主动搜索跨模态证据的能力仍鲜有研究。本文提出Omni-DeepSearch,一个以音频驱动的跨模态深度搜索基准,要求模型从给定的音频片段和相关问题中提取线索,调用文本、图像和视频检索工具,进行多跳推理生成简短、客观且可验证的答案。该基准包含640个样本,涵盖四个检索目标模态和四种音频内容类型,并通过多阶段过滤流程确保任务难度与挑战性,实验表明当前最先进的模型在该任务上的平均准确率仅为43.44%,突显了该方向的重要研究价值。

Comments 43 pages

详情
英文摘要

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source omni-modal models show that this task remains highly challenging: the strongest evaluated model, Gemini-3-Pro, achieves only 43.44\% average accuracy. Further analyses illustrate key bottlenecks in audio entity inference, query formulation, tool-use reliability, multi-hop retrieval, and cross-modal verification. These results highlight audio-driven omni-modal deep search as an important and underexplored direction for future multimodal agents.

2605.08760 2026-05-12 cs.LG cs.DC

FedGMI: Generative Model-Driven Federated Learning for Probabilistic Mixture Inference

Qijun Hou, Yuchen Shi, Pingyi Fan, Khaled B. Letaief

AI总结 本文研究了联邦学习中因数据异构性导致的性能下降问题,提出了一种基于生成模型的联邦学习框架FedGMI,用于处理概率混合推理场景。该方法通过变分自编码器建模共享的潜在分布,并推断每个客户端数据的混合成分,从而在保持个性化的同时实现结构化协作学习。实验表明,FedGMI能够有效区分潜在分布并准确估计混合比例,且在通信成本受限的情况下仍保持良好性能。

详情
英文摘要

Federated Learning (FL) facilitates collaborative model training across decentralized clients while preserving data privacy by avoiding raw data exchange. Despite its potential, FL performance is often compromised by data heterogeneity across clients. To address this, Clustered Federated Learning (CFL) groups clients with similar data distributions to improve model performance, but constrained by intra-cluster heterogeneity. Conversely, Personalized Federated Learning (PFL) tailors models to individual clients, but usually neglects the underlying structural similarities among clients. In this work, we investigate a probabilistic mixture (PM) scenario, where each client's local data distribution is modeled as a convex combination of several shared inherent distributions. To effectively model this structure, we propose FedGMI, a framework that utilizes Variational Autoencoders (VAEs) as generative density estimators to represent these inherent distributions and infer the mixture components of clients' local data distributions. This approach enables structured personalization without sacrificing the benefits of collaborative learning. Extensive experiments demonstrate that FedGMI effectively characterizes and discriminate the inherent distributions, as well as accurately estimates mixture proportions. Furthermore, FedGMI maintains robust performance even under communication cost constraints.

2605.08757 2026-05-12 cs.RO

A Visuo-Tactile Data Collection System with Haptic Feedback for Coarse-to-Fine Imitation Learning

Yeseung Kim, Nayoung Oh, Jun Park, Teetat Thamronglak, Daehyung Park

AI总结 本文提出了一种结合视觉与触觉反馈的数据采集系统,用于生成具有时间结构和丰富接触信息的示范数据,以支持模仿学习。该系统通过直接驱动夹爪保留操作者的自然触觉反馈,并集成视觉和触觉传感器捕捉图像和接触几何信息,同时通过手柄按钮实现任务关键区域的实时标注。该方法融合了手部力感知与时间标注,生成的多模态数据集适用于从粗到细的学习算法,有助于开发高质量的操控策略。

详情
英文摘要

We present a visuo-tactile data-collection system that generates temporally structured, contact-rich demonstrations for imitation learning. Conventional systems often decouple the operator from contact forces, which hinders the demonstration of subtle force modulation. Our system introduces a direct-drive gripper that the operator actuates with the fingers, preserving natural haptic feedback. Integrated visual sensors and custom tactile arrays capture image streams and contact geometry. A handle-mounted push button enables the operator to annotate the task's temporal structure in real time by marking task-critical regions. By fusing in-hand force perception with in-situ temporal annotation, the system produces multimodal datasets designed for coarse-to-fine learning algorithms that exploit structural task knowledge, enabling the development of high-quality manipulation policies.

2605.08756 2026-05-12 cs.AI cs.NE

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

Haoze Lv, Ning Lu, Ziang Zhou, Shengcai Liu

AI总结 本文提出了一种名为 AHD Agent 的新型多轮框架,旨在提升自动启发式设计(AHD)在解决复杂组合优化问题中的效率与效果。该框架通过集成工具调用机制,使大语言模型能够主动决定是生成启发式策略还是调用工具获取环境中的关键信息,从而更高效地进行探索。研究引入了一种基于代理强化学习的训练系统,结合环境合成流程优化模型的泛化能力,实验表明该方法在多个领域中表现优异,性能可与更大规模模型相比,同时大幅减少了评估次数。

Comments 10 pages, 7 figures for main content

详情
英文摘要

Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.

2605.08755 2026-05-12 cs.LG

LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

Euntae Choi, Sumin Song, Sungjoo Yoo

AI总结 大型推理模型(LRMs)通过长序列的自回归解码在数学和编程任务中达到了接近竞赛水平的准确率,但每 token 的解码成本成为部署的主要瓶颈。本文提出了一种无需在线转换开销的层-wise 权重量化方法 LAQuant,通过引入推理域校准和一层前瞻损失,有效解决了量化后长解码精度下降的问题。实验表明,LAQuant 在保持较高推理精度的同时,显著提升了解码速度。

详情
英文摘要

Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining reasoning-domain calibration with a one-layer lookahead loss whose implicit cross-layer co-adaptation preserves the next-layer residual stream. For Qwen3-4B under W3G128 quantization, LAQuant improves AIME25 Pass@1 over ParoQuant by 15.11pp (1.93pp over ParoQuant++ at matched calibration) while achieving a 3.42x decoding speedup over FP16 on RTX A6000, compared with ParoQuant's 3.01x.

2605.08753 2026-05-12 cs.CV stat.ML

Simultaneous Monitoring of Shape and Surface Color via 4D Point Clouds: A Registration-free Approach

Mariafrancesca Patalano, Giovanna Capizzi, Kamran Paynabar

AI总结 本文提出了一种无需配准的4D点云框架SMAC,用于同时监测物体的形状和表面颜色变化。该方法利用拉普拉斯-贝尔特拉米算子的谱特性,捕捉形状与颜色之间的关系,并通过联合监测策略有效检测形状变形和颜色异常。此外,该方法还引入了空间感知的后信号诊断过程,以定位异常来源,具有计算高效、无需配准和网格重建的优势,实验表明其在细微缺陷检测方面表现优异。

Comments 38 pages, 11 figures

详情
英文摘要

Advanced manufacturing technologies allow for the production of intricate parts featuring high shape complexity and spatially-varying material composition. Data fusion of point clouds with chromatic attributes provides 4D point clouds, a compact and informative representation that encodes both shape and material information. In this paper, we present a registration-free framework for Simultaneous Monitoring of shApe and Color (SMAC) via 4D point clouds. The proposed framework leverages Laplace-Beltrami operator spectral properties to capture and monitor geometric features and the relationship between shape and surface color. A combined monitoring scheme is proposed to effectively detect shape deformations and color anomalies, along with a spatially-aware post-signal diagnostic procedure to determine the source of change and localize color anomalies. Importantly, neither component relies on registration or mesh reconstruction, eliminating error-prone and computationally expensive preprocessing steps. A Monte Carlo simulation study and a case study on functionally graded materials demonstrate that SMAC achieves effective detection performance, particularly for subtle defects, while providing diagnostic capabilities to identify the source and location of anomalies.

2605.08750 2026-05-12 cs.LG cs.AI cs.CL cs.MA

Communicating Sound Through Natural Language

Emanuele Rossi, Emanuele Rodolà

AI总结 该研究提出了一种通过自然语言传递声音的框架——词法声学编码(LAC),利用预训练的语言模型作为发送和接收代理,实现声音信息的编码与解码。发送方将声音波形转化为可解释的声学描述符,并通过特定词汇表进行量化后生成英文文本,接收方则解析文本并重构声音波形。该方法在保持声音结构的同时,实现了声音信息的可解释性和可编辑性,展示了自然语言作为声音传输载体的潜力。

Comments Includes link to demo page

详情
英文摘要

Natural language is widely used to describe, prompt, and control audio systems, but rarely serves as the representation carrying audio itself. We introduce lexical acoustic coding (LAC), a framework in which pre-trained LLM sender and receiver agents transmit sound through natural language. Under fixed system prompts, the agents write their own analysis and synthesis code, communicating only through a lexical sentence, shared vocabulary, and optional symbolic music structure. The sender analyzes an input waveform into interpretable, non-learned acoustic descriptors, quantizes each with a feature-specific interval vocabulary, and verbalizes the lexical code as English. The receiver parses the sentence back into lexical-acoustic constraints and renders a waveform through closed-loop refinement. The transmitted text serves as both a rich caption and as the transport representation itself. We frame LAC as a finite-rate lossy quantizer, exposing trade-offs between vocabulary size, rate, and fidelity. Experiments on short sounds and symbolic music transfer show that plain text preserves measurable acoustic structure while remaining interpretable, editable, and native to LLM-mediated communication.

2605.08749 2026-05-12 cs.LG

The Wristband Gaussian Loss: Deterministic, Composable Latents via a Sphere-Interval Decomposition

Mikhail Parakhin, André M. Carvalho, Patrick Haluptzok

AI总结 本文提出了一种确定性的批量损失函数——Wristband Gaussian Loss,用于在无需采样、KL散度项或迭代运输的情况下对点嵌入进行高斯化处理。该方法通过将每个点映射到一个方向和一个经过CDF变换的半径,将其嵌入到球面与区间乘积空间中,并证明了该映射在源数据为高斯分布时能够生成均匀分布的推前测度。实验表明,该方法在多个基准测试中表现优异,尤其在高维数据上具有优势,并可与可学习键注意力机制结合,构建出具有独立和依赖因子控制能力的确定性高斯自编码器。

Comments preprint

详情
英文摘要

We present the Wristband Gaussian Loss, a deterministic batch loss for Gaussianizing point embeddings without sampling, KL terms, or iterative transport. Each $x \in \mathbb{R}^d$ is mapped to a direction $u = x/\|x\|$ and a CDF-transformed radius $t = F_{χ^2_d}(\|x\|^2)$ on the wristband $S^{d-1} \times [0,1]$. We prove (and machine-verify in Lean~4) that for $d \ge 2$ the pushforward wristband map equals $σ_{d-1} \otimes \mathrm{Unif}[0,1]$ iff the source is $\mathcal{N}(0, I_d)$, and that the Neumann-reflected wristband repulsion energy is uniquely minimized at the uniform target. We compute this reflected-kernel objective in two ways: a nearest three-image pairwise truncation at $O(N^2 d)$, and a spectral Neumann path joining angular and radial Mercer modes (spherical-harmonic and cosine) at $O(N d K)$, with empirically matched gradients. A 1D Wasserstein radial term and a moment penalty serve as finite-sample accelerators with the same optimum, and Monte-Carlo null calibration turns the components into a single standardized statistic. We evaluate direct point-cloud Gaussianization with a calibrated barycentric $W_2$ score: a deterministic Gaussian reference batch is built by recursive Hungarian averaging, with each method reported as a $z$-score against same-size Gaussian batches. On the axis-uniform X benchmark, Wristband is competitive in 2D and gives the best 10D score. On a harder radial--angular-copula impostor whose Gaussian radial and angular marginals are correct but dependent, Wristband gives the best 10D and 128D scores. Coupled with learnable-key Euclidean attention and exact invertible flows, the resulting Deterministic Gaussian Autoencoder delivers a Gaussian-latent interface for counterfactual sampling with independent factors and a context/residual construction for dependent factors.

2605.08746 2026-05-12 cs.LG math.DS math.OC

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

James Hazelden, Laura Driscoll, Eli Shlizerman, Eric Shea-Brown

AI总结 本文研究了梯度下降训练神经网络过程中全局经验神经切线核(NTK)的结构特性,揭示了其在参数更新和状态演化中的核心作用。通过将模型状态视为满足单一全局隐式约束的解,作者将NTK分解为两个算子的乘积,分别描述参数与状态、状态与状态之间的关系,并证明了对于包括RNN和Transformer在内的多种模型,NTK具有可计算的核结构,揭示了其有效秩受限和自指偏差的特性。研究进一步表明,NTK的结构限制了梯度下降的学习过程,导致模型倾向于在主导的隐藏状态和输入活动模式中学习,并为理解低秩表示的出现提供了理论依据。

Comments Submitted to TMLR

详情
英文摘要

In training a neural network with gradient descent (GD), each iteration induces a linear operator that governs first-order updates to a model's internal state variables. We define this operator as the Global Empirical Neural Tangent Kernel (NTK). In finite-width networks, the NTK is typically intractable to form, leading prior work to focus on restrictive settings such as tracking outputs only or taking infinite-width limits. Here, we study the structure of the NTK for a range of models. Formulating the model state as the solution to a single global implicit constraint, we derive the NTK as a product of two operators: K, accounting for immediate parameter-to-state interactions, and P, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables. This core structure reveals that the NTK is structurally bottlenecked, constraining its effective rank and giving rise to a self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity. For recurrent models, we examine the spectrum of the NTK and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that model dynamics at initialization bias the NTK, restricting learning and preventing task components from being learned effectively. Finally, we show that the NTK associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that the NTK possesses tractable structure that explains GD bias toward task solutions and the emergence of low-rank representations. To enable use of the NTK as a practical metric, we build kpflow, a library relying on randomized matrix-free numerical linear algebra.

2605.08742 2026-05-12 cs.CL cs.AI

Narrative Landscape: Mapping Narrative Dispositions Across LLMs

Donghoon Jung, Jiwoo Choi, Songeun Chae, Seohyon Jung

AI总结 本文提出了一种量化框架,用于刻画大语言模型在重复受控引导下输出中的稳定、模型特有特性。通过设计结构化的叙事约束选择任务,并在六种前沿模型和三种指令类型上进行实验,研究从“一致性”和“多样性”两个维度定义模型的叙事倾向,并引入基于主成分分析的“叙事景观”可视化方法,将不同模型的选择特征映射到统一空间进行对比。实验结果揭示了不同模型家族在刚性与探索性之间的明显差异,并表明指令类型会改变选择空间的几何结构,即使标量指标相似,其选择拓扑结构也可能存在本质区别。

Comments Accepted to NLP4DH 2026, camera-ready version

详情
英文摘要

This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: "consistency", measured as cross-replication selection overlap via Jaccard similarity, and "diversity", measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model's selection profile into a shared space for direct comparison. Results reveal a clear rigidity-exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.

2605.08741 2026-05-12 cs.CL

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

Zhengyang Zhao, Lu Ma, Wentao Zhang

AI总结 该研究提出了一种名为“On-Policy Harness Self-Distillation”(OPHSD)的方法,旨在通过引入推理时的辅助流程(harness)来提升大语言模型在复杂推理任务中的表现。该方法利用增强后的模型作为教师模型进行自蒸馏,将辅助流程中的额外监督信号融入学生模型中,从而提升其独立推理能力。实验表明,OPHSD在多个任务上优于现有方法,并且表明辅助流程在训练阶段即可发挥价值,无需在推理时持续使用。

详情
英文摘要

Inference-time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emph{On-Policy Harness Self-Distillation} (OPHSD), which employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft--verify harness for text classification and plan--solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83\% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at https://github.com/zzy1127/OPHSD-On-Policy-Harness-Self-Distillation.

2605.08740 2026-05-12 cs.LG cs.AI

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

Nilesh Sarkar, Dawar Jyoti Deka

AI总结 该研究探讨了Transformer模型中残差流表示的因果维度,提出了一种新的度量方法kappa(L, M, T),用于衡量某一层的因果影响能力。通过实验发现,随着稀疏自编码器(SAE)宽度的增加,表示能力显著提升,但因果能力增长较慢,揭示了表示与因果之间的分离现象。研究还表明,kappa在模型规模变化时保持稳定,并且在不同网络深度上表现出结构化的变化,为理解Transformer内部机制提供了新的视角。

Comments 9 pages, 17 figures, 14 tables (excluding references and appendices). Companion short paper under review at the ICML 2026 Mechanistic Interpretability Workshop. Code: https://anonymous.4open.science/r/NeurIPS-Causal-Capacity-in-SAEs-7D20/

详情
英文摘要

Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.

2605.08739 2026-05-12 cs.CV

ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

Luchao Wang, Kaimin Liao, Qian Ren, Hua Wang, Zhi Chen, Yaohua Tang

AI总结 本文提出了一种名为 ReorgGS 的方法,用于解决 3D 高斯溅射(3DGS)模型在收敛后参数化退化的问题。该方法通过将现有高斯点集视为经验概率场,重新采样中心点并估计各向异性协方差,从而重建更优的分布结构,提升后续优化的梯度可访问性。与简单重置不透明度的方法不同,ReorgGS 重构了高斯点的分布和可见性结构,在保持场景表达能力的同时,有效减少了冗余重叠,提高了模型的优化效果和渲染效率。

详情
英文摘要

A converged 3D Gaussian Splatting (3DGS) model may approximate the target scene while remaining poorly parameterized for further optimization. We identify this failure mode as \emph{parameterization degeneration}: high-opacity floaters attenuate gradients to true surfaces through alpha compositing, and redundant overlapping clusters create strongly coupled parameter blocks with nearly collinear Jacobian responses. These effects explain why continued optimization can plateau even when the model still contains removable artifacts. We propose ReorgGS, an equivalent distribution reorganization method for converged 3DGS models. ReorgGS treats the existing Gaussian set as an empirical probability field, resamples centers from it, estimates local anisotropic covariances with kNN, initializes low opacity, and continues optimization with the original 3DGS renderer and loss. Unlike opacity reset, which only rescales opacity on the old overlap graph, ReorgGS rebuilds centers, covariances, and visibility structure, thereby changing the graph itself. Our analysis shows that distributional equivalence is not optimization equivalence. The reorganized model preserves scene support while improving gradient accessibility under alpha compositing and reducing opacity-weighted overlap, thereby weakening local parameter coupling during subsequent optimization. Under the same additional optimization budget, ReorgGS improves fitting quality at a fixed Gaussian count, suppresses persistent floaters, and reduces rendering overhead from redundant overlap.

2605.08737 2026-05-12 cs.LG cs.CL

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Xin Li, Hao Jiang, Annan Wang, Yichi Zhang, Chau Yuen

AI总结 本文研究了在近确定性结构化输出任务中,基于策略的模型蒸馏(OPD)在奖励外推系数超过一定阈值时出现的“外推悬崖”问题。通过分析单位置伯努利简化模型,作者推导出一个由教师模型模态概率、初始质量及重要性采样裁剪强度决定的闭合形式安全阈值,揭示了超出该阈值后模型输出格式会从保持结构转向崩溃。实验表明,在亚马逊时尚数据集上,使用略低于该阈值的ListOPD方法,可以使较小的Qwen3学生模型在参数仅为基线模型五分之一的情况下,在结构化输出任务上达到与大模型相当的性能。

详情
英文摘要

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

2605.08735 2026-05-12 cs.CV

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Joowon Kim, Seungho Shin, Joonhyung Park, Eunho Yang

AI总结 该论文提出了一种名为CollabVR的协作视频推理框架,旨在解决视频生成模型(VGM)在多步骤任务中出现的长期偏差和中间片段模拟错误问题。该方法通过将视觉-语言模型(VLM)与VGM在步骤层面进行紧密协作,使VLM在每一步生成动作后对VGM生成的视频片段进行检查与修正,从而提升推理的准确性和鲁棒性。实验表明,CollabVR在多个基准测试中显著优于现有方法,尤其在复杂任务上表现突出,并且与针对推理优化的VGM结合使用时还能进一步提升性能。

详情
英文摘要

Recent "Thinking with Video" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier's diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@$k$, and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: https://joow0n-kim.github.io/collabvr-project-page.

2605.08734 2026-05-12 cs.LG cs.AI cs.CL

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

Ziyun Liu, Fengmiao Bian, Jian-Feng Cai

AI总结 本文提出了一种名为 AdaPreLoRA 的低秩适配方法,旨在解决传统 LoRA 在参数更新过程中因雅可比矩阵秩不足而导致的预处理失效问题。该方法结合了 Adafactor 预处理技术与因子空间优化,通过在权重空间中引入对角克罗内克预处理矩阵,并在因子空间中选择最优更新方向,以最小化加权不平衡度,从而获得更精确的参数更新。实验表明,AdaPreLoRA 在多个自然语言处理和扩散模型任务中表现优异,同时保持了与现有 LoRA 优化器相当的内存效率。

Comments 27 pages

详情
英文摘要

Low-Rank Adaptation (LoRA) reparameterizes a weight update as a product of two low-rank factors, but the Jacobian $J_{G}$ of the generator mapping the factors to the weight matrix is rank-deficient, so the factor-space preconditioner $J_{G}^* {F}_t J_{G}$ induced by any ${W}$-space preconditioner ${F}_t$ is singular, and consequently the standard chain rule cannot be uniquely inverted to map a preconditioned ${W}$-space direction back to a factor-space update. We cast existing LoRA optimizers in a unified framework parameterized by two choices: (i) which invertible surrogate for $J_{G}^* {F}_t J_{G}$ to use, and (ii) which ${F}_t$ on ${W}$ to use. Existing methods occupy four families along these axes: factor-space adaptive updates, block-diagonal surrogates for $J_{G}^* J_{G}$, Frobenius-residual pseudoinverse methods, and Riemannian manifold constraint. Within this design space, a gradient-statistics-aware ${F}_t$ paired with a closed-form factor-space solve at ${O}((m+n)r)$ memory remains underexplored. We propose \textbf{AdaPreLoRA}, which fills this gap by adopting the Adafactor diagonal Kronecker preconditioner ${H}_t$ on ${W}$ and selecting from the resulting factor-space solution family the element minimizing an ${H}_t$-weighted imbalance between the two factor contributions; by construction, the resulting factor update is the closest LoRA approximation to the preconditioned ${W}$-space direction under the ${H}_t$-weighted norm. Across GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization, AdaPreLoRA is competitive with or improves over a representative set of LoRA optimizers while keeping peak GPU memory at the LoRA optimizer level.

2605.08733 2026-05-12 cs.LG cs.IT math.IT

Generative Actor-Critic with Soft Bridge Policies

Ke He, Le He, Shunpu Tang, Yafei Wang, Lisheng Fan

AI总结 该论文研究了如何在最大熵在线强化学习中有效地训练生成性策略,提出了软生成性actor-critic(SoftGAC)方法。为了解决传统生成模型在训练过程中面临的边际动作密度不可用和推理成本高的问题,SoftGAC通过构建一个从固定潜在变量到动作潜在变量的结构化桥梁,使得最大熵目标可以转化为可解析处理的路径相对熵目标。实验表明,SoftGAC在保持低延迟生成能力的同时,在多个连续控制任务中取得了优于现有生成策略基线的性能。

详情
英文摘要

Expressive generative policies such as diffusion and flow models are appealing for MaxEnt online reinforcement learning because of their ability to model multimodal and highly non-Gaussian action distributions. However, training effective soft generative policies faces two obstacles that often arise together. First, marginal action densities are often unavailable, so existing methods typically rely on entropy bounds, heuristic proxies or approximations. Second, iterative shared-parameter samplers raise inference cost and require backpropagation through time over repeated network evaluations, increasing memory cost and destabilizing policy optimization. These obstacles motivate us to seek a generative policy that exposes a tractable MaxEnt objective while requiring only a single sampled actor forward pass for action generation. To this end, we propose soft generative actor-critic (SoftGAC), whose actor defines a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space. This structured bridge allows us to lift the MaxEnt objective as an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization. Moreover, we keep the single-pass actor lightweight by using small step-specific bridge transitions, each evaluated only once per sampled action, while maintaining a parameter budget comparable to strong actor baselines. Extensive experiments on challenging continuous-control benchmarks show that SoftGAC attains higher or competitive returns than strong generative policy baselines, including diffusion and flow-matching policies, while staying in the low-latency regime of one-pass actors and showing considerable improvements in the compute-return tradeoff.

2605.08730 2026-05-12 cs.LG cs.CR

Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis, Mitigation, and Evaluation

Weidong Zheng, Kongyang Chen, Yuanwei Guo, Yatie Xiao

AI总结 本文研究了类级机器遗忘中的分类头偏差问题,揭示了现有方法在遗忘类预测中可能通过简单降低分类头偏差来实现遗忘,而未真正消除模型对遗忘类的依赖。为此,作者提出了一种名为BiasShift的诊断基线,并设计了两种偏差感知机制以缓解偏差过度抑制的问题,同时引入多项偏差导向的评估指标,实验表明所提方法在保持遗忘性能的同时提升了偏差分布的稳定性。

详情
英文摘要

Class-level machine unlearning aims to remove the influence of specified classes while preserving model utility on retained classes. Existing methods are commonly evaluated by retain-set accuracy, forget-set accuracy, and unlearning time, but these metrics provide limited insight into how forgetting is achieved internally. In this paper, we reveal a bias-dominated shortcut in class-level unlearning: the prediction of forgotten classes can be suppressed by decreasing the corresponding bias terms in the final classification head. We first analyze the gradient dynamics of classification-head biases under softmax cross-entropy training, explaining why retain-set-only optimization tends to reduce the biases of absent classes. Based on this observation, we introduce BiasShift as a diagnostic baseline, showing that simple bias manipulation can satisfy conventional unlearning metrics while leaving abnormal bias patterns that reveal forgotten labels. To mitigate excessive forgotten-class bias suppression, we propose two bias-aware mechanisms, namely Two-Stage Bias Gradient Reversal Mechanism (TS-BGRM) and Lower-Bound Hinge Regularization (LB-HR). We further introduce three bias-oriented metrics, including Bias Stability Coefficient (BSC), Median Bias Gap (MBG), and Minimal Bias Score (MBS), to quantify bias dependence and potential leakage. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that the proposed methods maintain competitive unlearning performance while producing more stable bias distributions. We have released our code at {https://github.com/zwd2024/Beyond-the-Shadow-of-Bias-From-Classification-Head-Bias-to-Parameter-Redistribution}.

2605.08729 2026-05-12 cs.CV cs.GR cs.MM cs.SD

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu

AI总结 Unison 是一个统一的框架,旨在解决人类中心视频生成中动作、语音和声音之间异步特性带来的对齐难题。该方法通过语义引导的谐波策略,分离生成语音和音效组件,并利用双向音频交叉注意力和语义条件门控机制,提升声音清晰度并减少语音主导现象。此外,Unison 提出双向跨模态强制策略,通过解耦的去噪时间表实现动作与音频的同步,显著提升了生成视频在音频感知质量和跨模态同步方面的表现。

详情
英文摘要

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

2605.08727 2026-05-12 cs.CV cs.AI cs.LG

Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Jiaming Liang, Chi-Man Pun, Weisi Lin, Greta Seng Peng Mok

AI总结 本文研究了在学习图像压缩系统中实现高分辨率全局语义操控(GSM)的问题,指出现有方法在高分辨率场景下效果有限。作者通过理论与实验分析,揭示了高分辨率GSM攻击需要经过懒惰-震荡-细化三个阶段,并提出了一种周期几何衰减的步长调度策略,从而实现$\ell_{\infty}$-有界条件下的高分辨率GSM。基于此,他们改进了PGD方法,提出PGD$^{2}$-GSM,在Kodak数据集上首次实现了稳定高效的高分辨率GSM,揭示了学习图像压缩系统的新安全威胁。

详情
英文摘要

Learned image compression (LIC) integrates deep neural networks (DNNs) to map high-dimensional images into compact latent representations, reducing redundancy and achieving superior rate-distortion (RD) performance in benign settings. Unfortunately, due to inherent vulnerabilities in DNNs, LIC systems are susceptible to adversarial perturbations that lead to downstream deterioration, compression rate degradation, untargeted distortion, and both local semantic manipulation (LSM) and low-resolution ($3\times28\times28$) global semantic manipulation (GSM). However, high-resolution GSM remains unexplored due to its intractability. Notably, the existing project gradient descent (PGD) method achieves near-perfect white-box attacks for classification, segmentation, and other tasks, yet fails to generalize to high-resolution GSM. Our theoretical and empirical analyses reveal that well-performing GSM drives adversarial examples from the Identity Region to the Amplification Region through the Lazying-Oscillating-Refining stages. General $\ell_{\infty}$-bounded attacks fail on high-resolution GSM because their step-size schedules cannot accommodate both the Oscillating and Refining stages. Based on this, we propose the Periodic Geometric Decay schedule that enables $\ell_{\infty}$-bounded high-resolution GSM. To verify our approach, we integrate it with PGD, yielding a minimal variant, PGD$^{2}$-GSM. Extensive experiments on the Kodak $(3\times768\times512)$ demonstrate that our PGD$^{2}$-GSM is the first to stably achieve high-resolution GSM, thereby exposing a novel threat to LIC systems. Code is available at https://github.com/chinaliangjiaming/PGD2-GSM.

2605.08724 2026-05-12 cs.CV

SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

Weiren Zhao, Yi Dong, Cheng Chen

AI总结 本文提出SynerMedGen,一个通过任务对齐将医学多模态理解与生成统一的框架,旨在解决现有模型中理解与生成目标分离的问题。该方法引入了三个与生成对齐的理解任务和两阶段训练策略,使理解阶段学到的生成有益表征能够有效支持医学图像合成。实验表明,SynerMedGen在多个医学图像生成任务中表现出色,且具有良好的泛化能力,同时作者还发布了包含100万对合成样本和200万生成衍生理解实例的SynerMed数据集,以支持相关研究。

Comments Accepted by ICML 2026

详情
英文摘要

Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at https://github.com/Mhilab/SynerMedGen.