arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪 全部专题
2605.14555 2026-05-15 cs.SD cs.AI

Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

Shuyang Cui, Zhi Zhong, Qiyu Wu, Zachary Novack, Woosung Choi, Keisuke Toyama, Kin Wai Cheuk, Junghyun Koo, Yukara Ikemiya, Christian Simon, Chihiro Nagashima, Shusuke Takahashi

发表机构 * Sony Group Corporation(索尼集团公司) Sony AI(索尼人工智能)

AI总结 本文提出了一种名为“Break-the-Beat!”的可控MIDI到鼓音效合成模型,旨在解决数字音乐制作中鼓循环音频生成缺乏精细控制的问题。该模型通过引入内容编码器和混合条件机制,对预训练的文本到音频模型进行微调,实现了根据参考音频生成具有特定音色的鼓音效。实验表明,该方法在音频质量、节奏对齐和节拍连贯性方面表现优异,为音乐制作人提供了一种高效、可控的创作工具。

详情
英文摘要

Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/

2605.14553 2026-05-15 cs.LG cs.AI

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

Donghao Li, Chengshuai Shi, Weijuan Ou, Cong Shen, Jing Yang

发表机构 * University of Virginia(弗吉尼亚大学) Princeton University(普林斯顿大学) Southern University of Science and Technology(南方科技大学)

AI总结 本文研究了多目标提示选择问题,旨在高效识别在多个性能指标下表现最优的提示。作者将问题建模为纯探索带宽框架,并引入了适用于结构化带宽的高效算法,提供了线性情况下的理论误差保证。实验表明,该方法在多种大语言模型上显著优于基线方法,为多目标提示优化提供了原理清晰且高效的解决方案。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection -- efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.

2605.14551 2026-05-15 cs.LG

SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies

Hao Li, Lu Zhang, Liu Chong, Yankai Chen, Pengyang Wang, Yingjie Zhou

发表机构 * Sichuan University(四川大学) Chengdu University of Information Technology(成都信息科技大学) McGill University(麦吉尔大学) University of Macau(澳门大学)

AI总结 该论文提出了一种名为 SeesawNet 的新型网络架构,用于非平稳时间序列预测,旨在平衡对样本间共性依赖和个体特异性依赖的建模。其核心方法是引入自适应平稳-非平稳注意力机制(ASNA),通过分别从标准化序列和原始序列中提取共性与特异性依赖,并根据每个样本的非平稳特性进行自适应融合。实验表明,SeesawNet 在多个真实数据集上优于现有先进方法,展示了其在处理非平稳时间序列中的有效性。

Comments Accepted by IJCAI-ECAI 2026, the 35th International Joint Conference on Artificial Intelligence. Code is at https://github.com/dreamone-Lee/SeesawNet

详情
英文摘要

Instance normalization (IN) is widely used in non-stationary multivariate time series forecasting to reduce distribution shifts and highlight common patterns across samples. However, IN can over-smooth instance-specific structural information that is essential for modeling temporal and cross-channel heterogeneity. While prior methods further suppress distribution discrepancies or attempt to recover temporal specific dependencies, they often ignore a central tension: how to adaptively model common and instance-specific dependency based on each instance's non-stationary structures. To address this dilemma, we propose SeesawNet, a unified architecture that dynamically balances common and instance-specific dependency modeling in both temporal and channel dimensions. At its core is Adaptive Stationary-Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, and adaptively fuses them according to instance-level non-stationarity. Built upon ASNA, SeesawNet alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies. Extensive experiments on multiple real-world benchmarks demonstrate that SeesawNet consistently outperforms state-of-the-art methods.

2605.14550 2026-05-15 cs.LG

Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework

Phuc Truong Loc Nguyen, Thanh Hung Do, Truong Thanh Hung Nguyen, Hung Cao

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(埃朗根-纽伦堡弗里德里希-亚历山大大学) University of New Brunswick(新不伦瑞克大学)

AI总结 本文提出了一种名为MIRAI的统一评估框架,用于在高风险表格数据领域综合评估人工智能模型的完整性与责任性,涵盖可解释性、公平性、鲁棒性、隐私性和可持续性五个维度,并将其整合为单一评分。该框架通过标准化和方向对齐的维度得分,使得不同架构和计算复杂度的模型之间可以进行直接比较。实验表明,预测性能高的模型不一定在整体责任性方面表现更好,部分简单模型在多维度平衡上优于复杂的深度表格模型,为监管环境下的负责任模型选择提供了实用依据。

Comments Accepted to the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026)

详情
英文摘要

Artificial intelligence in high-stakes tabular domains cannot be evaluated by predictive performance alone, yet current practice still assesses explainability, fairness, robustness, privacy, and sustainability mostly in isolation. We propose the Model Integrity and Responsibility Assessment Index (MIRAI), a unified evaluation framework that measures tabular models across these five dimensions under a controlled comparison setting and aggregates them into a single score. MIRAI combines established metrics through normalized and direction-aligned dimension scores, which enables direct comparison across models with different architectural and computational profiles. Experiments on healthcare, financial, and socioeconomic datasets show that higher predictive performance does not necessarily imply better overall integrity and responsibility. In several cases, simpler models achieve a stronger cross-dimensional balance than more complex deep tabular architectures. MIRAI provides a compact and practical basis for responsible model selection in regulated settings.

2605.14548 2026-05-15 cs.CV

Local Spatiotemporal Convolutional Network for Robust Gait Recognition

Xiaoyun Wang, Cunrong Li, Wu Wang

发表机构 * School of Mechanical and Electrical Engineering, Osh State University(机械与电子工程学院,奥什州大学)

AI总结 本文研究如何从视频序列中鲁棒地识别步态特征,以提升步态识别的准确性和稳定性。为解决现有方法对计算资源需求高、难以捕捉连续帧中内在运动模式的问题,作者提出了一种结构简洁但高效的局部时空卷积网络(LSTCN),通过引入全局双向空间池化机制和局部时空卷积层,使标准二维卷积网络能够有效提取步态的时空特征。该方法在降低计算复杂度的同时,提升了对视角变化、服装差异等干扰因素的鲁棒性。

详情
英文摘要

Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

2605.14546 2026-05-15 cs.LG

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

Pengkai Wang, Pengwei Liu, Yuanyi Wang, Guanyu Chen, Xingyu Ren, Xiaolong Li, Zhongkai Hao, Yuting Kong, Qixin Zhang, Dong Ni

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Zhejiang University(浙江大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 该研究探讨了在神经偏微分方程(PDE)算子中,通过微调共享算子来适应不同物理场景时,权重空间中是否形成了可复用的物理结构。研究发现,微调更新可以分解为共享的适应部分和与物理参数对齐的方向,从而揭示了权重空间中的物理方向。基于这一发现,作者提出了一种后处理方法Calibration-Conditioned Merge(CCM),能够根据物理元数据或初始观测信息,在物理方向上组合不同场景的专家模型,显著提升了模型在分布外场景下的预测性能。

详情
英文摘要

Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction--diffusion system, viscosity-parameterized two-dimensional Navier--Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.

2605.14544 2026-05-15 cs.AI

Complacent, Not Sycophantic: Reframing Large Language Models and Designing AI Literacy for Complacent Machines

Federico Germani, Giovanni Spitale

发表机构 * Institute for Data Science and Artificial Intelligence, Boğaziçi University(数据科学与人工智能研究所,博科尼大学)

AI总结 本文重新审视了大型语言模型(LLM)的行为特征,指出其常被描述为“谄媚”是概念上的误导,实际上应理解为“ complacency( complacent)”,即模型倾向于同意用户输入,这是由于训练数据、奖励信号和设计机制更偏好一致而非纠正。研究强调,模型本身并无谄媚的动机,其行为取决于开发者的意图和系统设计。因此,文章主张应通过提升AI素养教育,帮助用户识别和对抗模型可能强化的确认偏误。

详情
英文摘要

Large language models are often described as sycophantic, in the sense that they appear to flatter users or mirror their beliefs. We argue that this label is conceptually misleading: sycophancy implies motives and strategic intent, which LLMs do not possess. Their behaviour is better understood as complacency, a structural tendency to agree with user input because training data, reward signals and design favour agreement and reinforcement over correction. We argue that this distinction matters. Whether developers act sycophantically or not, models themselves never are sycophants; they can only be made more or less complacent. This reframing locates agency in developers and institutions, not in the model. Because complacent models reinforce users' prior beliefs, we argue that AI literacy educational approaches should particularly focus on strategies to counter confirmation bias.

2605.14543 2026-05-15 cs.LG cs.AI

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

Shuhao Chen, Weisen Jiang, Changmiao Wang, Xiaoqing Wu, Xuanren Shi, Yu Zhang, James T. Kwok

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Southern University of Science and Technology(南方科技大学) The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区) Shenzhen University General Hospital(深圳大学人民医院)

AI总结 RxEval 是一个用于评估大语言模型(LLM)处方推荐能力的处方级基准,旨在解决现有基准在细粒度药物推荐任务中的不足。该基准通过多选题形式,要求模型根据详细的患者信息和时间顺序的临床轨迹,从真实处方和生成的干扰选项中选择具体的药物-剂量-给药途径组合。实验表明,RxEval 对不同模型具有较高的区分度,反映出当前最先进模型在实际临床信息理解和推理方面仍存在挑战。

详情
英文摘要

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

2605.14542 2026-05-15 cs.AI

VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce

Yuyan Chen

发表机构 * Cornell University(康奈尔大学)

AI总结 该研究提出了一种名为VerbalValue的虚拟直播带货助手,旨在通过提升语言能力实现更高的销售转化率。其核心方法包括构建产品知识库与销售术语词典、收集并标注大量直播互动数据,以及基于这些数据微调大语言模型以生成更具共情力和说服力的回应。实验表明,该模型在信息性、事实准确性及观众互动方面均优于多个主流大模型,展现出显著的商业应用潜力。

Comments Accepted to the CVPR 2026 HiGen Workshop

详情
英文摘要

A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.

2605.14539 2026-05-15 cs.CL

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Xiaohongshu Inc(小红书公司)

AI总结 本文提出了一种名为CIPO的纠正导向策略优化方法,旨在解决基于可验证奖励的强化学习(RLVR)中因稀疏奖励和弱信用分配导致的学习效率低下问题。该方法通过将模型自身失败轨迹转化为纠正导向的监督信号,无需依赖外部信息,从而提升模型的错误纠正能力和学习效果。实验表明,CIPO在多个数学推理和代码生成基准上显著优于现有方法,有效增强了模型的内在推理能力。

Comments Work on progress

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

2605.14535 2026-05-15 cs.LG

Exploring Geographic Relative Space in Large Language Models through Activation Patching

Stef De Sabbata, Rahul Baiju, Stefano Mizzaro, Kevin Roitero

发表机构 * School of Geography, Geology and the Environment, University of Leicester, UK(地理、地质与环境学院,莱斯特大学,英国) Department of Mathematics, Computer Science and Physics, University of Udine, Italy(数学、计算机科学与物理系,乌迪内大学,意大利)

AI总结 本文探讨了大语言模型(LLM)在处理相对地理空间时的内部工作机制,通过激活值插补技术揭示其处理地理关系的潜在机制。研究旨在增进对LLM在地理任务中行为的理解,为安全有效地应用这类模型提供理论支持。

详情
英文摘要

The increased use of Large Language Models (LLMs) in geography raises substantial questions about the safety of integrating these tools across a wide range of processes and analyses, given our very limited understanding of their inner workings. In this extended abstract, we examine how LLMs process relative geographic space using activation patching, an emerging tool for mechanistic interpretability.

2605.14534 2026-05-15 cs.CV cs.AI cs.MM

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Fuhao Li, Shaofeng You, Jiagao Hu, Yu Liu, Yuxuan Chen, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan

发表机构 * MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus)

AI总结 评估图像和视频中的物体移除效果仍然具有挑战性,因为该任务本质上是一对多的,而现有指标常与人类感知不一致。为解决这一问题,本文提出RC(移除一致性)指标,包括RC-S和RC-T,分别从空间和时间维度衡量移除区域的感知一致性,并构建了PROVE-Bench基准数据集以支持社区评估。实验表明,RC指标在多种图像和视频基准上表现出比现有方法更强的人类感知对齐能力。

Comments Project Page: https://xiaomi-research.github.io/prove/

详情
英文摘要

Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.

2605.14527 2026-05-15 cs.LG cond-mat.mtrl-sci physics.comp-ph

Lang2MLIP: End-to-End Language-to-Machine Learning Interatomic Potential Development with Autonomous Agentic Workflows

Wenwen Li, Yuki Orimo, Nontawat Charoenphakdee

发表机构 * Preferred Networks, Inc.(Preferred Networks公司)

AI总结 本文提出了一种名为Lang2MLIP的多智能体框架,旨在通过自然语言输入实现端到端的机器学习原子势能(MLIP)开发,降低非专家开发MLIP的门槛。该方法将MLIP开发过程建模为一个序列决策问题,由大型语言模型驱动的决策代理自动选择优化模型的动作,无需预设流程,且具备自我修正能力。实验在多组分固态电解质界面系统上验证了该方法的有效性,表明基于大语言模型的多智能体系统在自动化MLIP开发中具有广阔前景。

Comments 31 pages, 12 figures

详情
英文摘要

Developing machine learning interatomic potentials (MLIPs) for complex materials systems remains challenging because it requires expertise in atomistic simulations, machine learning, and workflow design, as well as iterative active learning procedures. Existing automated pipelines typically assume a fixed sequence of stages or depend on domain experts, which limits their adaptability to heterogeneous materials systems where the optimal curriculum is not known in advance. To lower the barrier to developing MLIPs for non-experts, we propose Lang2MLIP, a multi-agent framework that takes natural-language input and formulates end-to-end MLIP development as a sequential decision-making problem solved by large language models (LLMs). At each step, a decision-making agent observes the current dataset, model, evaluation results, and execution log, and then automatically selects an appropriate action to improve the model. This removes the need for a predefined pipeline and enables the agent to self-correct by revisiting earlier subsystems when new failures arise. We evaluate this approach on a solid electrolyte interphase (SEI) system with multiple components and interfaces. These results suggest that LLM-based multi-agent systems are a promising direction for automating MLIP development and making it more accessible to non-experts.

2605.14525 2026-05-15 cs.CV

From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

Ling Li, Changjie Chen, Yuyan Wang, Jiaqing Lyu, Kenglun Chang, Yiyun Chen, Zhidong Deng

发表机构 * Department of Computer Science, THUAI, BNRist, Tsinghua University, Beijing, China(清华大学北京研究院,清华大学计算机科学系,北京,中国) Dalian University of Technology, Dalian, China(大连理工大学,大连,中国) Apple, Beijing, China(苹果公司,北京,中国) Hong Kong University of Science and Technology (Guang Zhou), Guang Zhou, China(香港科技大学(广州),广州,中国) University of Manchester, Manchester, UK(曼彻斯特大学,曼彻斯特,英国)

AI总结 在多视角三维人体姿态估计中,传统方法通常依赖于同一时刻不同视角的图像来预测某一时刻的姿态,忽略了相邻帧之间的丰富时序依赖关系。本文提出了一种新的输入方式——稀疏交错输入,通过在不同时间点采集不同视角的图像,使模型能够捕捉丰富的时空信息,从而提升性能。该方法不仅能够通过多相机提高输出姿态的帧率,突破单视角帧率限制,还能减少数据冗余。研究引入了DenseWarper模型,利用极线几何实现高效的时空热图交换,并在多个数据集上取得了优于传统密集输入方法的先进性能。

详情
英文摘要

In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+δ$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: https://github.com/lingli1724/DenseWarper-ICLR2026

2605.14521 2026-05-15 cs.LG

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

Yuxin Guo, Yihao Yue, Yunhao Ni, Yizhou Ruan, Jie Luo, Wenjun Wu, Lei Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment (SKLCCSE)(复杂与关键软件环境国家重点实验室) Beihang University(北京航空航天大学) Hangzhou International Innovation Institute(杭州国际创新研究院)

AI总结 本文研究了如何在不改变模型功能的前提下,将深度神经网络中的层归一化(LN)替换为计算更高效的RMSNorm。核心方法是通过引入列中心约束(CCC)和基于列的权重中心化(CBWC),将LN的中心化操作折叠到前向线性层中,从而实现等效替换。该方法适用于多种深度网络结构,实验表明在多个任务中可实现2%到12%的推理加速,同时保持模型预测性能。

Comments 33 pages, 21 figures

详情
英文摘要

Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may discard benefits associated with centering. This paper propose a framework to determine whether an LN in an arbitrary DNN can be replaced by RMSNorm without changing the model function. The key idea is to fold LN's centering operation into upstream general linear layers by enforcing zero-mean outputs through the column-centered constraint (CCC) and column-based weight centering (CBWC). We extend the analysis to arbitrary DNNs, define such LNs as foldable LNs, and develop a graph-based detection algorithm. Our analysis shows that many LNs in widely used architectures are foldable, enabling exact inference-time conversion and end-to-end acceleration of 2% to 12% without changing model predictions. Experiments across multiple task families further show that, when exact equivalence is partially broken in practical training settings, our method remains competitive with vanilla LN while improving efficiency.

2605.14518 2026-05-15 cs.CV cs.LG

ArcGate: Adaptive Arctangent Gated Activation

Avik Bhattacharya, Siddhant Dnyanesh Gole, Subhasis Chaudhuri, Alejandro C. Frery, Biplab Banerjee

发表机构 * Microwave Remote Sensing Lab Center of Studies in Resources Engineering(微波遥感实验室 资源工程研究中心) Centre of Machine Intelligence and Data Science(智能与数据科学中心) Department of Electrical Engineering(电气工程系) School of Mathematics and Statistics(数学与统计学学院) Center of Studies in Resources Engineering(资源工程研究中心)

AI总结 本文提出了一种新型的自适应反正切门控激活函数ArcGate,通过三阶段非线性变换生成多样化的激活形状,相比传统的固定形状激活函数(如ReLU、GELU等),其每个网络层包含七个可学习参数,能够根据特征层次和数据分布自主优化非线性特性。实验在多个遥感数据集上验证了ArcGate的优越性,尤其在噪声环境下表现出更强的鲁棒性,并揭示了其参数随网络深度变化的演化规律,表明ArcGate是一种适用于高分辨率地球观测任务的通用且自适应的激活函数。

详情
英文摘要

Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.

2605.14517 2026-05-15 cs.CL cs.AI

Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

GAng Peng

发表机构 * Huizhou Lateni AI Technology Co., Ltd.(惠州拉提尼人工智能技术有限公司) Huizhou University(惠州大学)

AI总结 该研究提出了一种维度级意图保真度评估框架,用于更细致地评估大语言模型在结构形式和用户意图保持方面的表现。通过结构化提示消融实验,研究分析了2880个输出在三个语言、三个任务领域和六种模型中的表现,揭示了整体评分与维度意图缺陷之间的系统性差异。实验表明,仅依赖整体评估可能掩盖模型在具体意图上的不足,而维度级评估能更准确地反映模型质量,为用户特定任务的模型评估提供了重要补充。

Comments Preprint. 30 tasks, 3 languages, 6 LLMs, 2,880 outputs; includes human evaluation and structured prompt ablation

详情
英文摘要

Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.

2605.14513 2026-05-15 cs.CV cs.AI

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Xuzhe Zheng, Yuexiao Ma, Jing Xu, Xiawu Zheng, Rongrong Ji, Fei Chao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(中国教育部多媒体可信感知与高效计算重点实验室,厦门大学)

AI总结 本文提出了一种名为HASTE的训练-free视频扩散加速方法,旨在解决现有稀疏注意力机制在视频生成中因二次复杂度和固定阈值带来的效率与质量平衡问题。该方法通过引入头级自适应框架,包含时间掩码复用和误差引导的预算校准两个模块,有效减少了掩码预测开销并优化了各注意力头的稀疏性分配。实验表明,HASTE在保持视频质量的同时,显著提升了模型推理速度。

详情
英文摘要

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

2605.14500 2026-05-15 cs.SD cs.HC eess.IV

Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross, Shervin Dehghani, Michael Sommersperger, Koorosh Faridpooya, Mohammad Ali Nasseri, Merle Fairhurst, Nassir Navab, Sasan Matinfar

发表机构 * Computer Aided Medical Procedures(计算机辅助医疗程序) TUM Klinikum Rechts der Isar(TUM 右岸医院) Rotterdam Eye Hospital(鹿特丹眼科医院) Centre for Tactile Internet with Human-in-the-Loop(人机交互触觉互联网中心) Munich Center for Machine Learning(慕尼黑机器学习中心) Chair for Social Affective Touch(社会情感触觉 chair)

AI总结 本文提出了一种基于物理模型的实时iOCT声学反馈框架,用于提高视网膜下注射手术中的实时交互感知。该方法通过将iOCT获取的视网膜层信息映射为声音反馈,使外科医生能够通过听觉感知针头位置和视网膜形变,从而减轻视觉负担并提升手术精度。实验表明,该方法在视网膜层识别和形变检测方面显著优于现有方法,具有重要的临床应用潜力。

详情
英文摘要

Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.

2605.14497 2026-05-15 cs.LG cs.AI

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

Letian Yang, Xu Liu, Yiqiang Lu, Jian Liu, Weiqiang Wang, Shuai Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 本文提出了一种名为 ROAD 的离线到在线强化学习框架,通过双层优化方法实现自适应数据混合,以解决离线数据与在线策略之间非平稳分布偏移的问题。该方法将数据选择建模为双层优化过程,外层优化策略性能,内层进行传统 Q 学习更新,并引入多臂老虎机机制实现动态数据回放。实验表明,ROAD 在多个数据集上均优于现有方法,无需人工调整即可实现更优的稳定性和长期性能。

Comments 20 pages, 9 figures, 7 tables. Accepted to IJCAI 2026

详情
英文摘要

Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.

2605.14494 2026-05-15 cs.AI cs.LG

Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty

Tianjue Lin, Jianan Zhou, Jieyi Bi, Yaoxin Wu, Wen Song, Zhiguang Cao, Jie Zhang

发表机构 * Nanyang Technological University(南洋理工大学) Eindhoven University of Technology(埃因霍温理工大学) Shandong University(山东大学) Singapore Management University(新加坡国立大学)

AI总结 本文研究了具有离散不确定性的两阶段鲁棒优化问题,该问题因计算复杂度高而难以求解。为解决这一问题,作者提出了一种基于图神经网络和Transformer的神经代理模型NeurPRISE,通过模仿学习从问题驱动的场景缩减方法PRISE中学习场景选择策略,从而在保证解质量的同时大幅提升计算效率。实验表明,NeurPRISE在多个两阶段鲁棒优化问题中表现出良好的性能和扩展性,并具备较强的零样本泛化能力。

详情
英文摘要

Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.

2605.14489 2026-05-15 cs.LG cs.SY eess.SY

A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures

Sergio Vanegas, Lasse Lensu, Fredy Ruiz

发表机构 * Computational Engineering(计算工程部) LUT University(卢托大学) DEIB(迪埃比部门) Politecnico di Milano(米兰理工学院)

AI总结 本文提出了一种基于施瓦茨分解的权重投影方法,用于确保线性离散时间状态空间神经网络层的稳定性。该方法通过动态地将状态矩阵的实施瓦茨分解中的准上三角因子投影到最近的稳定矩阵,从而在保持模型精度的同时保证系统渐近稳定性。实验表明,该方法在合成线性系统上表现出与先进方法相当的识别精度和收敛速度,且在实际数据集的非线性神经网络结构中也有良好的训练表现。

Comments 32 pages, 13 figures. Source code at https://codeberg.org/sergiovaneg/SchurSS

详情
英文摘要

Building black-box models for dynamical systems from data is a challenging problem in machine learning, especially when asymptotic stability guarantees are required. In this paper, we introduce a novel stability-ensuring and backpropagation-compatible projection scheme based on the Schur decomposition for the state matrix of linear discrete-time state-space layers, as well as an alternative pre-factorized formulation of the methodology. The proposed methods dynamically project the quasi-triangular factor of the state matrix's real Schur decomposition onto its nearest stable peer, ensuring stable dynamics with minimal overparameterization. Experiments on synthetic linear systems demonstrate that the method achieves accuracy and convergence rates comparable to those of state-of-the-art stable-system identification techniques, despite a marginal increase in computational complexity. Furthermore, the lower weight count facilitates convergence during training without sacrificing accuracy in stacked neural-network architectures with static nonlinearities targeting real-world datasets. These results suggest that the Schur-based projection provides a numerically robust framework for identifying complex dynamics on par with the State of the Art while satisfying strict asymptotic-stability requirements.

2605.14488 2026-05-15 cs.AI

Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

Assaf Gerner, Netta Madvil, Nadav Barak, Alex Zaikman, Jonatan Liberman, Liron Hamra, Rotem Brazilay, Shay Tsadok, Yaron Friedman, Neal Harow, Noam Bresler, Shir Chorev, Philip Tannor, Lior Rokach

发表机构 * Deepchecks, Ramat Gan, Israel(深检查,以色列拉马特甘) Ben-Gurion University, Beer Sheva, Israel(本· Gurion大学,以色列贝尔谢巴)

AI总结 本文介绍了 Deepchecks,一个用于评估检索增强生成(RAG)系统的综合性框架。该框架通过多方面的方法、根本原因分析和生产监控,应对RAG系统评估中的复杂挑战,旨在确保评估结果与具体应用需求一致,从而提升系统在可靠性、相关性和用户满意度方面的表现。

详情
英文摘要

Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.

2605.14487 2026-05-15 cs.CV cs.AI

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

Jiahao Tian, Yiwei Wang, Gang Yu, Chi Zhang

发表机构 * AGI Lab, Westlake University University of California at Merced StepFun

AI总结 本文研究了长时序自回归视频生成中的误差累积和上下文丢失问题,提出了一种名为Head Forcing的训练无需额外训练的框架。该方法通过识别并区分扩散变压器中注意力头的不同功能,分别为局部细节优化、结构稳定和长程上下文聚合的头分配定制化的键值缓存策略,从而提升生成质量和效率。实验表明,该方法在不增加训练成本的情况下显著延长了视频生成时长,并支持多提示交互合成,优于现有基线方法。

详情
英文摘要

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.

2605.14486 2026-05-15 cs.CV

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

Yiheng Li, Yang Yang, Zichang Tan, Gao Li, Zhen Lei, Wenhao Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部门) Sangfor Technologies Inc.(Sangfor技术公司) China Mobile Financial Technology Co., Ltd.(中国移动金融科技有限公司) CAIR, HKSIS, Chinese Academy of Sciences(中国科学院CAIR、HKSIS部门) SCSE, FIE, M.U.S.T, Macau, China(澳门SCSE、FIE、M.U.S.T部门) Vast Intelligence Lab, Sydney, Australia(悉尼澳大利亚Vast Intelligence Lab)

AI总结 随着AI生成图像的滥用日益严重,亟需具备广泛适用性的图像检测技术。本文提出了一种基于GAN的上采样方法,以生成与重建方法对齐但具有更多样化伪影模式的假图像,从而弥补现有方法在多样性方面的不足。为了解决不同生成方法之间的领域偏移问题,研究引入了分离专家融合(SEF)框架,通过领域特定专家模型和门控网络实现特征的互补融合,显著提升了模型在多种生成方法上的检测性能和泛化能力。

Comments preprint

详情
英文摘要

As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: https://github.com/liyih/SEF_AIGC_detection.

2605.14483 2026-05-15 cs.AI

LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

Xudong Chen, Yixin Liu, Hua Wei, Kaize Ding

发表机构 * GitHub

AI总结 LEMON 是一种基于大语言模型的多智能体协调器,通过反事实强化学习生成可执行的多智能体协调规范。该方法通过整合任务特定角色、职责分配、能力等级和依赖结构,提升系统整体的执行效率与解题质量。LEMON 在六个推理与编程基准测试中表现出色,取得了当前多智能体协调方法中的最佳性能。

Comments Submitted to Neurips 2026

详情
英文摘要

Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.

2605.14477 2026-05-15 cs.LG

Test-Time Learning with an Evolving Library

Weijia Xu, Alessandro Sordoni, Chandan Singh, Zelalem Gero, Michel Galley, Xingdi Yuan, Jianfeng Gao

发表机构 * Microsoft Research(微软研究院)

AI总结 本文提出了一种名为EvoLib的测试时学习框架,使大型语言模型能够在不更新参数或依赖外部监督的情况下,跨问题实例积累、复用和演化知识。该方法通过维护一个共享的知识库,自动从模型自身的推理轨迹中提取模块化技能和反思性见解,并引入一种机制以平衡即时效用与长期价值,从而实现知识的持续优化与泛化。实验表明,EvoLib在数学推理、代码生成和多轮智能体环境中显著优于现有的测试时学习方法。

详情
英文摘要

We introduce EvoLib, a test-time learning framework that enables large language models to accumulate, reuse, and evolve knowledge across problem instances without parameter updates or external supervision. Instead of adapting model parameters, our approach maintains a shared library of knowledge abstractions, including modular skills and reflective insights, automatically extracted from the model's own inference trajectories. To support continual improvement, we introduce a principled weighting and consolidation mechanism that jointly optimizes for immediate utility and long-term value. This allows simple, instance-specific abstractions to evolve into more general and reusable ones over time. Across challenging benchmarks in mathematical reasoning, code generation, and multi-turn agentic environments, EvoLib improves substantially over the top test-time scaling and learning methods without ground-truth feedback.

2605.14475 2026-05-15 cs.CV

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Jiashun Zhu, Ronghao Fu, Jiasen Hu, Nachuan Xing, Xu Na, Xiao Yang, Zhiwen Lin, Weipeng Zhang, Lang Sun, Zhiheng Xue, Haoran Liu, Weijie Zhang, Bo Yang

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院)

AI总结 GeoVista 是一种面向超高分辨率遥感图像理解的视觉引导主动感知框架,旨在解决现有方法在探索大场景时易丢失全局上下文、重复访问或遗漏关键区域的问题。该方法通过构建全局探索计划并多分支验证候选区域,结合显式的证据状态管理,实现跨区域的信息聚合与去重。GeoVista 引入了 APEX-GRO 轨迹语料库和 Observe-Plan-Track 机制,有效提升了遥感图像的语义理解和问答性能,在多个基准测试中取得了最先进的结果。

详情
英文摘要

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista

2605.14467 2026-05-15 cs.LG

Focused PU learning from imbalanced data

Elias Zavitsanos, Georgios Paliouras

发表机构 * Institute of Informatics and Telecommunications(信息与电信研究所)

AI总结 本文提出了一种针对高度不平衡数据集的正例与未标记例(PU)学习新方法,旨在解决在标注数据有限的情况下,如疾病基因识别、欺诈检测等实际问题中的分类难题。该方法通过引入一种聚焦的经验风险估计器,结合正例和未标记例训练二分类模型,有效提升了在不平衡数据下的分类性能。实验表明,该方法在多种不平衡数据集上表现优异,并在财务舞弊检测等实际应用中展现出良好的应用价值。

详情
英文摘要

We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.

2605.14465 2026-05-15 cs.AI

From Table to Cell: Attention for Better Reasoning with TABALIGN

Tung Sum Thomas Kwok, Zeyong Zhang, Xinyu Wang, Chunhe Wang, Xiaofeng Lin, Hanwei Wu, Lei Ding, Guang Cheng, Zhijiang Guo

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) New Jersey Institute of Technology(新泽西理工学院) McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学) University of Manitoba(曼尼托巴大学) SimpleWay The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 该研究针对结构化表格中多步骤推理的问题,提出了一种名为TABALIGN的新框架,旨在解决推理过程中规划与执行之间缺乏明确的单元格对齐机制的问题。其核心方法结合了双向去噪的扩散语言模型(DLM)作为规划器,生成二进制单元格掩码表示推理步骤,并引入一个轻量级验证器TABATTN,基于大量人工验证的注意力标准对每一步进行评分。实验表明,TABALIGN在多个基准测试中显著提升了推理准确性,并加快了后续推理的执行速度。

详情
英文摘要

Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.