arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09290 2026-05-12 cs.LG

From Regression to Inference: Meta-Learning Predictors for Neural Architecture Search

Liping Deng, MingQing Xiao

AI总结 本文研究了基于预测的神经架构搜索(NAS)中性能预测器的泛化问题,提出了一种基于元学习的卷积神经过程(ConvNP)方法,将性能预测建模为条件函数推断问题。与传统回归方法不同,该方法通过元学习从少量样本中学习泛化能力,提升了对未见架构的预测准确性。实验表明,该方法在多个NAS基准数据集上显著提升了架构选择的性能,达到了当前最优水平。

详情
英文摘要

Prediction-based approaches are widely used in neural architecture search (NAS), where a predictor estimates the performance of candidate architectures to guide selection. However, existing predictors are typically trained via supervised regression on limited samples, leading to overfitting and poor generalization to unseen architectures. In this work, we propose a fundamentally different formulation that models performance prediction as a conditional function inference problem using a Convolutional Neural Process (ConvNP) with meta-learning capabilities. Instead of fitting a fixed mapping to limited samples, our approach meta-learns to infer performance from partial observations by training with context-target splits across a group of synthesized tasks, explicitly optimizing for generalization under data scarcity and aligning the training procedure with the deployment setting in NAS. We further design simple yet effective meta-features for cell-based architectures and evaluate our method on NAS-Bench-101 and NAS-Bench-201. Extensive experiments show that our approach consistently improves top-K ranking quality and achieves the state-of-the-art architecture selection using limited samples.

2605.09288 2026-05-12 cs.LG cs.AI cs.CE cs.CV cs.NA math.NA

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

Ethan Hsu, Hong Meng Yam, Ivan Ge

AI总结 该论文提出了一种名为 MC² 的混合求解方法,结合蒙特卡洛方法(Walk-on-Spheres)与神经网络,用于高效求解椭圆型偏微分方程(PDE)。该方法通过将低计算量的蒙特卡洛解作为结构化估计器,训练神经网络进行单次前向传播修正,从而获得高精度解,显著提升了求解速度。此外,论文还发布了 PDEZoo,一个包含两百万个椭圆型 PDE 的标准化基准数据集,为有限计算资源下的 PDE 求解研究提供了重要支持。

详情
英文摘要

Partial differential equation (PDE) solvers underpin scientific computing, but real-world deployment is bounded by compute. Classical Monte Carlo solvers such as Walk-on-Spheres (WoS) are unbiased and geometry-agnostic but are slow. Learned solvers are fast but biased and brittle under distribution shift. We present \textbf{MC$^2$}, a hybrid WoS-Neural Network (WoS-NN) PDE solver that treats a low-budget Monte Carlo solution as a structured estimator of the true field and learns a single-pass neural correction to recover a high-fidelity solution. MC$^2$ matches the accuracy of solutions using over $1000\times$ more Monte Carlo compute, outperforming all evaluated classical, denoising, and neural-operator baselines. To enable reproducible study of finite-compute PDE solving, we additionally release \textbf{PDEZoo}, the largest standardized elliptic PDE benchmark to date: 2M PDEs spanning five elliptic families and unlimited geometric compositions, with analytic ground truth and multi-budget Monte Carlo trajectories. Together \textbf{MC$^2$} and \textbf{PDEZoo} (1) empirically establish that finite-sample Monte Carlo error is structured, learnable, and correctable in a single forward pass, (2) show that we can solve PDEs $\sim$\textbf{1000x} faster than with just WoS, and (3) provide the evaluation infrastructure the field has so far lacked.

2605.09285 2026-05-12 cs.CL

BetaEdit: Null-Space Constrained Sequential Model Editing

Bingqing Liu, Wei Liu, Yuhua Li

AI总结 本文提出了一种名为 BetaEdit 的模型编辑方法,旨在解决基于零空间的模型编辑方法在连续编辑过程中出现的知识泄露和性能下降问题。通过深入分析历史感知更新机制的作用,作者提出了一个结合历史信息的零空间编辑框架,有效控制了知识泄露并提升了编辑效果。实验表明,BetaEdit 在大规模连续编辑任务中优于现有方法,具有更好的编辑性能和通用能力。

详情
英文摘要

Null-space-based methods have garnered considerable attention in model editing by constraining updates to the null space of the pre-existing knowledge representation, thereby preserving the model's original behavior. However, in practice these methods rely on an approximate null space--leading to knowledge leakage--and further suffer from severe performance degradation during sequential editing. Recent work shows that history-aware editing strategies can empirically mitigate this decline, yet the underlying reason remains unclear. In this paper, we first expose the knowledge leakage inherent in existing null-space approaches and then analyze why history-aware updates effectively preserve both editing performance and general capabilities during long-horizon editing. Building on these insights, we propose BetaEdit, a refined framework that effectively controls the knowledge leakage and integrates history-aware updates into the null-space paradigm. Extensive experiments on three large language models across two standard benchmarks show that BetaEdit consistently outperforms prior methods in the challenging regime of massive-scale sequential editing. Code is available at: https://github.com/lbq8942/BetaEdit.

2605.09284 2026-05-12 cs.LG cs.AI cs.CE physics.app-ph physics.comp-ph

Semi-Supervised Neural Super-Resolution for Mesh-Based Simulations

Jiyeon Kim, Youngjoon Hong, Won-Yong Shin

AI总结 本文提出了一种名为SuperMeshNet的半监督神经网络超分辨率框架,用于提高基于网格的仿真计算效率。该方法通过结合少量配对的低分辨率-高分辨率数据与大量未配对的低分辨率数据,利用消息传递神经网络(MPNN)实现高效的高分辨率解重建,有效减少了对高分辨率监督数据的依赖。实验表明,SuperMeshNet在使用更少高分辨率数据的情况下,能够取得比全监督方法更低的均方根误差,显著提升了计算效率。

Comments International Conference on Machine Learning (ICML 2026) (to appear) (Please cite our conference version.)

详情
英文摘要

Mesh-based simulations provide high-fidelity solutions to partial differential equations (PDEs), but achieving such accuracy typically requires fine meshes, leading to substantial computational overhead. Super-resolution techniques aim to mitigate this cost by reconstructing high-resolution (HR), high-fidelity solutions from low-cost, low-resolution (LR) counterparts. However, training neural networks for super-resolution often demands large amounts of expensive HR supervision data. To address this challenge, we propose SuperMeshNet, an HR data-efficient super-resolution framework for mesh-based simulations aided by message passing neural networks (MPNNs). At its core, SuperMeshNet introduces complementary learning, a semi-supervised approach that effectively leverages both 1) a small amount of paired LR-HR data and 2) abundant unpaired LR data via two jointly trained, complementary MPNN-based models. Additionally, our model is enriched by inductive biases, which are empirically shown to further improve super-resolution performance. Extensive experiments demonstrate that SuperMeshNet requires 90% less HR data to achieve even lower root mean square error (RMSE) than that of the fully supervised benchmark without the inductive biases. The source code and datasets are available at https://github.com/jykim-git/SuperMeshNet.git.

2605.09283 2026-05-12 cs.AI cs.CL

A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web

Shusaku Egami, Masahiro Hamasaki

AI总结 随着大型语言模型和基于其构建的AI代理的发展,网络正从以人类为中心向由AI代理驱动的“智能体网络”转变。然而,当前缺乏对AI生成内容(AIGC)在生成过程中可靠性、可复现性和合规性的验证机制,这可能导致内容误用和合规风险。本文提出了一种提示感知的结构化框架,在生成时自动为AIGC附加结构化元数据,包括模块化提示、上下文、模型信息、超参数和置信度,并结合可验证凭证,从而支持AIGC的可靠评估与安全复用。

Comments 5 pages, 2 figures, Accepted at FAAW@WWW2026

详情
英文摘要

The evolution of Large Language Models (LLMs) and the software agents built on them (AI agents) marks a turning point in the transition from a human-centric Web to an ``Agentic Web'' driven by AI agents. However, for AI-Generated Content (AIGC), which is expected to dominate the Web, there is currently no mechanism for agents to verify its reliability, reproducibility, or license compliance during generation. This lack of transparency risks causing chained hallucinations and compliance violations through the reuse of AIGC. Consequently, a framework to manage the provenance and generation conditions of AIGC is essential. In this paper, we present a framework that automatically attaches structured metadata to AIGC at generation time, including modularized prompts, contexts, thoughts, model information, hyperparameters, and confidence. The metadata is enveloped together with verifiable credentials to support the reliable assessment and reuse of AIGC. This framework enables efficient curation of structured AIGC and facilitates its safe use for applications such as fine-tuning and knowledge distillation.

2605.09281 2026-05-12 cs.LG

TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu

AI总结 本文提出了一种名为 TileQ 的高效低秩量化方法,用于压缩混合专家(MoE)模型。该方法通过在输入和输出维度上共享低秩因子,采用二维分块结构化低秩量化,在无需微调的情况下实现模型压缩。实验表明,TileQ 显著降低了额外内存占用并减少了推理延迟,同时保持了模型的先进精度。

详情
英文摘要

Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

2605.09278 2026-05-12 cs.AI

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

Yuqiao Meng, Sakshi Sunil Narvekar, Luoxi Tang, Rupali Rajendra Vaje, Yingxue Zhang, Muchao Ye, Zhaohan Xi

AI总结 多智能体辩论(MAD)系统依赖共享内存进行长期推理,但这也带来了内存污染的风险,现有方法依赖启发式或大模型判断,难以有效过滤错误。本文将内存更新建模为零信任博弈,提出EquiMem机制,在推理时通过智能体的检索查询和遍历路径量化评估内存更新的可信度,无需依赖大模型判断。该方法适用于嵌入式和图结构内存,在多种基准和架构下表现出更优的防护效果和鲁棒性。

详情
英文摘要

Multi-agent debate (MAD) systems increasingly rely on shared memory to support long-horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory-augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM-based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross-agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero-trust memory game, in which no agent is assumed honest and the game's equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference-time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents' existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding- and graph-based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.

2605.09276 2026-05-12 cs.LG cs.CV

Uncertainty-Aware Token Importance Estimation in Spiking Transformers

Wenxuan Liu, Zecheng Hao, Tong Bu, Yuran Wang, Zhaofei Yu

AI总结 本文研究了在脉冲变压器中如何更准确地估计令牌的重要性,以减少冗余计算并提高推理效率。现有方法主要依赖于响应特征,如激活幅度或发放统计,但未能反映令牌在时间演化中的不确定性变化。作者提出了一种无需训练、可插拔的Uncert框架,通过建模令牌的类别证据并分析其时间不确定性模式,为令牌重要性评估提供了新的依据。实验表明,该方法在静态和神经形态基准上均取得了良好的精度与效率平衡,尤其在令牌剪枝任务中表现突出。

详情
英文摘要

Spiking transformers have shown strong potential for neuromorphic vision, yet their token processing across multiple spiking steps still introduces substantial redundancy and inference cost. Existing token reduction methods mainly rely on response based cues, such as activation magnitude, firing statistics, or feature similarity. Although effective, these criteria do not explicitly characterize token importance from the perspective of temporally evolving class evidence. In spiking transformers, token representations are progressively formed across multiple spiking steps rather than determined at a single instant, suggesting that token importance should be evaluated not only by instantaneous responses but also by temporal uncertainty patterns. Our key observation is that tokens exhibit heterogeneous uncertainty trajectories over time, and that their temporally aggregated uncertainty statistics provide an effective cue for distinguishing informative tokens from redundant ones. Motivated by this, we propose Uncert, a training free and plug and play token importance estimation framework for spiking transformers. Specifically, Uncert models token wise class evidence with a Dirichlet distribution and summarizes each token temporal uncertainty using its mean and fluctuation across spiking steps, yielding an uncertainty aware importance score for token reduction during inference. Experiments on both static and neuromorphic benchmarks show that Uncert achieves favorable accuracy and efficiency tradeoffs, with the most consistent gains observed under token pruning. Further analysis reveals a clear empirical connection between temporal uncertainty patterns and token contribution, offering new insights into token dynamics in spiking transformers.

2605.09275 2026-05-12 cs.LG

DiffATS: Diffusion in Aligned Tensor Space

Jinhua Lyu, Tianmin Yu, Brian Kim, Lizhuo Zhou, Chanwook Park, Naichen Shi

AI总结 本文提出了一种名为 DiffATS 的生成模型,用于高效建模高分辨率时空场。该方法通过构造数据自适应的张量原语,避免了预训练压缩自编码器的依赖,解决了张量分解中因子非唯一性的问题。通过正交Procrustes对齐技术,模型实现了紧凑且可直接解码的生成表示,并在图像、视频和偏微分方程解等任务中取得了优异的生成效果,同时实现了高达210倍的数据压缩。

详情
英文摘要

Direct diffusion modeling of high-resolution spatiotemporal fields is computationally challenging. Parameter-efficient primitives address this by representing high-dimensional data with a compact set of parameters. In this paper, we construct data-dependent tensor primitives without pretrained compression autoencoders. Our construction starts from Tucker decomposition, which captures low-rank multilinear structure through a core tensor and mode-wise factors. However, Tucker factors are non-unique: the same tensor can be represented by different rotated factors, which complicates generative modeling. We address this issue with orthogonal Procrustes (OP) alignment. Specifically, we select medoid anchor matrices from the data and align the factor matrices to resolve the gauge ambiguity. This yields matrix Grassmannian primitives and tensor Grassmannian primitives that are compact, data-adaptive, and directly decodable by explicit multilinear reconstruction. Theoretically, we prove that the proposed primitive maps are homeomorphisms between low-rank tensors and their corresponding primitive spaces, certifying that the representations are non-degenerate and topologically faithful. Building on these primitives, we propose *Diffusion in Aligned Tensor Space* (DiffATS), a generative framework that trains diffusion models directly on aligned tensor primitives. Across images, videos, and PDE solutions, DiffATS achieves strong unconditional and conditional generation performance while compressing original data by $3.9\times$ to $210\times$, without relying on any pretrained deep compression autoencoders.

2605.09272 2026-05-12 cs.AI cs.CL cs.CV

Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah, Jason Gusdorf, Anil Palepu, Chunjong Park, Jack W. O'Sullivan, Vishnu Ravi, Tim Strother, Pavel Dubov, Aliya Rysbek, Toshiyuki Fukuzawa, Yana Lunts, Jan Freyberg, Michael B. Chang, Aniruddh Raghu, David Stutz, Devora Berlowitz, Eliseo Papa, Taylan Cemgil, JD Velasquez, Jack Chen, Arthur Chen, Doug Fritz, Charlie Taylor, Katya Tregubova, Jing Rong Lim, Richard Green, Sara Mahdavi, Mahvish Nagda, Jihyeon Lee, Craig Schiff, Liviu Panait, Sukhdeep Singh, Valentin Liévin, David G. T. Barrett, Hannah Gladman, Anna Cupani, Francesca Pietra, Uchechi Okereke, Katherine Tong, Clemens Meyer, Erwan Rolland, Mili Sanwalka, Michael D. Howell, Shixiang Shane Gu, Bibo Xu, Euan A. Ashley, S. M. Ali Eslami, Gregory Wayne, Pushmeet Kohli, Vivek Natarajan, Adam Rodman, Alan Karthikesalingam, Ryutaro Tanno

AI总结 该研究提出了一种名为AI co-clinician的新型会话式医疗AI系统,能够实时处理来自医患对话的视听数据,辅助临床决策。该系统基于Gemini的低延迟音视频处理能力,采用双代理架构,兼顾深度临床推理与自然对话所需的低延迟响应。实验表明,AI co-clinician在多个关键评估维度上接近初级保健医生,且在通用评估标准上显著优于GPT-Realtime,但仍在体格检查和疾病特异性推理方面存在不足,突显了视听信息在医疗咨询中的重要性。

Comments Video examples are available on Youtube: https://youtu.be/y5Vaa_SN1t0, https://youtu.be/dC4icb75vLQ, and https://youtu.be/E7iEvWo-E6c

详情
英文摘要

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

2605.09269 2026-05-12 cs.CL cs.CV

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu, Dian Yu, Zhenwen Liang, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

AI总结 DeltaRubric 是一种用于多模态大语言模型奖励建模的生成式方法,旨在解决现有评估方式在视觉细节判断上的偏差问题。该方法通过将评估过程分解为“规划”和“验证”两个步骤,动态生成针对具体实例的检查清单,并基于图像和问题进行验证,从而提高评估的准确性和可靠性。实验表明,DeltaRubric 在多个基准测试中显著提升了模型的奖励建模效果,验证了其在多模态任务中的有效性。

详情
英文摘要

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

2605.09268 2026-05-12 cs.CL cs.AI

Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs

Aditya Sinha, Harald Steck, Vito Ostuni, Matteo Rinaldi

AI总结 本文研究了大型语言模型(LLMs)在多轮对话中处理上下文切换时面临的挑战,特别是模型难以识别用户请求的转变或主题切换,并容易携带不相关的先前上下文。为此,作者构建了基于真实数据集的合成基准,测试了十种不同类型的LLMs在零样本情况下的表现,发现只有部分具备推理能力或明确指令引导的模型能够准确检测到上下文切换,而大多数模型存在位置偏差和对过时上下文的依赖问题。研究结果为提升LLMs在多轮对话中的长期鲁棒性提供了重要启示。

Comments Accepted to the ICBINB Workshop @ ICLR 2026

详情
英文摘要

Users interacting with Large Language Models (LLMs) in a multi-turn conversation routinely refine their requests or pivot to new topics. LLMs, however, often miss these topic shifts and carry over irrelevant context from previous turns, leading to inaccurate responses. In this paper, we stress-test the multi-turn understanding of LLMs and study the following two sub-tasks: (1) detecting whether the user pivots or refines in the current turn, and (2) shortlisting relevant context from previous turns. To this end, we construct synthetic benchmarks based on real-world datasets from varied domains, as to simulate context shifts of different levels of difficulty. We then evaluate the zero-shot performance of ten LLMs (open-weight, closed-source and reasoning), and demonstrate that only some reasoning and strongly instructed LLMs are accurate in detecting pivots; open-weight LLMs struggle with the task and frequently carry stale context even with explicit cues; and all models suffer from a position bias. Based on the results, we discuss key takeaways for improving long-term robustness in multi-turn capabilities for LLMs.

2605.09262 2026-05-12 cs.CV cs.CL

Reinforcing Multimodal Reasoning Against Visual Degradation

Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

AI总结 该研究针对多模态大语言模型在面对现实视觉退化(如模糊、压缩伪影等)时推理能力下降的问题,提出了一种基于强化学习的微调框架ROMA。该方法通过双前向传播策略、分布一致性约束和正确性条件正则化等技术,在不损害干净输入性能的前提下提升模型对视觉退化的鲁棒性。实验表明,ROMA在多个多模态推理基准上显著优于现有方法,提升了可见和未见退化场景下的推理准确性。

详情
英文摘要

Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

2605.09259 2026-05-12 cs.SD cs.AI

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

Leduo Chen, Junchuan Zhao, Shengchen Li

AI总结 本文研究了如何在多乐器混合音频中实现灵活的音色迁移,即在保持原旋律和节奏的前提下,将不同声部的音色转换为目标乐器。为此,作者提出了MixtureTT,这是首个直接从多乐器混合音频中进行逐声部音色迁移的系统,通过共享的扩散过程同时处理所有声部,有效避免了传统分步处理带来的错误累积和音色不协调问题。实验表明,MixtureTT在客观和主观指标上均优于单乐器方法,验证了跨声部建模在混合音色迁移中的重要性。

详情
英文摘要

Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.

2605.09258 2026-05-12 cs.CV cs.AI

Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

R. James Cotton, Pouyan Firouzabadi, Wendy Murray

AI总结 该研究旨在解决单目视频中精确追踪手指生物力学运动的问题,提出了一种结合SAM 3D Body基础模型与逆运动学优化的方法,从单视角视频中提取解剖学约束的手指关节角度。通过将模型迁移至JAX并集成至MuJoCo-MJX,实现了高效的GPU加速优化,并建立了Momentum Human Rig输出与生物力学模型标记之间的新映射关系。实验表明,该方法在多种手部动作和物体操作任务中,能够达到约10度的关节角度误差和6毫米的手部位置误差,具有良好的视角一致性和鲁棒性,为基于视频的定量手部运动分析提供了新途径。

Comments Accepted to EMBC 2026

详情
英文摘要

Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.

2605.09256 2026-05-12 cs.LG cs.AI stat.ML

Improving Generalization by Permutation Routing Across Model Copies

Shuhei Kashiwamura, Timothee Leleu

AI总结 本文提出了一种利用 $M$-cover 变换来提升机器学习模型泛化能力的方法。该方法通过复制模型 $M$ 次,并利用结构化的混合核 $Q$ 对模型参数进行排列路由,从而在不同副本之间传递局部学习信息,而非传统的参数平均或显式吸引力机制。这种方法通过结构化的消息共享机制,有效改善了模型的泛化性能,适用于从感知机到多层感知机等多种模型结构。

详情
英文摘要

We introduce a use of the \(M\)-cover (or \(M\)-layer) transform for machine learning. The method replicates a model \(M\) times, but instead of coupling the copies through parameter averaging or an explicit attractive force, as in replicated SGD or Elastic SGD, it rewires the contexts in which local learning messages are computed. Each local loss is evaluated on a routed model whose parameters are drawn from different copies according to permutations sampled from a structured mixing kernel \(Q\). Training then uses the original local update rule, while the resulting learning messages are redistributed across the copies through these routed computational paths. Thus \(Q\) defines a topology for message transport and controls the long-loop structure of the lifted factor graph. We formulate this construction for perceptrons, committee machines, and multilayer perceptrons, showing that the same principle applies from discrete models to differentiable neural networks. The resulting framework provides a mechanism for improving generalization through structured message sharing rather than replica collapse or parameter-space coupling.

2605.07922 2026-05-12 cs.LG

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

Tue M. Cao, Hoang X. Nhat, Raed Alharbi, Phi Le Nguyen, My T. Thai

AI总结 本文提出了一种名为Tree SAE的新方法,用于在稀疏自编码器中学习层次化特征结构。该方法通过引入一种新的重构条件,结合激活和重构约束,克服了现有方法中因语义无关概念误判而导致的虚假正例问题。实验表明,Tree SAE在学习层次化特征对方面显著优于现有方法,并在多个基准测试中保持了与最先进方法相当的性能,同时还能用于分析大型语言模型中复杂的层次化概念结构。

Comments 21 pages

详情
英文摘要

Learning hierarchical features in Sparse Autoencoders (SAEs) is essential for capturing the structured nature of real-world data and mitigating issues like feature absorption or splitting. Existing works attempt to identify hierarchical relationships within independent feature sets by relying on activation coverage, the assumption that child feature should only activate when its parent feature activates. However, we demonstrate that this condition alone is insufficient; that is, it often produces false positives where parent and child concepts are semantically unrelated. To address this, we introduce a novel reconstruction condition that enforces a deeper functional link between hierarchical levels. By combining both activation and reconstruction constraints, we propose the Tree SAE, a model designed to learn hierarchical structures directly from within the feature set. Our results demonstrate that Tree SAEs significantly surpass the existing SAEs at learning hierarchical pairs while maintaining competitive performance to the state-of-the-art on several key benchmarks. Finally, we demonstrate the practical utility of our Tree SAE in mapping the geometry of child feature subspaces and uncovering the complex hierarchical concept structures encoded within large language models.

2605.07910 2026-05-12 cs.CV

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

Yulong Chen, Xiaoyun Dong, Haoyu Zhang, Zongxian Yang, Lewei Xie, Xinke Li, Yifan Zhang, Kai Wang, Jianping Wang

AI总结 本文研究了从车路协同自动驾驶(VICAD)数据中重建动态场景的问题,指出现有高斯场景图方法因假设观测同步而无法处理车辆与基础设施摄像头之间的时序不同步问题,导致动态目标出现严重鬼影现象。为此,作者提出了一种解耦时空高斯场景图(DUST),通过为每个代理维护独立的位姿轨迹并共享统一的外观表示,有效消除了跨源干扰,并在V2X-Seq数据集上取得了显著的性能提升。

详情
英文摘要

Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony.

2605.07649 2026-05-12 cs.CV cs.AI cs.RO

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Berkehan Ünal, Hauke Dierend, Dren Fazlija, Christopher Plachetka

AI总结 本文研究了如何利用视觉-语言模型(VLM)实现对操作设计域(ODD)的零样本感知,以支持自动驾驶系统等安全关键应用。通过在自定义数据集和Mapillary Vistas上的实验,作者评估了四种VLM在零样本分类与检测任务中的表现,并分析了不同优化策略的效果。研究提出了一种基于定义锚定的思维链提示方法,结合角色分解,显著提升了感知性能,为构建透明、高效的ODD感知系统提供了可行方案。

Comments 8 pages, 4 figures

详情
英文摘要

Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.

2605.07579 2026-05-12 cs.LG cs.AI cs.CL

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo

AI总结 该论文提出了一种名为POISE的新方法,用于在大型推理模型中进行可验证奖励的强化学习。其核心思想是利用策略模型在前向传播过程中已生成的内部状态信号来估计基线,从而显著降低计算成本。通过一个轻量级探针从隐藏状态和生成轨迹中预测可验证奖励,并在训练过程中与策略一同优化。实验表明,POISE在数学推理任务上表现优异,计算效率优于现有方法,并且其价值估计器性能接近独立的大型价值模型。

Comments Under Review; Project Page: https://elijah0430.github.io/poise/

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

2605.07399 2026-05-12 cs.CV

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

Yu Pan, Andi Zhang, Yi Wang, Sibei Yang, Wenjie Wang

AI总结 该论文研究了扩散视觉语言模型(dVLMs)在面对越狱攻击时的安全性问题,揭示了其在应对传统固定前缀优化(FPO)攻击时表现出的假象性鲁棒性。作者提出了一种基于全局概率优化(GPO)的新型越狱方法,通过操纵扩散模型的去噪轨迹,绕过模型的防护机制,并进一步开发了首个针对dVLMs的视觉模态越狱框架GPO-V。实验表明,GPO-V能够生成隐蔽且具有跨模型迁移能力的扰动,暴露了非序列生成架构中的关键安全漏洞,突显了对dVLMs进行安全对齐的紧迫性。

详情
英文摘要

Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with "Sure, here is"), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.

2605.07357 2026-05-12 cs.AI

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

Xingtong Yu, Zhongwei Kuai, Chang Zhou, Xuanting Xie, Renhe Jiang, Xikun Zhang, Hong Cheng, Xinming Zhang, Yuan Fang

AI总结 本文提出了一种名为GraphReAct的图推理-行动框架,用于解决多步骤图推理问题。该方法结合了图结构数据中的拓扑信息与语义信息,设计了两种互补的检索动作——拓扑检索和语义检索,以动态扩展推理上下文,并引入上下文精炼动作以逐步压缩信息。实验表明,GraphReAct在六个基准数据集上均优于现有方法,验证了其在图学习中的有效性。

Comments Under review

详情
英文摘要

Reasoning-acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi-step inference. In this work, we propose GraphReAct, a graph reasoning-acting framework that enables step-by-step inference over graph-structured data. Specifically, we design a graph-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non-local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state-of-the-art methods, validating the effectiveness of reasoning-acting for graph learning.

2605.07237 2026-05-12 cs.CL

Teaching Language Models to Think in Code

Hyeon Hwang, Jiwoo Lee, Jaewoo Kang

AI总结 本文提出了一种名为ThinC的新框架,旨在让语言模型通过代码进行推理,而非将代码作为自然语言指令的工具。该方法通过代码块之间的执行结果进行推理,减少了自然语言推理的干扰与错误。实验表明,ThinC在多个高水平数学基准测试中表现优异,甚至超越了更大规模的模型,并且其推理过程高度依赖代码执行结果,具有较强的鲁棒性。

Comments Preprint

详情
英文摘要

Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.

2605.07203 2026-05-12 cs.CV

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

Chamuditha Jayanga Galappaththige, Jason Lai, Timothy Patten, Donald Dansereau, Niko Suenderhauf, Dimity Miller

AI总结 本文研究了基于高斯泼溅(Gaussian Splatting)的场景变化检测问题,提出了一种直接在原始高斯参数空间进行比较的方法,而非传统的渲染后对比方式。通过分析高斯的原始属性(位置、各向异性协方差和颜色),作者证明这些属性本身已包含足够的变化信息,并引入几何和光度漂移的各向异性模型以及每个高斯的可观测性项来解决表示的欠约束问题。该方法在多视角一致性、变化类型区分等方面具有优势,并在实际数据集上取得了优于现有方法约17%的性能提升。

Comments Project Page: https://chumsy0725.github.io/GS-DIFF/

详情
英文摘要

Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone -- position, anisotropic covariance, and color -- carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by $\sim$17% in mean Intersection over Union.

2605.07202 2026-05-12 cs.AI

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

Dongming Wu, Junwen Li, Ming Lu, Gang Wang, Ting Chen

AI总结 本文提出AIDA(自主洞察发现代理),首个面向复杂商业环境的端到端自主探索框架,旨在解决大语言模型在将碎片化企业数据转化为可操作洞察时面临的挑战。AIDA构建了一个包含200多个指标和100多个维度的高灵活性零售环境,并集成了专有的领域特定语言(DSL),实现了语义推理与精确SQL执行的结合。通过强化学习系统,AIDA将商业分析建模为受帕累托原则指导的累积推理过程,实验表明其在环境感知和多角度深入分析方面显著优于基于工作流的代理。

详情
英文摘要

Transforming fragmented enterprise data into actionable insights remains a significant challenge for LLMs, constrained by complex database schemas, limitations in dynamic SQL generation, and the need for deep multi-dimensional analysis.In this paper, we propose AIDA(Autonomous Insight Discovery Agent), the first end-to-end framework designed for autonomous exploration in complex business environments. We establish a highly flexible instant retail environment encompassing 200+ metrics and 100+ dimensions, and integrates a proprietary Domain-Specific Language (DSL) that bridges semantic reasoning with precise SQL execution. Our reinforcement learning system subsequently formulates business analysis as a Pareto Principle-guided cumulative reasoning process. Experimental results demonstrate that AIDA significantly outperforms workflow-based agents, and extensive evaluations further reveal that AIDA achieves superior environmental perception and more in-depth analysis from diverse perspectives. Our work ultimately establishes the transformative potential of autonomous intelligence for industrial-scale business intelligence systems.

2605.07024 2026-05-12 cs.LG cs.AI

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu

AI总结 该论文提出Delulu,一个经过验证的多语言基准数据集,用于检测代码生成任务中Fill-in-the-Middle(FIM)任务中的幻觉问题。研究通过对抗性流程生成并筛选出包含1951个样本的高质量数据集,涵盖7种语言和4类幻觉类型,并利用Docker容器验证代码的编译与运行错误。实验评估了11个开源FIM模型,结果显示即使是最强模型也仅达到84.5%的准确率,表明FIM任务中的幻觉问题具有内在难度,而非特定模型家族的缺陷。

详情
英文摘要

Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.

2605.06969 2026-05-12 cs.CV

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Yuchen Guo, Junli Gong, Yao Lu, Xintong Xu, Yiuming Cheung, Weifeng Su

AI总结 该研究旨在提升红外-可见光图像融合(IVIF)质量评估的准确性,针对现有方法过度依赖手工特征和全参考指标的问题,提出了一种基于多模态大语言模型(MLLM)的新型评估方法FuScore。该方法通过MLLM生成连续的质量评分,而非离散等级预测,从而实现对相似质量图像的细粒度区分,并结合多维度一致性构建软标签,进一步引入三元目标函数以提升评估的全面性和鲁棒性。实验表明,FuScore在与人类视觉偏好相关性方面达到了当前最优水平。

详情
英文摘要

Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.

2605.06763 2026-05-12 cs.LG

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

Mohsen Dehghankar, Abolfazl Asudeh

AI总结 该研究将稀疏注意力机制重新定义为半空间范围搜索问题,旨在提升大语言模型推理效率的同时保证关键键值对的完整召回。为此,作者提出了一种名为Louver的新索引结构,能够在理论和实践中实现零误漏,并且具备轻量、易集成以及硬件优化等特性。实验表明,Louver在准确性和运行效率上均优于现有稀疏注意力方法,甚至超越了高度优化的密集注意力实现。

详情
英文摘要

Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both query- and step-dependent. Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys above a certain threshold. We address this challenge by reformulating sparse attention as the halfspace range searching problem. However, existing range searching indices are not suitable for modern LLM inference due to their computational and implementation overheads. To overcome this, we introduce Louver, a novel index structure tailored for efficient KV cache retrieval. Louver (i) guarantees zero false negatives with respect to a specified threshold in both theory and practice, (ii) is lightweight to integrate into existing LLM pipelines, and (iii) incorporates hardware-aware optimizations for both CPU and GPU executions. Our experiments demonstrate that Louver outperforms prior sparse attention methods in both accuracy and runtime, and is faster than highly optimized dense attentions such as FlashAttention. These results highlight that recall guarantees are a critical and overlooked dimension of sparse attention, and open a new direction for building theoretically grounded, efficient KV cache indices.

2605.06681 2026-05-12 cs.LG cs.CV

A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry

Lorenzo Riccardo Allegrini, Geremia Pompei

AI总结 本文提出了一种分层集成管道,用于处理欧洲空间局(ESA)卫星遥测数据中的异常检测问题。该方法结合了形状片段提取、统计特征分析、单通道建模、通道内堆叠以及跨通道聚合等多种技术,通过时间序列交叉验证和双层掩码策略进行训练与验证,有效防止信息泄露。实验结果表明,该方法在ESA-ADB基准测试中表现出优异的泛化能力,能够有效检测现实卫星遥测数据中的细微异常。

Comments 15 pages, 3 figures, 1 table. Submitted to the ML4ITS workshop at the ECML PKDD 2025 conference. Awarded 2nd place in the final round of the Spacecraft Anomaly Challenge on ESA dataset. (Ranked 1st on the Kaggle public leaderboard and 3rd on the private leaderboard)

详情
Journal ref
Communications in Computer and Information Science 2842 (2026) Chapter 7
英文摘要

A hierarchical ensemble pipeline is introduced to address anomaly detection in multivariate telemetry data provided by European Space Agency (ESA). The method integrates shapelet-based and statistical feature extraction, per-channel modeling, intra-channel stacking, and a final cross-channel aggregation. The pipeline is trained and validated using time-series cross-validation and two-level masking strategies to prevent information leakage. Results on the European Space Agency Anomaly Detection Benchmark (ESA-ADB) challenge demonstrate strong generalization, highlighting the effectiveness of hierarchical modeling in detecting subtle anomalies in realistic satellite telemetry.

2605.06663 2026-05-12 cs.CL

EMO: Pretraining Mixture of Experts for Emergent Modularity

Ryan Wang, Akshita Bhagia, Sewon Min

AI总结 本文提出了一种名为EMO的专家混合(MoE)预训练模型,旨在实现模型的模块化部署,使得不同任务可以独立使用或组合使用专家子集,而无需人工定义先验知识。EMO通过鼓励相似领域的内容使用相似的专家,并基于文档边界进行预训练,使专家分组在训练过程中自然形成。实验表明,EMO在保持整体性能的同时,能够显著减少推理时所需的专家数量,且专家子集在语义层面(如数学、代码等)表现出专业化能力,优于传统MoE在语法层面的分工。

详情
英文摘要

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.