arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2603.29475 2026-05-14 cs.LG

Survival In-Context: Amortized Bayesian Survival Analysis via Prior-Fitted Networks

Dmitrii Seletkov, Paul Hager, Georgios Kaissis, Rickmer Braren, Daniel Rueckert, Raphael Rehms

发表机构 * Institute of Diagnostic and Interventional Radiology, Technical University of Munich, Germany(慕尼黑技术大学诊断与介入放射学研究所,德国) Chair for AI in Healthcare and Medicine, Technical University of Munich, Germany(慕尼黑技术大学医疗人工智能与医学研究所,德国) Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany(波茨坦大学数字工程哈索普兰特纳研究所,德国) University Hospital Hamburg-Eppendorf, Germany(汉堡-埃彭多夫大学医院,德国) Department of Computing, Imperial College London, UK(伦敦帝国理工学院计算系,英国) Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心(MCML),德国)

AI总结 该论文提出了一种名为Survival In-Context(SIC)的先验拟合生存分析模型,旨在解决医疗等领域中生存数据分析面临的数据量小、存在截尾现象以及协变量异质性等问题。该方法通过构建一个可控的生存先验生成框架,结合基于合成数据的预训练,实现了无需任务特定训练或超参数调整的个体化生存预测。实验表明,SIC在多个真实生存数据集上表现优异,尤其在小到中等规模数据集上优于传统和深度生存模型,展示了先验拟合范式在生存分析中的潜力。

详情
英文摘要

Survival analysis is crucial for many medical applications, but remains challenging for modern machine learning due to limited data, censoring, and the heterogeneity of tabular covariates. While the prior-fitted paradigm, which relies on pretraining models on large collections of synthetic datasets, has recently facilitated tabular foundation models for classification and regression, its suitability for time-to-event modeling remains unclear. We propose a flexible survival data generation framework that defines a rich survival prior with explicit control over covariates and time-event distributions. Building on this prior, we introduce Survival In-Context (SIC), a prior-fitted in-context learning model for survival analysis that is pretrained exclusively on synthetic data. SIC is trained to approximate Bayesian posterior predictive inference under the synthetic survival prior, enabling individualized survival prediction in a single forward pass, requiring no task-specific training or hyperparameter tuning. Across a broad evaluation on real-world survival datasets, SIC achieves competitive or superior performance compared to classical and deep survival models, particularly in small and medium-sized data regimes, highlighting the promise of a prior-fitted paradigm for survival analysis. The code and pretrained models will be made available upon publication.

2603.27910 2026-05-14 cs.AI cs.IR cs.MA

GAAMA: Graph Augmented Associative Memory for Agents

Swarna Kamal Paul, Shubhendu Sharma, Nitin Sareen

发表机构 * Nagarro(Nagarro公司)

AI总结 GAAMA 是一种用于智能体的图增强关联记忆系统,旨在解决多会话交互中长期记忆保持的问题。该方法通过构建一个由事件、事实、反思和概念节点组成的结构化知识图谱,结合基于余弦相似度的检索与边类型感知的个性化PageRank算法,有效避免了传统方法中的结构关系丢失和中心节点效应问题。实验表明,GAAMA 在多个任务中均优于现有方法,尤其在长对话场景下表现更为突出。

详情
英文摘要

AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relationships among memories, or use entity-centric knowledge graphs that suffer from mega-hub effects in conversational data, diluting graph-based relevance propagation. We propose GAAMA, a graph-augmented associative memory for agents that constructs a concept-mediated knowledge graph through a three-step pipeline: (1)verbatim episode preservation, (2)LLM-based extraction of atomic facts and topic-level concept nodes, and (3)synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting traversal paths that avoid the mega-hub problem of entity-centric designs. Retrieval combines cosine-similarity-based k-nearest neighbor search with edge-type-aware Personalized PageRank (PPR) through an additive scoring function. We further introduce GRAFT (Graph Repair by Augmenting Facts & Topology), a post-retrieval corrective layer that diagnoses retrieval failures and surgically repairs the knowledge graph. On LoCoMo-10 (1,540 questions, 10 multi-session conversations), GAAMA achieves 79.1% mean reward, a +4.2~pp improvement over a tuned RAG baseline, the strongest comparator. On MemoryArena, GAAMA outperforms full-context baselines across three tasks - Group Travel (+0.4~pp), Web Shopping (+3.4~pp), and Progressive Search (+0.7~pp) - with advantages growing monotonically with dialogue length. Notably, GAAMA delivers consistent performance across all categories, matching the best competing method in each, whereas every competitor degrades in at least one category.

2603.24649 2026-05-14 cs.CV

MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

Weixiang Shen, Chengzhi Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Xiao Han, Zongyue Li, Jingpei Wu, Min Xu, Daguang Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan

发表机构 * Technical University of Munich(慕尼黑技术大学) TUM University Hospital(TUM大学医院) LMU Munich(慕尼黑大学) Imperial College London(伦敦帝国理工学院) University of Oxford(牛津大学) Carnegie Mellon University(卡内基梅隆大学) NVIDIA(NVIDIA公司) National University of Singapore(新加坡国立大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 该研究指出当前医学影像评估基准过于关注预选的2D图像,未能反映真实临床工作流程中的复杂任务。为此,研究者提出了MedFlowBench和MedOpenClaw,前者是一个完整的医学影像研究评估基准,后者是一个可控的医学影像软件运行环境,用于评估视觉语言模型在完整研究中的表现。实验表明,仅凭最终答案的评分会高估模型性能,而真实任务中模型还需生成可审计的证据,才能正确完成复杂流程。

Comments 33 pages

详情
英文摘要

Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.

2603.24002 2026-05-14 cs.LG

Stochastic Dimension-Free Zeroth-Order Estimator for High-Dimensional and High-Order PINNs

Zhangyong Liang, Huanhuan Gao

发表机构 * Tianjin University National Center for Applied Mathematics(天津大学应用数学中心) Jilin University School of Mechanical and Aerospace Engineering(吉林大学机械与 aerospace 工程学院)

AI总结 该论文针对高维高阶物理信息神经网络(PINNs)训练中面临的计算复杂度和内存消耗过高的问题,提出了一种无维度依赖的零阶优化估计器SDZE。该方法通过引入共同随机数同步技术,有效消除了零阶优化中的方差爆炸问题,并结合隐式无矩阵子空间投影技术,显著降低了参数探索的方差和内存占用。实验表明,SDZE能够在单块GPU上高效训练千万维的PINNs,大幅提升了计算速度和内存效率。

Comments arXiv admin note: text overlap with arXiv:2412.00088, arXiv:2410.08989, arXiv:2307.12306 by other authors

详情
英文摘要

Physics-Informed Neural Networks (PINNs) for high-dimensional and high-order partial differential equations (PDEs) are primarily constrained by the $\mathcal{O}(d^k)$ spatial derivative complexity and the $\mathcal{O}(P)$ memory overhead of backpropagation (BP). While randomized spatial estimators successfully reduce the spatial complexity to $\mathcal{O}(1)$, their reliance on first-order optimization still leads to prohibitive memory consumption at scale. Zeroth-order (ZO) optimization offers a BP-free alternative; however, naively combining randomized spatial operators with ZO perturbations triggers a variance explosion of $\mathcal{O}(1/\varepsilon^2)$, leading to numerical divergence. To address these challenges, we propose the \textbf{S}tochastic \textbf{D}imension-free \textbf{Z}eroth-order \textbf{E}stimator (\textbf{SDZE}), a unified framework that achieves dimension-independent complexity in both space and memory. Specifically, SDZE leverages \emph{Common Random Numbers Synchronization (CRNS)} to algebraically cancel the $\mathcal{O}(1/\varepsilon^2)$ variance by locking spatial random seeds across perturbations. Furthermore, an \emph{implicit matrix-free subspace projection} is introduced to reduce parameter exploration variance from $\mathcal{O}(P)$ to $\mathcal{O}(r)$ while maintaining an $\mathcal{O}(1)$ optimizer memory footprint. Empirical results demonstrate that SDZE enables the training of 10-million-dimensional PINNs on a single NVIDIA A100 GPU, delivering significant improvements in speed and memory efficiency over state-of-the-art baselines.

2603.23777 2026-05-14 cs.RO cs.AI cs.SY eess.SY

Human-in-the-Loop Pareto Optimization: Trade-off Characterization for Assist-as-Needed Training and Performance Evaluation

Harun Tolasa, Volkan Patoglu

发表机构 * Faculty of Engineering and Natural Sciences(工程与自然科学学院)

AI总结 在人类运动技能训练和康复过程中,任务难度与用户表现之间存在内在权衡关系,准确刻画这一权衡对评估表现、设计按需辅助(AAN)方案至关重要。本文提出了一种基于人机闭环的帕累托优化方法,结合定量性能指标和定性挑战度指标,系统高效地刻画任务表现与感知挑战水平之间的权衡关系。通过用户实验和三个应用场景验证,该方法不仅可用于设计和评估AAN训练方案,还能在不同辅助水平下公平评估个体训练进展和用户间表现差异。

Comments Under review for publication in IEEE Transactions on Haptics

详情
英文摘要

During human motor skill training and physical rehabilitation, there is an inherent trade-off between task difficulty and user performance. Characterizing this trade-off is crucial for evaluating user performance, designing assist-as-needed (AAN) protocols, and assessing the efficacy of training protocols. In this study, we propose a novel human-in-the-loop (HiL) Pareto optimization approach to characterize the trade-off between task performance and the perceived challenge level of motor learning or rehabilitation tasks. We adapt Bayesian multi-criteria optimization to systematically and efficiently perform HiL Pareto characterizations. Our HiL optimization employs a hybrid model that measures performance with a quantitative metric, while the perceived challenge level is captured with a qualitative metric. We demonstrate the feasibility of the proposed HiL Pareto characterization through a user study. Furthermore, we present the utility of the framework through three use cases in the context of a manual skill training task with haptic feedback. First, we demonstrate how the characterized trade-off can be used to design a sample AAN training protocol for a motor learning task and to evaluate the group-level efficacy of the proposed AAN protocol relative to a baseline adaptive assistance protocol. Second, we demonstrate that individual-level comparisons of the trade-offs characterized before and after the training session enable fair evaluation of training progress under different assistance levels. This evaluation method is more general than standard performance evaluations, as it can provide insights even when users cannot perform the task without assistance. Third, we show that the characterized trade-offs also enable fair performance comparisons among different users, as they capture the best possible performance of each user under all feasible assistance levels.

2603.22273 2026-05-14 cs.LG

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Zakaria Mhammedi, James Cohan

发表机构 * Google Research, NYC(谷歌研究,纽约市)

AI总结 本文提出了一种将探索与策略优化解耦的新方法,旨在解决强化学习中困难探索问题。该方法采用基于不确定性的树搜索策略,无需依赖传统强化学习框架,从而显著提高了探索效率。实验表明,该方法在多个硬探索任务中表现优异,并能通过监督学习将探索轨迹转化为高性能策略,且无需领域知识或专家示范。

详情
英文摘要

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new approach that explicitly decouples exploration from policy optimization and bypasses RL entirely during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard exploration benchmarks. Further, we demonstrate that the trajectories discovered during exploration can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art performance by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

2603.22267 2026-05-14 cs.CL cs.AI eess.AS

TiCo: Time-Controllable Spoken Dialogue Model

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

发表机构 * MIT(麻省理工学院) NTU(国立台湾大学) NTU AI-CoRE(国立台湾大学AI-CoRE)

AI总结 本文提出 TiCo,一种可控制时间的语音对话模型,能够根据时间约束指令(如“生成约15秒的回应”)生成时长可控的语音响应。为解决现有模型缺乏时间感知能力的问题,研究引入了 TiCo-Bench 作为首个评估时间可控性的基准,并通过语音时间标记(STM)帮助模型在生成过程中估计已用时间并调整内容以满足目标时长。实验表明,TiCo 在不依赖问答对数据的情况下,通过自生成和可验证奖励的强化学习进行高效微调,显著提升了时长控制精度,同时保持了响应质量。

详情
英文摘要

We introduce TiCo, a time-controllable spoken dialogue model (SDM) that follows time-constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions. To systematically evaluate this, we introduce TiCo-Bench, the first benchmark for time-controllable instruction following in SDMs, on which existing open-source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post-trained efficiently without question-answer paired data, relying on self-generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.

2603.19185 2026-05-14 cs.LG

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori, John Jewell, Sana Ayromlou, Wei Pang, Veronica Chatrath, Gauri Sharma, Deval Pandya

发表机构 * Vector Institute

AI总结 本文研究了基于扩散模型生成的合成表格数据在隐私保护方面的性能,特别是其对成员推理攻击(MIA)的抵抗能力。针对表格数据的异质性和复杂性,研究探索了多种目标模型用于成员推理攻击,并提出了专门针对这些扩散模型的黑盒和白盒攻击方法,为评估其隐私效果提供了全面的实验基础。该研究为理解生成模型在隐私安全方面的潜力与局限提供了重要参考。

Comments 4 page, 1 table

详情
英文摘要

Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

2603.05093 2026-05-14 cs.LG cs.AI cs.CV

From Baselines to Transport Geodesics: Axiomatic Attribution via Optimal Generative Flows

Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You

发表机构 * Shanghai Jiao Tong University(上海交通大学) Aalto University(艾尔沃斯大学) Alibaba(阿里巴巴) Technical University of Denmark(丹麦技术大学)

AI总结 该论文研究了特征归因中的路径选择问题,提出了一种基于最优生成流的归因方法。不同于传统的手工设计路径或模型敏感性几何,作者通过最小化运输过程中的动能作用,从数据生成过程中自动选择归因路径,从而获得更稳定和结构化的解释。研究证明了Aumann-Shapley积分在固定路径下的唯一性,并通过Rectified Flow等方法实现了该理论的近似,实验表明新方法在保持删除忠实度的同时提升了归因的稳定性。

Comments 10 figures, 31 pages

详情
英文摘要

Feature attributions often hide a critical modeling choice: they explain a prediction along a counterfactual path from a reference state to an input. Different baselines, interpolations, and generative trajectories define different paths and can therefor produce different explanations. We study this path ambiguity as a modeling problem. Our central question is whether the path can be chosen by the data-generating transport process, rather than by a hand-designed interpolation or by the sensitivity geometry of the model being explained. We separate attribution into fixed-path credit allocation and path selection. For a fixed path, we prove that the Aumann-Shapley line integral is the unique attribution rule under standard fixed-path axioms and explicit coordinate-trace regularity. For path selection, we minimize kinetic action over flows that transport a reference distribution to the data distribution, yielding a transport-geodesic attribution principle. We approximate this ideal with Rectified Flow and Reflow and derive stability bounds linking vector-field error to attribution error. Experiments show that lower-action, transport-consistent paths produce more stable and structured explanations, preserving competitive deletion faithfulness, without claiming data-manifold membership. Our code is available at https://github.com/cenweizhang/OTFlowSHAP.

2602.22847 2026-05-14 cs.LG cs.AI stat.ML

Decentralized Ranking Aggregation via Gossip: Convergence and Robustness

Kerrian Le Caillec, Anna Van Elst, Igor Colin, Stephan Clémençon

发表机构 * LTCI, Télécom-Paris, Institut Polytechnique de Paris(LTCI, Télécom-Paris, 法国巴黎理工学院)

AI总结 本文研究了在去中心化网络环境中实现可靠且鲁棒的排名共识的问题,提出了一种基于随机闲聊(gossip)通信机制的方法,使各节点仅通过局部交互即可计算全局排名共识,无需中心协调。该方法在保证收敛性的同时,增强了对恶意节点的鲁棒性,并降低了通信成本,为分布式偏好分析提供了新的解决方案。

Comments 33 pages, 5 figures

详情
英文摘要

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, \textit{i.e.}, when all the ranking data to be aggregated can be brought together in a single computing unit. For many technologies (\textit{e.g.} peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees of convergence and resilience to potential contamination in a decentralized setting, when preference data is initially distributed across a communicating network, remains a major methodological challenge. Indeed, in recent years, the literature on decentralized computation has mainly focused on computing or optimizing statistics such as arithmetic means using gossip algorithms. The purpose of this article is precisely to study how to achieve reliable and resilient consensus on collective rankings in a decentralized setting, thereby raising new questions, robustness to corrupted nodes, and scalability through reduced communication costs in particular. The approach proposed and analyzed here relies on the robustness guarantees offered by random gossip communication, which allows autonomous agents to compute a global ranking consensus using local interactions only, without coordination or a central authority.

2602.22251 2026-05-14 cs.LG cond-mat.mtrl-sci cs.AI

Zatom-1: Towards a Multimodal Foundation Model for 3D Molecules and Materials

Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney

发表机构 * LBNL(劳伦斯伯克利国家实验室) ICSI(国际计算机科学研究所) University of Cambridge(剑桥大学) Yale University(耶鲁大学) MIT(麻省理工学院) UC Berkeley(加州大学伯克利分校)

AI总结 该研究提出了一种名为 Zatom-1 的通用基础模型,旨在统一3D分子和材料的生成与预测任务。该模型基于简化版的Transformer架构,通过多模态流匹配目标联合建模离散原子类型和连续3D结构,实现了跨领域、多任务的学习能力。实验表明,Zatom-1 在生成和预测性能上均优于现有专门模型,并显著提升了生成推理速度,同时展示了从材料生成预训练中向分子属性预测的正向迁移效果。

Comments 38 pages, 10 figures, 15 tables. ICLR 2026 FM4Science. Code, data, and model weights are available at https://github.com/Zatom-AI/zatom

详情
英文摘要

General-purpose 3D modeling in chemistry encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, a cross-domain, general-purpose model architecture that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a deliberately simplified Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use cross-domain generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 outperforms or competes with specialized baselines on both multi-task generative and predictive benchmarks in data-controlled settings, while improving generative inference speed by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between data domains from joint generative pretraining: modeling materials during generative pretraining improves molecular property prediction accuracy. Open-source code and model weights are freely available at https://github.com/Zatom-AI/zatom.

2602.17555 2026-05-14 cs.CV

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

Zixu Cheng, Da Li, Jian Hu, Yuhang Zang, Ziquan Liu, Shaogang Gong, Wei Li

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) Samsung AI Centre Cambridge(剑桥三星人工智能中心) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Nanyang Technological University(南洋理工大学)

AI总结 视频推理需要对视频中对象和事件之间的时序依赖和事件级关系进行细粒度理解。当前多模态大语言模型在视频推理中容易产生严重的时序幻觉,其根本原因在于视觉-时序对齐较弱且缺乏对事件关系的显式结构建模。为此,本文提出GraphThinker,一种通过强化微调构建结构化事件表示并加强视觉对齐的视频推理方法,有效减少了推理过程中的幻觉问题。实验表明,该方法在多个基准数据集上均取得了显著的性能提升。

Comments Under review

详情
英文摘要

Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning hallucinations. Specifically, we employ an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations, guiding a structured video reasoning process. Moreover, we address the weak grounding issue by introducing a novel visual attention reward during reinforcement finetuning that encourages the model to actively attend to reliable visual cues. On the RexTime dataset, GraphThinker achieves an over 4% improvement in IoU=0.3 for moment localisation. On the VidHalluc dataset, GraphThinker achieves a 9.8% improvement in reducing temporal sequence hallucination and a 7.6% gain in Binary QA in reducing action hallucination, compared to the state-of-the-art methods.

2602.16246 2026-05-14 cs.AI

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

发表机构 * PayPal AI

AI总结 该研究提出了一种基于代理状态的评估方法,用于评估多轮工具调用的大型语言模型代理系统。该方法通过LLM模拟器生成结构化的代理状态,无需依赖确定性后端,从而降低了构建和迭代成本。实验表明,该框架能够稳定区分不同模型,并在不同推理条件下保持评估一致性,同时支持对用户角色的敏感性分析,具有较高的自动化评估可靠性。

详情
英文摘要

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks, such as tau-bench, tau^2-bench, and AppWorld, rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across model families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates, as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

2602.07458 2026-05-14 cs.CV

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Hong Kong University of Science and Technology(香港科学与技术大学) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生学院,清华大学)

AI总结 在线强化学习(RL)为复杂图像编辑提供了前景,但目前受限于可靠且细粒度奖励信号的缺乏。本文提出 SpatialReward,一种通过显式空间推理增强评估准确性的奖励模型,有效解决了现有评估器在跨图像比较和细粒度细节捕捉上的“注意力坍塌”问题。该模型基于预测的编辑区域进行像素级验证,显著提升了评估效果,并在多个基准测试中取得领先性能,同时作为在线RL的强效信号,显著提升了图像生成模型的表现。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

2602.07342 2026-05-14 cs.AI

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

Shengyue Guan, Yihao Liu, Lang Cao

发表机构 * Alibaba Group(阿里巴巴集团) Peking University(北京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出SupChain-Bench,一个用于评估大语言模型在真实供应链管理场景中表现的统一基准,重点考察模型在领域知识和基于标准操作流程的长期多步骤任务执行能力。研究发现当前模型在执行可靠性方面存在较大差距,并提出了一种无需依赖标准操作流程的SupChain-ReAct框架,能够自主生成可执行的工具调用流程,取得了最稳定和出色的性能。该工作为研究真实场景下的长期任务协调提供了系统评估基准,并指出了当前供应链智能代理的改进空间。

详情
英文摘要

Large language models (LLMs) have shown promise in complex reasoning and tool-based decision making, motivating their application to real-world supply chain management. However, supply chain workflows require reliable long-horizon, multi-step orchestration grounded in domain-specific procedures, which remains challenging for current models. To systematically evaluate LLM performance in this setting, we introduce SupChain-Bench, a unified real-world benchmark that assesses both supply chain domain knowledge and long-horizon tool-based orchestration grounded in standard operating procedures (SOPs). Our experiments reveal substantial gaps in execution reliability across models. We further propose SupChain-ReAct, an SOP-free framework that autonomously synthesizes executable procedures for tool use, achieving the strongest and most consistent tool-calling performance. Our work establishes a principled benchmark for studying reliable long-horizon orchestration in real-world operational settings and highlights significant room for improvement in LLM-based supply chain agents.

2602.04804 2026-05-14 cs.CL

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang

发表机构 * New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA)(模式识别新实验室(NLPR)、自动化研究所、中国科学院(CASIA)) Nanjing University(南京大学) The Hong Kong University of Science(香港科学大学) Sichuan University(四川大学) Peking University(北京大学)

AI总结 OmniSIFT 是一种针对多模态大语言模型(Omni-LLMs)设计的模态非对称token压缩框架,旨在解决其在处理多模态序列时计算开销大的问题。该方法采用两阶段压缩策略,分别对视频和音频模态进行精细化压缩,通过端到端优化提升效率。实验表明,OmniSIFT 在多个基准测试中表现优异,仅引入少量参数即可显著降低推理延迟,且在部分任务上甚至超越了完整token模型的性能。

Comments [ICML 2026] Code Link: https://github.com/dingyue772/OmniSIFT

详情
英文摘要

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.

2602.03429 2026-05-14 cs.AI cs.CL cs.HC cs.LG

DiscoverLLM: From Executing Intents to Discovering Them

Tae Soo Kim, Yoonjoo Lee, Jaesang Yu, John Joon Young Chung, Juho Kim

发表机构 * University of Michigan(密歇根大学)

AI总结 为了处理模糊和开放式的用户请求,研究提出DiscoverLLM框架,训练大语言模型帮助用户形成和发现其尚未明确的意图。该方法引入了一个新型用户模拟器,通过分层意图建模用户的认知状态,并利用意图的具体化程度作为奖励信号进行模型训练,使模型能够在意图不明确时主动探索,意图明确时快速收敛。实验表明,DiscoverLLM在多个交互任务中显著提升了任务完成效率,并减少了对话长度,同时在用户研究中也表现出更高的满意度和效率。

Comments Accepted at ICML 2026

详情
英文摘要

To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options -- where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.

2602.02560 2026-05-14 cs.LG cs.AI cs.CV

Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

Bartlomiej Sobieski, Jakub Grzywaczewski, Karol Dobiczek, Mateusz Wójcik, Tomasz Bartczak, Patryk Szatkowski, Przemysław Bombiński, Matthew Tivnan, Przemyslaw Biecek

发表机构 * National Lung Screening Trial Research Team(国家肺癌筛查试验研究组)

AI总结 该研究针对深度学习模型Sybil在肺部癌症风险预测中的决策机制进行因果验证,提出了一个模型无关的审计框架S(H)NAP。该方法通过生成干预性归因,结合专家放射科医生的验证,系统分析模型对风险评分的因果贡献。研究发现,尽管Sybil在很多情况下表现类似专家,但其仍存在对临床无关伪影过度敏感和径向偏差等关键失效模式。

Comments ICML 2026

详情
英文摘要

Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model's actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

2602.01629 2026-05-14 cs.LG cs.RO cs.SY eess.SY

AdaptNC: Adaptive Nonconformity Scores for Conformal Prediction under Distribution Shift

Renukanandan Tumu, Aditya Singh, Rahul Mangharam

发表机构 * Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系)

AI总结 本文研究了在分布偏移环境下如何提升共形预测(Conformal Prediction)的不确定性量化能力。传统共形预测依赖于数据交换性假设,但在实际机器人系统中这一假设常被违反,导致预测区域过于保守。为此,作者提出AdaptNC框架,同时在线调整非一致性得分函数参数和共形阈值,通过自适应加权和回放缓冲机制提升预测效率与稳定性。实验表明,AdaptNC在多个机器人基准任务中显著减少了预测区域体积,同时保持目标覆盖率。

详情
英文摘要

Rigorous uncertainty quantification is essential for the safe deployment of autonomous systems in unconstrained environments. Conformal Prediction (CP) provides a distribution-free framework for this task, yet its standard formulations rely on exchangeability assumptions that are violated by the distribution shifts inherent in real-world robotics. Existing online CP methods maintain target coverage by adaptively scaling the conformal threshold, but typically employ a static nonconformity score function. We show that this fixed geometry leads to highly conservative, volume-inefficient prediction regions when environments undergo structural shifts. To address this, we propose $\textbf{AdaptNC}$, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold. AdaptNC leverages an adaptive reweighting scheme to optimize score functions, and introduces a replay buffer mechanism to mitigate the coverage instability that occurs during score transitions. We evaluate AdaptNC on diverse robotic benchmarks involving multi-agent policy changes, environmental changes and sensor degradation. Our results demonstrate that AdaptNC significantly reduces prediction region volume compared to state-of-the-art threshold-only baselines while maintaining target coverage levels.

2601.22868 2026-05-14 cs.CV cs.LG

Conditional Compatibility Learning for Context-Dependent Anomaly Detection

Shashank Mishra, Didier Stricker, Jason Rambach

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) RPTU Kaiserslautern(科布伦茨-莱茵威达大学(RPTU)基尔伯恩)

AI总结 该论文研究了上下文相关的异常检测问题,即同一对象在不同场景下可能表现出正常或异常的差异。传统方法通常假设异常是对象本身的属性,而本文指出这种假设在现实场景中并不成立。为此,作者提出了条件兼容性学习(Conditional Compatibility Learning)方法,通过分离对象和上下文的表示,并利用文本条件注意力机制进行融合,构建了CC-CLIP模型,在多个现实场景的异常检测任务中取得了显著优于现有方法的性能。

Comments Preprint. 9 pages main text, plus appendix

详情
英文摘要

Anomaly detection usually assumes that abnormality is an intrinsic property of an observation. A defect is a defect, and a rare object is rare, regardless of where it appears. Many real-world anomalies do not work this way. A runner on a track is normal, but the same runner on a highway is not. The subject is unchanged; only the context makes it anomalous. This setting, long recognized as contextual anomaly detection, remains largely underexplored in modern vision-language systems. The difficulty is not merely empirical; it is formal. When anomaly labels depend on the relation between a subject and its context, any detector reasoning from a global representation that conflates subject and context is provably non-identifiable: two different subject-context configurations can map to the same embedding while requiring opposite labels, and no such detector can be correct on both. This impossibility motivates a different formulation: instead of asking whether an observation deviates from a global notion of normality, the model should ask whether subjects are compatible with their surrounding context. We define this as conditional compatibility learning. We instantiate this framework in CC-CLIP, a vision-language architecture that learns disentangled subject- and context-aware representations from a single image and fuses visual evidence through text-conditioned attention. CC-CLIP achieves state-of-the-art results on real-world contextual anomaly detection, substantially outperforming all existing CLIP-based and context-reasoning baselines. A single-branch variant of CC-CLIP also achieves competitive performance on structural anomaly benchmarks.

2601.21975 2026-05-14 cs.AI cs.ET

Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

Pranav Mahajan, Ihor Kendiukhov, Syed Hussain, Lydia Nottingham

发表机构 * University of Oxford(牛津大学) Max Planck Institute for Biological Cybernetics(生物信息学Max Planck研究所) University of Tuebingen(图宾根大学) Cardiff University(卡迪夫大学) Cambridge–Boston Alignment Initiative (CBAI)(剑桥-波士顿对齐倡议)

AI总结 该研究探讨了语言模型中陈述偏好与揭示偏好之间的差距(SvR gap),并分析了不同偏好获取协议对此差距的影响。研究发现,允许在陈述偏好过程中表达中立或弃权可以提升偏好相关性,但若在揭示偏好中也允许弃权,则可能导致相关性显著下降。研究强调,偏好获取方法需考虑不确定偏好,以更准确地评估模型的真实价值倾向。

Comments Accepted to ACL 2026 Eval Eval Workshop and 3rd Technical AI Safety Conference (TAIS 2026)

详情
英文摘要

Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman's rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.

2601.21366 2026-05-14 cs.LG math.OC

Perceptrons and localization of attention's mean-field landscape

Antonio Álvarez-López, Borjan Geshkovski, Domènec Ruiz-Balet

发表机构 * Universidad Autónoma de Madrid(马德里自治大学) Laboratoire Jacques-Louis Lions Inria & Sorbonne Université(雅克-路易-洛伦斯实验室 Inria & 巴黎索邦大学) Universitat de Barcelona(巴塞罗那大学)

AI总结 本文研究了Transformer模型中感知机模块在注意力机制均场景观中的作用,将前向传播过程建模为单位球面上的相互作用粒子系统。通过分析权重设置下的梯度流和无限上下文长度的均场极限,发现临界点通常具有原子性和在球面子集上的局部化特性,揭示了注意力机制在高维空间中的结构特征。

详情
英文摘要

The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

2601.21033 2026-05-14 cs.LG

Predict-Project-Renoise: Sampling Diffusion Models under Hard Constraints

Omer Rochman-Sharabi, Gilles Louppe

发表机构 * University of Liège(利根大学)

AI总结 扩散模型难以满足严格的约束条件,而物理科学中的许多应用则需要精确满足守恒定律、边界条件和观测一致性。本文提出了一种名为Predict-Project-Renoise(PPR)的算法,通过迭代地利用去噪器进行投影并结合前向扩散核重新引入噪声,从而在预训练扩散模型中实现对硬约束的采样。该方法在多个实验中表现出色,能够在保持分布保真度的同时显著降低约束违反程度,是现有方法所无法实现的。

Comments Code coming soon

详情
英文摘要

Diffusion models cannot enforce hard constraints, yet applications in the physical sciences demand exact satisfaction of conservation laws, boundary conditions, and observational consistency. In this work, we identify a corrector kernel whose unique stationary distribution is the constrained marginal at each noise level, and approximate it by iteratively projecting through the denoiser and renoising via the forward kernel. The resulting Predict-Project-Renoise (PPR) algorithm enables sampling from pretrained diffusion models under hard constraints. Its three components are each necessary: projecting through the denoiser keeps samples close to the data manifold, while renoising and iterating drive samples toward the constrained marginal. On 2D distributions, the Kuramoto-Sivashinsky equation, and global weather forecasting with a $10^8$-dimensional atmospheric model, PPR simultaneously achieves low constraint violations and high distributional fidelity, a combination that existing methods fail to deliver.

2601.20239 2026-05-14 cs.RO

TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance

Zhemeng Zhang, Jiahua Ma, Xincheng Yang, Xin Wen, Yuzhi Zhang, Boyan Li, Yiran Qin, Jin Liu, Can Zhao, Li Kang, Haoqin Hong, Zhenfei Yin, Philip Torr, Hao Su, Ruimao Zhang, Daolin Ma

发表机构 * Shanghai Jiao Tong University(上海交通大学) Xense Robotics(Xense机器人公司) Sun Yat-sen University(中山大学) Oxford(牛津大学) Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) UCSD(加州大学圣地亚哥分校)

AI总结 本文提出了一种名为TouchGuide的新方法,通过触觉引导在推理阶段对视觉运动策略进行引导,以提升机器人对精细和高接触任务的操控能力。该方法结合预训练的视觉运动策略与任务特定的接触物理模型(CPM),在低维动作空间中融合视觉与触觉信息,从而生成符合物理接触约束的精细动作。此外,研究还引入了TacUMI数据采集系统,以高效、低成本地获取可靠的触觉数据,实验表明TouchGuide在多个复杂任务中显著优于现有方法。

详情
英文摘要

Fine-grained and contact-rich manipulation remain challenging for robots, largely due to the underutilization of tactile feedback. To address this, we introduce TouchGuide, a novel cross-policy visuo-tactile fusion paradigm that fuses modalities within a low-dimensional action space. Specifically, TouchGuide operates in two stages to guide a pre-trained diffusion or flow-matching visuomotor policy at inference time. First, the policy produces a coarse, visually-plausible action using only visual inputs during early sampling. Second, a task-specific Contact Physical Model (CPM) provides tactile guidance to steer and refine the action, ensuring it aligns with realistic physical contact conditions. Trained through contrastive learning on limited expert demonstrations, the CPM provides a tactile-informed feasibility score to steer the sampling process toward refined actions that satisfy physical contact constraints. Furthermore, to facilitate TouchGuide training with high-quality and cost-effective data, we introduce TacUMI, a data collection system. TacUMI achieves a favorable trade-off between precision and affordability; by leveraging rigid fingertips, it obtains direct tactile feedback, thereby enabling the collection of reliable tactile data. Extensive experiments on five challenging contact-rich tasks, such as shoe lacing and chip handover, show that TouchGuide consistently and significantly outperforms state-of-the-art visuo-tactile policies.

2601.18608 2026-05-14 cs.AI cs.LG

PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression

Fabian Fumagalli, R. Teal Witter, Christopher Musco

发表机构 * Bielefeld University(比勒菲尔德大学) Claremont McKenna College(克莱蒙特麦肯纳学院) New York University(纽约大学)

AI总结 本文提出了一种名为 PolySHAP 的新方法,通过引入高阶多项式回归扩展了 KernelSHAP 算法,以更准确地捕捉特征之间的非线性交互作用,从而提升对 Shapley 值的估计效果。研究证明了 PolySHAP 在多个基准数据集上具有更好的实证表现,并且其估计结果具有一致性。此外,该方法还揭示了配对采样(antithetic sampling)与二阶 PolySHAP 之间的理论联系,为这一广泛使用的改进方法提供了首个坚实的理论依据。

Comments Published at ICLR 2026: https://openreview.net/forum?id=M19J8UGguq

详情
英文摘要

Shapley values have emerged as a central game-theoretic tool in explainable AI (XAI). However, computing Shapley values exactly requires $2^d$ game evaluations for a model with $d$ features. Lundberg and Lee's KernelSHAP algorithm has emerged as a leading method for avoiding this exponential cost. KernelSHAP approximates Shapley values by approximating the game as a linear function, which is fit using a small number of game evaluations for random feature subsets. In this work, we extend KernelSHAP by approximating the game via higher degree polynomials, which capture non-linear interactions between features. Our resulting PolySHAP method yields empirically better Shapley value estimates for various benchmark datasets, and we prove that these estimates are consistent. Moreover, we connect our approach to paired sampling (antithetic sampling), a ubiquitous modification to KernelSHAP that improves empirical accuracy. We prove that paired sampling outputs exactly the same Shapley value approximations as second-order PolySHAP, without ever fitting a degree 2 polynomial. To the best of our knowledge, this finding provides the first strong theoretical justification for the excellent practical performance of the paired sampling heuristic.

2512.20211 2026-05-14 cs.SD eess.AS eess.SP

Aliasing-Free Neural Audio Synthesis

Yicheng Gu, Junan Zhang, Chaoren Wang, Jerry Li, Zhizheng Wu, Lauri Juvela

发表机构 * Aalto University School of Science(阿alto大学科学学院) Aalto University(阿alto大学) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Spellbrush, Akihabara, Tokyo(东京秋叶原Spellbrush)

AI总结 在神经音频合成中,现有模型在生成高质量音乐和人声演唱时常因非线性激活函数和上采样层引入严重的混叠伪影而表现不足。本文将可微分的抗混叠技术引入激活和上采样模块,提出Pupu-Vocoder和Pupu-Codec模型,有效提升了音频重建质量。实验表明,新模型在音乐、人声演唱和通用音频任务中优于现有系统,在语音任务上也保持了相近性能。

Comments Accepted by TASLP

详情
英文摘要

In neural audio synthesis, neural vocoders and codecs are models that reconstruct waveforms from acoustic and latent representations, which are essential to the resulting audio quality. While current models are capable of generating perceptually natural speech, they still struggle with high-fidelity music and singing voice synthesis, as severe aliasing artifacts are introduced by non-linear activation functions and upsampling layers in existing architectures. Although various anti-aliasing techniques have been proposed in digital signal processing, their integration into neural vocoders and codecs remains under-explored. This paper incorporates differentiable anti-aliasing techniques into the activation and upsampling modules to bridge this gap, and thus presents Pupu-Vocoder and Pupu-Codec. We build a test signal benchmark to evaluate the anti-aliased modules, and validate our proposed models on speech, singing voice, music, and audio. Experimental results show that Pupu-Vocoder and Pupu-Codec outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech. Demos, codes, and checkpoints are available at VocodexElysium.github.io/AliasingFreeNeuralAudioSynthesis/.

2512.16767 2026-05-14 cs.CV

Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters

Zhiyang Guo, Ori Zhang, Jax Xiang, Alan Zhao, Zhenxun Yuan, Wengang Zhou, Houqiang Li

发表机构 * EEIS Department University of Science(电子信息科学系中国科学技术大学) Tencent PCG Shenzhen China(腾讯PCG深圳中国) Tencent PCG New York USA(腾讯PCG纽约美国) Tencent PCG Beijing China(腾讯PCG北京中国) University of Science(中国科学技术大学) Tencent PCG(腾讯PCG)

AI总结 本文提出了一种名为 Make-It-Poseable 的新型前馈框架,用于解决3D角色姿态生成中的关键问题,如皮肤权重不准确、网格拓扑固定和姿态不匹配等。该方法将角色姿态生成重新定义为一种无需皮肤绑定的潜在空间变换问题,通过在紧凑的潜在表示上操作,实现了对目标姿态的高效重建。该框架结合了潜在姿态变换器、密集姿态表示和自适应补全模块,能够处理拓扑变化并展现出优异的零样本泛化能力,适用于多种形态的角色和3D创作任务。

Comments Project page: https://jasongzy.github.io/Make-It-Poseable/

详情
英文摘要

Posing 3D characters is a fundamental task in computer graphics. However, existing paradigms, ranging from traditional auto-rigging to recent pose-conditioned generative models, frequently struggle with inaccurate skinning weights, fixed mesh topologies, and poor pose conformance. These challenges have become particularly pronounced with the recent explosion of AI-generated 3D assets, which often exhibit flawed structures and fused geometry. To address these issues, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a skinning-free latent-space transformation problem. By decoupling shape deformation from the constraints of fixed mesh connectivity, our method directly operates on compact latent representations to reconstruct characters in target poses. To achieve this, our framework integrates a latent posing transformer for shape manipulation, a dense pose representation for fine-grained control, and an adaptive completion module optimized via a bipartite-matched latent loss to robustly handle topological changes. Extensive experiments demonstrate that our method significantly outperforms existing baselines in posing quality. Furthermore, our skeleton-agnostic design exhibits remarkable zero-shot generalization to diverse morphologies including quadrupeds and seamlessly supports various 3D authoring applications such as part replacement and refinement.

2512.10931 2026-05-14 cs.LG cs.CL

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Denis Kuznedelev, Alina Shutova, Max Ryabinin

发表机构 * Yandex HSE University(俄罗斯高等经济大学) The University of Tokyo(东京大学) MATS Together AI

AI总结 许多最先进的大型语言模型在回答问题前需要进行推理,但这种顺序交互方式限制了其在实时场景中的应用。本文提出了一种无需额外训练的方法,使具备推理能力的模型能够像人类一样异步进行思考、监听和输出。通过利用位置嵌入的特性,模型可以同时进行多任务处理,显著提升了响应速度和交互效率。

Comments Preprint, work in progress

详情
英文摘要

Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall delays by up to $12{\times}$.

2512.09972 2026-05-14 cs.LG cs.CL cs.NE

AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging

Kesheng Chen, Yamin Hu, Zhenqian Zhu, Yiya Diao, Wenjian Luo

发表机构 * Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies(广东新型安全智能技术重点实验室) Institute of Cyberspace Security(网络空间安全研究院) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 在大型语言模型(LLMs)部署中,推理能力与推理成本之间的权衡是一个重要问题。本文提出了一种异步先验引导的贝叶斯模型合并方法(AP-BMM),通过层-wise合并策略,结合参数和推理激活差异来指导搜索过程,并利用异步优化提升计算效率。该方法在固定评估预算下,能够生成更高质量且覆盖范围更广的精度-成本帕累托前沿集,优于同步优化和传统模型级合并方法。

详情
英文摘要

Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts of time, so synchronous batch optimization wastes GPU time while waiting for slow evaluations. We propose Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM). AP-BMM uses parameter and reasoning-activation differences between the source models to suggest which layers should matter early in the search. It also uses an asynchronous Bayesian optimization loop that accounts for candidate models already being evaluated. A lightweight reranking step further spreads candidates across the accuracy--cost trade-off. Under fixed evaluation budgets, AP-BMM achieves stronger Pareto-set quality and broader trade-off coverage than synchronous layer-wise baselines and representative model-level merging baselines. Compared with the synchronous Bayesian baseline, it also reduces wall-clock time by improving GPU utilization. Code: https://github.com/MiLab-HITSZ/AP-BMM.

2511.17001 2026-05-14 cs.RO

Unify Robot Actions in Camera Frame

Sicheng Xie, Lingchen Meng, Zijie Diao, Haidong Cao, Zhiying Du, Shuyuan Tu, Jiaqi Leng, Qiuyue Wang, Mingsheng Li, Shuai Bai, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身人工智能研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里巴巴公司)

AI总结 本文研究了跨机器人平台动作表示的一致性问题,提出了一种基于相机外参的统一动作表示方法,使单臂和双臂机器人等不同形态的机器人动作在相机坐标系下具有相同的几何语义。为了解决现有数据集缺乏相机外参标注的问题,作者提出了一个无需训练、跨机器人平台的标注方法CalibAll,通过从粗到细的校准策略,实现了高精度的相机外参估计,并生成标准化的动作表示。实验表明,基于相机帧动作的跨平台预训练在多个任务中取得了最先进的性能。

详情
英文摘要

Cross-embodiment robot learning requires a unified action representation with consistent semantics across robot platforms. Existing representations suffer from platform-specific inconsistencies, while current solutions either maintain embodiment-specific action heads or learn latent action spaces, without fundamentally resolving the mismatch. We propose to unify robot actions in the camera frame using camera extrinsics, so that actions share consistent geometric semantics across different robot embodiments, including both single-arm and bimanual robots. However, most existing datasets lack camera extrinsic annotations, and existing offline calibration methods either suffer from local minima or require robot-specific training data. To address this gap, we present CalibAll, a training-free, robot-independent annotation pipeline that estimates camera extrinsics for offline datasets and converts heterogeneous robot actions into standardized camera-frame actions. CalibAll follows a coarse-to-fine calibration strategy: temporal PnP provides a stable initialization, followed by differentiable rendering-based refinement for high precision. Beyond extrinsics, CalibAll produces standardized TCP-pose actions and auxiliary annotations. We apply CalibAll to 16 datasets across 4 robot platforms, producing approximately 97K calibrated data episodes. Downstream simulation and real-robot experiments show that cross-embodiment pretraining with camera-frame actions achieves state-of-the-art performance.