arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2604.13986 2026-05-14 cs.LG

PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling

Zichao Yan, Yan Wu, Mica Xu Ji, Chaitra Agrahar, Esther Wershof, Marcel Nassar, Mehrshad Sadria, Ridvan Eksi, Vladimir Trifonov, Ignacio Ibarra, Telmo Felgueira, Błażej Osiński, Rory Stark

AI总结 PRiMeFlow 是一种基于流匹配的端到端方法,旨在直接建模基因和小分子扰动对基因表达空间的影响,以应对单细胞基因表达异质性和潜在基因依赖关系带来的建模挑战。该方法通过分布拟合准确逼近单细胞基因表达的实证分布,并在 PerturBench 平台上进行了广泛基准测试,验证了其有效性。研究还通过消融实验验证了关键设计选择,并在多个数据集上展示了其在人类胚胎干细胞扰动预测任务中的卓越性能。

详情
英文摘要

Predicting the effects of perturbations in-silico on cell state can identify drivers of cell behavior at scale and accelerate drug discovery. However, modeling challenges remain due to the inherent heterogeneity of single cell gene expression and the complex, latent gene dependencies. Here, we present PRiMeFlow, an end-to-end flow matching based approach to directly model the effects of genetic and small molecule perturbations in the gene expression space. The distribution-fitting approach taken by PRiMeFlow enables it to accurately approximate the empirical distribution of single-cell gene expression, which we demonstrate through extensive benchmarking inside PerturBench. Through ablation studies, we also validate important model design choices such as operating in gene expression space and parameterizing the velocity field with a U-Net architecture. Finally, by scaling PRiMeFlow to a broad perturbation data atlas spanning multiple datasets and employing a carefully designed pretraining-finetuning strategy, we demonstrate its outstanding performance on the H1 human embryonic stem cells from the ARC Virtual Cell Challenge benchmark.

2604.11581 2026-05-14 cs.CL

Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

Solomon Messing

AI总结 大型语言模型(LLM)的评估结果对模型部署、安全标准、研究结论和人工智能对劳动力市场的影响预测具有重要影响。然而,现有评估方法通常忽略判断模型选择、模型温度和提示语表达等因素带来的不确定性,导致置信区间覆盖不足,且随着数据量增加问题更加严重。本文分析了LLM评估流程中的不确定性来源,区分了数据量增加可减少的方差与研究者设计选择带来的敏感性,并通过设计研究预测来降低总体评估误差,显著提升了评估结果的准确性和可靠性。

详情
英文摘要

LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made. Yet standard confidence intervals ignore variability from judge model choice, model temperature, and prompt phrasing, producing under-coverage that worsens with more data. The omitted variance can shift results enough to reverse conclusions \citep{baumann2025llmhacking, huang2026dropping}; pipelines that fail to average over it leave the surface that ``benchmark hacking'' exploits \citep{singh2025leaderboard}. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total evaluation error (TEE). Across the demonstrations, naive standard errors are 40 - 60\% smaller than the TEE-corrected SE. Using Chatbot Arena data, we show naive 95\% CI coverage drops as $n$ grows while TEE-corrected coverage holds at 95\%, and TEE-guided pipelines restrict the benchmark gaming surface from 56 to 32 Elo ($K=27$), below the human-leaderboard baseline. We show further that a small pilot recovers honest CIs and projects which design changes most improve precision. Acting on those projections halves MMLU estimation error against the answer key at equivalent cost, and raises per-match agreement with human votes by 7.9 percentage points on Chatbot Arena.

2604.10755 2026-05-14 cs.CV

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

Junzhi Ning, Jiashi Lin, Yingying Fang, Wei Li, Jiyao Liu, Cheng Tang, Chenglong Ma, Wenhao Tang, Tianbin Li, Ziyan Huang, Guang Yang, Junjun He

AI总结 该论文提出了MMRareBench,首个针对罕见病的多模态和多图像医学评估基准,旨在评估模型在诊断、治疗规划、跨图像证据对齐和检查建议等四个临床流程中的综合能力。该基准包含1,756个问答对和7,958张医学图像,采用基于Orphanet的本体对齐和严格的评估协议,系统揭示了现有大型语言模型在罕见病场景下处理多图像信息时能力不足的问题,尤其在治疗规划方面表现较差。研究结果表明,尽管医学领域模型在诊断任务上表现较好,但在多图像任务中仍显著落后于通用模型。

详情
英文摘要

Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.

2604.10720 2026-05-14 cs.AI cs.CL cs.CY

Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation

Charles Koutcheme, Juho Leinonen, Arto Hellas

AI总结 本文提出了一种训练开放权重的编程学习模拟模型的新框架,通过将真实学生的学习过程数据转化为对话形式,模拟学生与自动评估系统之间的交互过程。该方法结合了监督微调和偏好优化,使模型能够更贴近真实学生的调试行为。实验表明,该方法在功能对齐和代码相似性方面优于传统仅基于代码的模型和提示生成的大语言模型。

Comments 8 pages, 2 figures, 2 tables. Accepted to Educational Data Mining 2026

详情
英文摘要

Artificial students -- models that simulate how learners act and respond within educational systems -- are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, most existing approaches rely on prompting large, proprietary language models, limiting adaptability to specific courses and raising concerns around privacy, cost, and dependence. In this work, we propose a framework for training open-weight artificial programming learners directly from authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student's problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens models' ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.

2604.10634 2026-05-14 cs.CV

NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li, Yeying Jin, Suhang Yao, Beibei Lin, Zhaoxin Fan, Wending Yan, Xin Jin, Zongwei Wu, Bingchen Li, Peishu Shi, Yufei Wang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Runzhe Li, Kui Jiang, Zhaocheng Yu, Yiang Chen, Junjun Jiang, Xianming Liu, Hongde Gu, Zeliang Li, Mache You, Jiangxin Dong, Jinshan Pan, Qiyu Rong, Bowen Shao, Hongyuan Jing, Mengmeng Zhang, Bo Ding, Hui Zhang, Yi Ren, Mohab Kishawy, Jun Chen, Anh-Kiet Duong, Petra Gomez-Kramer, Jean-Michel Carozza, Wangzhi Xing, Xin Lu, Enxuan Gu, Jingxi Zhang, Diqi Chen, Qiaosi Yi, Bingcai Wei, Wenjie Li, Bowen Tie, Heng Guo, Zhanyu Ma, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi, Paula Garrido Mellado, Daniel Feijoo, Alvaro Garcia Lara, Marcos V. Conde, Zhidong Zhu, Bangshu Xiong, Qiaofeng Ou, Zhibo Rao, Wei Li, Zida Zhang, Hui Geng, Qisheng Xu, Xuyao Deng, Changjian Wang, Kele Xu, Guanglu Dong, Qiyao Zhao, Tianheng Zheng, Chunlei Li, Lichao Mou, Chao Ren, Chang-De Peng, Chieh-Yu Tsai, Guan-Cheng Liu, Li-Wei Kang, Abhishek Rajak, Milan Kumar Singh, Ankit Kumar, Dimple Sonone, Kishor Upla, Kiran Raja, Huilin Zhao, Xing Xu, Chuan Chen, Yeming Lao, Wenjing Xun, Li Yang, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Hao Yang, Ruikun Zhang, Liyuan Pan

AI总结 本文介绍了NTIRE 2026第二届昼夜雨滴去除双焦点图像挑战赛的整体情况。该挑战基于真实场景下的Raindrop Clarity数据集,旨在建立一个在不同光照和对焦条件下具有良好实用性的雨滴去除基准。本次挑战吸引了168支队伍参与,其中17支队伍提交了最终方案,并在测试集上取得了较好的性能,展示了该领域技术的持续进步。

Comments Accepted by CVPR2026 Workshop; NTIRE 2026 Challenge Report

详情
英文摘要

This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.

2604.10547 2026-05-14 cs.AI

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian

AI总结 本文提出了一种名为 Agent² RL-Bench 的紧凑型诊断基准,用于评估大型语言模型(LLM)代理在强化学习(RL)后训练中的自主设计与优化能力。该基准要求代理在有限预算下自主完成模型训练、调试和评估,涵盖从静态规则训练到闭环在线 RL 的多种任务。实验表明,尽管部分代理能有效提升模型性能,但整体上在固定预算下实现稳定、自主的 RL 后训练仍具有挑战性,该基准为未来研究提供了有效的评估框架。

Comments 37 pages, 7 figures, 20 tables

详情
英文摘要

We introduce Agent2 RL-Bench, a compact diagnostic benchmark for evaluating agentic RL post-training, which tests whether LLM agents can autonomously design, implement, debug, and execute post-training pipelines that improve foundation models. RL post-training increasingly drives model alignment and specialization, yet existing benchmarks are largely static, rewarding supervised fine-tuning or script generation without assessing an agent's ability to close an interactive RL loop. Agent2 RL-Bench provides a unified agent-facing interface: each run starts from an isolated workspace containing a base model, task data, instructions, and a grading API, and agents must iterate within a fixed budget by training models and submitting artifacts for evaluation. The benchmark spans six tasks across three levels, from static rule-based training to judge-based optimization and closed-loop online RL with trajectory collection. Two diagnostic skills, namely runtime recording and post-hoc summarization, enable structured analysis of agent behavior, facilitating smooth and effective iteration of the benchmark's evaluation framework. Across five agent systems and six driver LLMs, agents show intelligent behavior but clear limitations: one RL-oriented run improves ALFWorld from 4.85 to 93.28 via SFT warm-up and GRPO with online rollouts, yet DeepSearchQA remains difficult, most successful routes rely on supervised pipelines, and interactive outcomes show large single-run differences across agent stacks. Overall, Agent2 RL-Bench shows that current agents can sometimes engineer online RL, but stable agent-driven RL post-training remains rare under fixed budgets. It also demonstrates that our benchmark provides a strong and effective evaluation framework for future research in this direction. Code is available at https://github.com/microsoft/RD-Agent/blob/main/rdagent/scenarios/rl/autorl_bench/README.md

2604.09543 2026-05-14 cs.LG

ANTIC: Adaptive Neural Temporal In-situ Compressor

Sandeep S. Cranganore, Andrei Bodnar, Gianluca Galletti, Fabian Paischer, Johannes Brandstetter

AI总结 本文提出了一种名为ANTIC的自适应神经时序原位压缩方法,用于解决由高维偏微分方程驱动的高分辨率时空场在长期存储中产生的海量数据问题。该方法结合了自适应时间选择器和基于持续微调的神经空间压缩模块,能够在模拟过程中实时筛选关键帧并学习相邻快照之间的残差更新,从而在单次流式处理中实现时空联合压缩,大幅减少存储需求而不显著影响物理模拟的准确性。实验表明,该方法可实现多个数量级的存储压缩。

Comments 31 pages, 19 figures, 9 Tables; Accepted at ICML 2026; First authors contributed equally

详情
Journal ref
The Forty-Third International Conference on Machine Learning 2026
英文摘要

The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.

2604.07969 2026-05-14 cs.CL

Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

George Fountzoulas

AI总结 本文提出了一种名为 Kathleen 的文本分类架构,该架构直接在原始 UTF-8 字节上进行操作,无需分词器或注意力机制,参数量少于 470K。其核心方法包括基于振荡器的序列处理、FFT 变换编码、相位谐波非线性以及内容相关的混响机制等创新组件。实验表明,Kathleen 在多个基准数据集上取得了与预训练模型相当甚至更优的性能,同时大幅减少了参数量。

Comments 15 pages, 10 tables. v2: Added V9 architecture with Positional Decay Modulation. Pretraining eliminated. SST-2 improved from 83.3% to 85.8%

详情
英文摘要

We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and under 470K parameters. Kathleen introduces several novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats); (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters (+2.6% accuracy, <0.001% of model parameters); (4) Content-Dependent Reverb with Positional Decay Modulation -- a temporal memory mechanism whose decay rate is jointly conditioned on input content and a learned position-indexed bias vector; (5) Token-Level Module Sequencer with consonance and dissonance interference channels. Through iterative architecture evolution from an initial 733K-parameter baseline (Kathleen-Clean) to the current Kathleen-V9 (469K parameters), we demonstrate that pretraining can be entirely eliminated while improving accuracy. Kathleen-V9 achieves 88.5% +/- 0.2% on IMDB, 92.4% +/- 0.2% on AG News, and 85.8% +/- 0.5% on SST-2 (3-seed averages) -- matching or exceeding the pretrained baseline on all benchmarks with 36% fewer parameters. On SST-2, the improvement is +2.5% absolute over the pretrained predecessor. Kathleen processes sequences in O(L) time and memory.

2604.02753 2026-05-14 cs.CV

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun

AI总结 本文提出了一种名为DeCo-DETR的解耦认知DETR框架,旨在解决开放词汇目标检测(OVOD)在实际应用中的效率与性能问题。该方法通过构建基于预训练多模态模型的层次化语义原型空间,避免了推理时对文本编码器的依赖,从而提升了检测效率。同时,通过解耦语义推理与定位任务的训练策略,实现了检测精度与开放世界泛化的有效平衡,实验表明其在多个基准上表现出优异的零样本检测性能。

Comments Accepted at ICLR 2026

详情
英文摘要

Open-vocabulary object detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

2604.01938 2026-05-14 cs.CL cond-mat.stat-mech physics.soc-ph

How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

Ramon Ferrer-i-Cancho

AI总结 本文研究如何从交换距离最小化的角度衡量语言中词序或跨语言手势顺序的最优性。作者提出了一种数学框架,用于评估词序在排列图(permutohedron)中的最优程度,并发现跨语言手势的顺序至少有77%达到最优,表明这种最优性并非偶然。研究还引入二次分配问题(QAP)作为语言学中多种优化问题的统一框架,提出了一个能够整合包括交换距离最小化在内的多种语言学原则的通用最优分配原理。

Comments Many corrections in Appendix C, specially in the proofs

详情
英文摘要

The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

2603.29475 2026-05-14 cs.LG

Survival In-Context: Amortized Bayesian Survival Analysis via Prior-Fitted Networks

Dmitrii Seletkov, Paul Hager, Georgios Kaissis, Rickmer Braren, Daniel Rueckert, Raphael Rehms

AI总结 该论文提出了一种名为Survival In-Context(SIC)的先验拟合生存分析模型,旨在解决医疗等领域中生存数据分析面临的数据量小、存在截尾现象以及协变量异质性等问题。该方法通过构建一个可控的生存先验生成框架,结合基于合成数据的预训练,实现了无需任务特定训练或超参数调整的个体化生存预测。实验表明,SIC在多个真实生存数据集上表现优异,尤其在小到中等规模数据集上优于传统和深度生存模型,展示了先验拟合范式在生存分析中的潜力。

详情
英文摘要

Survival analysis is crucial for many medical applications, but remains challenging for modern machine learning due to limited data, censoring, and the heterogeneity of tabular covariates. While the prior-fitted paradigm, which relies on pretraining models on large collections of synthetic datasets, has recently facilitated tabular foundation models for classification and regression, its suitability for time-to-event modeling remains unclear. We propose a flexible survival data generation framework that defines a rich survival prior with explicit control over covariates and time-event distributions. Building on this prior, we introduce Survival In-Context (SIC), a prior-fitted in-context learning model for survival analysis that is pretrained exclusively on synthetic data. SIC is trained to approximate Bayesian posterior predictive inference under the synthetic survival prior, enabling individualized survival prediction in a single forward pass, requiring no task-specific training or hyperparameter tuning. Across a broad evaluation on real-world survival datasets, SIC achieves competitive or superior performance compared to classical and deep survival models, particularly in small and medium-sized data regimes, highlighting the promise of a prior-fitted paradigm for survival analysis. The code and pretrained models will be made available upon publication.

2603.27910 2026-05-14 cs.AI cs.IR cs.MA

GAAMA: Graph Augmented Associative Memory for Agents

Swarna Kamal Paul, Shubhendu Sharma, Nitin Sareen

AI总结 GAAMA 是一种用于智能体的图增强关联记忆系统,旨在解决多会话交互中长期记忆保持的问题。该方法通过构建一个由事件、事实、反思和概念节点组成的结构化知识图谱,结合基于余弦相似度的检索与边类型感知的个性化PageRank算法,有效避免了传统方法中的结构关系丢失和中心节点效应问题。实验表明,GAAMA 在多个任务中均优于现有方法,尤其在长对话场景下表现更为突出。

详情
英文摘要

AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relationships among memories, or use entity-centric knowledge graphs that suffer from mega-hub effects in conversational data, diluting graph-based relevance propagation. We propose GAAMA, a graph-augmented associative memory for agents that constructs a concept-mediated knowledge graph through a three-step pipeline: (1)verbatim episode preservation, (2)LLM-based extraction of atomic facts and topic-level concept nodes, and (3)synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting traversal paths that avoid the mega-hub problem of entity-centric designs. Retrieval combines cosine-similarity-based k-nearest neighbor search with edge-type-aware Personalized PageRank (PPR) through an additive scoring function. We further introduce GRAFT (Graph Repair by Augmenting Facts & Topology), a post-retrieval corrective layer that diagnoses retrieval failures and surgically repairs the knowledge graph. On LoCoMo-10 (1,540 questions, 10 multi-session conversations), GAAMA achieves 79.1% mean reward, a +4.2~pp improvement over a tuned RAG baseline, the strongest comparator. On MemoryArena, GAAMA outperforms full-context baselines across three tasks - Group Travel (+0.4~pp), Web Shopping (+3.4~pp), and Progressive Search (+0.7~pp) - with advantages growing monotonically with dialogue length. Notably, GAAMA delivers consistent performance across all categories, matching the best competing method in each, whereas every competitor degrades in at least one category.

2603.24649 2026-05-14 cs.CV

MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

Weixiang Shen, Chengzhi Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Xiao Han, Zongyue Li, Jingpei Wu, Min Xu, Daguang Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan

AI总结 该研究指出当前医学影像评估基准过于关注预选的2D图像,未能反映真实临床工作流程中的复杂任务。为此,研究者提出了MedFlowBench和MedOpenClaw,前者是一个完整的医学影像研究评估基准,后者是一个可控的医学影像软件运行环境,用于评估视觉语言模型在完整研究中的表现。实验表明,仅凭最终答案的评分会高估模型性能,而真实任务中模型还需生成可审计的证据,才能正确完成复杂流程。

Comments 33 pages

详情
英文摘要

Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.

2603.24002 2026-05-14 cs.LG

Stochastic Dimension-Free Zeroth-Order Estimator for High-Dimensional and High-Order PINNs

Zhangyong Liang, Huanhuan Gao

AI总结 该论文针对高维高阶物理信息神经网络(PINNs)训练中面临的计算复杂度和内存消耗过高的问题,提出了一种无维度依赖的零阶优化估计器SDZE。该方法通过引入共同随机数同步技术,有效消除了零阶优化中的方差爆炸问题,并结合隐式无矩阵子空间投影技术,显著降低了参数探索的方差和内存占用。实验表明,SDZE能够在单块GPU上高效训练千万维的PINNs,大幅提升了计算速度和内存效率。

Comments arXiv admin note: text overlap with arXiv:2412.00088, arXiv:2410.08989, arXiv:2307.12306 by other authors

详情
英文摘要

Physics-Informed Neural Networks (PINNs) for high-dimensional and high-order partial differential equations (PDEs) are primarily constrained by the $\mathcal{O}(d^k)$ spatial derivative complexity and the $\mathcal{O}(P)$ memory overhead of backpropagation (BP). While randomized spatial estimators successfully reduce the spatial complexity to $\mathcal{O}(1)$, their reliance on first-order optimization still leads to prohibitive memory consumption at scale. Zeroth-order (ZO) optimization offers a BP-free alternative; however, naively combining randomized spatial operators with ZO perturbations triggers a variance explosion of $\mathcal{O}(1/\varepsilon^2)$, leading to numerical divergence. To address these challenges, we propose the \textbf{S}tochastic \textbf{D}imension-free \textbf{Z}eroth-order \textbf{E}stimator (\textbf{SDZE}), a unified framework that achieves dimension-independent complexity in both space and memory. Specifically, SDZE leverages \emph{Common Random Numbers Synchronization (CRNS)} to algebraically cancel the $\mathcal{O}(1/\varepsilon^2)$ variance by locking spatial random seeds across perturbations. Furthermore, an \emph{implicit matrix-free subspace projection} is introduced to reduce parameter exploration variance from $\mathcal{O}(P)$ to $\mathcal{O}(r)$ while maintaining an $\mathcal{O}(1)$ optimizer memory footprint. Empirical results demonstrate that SDZE enables the training of 10-million-dimensional PINNs on a single NVIDIA A100 GPU, delivering significant improvements in speed and memory efficiency over state-of-the-art baselines.

2603.23777 2026-05-14 cs.RO cs.AI cs.SY eess.SY

Human-in-the-Loop Pareto Optimization: Trade-off Characterization for Assist-as-Needed Training and Performance Evaluation

Harun Tolasa, Volkan Patoglu

AI总结 在人类运动技能训练和康复过程中,任务难度与用户表现之间存在内在权衡关系,准确刻画这一权衡对评估表现、设计按需辅助(AAN)方案至关重要。本文提出了一种基于人机闭环的帕累托优化方法,结合定量性能指标和定性挑战度指标,系统高效地刻画任务表现与感知挑战水平之间的权衡关系。通过用户实验和三个应用场景验证,该方法不仅可用于设计和评估AAN训练方案,还能在不同辅助水平下公平评估个体训练进展和用户间表现差异。

Comments Under review for publication in IEEE Transactions on Haptics

详情
英文摘要

During human motor skill training and physical rehabilitation, there is an inherent trade-off between task difficulty and user performance. Characterizing this trade-off is crucial for evaluating user performance, designing assist-as-needed (AAN) protocols, and assessing the efficacy of training protocols. In this study, we propose a novel human-in-the-loop (HiL) Pareto optimization approach to characterize the trade-off between task performance and the perceived challenge level of motor learning or rehabilitation tasks. We adapt Bayesian multi-criteria optimization to systematically and efficiently perform HiL Pareto characterizations. Our HiL optimization employs a hybrid model that measures performance with a quantitative metric, while the perceived challenge level is captured with a qualitative metric. We demonstrate the feasibility of the proposed HiL Pareto characterization through a user study. Furthermore, we present the utility of the framework through three use cases in the context of a manual skill training task with haptic feedback. First, we demonstrate how the characterized trade-off can be used to design a sample AAN training protocol for a motor learning task and to evaluate the group-level efficacy of the proposed AAN protocol relative to a baseline adaptive assistance protocol. Second, we demonstrate that individual-level comparisons of the trade-offs characterized before and after the training session enable fair evaluation of training progress under different assistance levels. This evaluation method is more general than standard performance evaluations, as it can provide insights even when users cannot perform the task without assistance. Third, we show that the characterized trade-offs also enable fair performance comparisons among different users, as they capture the best possible performance of each user under all feasible assistance levels.

2603.22273 2026-05-14 cs.LG

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Zakaria Mhammedi, James Cohan

AI总结 本文提出了一种将探索与策略优化解耦的新方法,旨在解决强化学习中困难探索问题。该方法采用基于不确定性的树搜索策略,无需依赖传统强化学习框架,从而显著提高了探索效率。实验表明,该方法在多个硬探索任务中表现优异,并能通过监督学习将探索轨迹转化为高性能策略,且无需领域知识或专家示范。

详情
英文摘要

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new approach that explicitly decouples exploration from policy optimization and bypasses RL entirely during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard exploration benchmarks. Further, we demonstrate that the trajectories discovered during exploration can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art performance by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

2603.22267 2026-05-14 cs.CL cs.AI eess.AS

TiCo: Time-Controllable Spoken Dialogue Model

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

AI总结 本文提出 TiCo,一种可控制时间的语音对话模型,能够根据时间约束指令(如“生成约15秒的回应”)生成时长可控的语音响应。为解决现有模型缺乏时间感知能力的问题,研究引入了 TiCo-Bench 作为首个评估时间可控性的基准,并通过语音时间标记(STM)帮助模型在生成过程中估计已用时间并调整内容以满足目标时长。实验表明,TiCo 在不依赖问答对数据的情况下,通过自生成和可验证奖励的强化学习进行高效微调,显著提升了时长控制精度,同时保持了响应质量。

详情
英文摘要

We introduce TiCo, a time-controllable spoken dialogue model (SDM) that follows time-constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions. To systematically evaluate this, we introduce TiCo-Bench, the first benchmark for time-controllable instruction following in SDMs, on which existing open-source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post-trained efficiently without question-answer paired data, relying on self-generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.

2603.19185 2026-05-14 cs.LG

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori, John Jewell, Sana Ayromlou, Wei Pang, Veronica Chatrath, Gauri Sharma, Deval Pandya

AI总结 本文研究了基于扩散模型生成的合成表格数据在隐私保护方面的性能,特别是其对成员推理攻击(MIA)的抵抗能力。针对表格数据的异质性和复杂性,研究探索了多种目标模型用于成员推理攻击,并提出了专门针对这些扩散模型的黑盒和白盒攻击方法,为评估其隐私效果提供了全面的实验基础。该研究为理解生成模型在隐私安全方面的潜力与局限提供了重要参考。

Comments 4 page, 1 table

详情
英文摘要

Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

2603.05093 2026-05-14 cs.LG cs.AI cs.CV

From Baselines to Transport Geodesics: Axiomatic Attribution via Optimal Generative Flows

Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You

AI总结 该论文研究了特征归因中的路径选择问题,提出了一种基于最优生成流的归因方法。不同于传统的手工设计路径或模型敏感性几何,作者通过最小化运输过程中的动能作用,从数据生成过程中自动选择归因路径,从而获得更稳定和结构化的解释。研究证明了Aumann-Shapley积分在固定路径下的唯一性,并通过Rectified Flow等方法实现了该理论的近似,实验表明新方法在保持删除忠实度的同时提升了归因的稳定性。

Comments 10 figures, 31 pages

详情
英文摘要

Feature attributions often hide a critical modeling choice: they explain a prediction along a counterfactual path from a reference state to an input. Different baselines, interpolations, and generative trajectories define different paths and can therefor produce different explanations. We study this path ambiguity as a modeling problem. Our central question is whether the path can be chosen by the data-generating transport process, rather than by a hand-designed interpolation or by the sensitivity geometry of the model being explained. We separate attribution into fixed-path credit allocation and path selection. For a fixed path, we prove that the Aumann-Shapley line integral is the unique attribution rule under standard fixed-path axioms and explicit coordinate-trace regularity. For path selection, we minimize kinetic action over flows that transport a reference distribution to the data distribution, yielding a transport-geodesic attribution principle. We approximate this ideal with Rectified Flow and Reflow and derive stability bounds linking vector-field error to attribution error. Experiments show that lower-action, transport-consistent paths produce more stable and structured explanations, preserving competitive deletion faithfulness, without claiming data-manifold membership. Our code is available at https://github.com/cenweizhang/OTFlowSHAP.

2602.22847 2026-05-14 cs.LG cs.AI stat.ML

Decentralized Ranking Aggregation via Gossip: Convergence and Robustness

Kerrian Le Caillec, Anna Van Elst, Igor Colin, Stephan Clémençon

AI总结 本文研究了在去中心化网络环境中实现可靠且鲁棒的排名共识的问题,提出了一种基于随机闲聊(gossip)通信机制的方法,使各节点仅通过局部交互即可计算全局排名共识,无需中心协调。该方法在保证收敛性的同时,增强了对恶意节点的鲁棒性,并降低了通信成本,为分布式偏好分析提供了新的解决方案。

Comments 33 pages, 5 figures

详情
英文摘要

The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, \textit{i.e.}, when all the ranking data to be aggregated can be brought together in a single computing unit. For many technologies (\textit{e.g.} peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees of convergence and resilience to potential contamination in a decentralized setting, when preference data is initially distributed across a communicating network, remains a major methodological challenge. Indeed, in recent years, the literature on decentralized computation has mainly focused on computing or optimizing statistics such as arithmetic means using gossip algorithms. The purpose of this article is precisely to study how to achieve reliable and resilient consensus on collective rankings in a decentralized setting, thereby raising new questions, robustness to corrupted nodes, and scalability through reduced communication costs in particular. The approach proposed and analyzed here relies on the robustness guarantees offered by random gossip communication, which allows autonomous agents to compute a global ranking consensus using local interactions only, without coordination or a central authority.

2602.22251 2026-05-14 cs.LG cond-mat.mtrl-sci cs.AI

Zatom-1: Towards a Multimodal Foundation Model for 3D Molecules and Materials

Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney

AI总结 该研究提出了一种名为 Zatom-1 的通用基础模型,旨在统一3D分子和材料的生成与预测任务。该模型基于简化版的Transformer架构,通过多模态流匹配目标联合建模离散原子类型和连续3D结构,实现了跨领域、多任务的学习能力。实验表明,Zatom-1 在生成和预测性能上均优于现有专门模型,并显著提升了生成推理速度,同时展示了从材料生成预训练中向分子属性预测的正向迁移效果。

Comments 38 pages, 10 figures, 15 tables. ICLR 2026 FM4Science. Code, data, and model weights are available at https://github.com/Zatom-AI/zatom

详情
英文摘要

General-purpose 3D modeling in chemistry encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, a cross-domain, general-purpose model architecture that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a deliberately simplified Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use cross-domain generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 outperforms or competes with specialized baselines on both multi-task generative and predictive benchmarks in data-controlled settings, while improving generative inference speed by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between data domains from joint generative pretraining: modeling materials during generative pretraining improves molecular property prediction accuracy. Open-source code and model weights are freely available at https://github.com/Zatom-AI/zatom.

2602.17555 2026-05-14 cs.CV

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

Zixu Cheng, Da Li, Jian Hu, Yuhang Zang, Ziquan Liu, Shaogang Gong, Wei Li

AI总结 视频推理需要对视频中对象和事件之间的时序依赖和事件级关系进行细粒度理解。当前多模态大语言模型在视频推理中容易产生严重的时序幻觉,其根本原因在于视觉-时序对齐较弱且缺乏对事件关系的显式结构建模。为此,本文提出GraphThinker,一种通过强化微调构建结构化事件表示并加强视觉对齐的视频推理方法,有效减少了推理过程中的幻觉问题。实验表明,该方法在多个基准数据集上均取得了显著的性能提升。

Comments Under review

详情
英文摘要

Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning hallucinations. Specifically, we employ an MLLM to construct an Event-based Video Scene Graph (EVSG) that captures both intra- and inter-event relations, guiding a structured video reasoning process. Moreover, we address the weak grounding issue by introducing a novel visual attention reward during reinforcement finetuning that encourages the model to actively attend to reliable visual cues. On the RexTime dataset, GraphThinker achieves an over 4% improvement in IoU=0.3 for moment localisation. On the VidHalluc dataset, GraphThinker achieves a 9.8% improvement in reducing temporal sequence hallucination and a 7.6% gain in Binary QA in reducing action hallucination, compared to the state-of-the-art methods.

2602.16246 2026-05-14 cs.AI

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

AI总结 该研究提出了一种基于代理状态的评估方法,用于评估多轮工具调用的大型语言模型代理系统。该方法通过LLM模拟器生成结构化的代理状态,无需依赖确定性后端,从而降低了构建和迭代成本。实验表明,该框架能够稳定区分不同模型,并在不同推理条件下保持评估一致性,同时支持对用户角色的敏感性分析,具有较高的自动化评估可靠性。

详情
英文摘要

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks, such as tau-bench, tau^2-bench, and AppWorld, rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across model families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates, as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

2602.07458 2026-05-14 cs.CV

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

AI总结 在线强化学习(RL)为复杂图像编辑提供了前景,但目前受限于可靠且细粒度奖励信号的缺乏。本文提出 SpatialReward,一种通过显式空间推理增强评估准确性的奖励模型,有效解决了现有评估器在跨图像比较和细粒度细节捕捉上的“注意力坍塌”问题。该模型基于预测的编辑区域进行像素级验证,显著提升了评估效果,并在多个基准测试中取得领先性能,同时作为在线RL的强效信号,显著提升了图像生成模型的表现。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

2602.07342 2026-05-14 cs.AI

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management

Shengyue Guan, Yihao Liu, Lang Cao

AI总结 本文提出SupChain-Bench,一个用于评估大语言模型在真实供应链管理场景中表现的统一基准,重点考察模型在领域知识和基于标准操作流程的长期多步骤任务执行能力。研究发现当前模型在执行可靠性方面存在较大差距,并提出了一种无需依赖标准操作流程的SupChain-ReAct框架,能够自主生成可执行的工具调用流程,取得了最稳定和出色的性能。该工作为研究真实场景下的长期任务协调提供了系统评估基准,并指出了当前供应链智能代理的改进空间。

详情
英文摘要

Large language models (LLMs) have shown promise in complex reasoning and tool-based decision making, motivating their application to real-world supply chain management. However, supply chain workflows require reliable long-horizon, multi-step orchestration grounded in domain-specific procedures, which remains challenging for current models. To systematically evaluate LLM performance in this setting, we introduce SupChain-Bench, a unified real-world benchmark that assesses both supply chain domain knowledge and long-horizon tool-based orchestration grounded in standard operating procedures (SOPs). Our experiments reveal substantial gaps in execution reliability across models. We further propose SupChain-ReAct, an SOP-free framework that autonomously synthesizes executable procedures for tool use, achieving the strongest and most consistent tool-calling performance. Our work establishes a principled benchmark for studying reliable long-horizon orchestration in real-world operational settings and highlights significant room for improvement in LLM-based supply chain agents.

2602.04804 2026-05-14 cs.CL

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang

AI总结 OmniSIFT 是一种针对多模态大语言模型(Omni-LLMs)设计的模态非对称token压缩框架,旨在解决其在处理多模态序列时计算开销大的问题。该方法采用两阶段压缩策略,分别对视频和音频模态进行精细化压缩,通过端到端优化提升效率。实验表明,OmniSIFT 在多个基准测试中表现优异,仅引入少量参数即可显著降低推理延迟,且在部分任务上甚至超越了完整token模型的性能。

Comments [ICML 2026] Code Link: https://github.com/dingyue772/OmniSIFT

详情
英文摘要

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.

2602.03429 2026-05-14 cs.AI cs.CL cs.HC cs.LG

DiscoverLLM: From Executing Intents to Discovering Them

Tae Soo Kim, Yoonjoo Lee, Jaesang Yu, John Joon Young Chung, Juho Kim

AI总结 为了处理模糊和开放式的用户请求,研究提出DiscoverLLM框架,训练大语言模型帮助用户形成和发现其尚未明确的意图。该方法引入了一个新型用户模拟器,通过分层意图建模用户的认知状态,并利用意图的具体化程度作为奖励信号进行模型训练,使模型能够在意图不明确时主动探索,意图明确时快速收敛。实验表明,DiscoverLLM在多个交互任务中显著提升了任务完成效率,并减少了对话长度,同时在用户研究中也表现出更高的满意度和效率。

Comments Accepted at ICML 2026

详情
英文摘要

To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options -- where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.

2602.02560 2026-05-14 cs.LG cs.AI cs.CV

Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

Bartlomiej Sobieski, Jakub Grzywaczewski, Karol Dobiczek, Mateusz Wójcik, Tomasz Bartczak, Patryk Szatkowski, Przemysław Bombiński, Matthew Tivnan, Przemyslaw Biecek

AI总结 该研究针对深度学习模型Sybil在肺部癌症风险预测中的决策机制进行因果验证,提出了一个模型无关的审计框架S(H)NAP。该方法通过生成干预性归因,结合专家放射科医生的验证,系统分析模型对风险评分的因果贡献。研究发现,尽管Sybil在很多情况下表现类似专家,但其仍存在对临床无关伪影过度敏感和径向偏差等关键失效模式。

Comments ICML 2026

详情
英文摘要

Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model's actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

2602.01629 2026-05-14 cs.LG cs.RO cs.SY eess.SY

AdaptNC: Adaptive Nonconformity Scores for Conformal Prediction under Distribution Shift

Renukanandan Tumu, Aditya Singh, Rahul Mangharam

AI总结 本文研究了在分布偏移环境下如何提升共形预测(Conformal Prediction)的不确定性量化能力。传统共形预测依赖于数据交换性假设,但在实际机器人系统中这一假设常被违反,导致预测区域过于保守。为此,作者提出AdaptNC框架,同时在线调整非一致性得分函数参数和共形阈值,通过自适应加权和回放缓冲机制提升预测效率与稳定性。实验表明,AdaptNC在多个机器人基准任务中显著减少了预测区域体积,同时保持目标覆盖率。

详情
英文摘要

Rigorous uncertainty quantification is essential for the safe deployment of autonomous systems in unconstrained environments. Conformal Prediction (CP) provides a distribution-free framework for this task, yet its standard formulations rely on exchangeability assumptions that are violated by the distribution shifts inherent in real-world robotics. Existing online CP methods maintain target coverage by adaptively scaling the conformal threshold, but typically employ a static nonconformity score function. We show that this fixed geometry leads to highly conservative, volume-inefficient prediction regions when environments undergo structural shifts. To address this, we propose $\textbf{AdaptNC}$, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold. AdaptNC leverages an adaptive reweighting scheme to optimize score functions, and introduces a replay buffer mechanism to mitigate the coverage instability that occurs during score transitions. We evaluate AdaptNC on diverse robotic benchmarks involving multi-agent policy changes, environmental changes and sensor degradation. Our results demonstrate that AdaptNC significantly reduces prediction region volume compared to state-of-the-art threshold-only baselines while maintaining target coverage levels.

2601.22868 2026-05-14 cs.CV cs.LG

Conditional Compatibility Learning for Context-Dependent Anomaly Detection

Shashank Mishra, Didier Stricker, Jason Rambach

AI总结 该论文研究了上下文相关的异常检测问题,即同一对象在不同场景下可能表现出正常或异常的差异。传统方法通常假设异常是对象本身的属性,而本文指出这种假设在现实场景中并不成立。为此,作者提出了条件兼容性学习(Conditional Compatibility Learning)方法,通过分离对象和上下文的表示,并利用文本条件注意力机制进行融合,构建了CC-CLIP模型,在多个现实场景的异常检测任务中取得了显著优于现有方法的性能。

Comments Preprint. 9 pages main text, plus appendix

详情
英文摘要

Anomaly detection usually assumes that abnormality is an intrinsic property of an observation. A defect is a defect, and a rare object is rare, regardless of where it appears. Many real-world anomalies do not work this way. A runner on a track is normal, but the same runner on a highway is not. The subject is unchanged; only the context makes it anomalous. This setting, long recognized as contextual anomaly detection, remains largely underexplored in modern vision-language systems. The difficulty is not merely empirical; it is formal. When anomaly labels depend on the relation between a subject and its context, any detector reasoning from a global representation that conflates subject and context is provably non-identifiable: two different subject-context configurations can map to the same embedding while requiring opposite labels, and no such detector can be correct on both. This impossibility motivates a different formulation: instead of asking whether an observation deviates from a global notion of normality, the model should ask whether subjects are compatible with their surrounding context. We define this as conditional compatibility learning. We instantiate this framework in CC-CLIP, a vision-language architecture that learns disentangled subject- and context-aware representations from a single image and fuses visual evidence through text-conditioned attention. CC-CLIP achieves state-of-the-art results on real-world contextual anomaly detection, substantially outperforming all existing CLIP-based and context-reasoning baselines. A single-branch variant of CC-CLIP also achieves competitive performance on structural anomaly benchmarks.