arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.12350 2026-05-15 cs.LG cs.AI

A New Technique for AI Explainability using Feature Association Map

Sayantani Ghosh, Amit Kumar Das, Amlan Chakrabarti

AI总结 本文提出了一种基于特征关联图(FAM)的新型可解释人工智能算法FAMeX,用于解释AI系统的决策过程。该方法通过构建特征之间的关联图,从图论角度分析特征的重要性,从而更准确地揭示模型的决策依据。实验表明,FAMeX在分类任务中优于现有的可解释性算法如PFI和SHAP,展现出更高的解释能力和有效性。

详情
英文摘要

Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system

2605.12055 2026-05-15 cs.CL

Do Language Models Encode Knowledge of Linguistic Constraint Violations?

Hardy, Sebastian Padó

AI总结 本研究探讨了大型语言模型(LLMs)是否在其参数中编码了对语言约束违反的表征,并在处理不合语法的句子时选择性激活这些表征。研究采用稀疏自编码器分解多义激活,提取可能与违反相关的特征,并引入敏感性评分以识别这些特征在违反约束输入中的激活情况。实验结果显示,现有语言模型中并未形成统一的语法违反检测机制,不同语言现象之间也缺乏共享的特征一致性。

详情
英文摘要

Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the falsification criteria are not jointly satisfied across linguistic phenomena, and (2) no features are consistently shared across all categories. While some phenomena show partial evidence of selective causal structure, the overall pattern provides limited support for a unified set of grammatical violation detectors in current LMs.

2605.11853 2026-05-15 cs.LG cs.AI cs.CL

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang

AI总结 该论文提出了一种名为GEAR的粒度自适应优势重加权方法,旨在提升大语言模型代理在强化学习中的训练效果。GEAR通过自蒸馏技术,利用token级和段级信号对轨迹级优势进行重加权,从而实现更细粒度的信用分配。该方法通过比较策略网络与教师模型的差异,动态调整信用区域的粒度,有效提升了长期轨迹中的策略更新效率。实验表明,GEAR在多个数学推理和工具使用基准中优于现有方法,尤其在基础较弱的基准上表现突出。

详情
英文摘要

Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.

2605.11775 2026-05-15 cs.LG cs.CL

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Jiazheng Zhang, Ziche Fu, Junrui Shen, Yunbin Zhao, Yunke Zhang, Zhiheng Xi, Long Ma, Chenxin An, Zhihao Zhang, Shichun Liu, Dingwei Zhu, Shihan Dou, Shaofan Liu, Han Li, Wiggin Zhou, Aiden Adams, Tao Gui, Fei Huang, Qi Zhang, Xuanjing Huang

AI总结 本文研究了强化学习中策略熵的极性特性,提出了熵极性这一新的概念,用于预测策略更新对熵的影响方向。通过理论分析,揭示了熵变化的结构不对称性,并基于此提出了一种新的策略优化方法PAPO,通过优势重加权实现对熵的精确控制。实验表明,PAPO在数学推理和智能体基准任务中表现出更优的训练效率和奖励提升效果。

详情
英文摘要

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

2605.11611 2026-05-15 cs.AI

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

Jianghan Shen, Siqi Luo, Xinyu Cheng, Jing Xiong, Yue Li, Jiyao Liu, Jiashi Lin, Yirong Chen, Junjun He

AI总结 本文提出了一种名为 CuSearch 的课程式 rollout 采样框架,用于改进基于可验证奖励的强化学习(RLVR)中智能体检索增强生成(RAG)系统的训练。该方法通过搜索深度(search depth)来动态调整 rollout 采样策略,更关注那些包含更多检索决策点、提供更密集监督的深层搜索轨迹。实验表明,CuSearch 能够显著提升不同模型和检索框架下的性能,为 RLVR 训练提供了一种无需人工标注的有效优化手段。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training.

2605.11459 2026-05-15 cs.RO cs.AI cs.CV cs.LG

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

Yanyan Zhang, Chaoda Song, Vikash Singh, Xinpeng Li, Kai Ye, Zhe Hu, Zhongzhu Pu, Yu Yin, Vipin Chaudhary

AI总结 视觉-语言-动作(VLA)模型在灵活性和泛化能力方面表现出色,但大多数现有模型由于采用单帧观测范式,无法感知时间动态变化,导致在非静态环境中性能显著下降。本文提出了一种无需训练的“节奏与路径校正”方法,通过在推理阶段对分块动作的VLA模型进行闭式修正,有效补偿动态变化带来的影响。该方法从单一二次成本函数出发,通过联合优化得到两个正交分解的通道,分别用于压缩执行节奏和调整空间路径,从而在动态环境中显著提升任务成功率。

详情
英文摘要

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.

2605.11410 2026-05-15 cs.AI

What Do EEG Foundation Models Capture from Human Brain Signals?

Ling Tang, Qian Chen, Jilin Mei, Houshi Xu, Quanshi Zhang, Jing Shao, Na Zou, Xia Hu, Dongrui Liu

AI总结 该研究探讨了EEG基础模型从人类脑电信号中学习到了哪些信息,并分析了其表征与传统手工特征之间的关系。通过层间岭回归、跨协方差子空间擦除等方法,研究发现EEG基础模型在多个临床任务中表现出色,其优势主要来源于频率域特征及其他多种手工特征的组合。研究还揭示了不同任务中模型性能的差异,并为未来特征发现提供了明确方向。

详情
英文摘要

Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emph{e.g.,} band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emph{what does the model learn}, \emph{what does the model use}, and \emph{how much can be explained}. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the $945$ (model, task, feature) units, $648$ ($68.6\%$) are representation-causal and $199$ ($21.1\%$) are encoded-only. Across tasks, $50$ features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, $79.3\%$ of the foundation model's advantage over the random baseline, with a clean task gradient (MDD $\approx 0.99$ down to Stress $\approx 0.56$): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.

2605.10664 2026-05-15 cs.CL cs.AI

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Diancheng Kang, Zheyuan Liu, Ningshan Ma, Yue Huang, Zhaoxuan Tan, Meng Jiang

AI总结 该论文研究了如何在对话场景中更有效地控制语言模型的行为,提出了一种新的激活引导方法,以解决传统方法在长对话中累积失效的问题。作者发现,键值缓存污染是导致引导效果下降的主要原因,并提出了一种基于门控裁剪注意力差值的引导方法(GCAD),通过系统提示对自注意力机制的影响进行引导信号提取,并在词元级别进行门控处理。实验表明,该方法在保持角色特征控制的同时,显著提升了长对话中的连贯性与角色表现能力。

Comments 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control

详情
英文摘要

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

2605.10550 2026-05-15 cs.CL

Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy

Denghao Ma, Qing Liu, Zulong Chen, Chuanfei Xu, Jia Xu, Zhibo Yang, Wei Shao, Zhao Li

AI总结 本文提出一个名为MMM-Bench的多领域、多模态文档分类基准,旨在解决现有文档分类基准过于简化的问题。该基准构建了一个包含五个层级的深度分类体系,并收集了来自阿里巴巴12个商业领域的5990份真实多模态文档,每份文档均由领域专家标注完整的层次路径。研究通过建立全面的基线模型,系统分析了该基准中的四个核心挑战,并提出了相应的研究见解,为多层级、多领域文档分类的研究提供了坚实的基础。

详情
英文摘要

Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms -- single domain settings with flat label structures -- that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at https://github.com/MMMDC-Bench/MMMDC-Bench.

2605.10496 2026-05-15 cs.CV

M$^2$E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection

Weiqi Yan, Lixin Chen, Xiangrui Hou, Zhipeng Cai, Youbiao Wang, Yangyang Shi, Yu Zang, Cheng Wang

AI总结 本文提出M$^2$E-UAV,首个针对运动中事件相机的微型无人机检测数据集与基准,旨在解决在观察者与目标同时运动的情境下,无人机检测面临的背景事件干扰严重、目标稀疏等问题。该数据集包含同步的事件流和IMU数据,并提供了基于时间传播的无人机前景标注,适用于多种表示方法的模型评估。实验表明,现有方法在面对稀疏目标和密集背景事件时仍存在较大局限。

详情
英文摘要

Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. Unlike static- or ground-observer event-based UAV detection, onboard UAV-view detection breaks the clean-background assumption because sensor ego-motion can activate dense background events over the entire field of view. To explore this practical problem, we present M$^2$E-UAV, to the best of our knowledge, the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection, where both the sensing platform and the target UAV are moving. M$^2$E-UAV provides synchronized event streams and IMU measurements collected from an onboard sensing platform, together with event-level UAV foreground labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We define a train/validation split and an evaluation protocol for comparing representative existing baselines across event-frame, voxel-grid, and point-set representations, with optional IMU input. The benchmark results show that existing baselines remain limited under sparse tiny-target evidence and dense ego-motion-induced background events. Code and benchmark files will be released at https://github.com/Wickyan/M2E-UAV.

2605.10364 2026-05-15 cs.LG

DeepLévy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series

Yang Yang, Du Yin, Hao Xue, Flora Salim

AI总结 本文研究了在具有重尾分布的高波动时间序列中建模不确定性这一关键问题,提出了一个名为DeepLévy的深度学习框架。该方法利用Lévy稳定分布的特性,通过最小化经验特征函数与参数化特征函数之间的差异来学习混合Lévy分布,从而有效捕捉极端事件的不确定性。实验表明,DeepLévy在尾部风险指标上优于现有先进方法,尤其在高波动环境下表现突出。

详情
英文摘要

Modeling uncertainty in heavy-tailed time series remains a critical challenge for deep probabilistic forecasting models, which often struggle to capture abrupt, extreme events. While Lévy stable distributions offer a natural framework for modeling such non-Gaussian behaviors, the intractability of their probability density functions severely limits conventional likelihood-based inference. To address this, we introduce DeepLévy, a neural framework that learns mixtures of Lévy stable distributions by minimizing the discrepancy between empirical and parametric characteristic functions. DeepLévy incorporates a mixture mechanism that adaptively learns context-dependent weights and parameters over multiple Lévy components, enabling flexible multi-horizon uncertainty modeling. Evaluations on both real and synthetic datasets demonstrate that DeepLévy outperforms state-of-the-art deep probabilistic forecasting approaches in tail risk metrics, especially under extreme volatility.

2605.10310 2026-05-15 cs.AI cs.CY cs.HC q-bio.NC

Positive Alignment: Artificial Intelligence for Human Flourishing

Ruben Laukkonen, Seb Krier, Chloé Bakalar, Shamil Chandaria, Morten Kringelbach, Adam Elwood, Daniel Ford, Fernando Rosas, Maty Bohacek, Matija Franklin, Nenad Tomašev, Stephanie Chan, Verena Rieser, Roma Patel, Michael Levin, Arun Rao

AI总结 本文提出“积极对齐”(Positive Alignment)的概念,旨在开发能够主动支持人类和生态繁荣的人工智能系统,同时保持安全与合作。与现有聚焦于安全与风险防范的对齐研究不同,积极对齐强调系统应具备多元、去中心化、情境敏感及用户主导的特性,并通过培养美德、促进人类福祉来解决当前对齐中的诸多问题。文章还提出了在大语言模型和智能体生命周期中的一系列技术方向与设计原则,以推动分歧包容与去中心化治理。

详情
英文摘要

Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology's focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.

2605.10289 2026-05-15 cs.LG stat.ML

Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

Bochao Li, Yao Fu, Wei Chen, Fang Kong

AI总结 本文研究了在分布偏移场景下的离线到在线学习问题,旨在利用离线数据提升在线决策性能。为了解决传统汤普森采样(TS)在处理分布偏移时的估计偏差问题,作者提出了基于样本均值锚定的汤普森采样(Anchor-TS),通过引入中位数锚定规则,有效校正了分布偏移带来的估计偏差,提升了算法的稳定性和性能。理论分析表明该方法能够安全利用离线数据加速在线学习,并通过实验验证了其在多种场景下的优越性。

详情
英文摘要

Offline-to-online learning aims to improve online decision-making by leveraging offline logged data. A central challenge in this setting is the distribution shift between offline and online environments. While some existing works attempt to leverage shifted offline data, they largely rely on UCB-type algorithms. Thompson sampling (TS) represents another canonical class of bandit algorithms, well known for its strong empirical performance and naturally suited to offline-to-online learning through its Bayesian formulation. However, unlike UCB indices, posterior samples in TS are not guaranteed to be optimistic with respect to the true arm means. This makes indices constructed from purely online and hybrid data difficult to compare and complicates their use. To address this issue, we propose sample-mean anchored TS (Anchor-TS), which introduces a novel median-based anchoring rule that defines the arm index as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean. The median anchoring systematically corrects bias induced by distribution shift by mitigating over-estimation for suboptimal arms and under-estimation for optimal arms, while exploiting offline information to obtain more accurate estimates when the shift is small. We establish theoretical guarantees showing that the proposed algorithm safely leverages offline data to accelerate online learning, and quantifying how the degree of distribution shift and the size of offline data affect the resulting regret reduction. Extensive experiments demonstrate consistent improvements of our algorithm over baselines.

2605.10195 2026-05-15 cs.LG

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

Shuzhang Zhong, Haochen Huang, Shengxuan Qiu, Pengfei Zuo, Runsheng Wang, Meng Li

AI总结 树-of-Thought(ToT)推理通过树状搜索结构提升大语言模型在复杂任务中的表现,但其效率受限于奖励依赖性屏障带来的同步瓶颈。本文提出SPEX方法,通过推测性探索打破该限制,引入路径选择、资源分配和早停机制等关键技术,显著提升ToT推理效率。实验表明,SPEX在多种ToT算法和模型上实现了1.2到3倍的加速,并与令牌级推测解码结合后最高达到4.1倍的加速效果,为高效可扩展的ToT推理提供了重要进展。

Comments OSDI 2026

详情
英文摘要

Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.

2605.09825 2026-05-15 cs.LG cs.AI

Pretraining large language models with MXFP4 on Native FP4 Hardware

Musa Cim, Poovaiah Palangappa, Miro Hodak, Ravi Dwivedula, Meena Arunachalam, Mahmut Taylan Kandemir

AI总结 本文研究了在原生FP4硬件上使用MXFP4量化进行大语言模型预训练时出现的训练不稳定性问题。通过控制实验,逐步启用FP4在前向传播、激活梯度和权重梯度中,发现权重梯度的量化是导致收敛性能下降的主要原因。研究进一步表明,确定性哈达玛旋转能够有效恢复稳定优化,而随机化方法则无法做到这一点,揭示了训练不稳定性源于敏感梯度路径上的结构化微缩误差,而非随机性不足。实验在AMD Instinct MI355X GPU上进行,无需依赖软件模拟即可验证这些结论。

详情
英文摘要

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

2605.09094 2026-05-15 cs.LG

A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

Zhiyao Zhang, Myeung Suk Oh, Zhen Qin, Jiaxiang Li, Xin Zhang, Jia Liu

AI总结 本文研究了多任务双层学习(MTBL)问题,并首次在弱化下层目标泛凸性假设的前提下,将其转化为等式约束多目标优化(ECMO)问题。为了解决ECMO这一新型问题,作者提出了基于KKT条件的帕累托平稳性收敛标准,并设计了一种加权切比雪夫惩罚算法,该算法在确定性和随机性设置下均具有有限时间收敛性。该方法能够系统探索帕累托前沿,且原问题与ECMO问题的解具有直接对应关系,从而建立了双层优化与多目标优化之间的理论联系。

详情
英文摘要

In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-task setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush-Kuhn-Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of $O(ST^{-\frac{1}{2})$ to KKT-based Pareto stationarity in both deterministic and stochastic settings, where $S$ denotes the number of objectives, and $T$ is the total iterations. Moreover, by varying the preference vector over the $S$-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.

2605.09038 2026-05-15 cs.AI

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

Jinchao Hu, Meizhi Zhong, Kehai Chen, Min Zhang

AI总结 本文提出了一种名为SearchSkill的框架,旨在教会大语言模型更有效地使用搜索工具,特别是在开放域问答任务中。该方法通过可复用的搜索技能库显式规划查询过程,模型在每一步先选择一个技能,再根据该技能生成搜索或回答动作。技能库会随着训练过程中的失败模式不断进化和优化,从而提升搜索效率和答案准确性。实验表明,SearchSkill在多个知识密集型问答基准上提升了精确匹配率,并改善了搜索行为,如减少复制初始查询、生成更聚焦的查询以及在有限搜索预算下获得更准确的答案。

详情
英文摘要

Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.

2605.09028 2026-05-15 cs.LG

Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection

Md Rafid Islam

AI总结 本文研究了基于权限的Android恶意软件检测模型在面对领域偏移时的性能下降问题,通过两个互补数据集和五种集成分类器,揭示了模型在不同领域间表现的显著不对称性,并发现特征重要性在不同领域间高度不稳定。研究进一步提出了一种基于共性特征的混合训练策略,有效提升了跨领域检测性能,为构建鲁棒的恶意软件检测系统提供了重要参考。

详情
英文摘要

Machine learning-based Android malware detectors often fail in real-world deployment due to domain shift, where models trained on one data source perform poorly on applications from another. This paper presents a comprehensive study on the generalizability and interpretability of permission-based detectors under cross-domain conditions. Using two complementary datasets (PerMalDroid and NATICUSdroid) and five ensemble classifiers, we first establish an intra-domain baseline, where models achieve over 92% accuracy, and then quantify a severe asymmetric performance drop. While models trained on PerMalDroid generalize well to NATICUSdroid (86% accuracy), the reverse direction sees a drastic drop to 73% accuracy. Explainable AI analysis reveals bimodal feature distributions and shows that feature importance is highly unstable, with key permissions losing or gaining influence across domains. The predictive feature sets for different domains are fundamentally mismatched, as models rely on different, dataset-specific permissions. Most importantly, an ablation study demonstrates that for most models, training on a noisy feature set leads to poor generalization, confirming that domain-specific artifacts are a greater obstacle than missing features. To mitigate this, we validate a hybrid training strategy based on the intersection of common features and successfully recover cross-domain performance, achieving 88% accuracy on PerMalDroid and maintaining 97% on NATICUSdroid. These findings highlight the importance of explainable, cross-domain-robust malware detection systems and provide a practical pathway toward improving real-world deployment of permission-based Android malware detectors.

2605.09027 2026-05-15 cs.CL cs.AI cs.LG cs.MA

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Alexandre Le Mercier, Chris Develder, Thomas Demeester

AI总结 在多智能体系统中,一个欺骗性智能体可能破坏整个智能体集体的性能并绕过防御机制。为解决现有研究在对抗性鲁棒性评估上的不足,本文提出GAMBIT基准,包含三种评估模式和两种独立评分,用于评估伪装智能体检测器的性能,特别关注其在分布偏移和新型攻击下的适应能力。GAMBIT基于国际象棋构建,引入了可泛化的自适应欺骗智能体,并提供了27,804个标注样本,揭示了零样本评估在面对自适应对手时可能产生误导性结果,同时展示了快速校准方法在对抗性系统中的有效性。

Comments 46 pages, 16 figures

详情
英文摘要

In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.

2605.08913 2026-05-15 cs.LG cs.AR cs.CL cs.PF

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria

AI总结 本文研究了在苹果MPS后端进行Transformer解码时出现的非单调延迟现象,即随着解码长度增加,延迟并非平稳增长,而是在某些配置下突然大幅上升。通过多类模型实验,发现延迟峰值可达正常情况的21倍,且该现象主要发生在解码阶段,与内存压力无关,并在CPU和NVIDIA CUDA后端未出现。研究进一步揭示了键值缓存(KV Cache)与异常执行模式之间的复杂交互,强调了硬件特性对长上下文推理性能的重要影响。

Comments 9 pages, 5 figures, 6 tables

详情
英文摘要

Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains beneficial overall, but its practical speedup collapses sharply within anomalous configurations, while cache-disabled decoding still exhibits residual non-monotonic behavior. These findings suggest that autoregressive decoding on MPS enters discrete execution regimes that are not captured by coarse-grained benchmarking, highlighting the importance of hardware-aware evaluation for long-context inference.

2605.08888 2026-05-15 cs.CL cs.CV

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye, Jinxin Hu, Zulong Chen, Yong Luo, Jing Zhang

AI总结 DocScope 是一个用于评估多模态大语言模型在长篇视觉丰富文档中进行可验证推理能力的基准测试。该研究将长文档问答问题转化为结构化的推理轨迹预测任务,要求模型输出证据页面、支持区域、相关事实陈述和最终答案,并通过四阶段评估协议对推理过程进行细致检验。实验表明,仅凭答案准确性无法全面评估模型可靠性,证据链完整率普遍较低,且区域定位和跨文档证据整合是当前的主要挑战。

Comments 50pages, 25 figures, 14 tables;

详情
英文摘要

Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.

2605.08851 2026-05-15 cs.CV cs.AI cs.LG

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

Jialin Li, Zhuo Zhang, Yue Cao, Guipeng Lan, Jiabao Wen, Shuai Xiao, Jiachen Yang

AI总结 该研究针对冠状动脉造影中狭窄病变检测数据不足的问题,提出了一种基于熵最优传输的几何约束狭窄编辑方法。通过将局部编辑建模为受几何信息引导的熵最优传输问题,该方法实现了更精确的结构控制和图像生成。实验表明,该方法生成的图像显著提升了狭窄检测性能,在公开数据集和多中心数据集上分别取得了27.8%和23.0%的相对性能提升。

Comments Accepted to ICML 2026

详情
英文摘要

The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

2605.08825 2026-05-15 cs.CV

Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

Meisen Wang, Hao Deng, Wei Bao, Ma Yuanxiao, Chengjie Wang, Zhiqiang Tian, Shaoyi Du, Siqi Li

AI总结 该论文针对基于事件相机的物体检测(EOD)任务,提出了一个统一的检测框架Ev-DTAD,旨在解决现有方法在表示层和模型层上的不足。通过引入层次化时间聚合(HTA)和频率感知超图时间融合(FHTF)模块,分别在表示层面显式编码时间信息,并在模型层面进行高阶关系推理,从而更有效地整合碎片化事件响应。实验表明,Ev-DTAD在多个数据集上实现了更高的检测精度和效率,验证了其方法的有效性。

详情
英文摘要

Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and 2.0$\times$ faster) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.

2605.08698 2026-05-15 cs.CV cs.LG

Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

Md Abu Obaida Zishan, Jannatun Noor, Annajiat Alim Rasel

AI总结 本文提出了一种无需训练即可提升Stable Diffusion等扩散模型生成高分辨率图像能力的方法,通过插值扩展卷积核来解决传统方法中因分辨率提升导致的物体重复伪影问题。该方法数学上证明了在乘以常数系数的情况下,插值能够正确扩展卷积核,并在生成超训练分辨率图像时取得了与现有方法相当的实验效果。此外,该方法还展示了在全连接层上的应用潜力,并可有效降低神经网络训练的内存占用。

Comments Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3

详情
英文摘要

Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.

2605.08522 2026-05-15 cs.CL

Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

Adib Sakhawat, Tahsin Islam, Takia Farhin, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

AI总结 本文提出了一种统一的多特质多方法(MTMM)几何框架,用于评估大语言模型(LLM)的能力。该方法将现有的九种评估指标(如改写不稳定性、漂移分数等)统一到一个共享的潜在坐标空间中,将其解释为几何度量而非孤立的标量值。通过这一框架,模型行为被分解为三个正交的潜在维度,从而有效区分任务无关的扰动与真实能力范围,为构建稳健且经验稳定的评估基准提供了理论依据。

Comments The paper has mistake of undertaking political spaces to semantic dimensions. This needs to be removed because this is a fetal flaw in consideration. The initial hypothesis and premise needs to be rigorously formulated within the political landscape not generalizing the metrics. Hence a withdrawal for now is necessary

详情
英文摘要

The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.

2605.08506 2026-05-15 cs.LG

Learning Polyhedral Conformal Sets for Robust Optimization

Shuyi Chen, Wenbin Zhou, Shixiang Zhu

AI总结 该研究旨在解决鲁棒优化中不确定性集选择的问题,提出了一种面向决策的符合预测框架,通过数据驱动的方式学习与优化目标对齐的多面体不确定性集。该方法利用数据驱动的超平面参数化不确定性集的几何结构,并通过最小化鲁棒损失来学习其形状,同时通过符合校准保证统计有效性。研究还引入了独立数据集的再校准步骤以修正数据依赖性选择带来的偏差,最终在保持计算可行性的同时,实现了方向性和各向异性不确定性的建模,并提供了有限样本下的覆盖率保证和次优性界分析。

详情
英文摘要

Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its performance critically depends on the choice of the uncertainty set. While large sets ensure reliability, they often lead to overly conservative decisions, whereas small sets risk excluding the true outcome. Recent data-driven approaches, particularly conformal prediction, offer finite-sample validity guarantees but remain largely task-agnostic, ignoring the downstream decision structure. In this paper, we propose a decision-aware conformal framework that learns uncertainty sets tailored to robust optimization objectives. Our approach parameterizes a flexible family of polyhedral sets via data-driven hyperplanes and learns their geometry by directly minimizing the induced robust loss, while preserving statistical validity through conformal calibration. To correct for data-dependent selection, we incorporate a re-calibration step on an independent dataset to restore coverage. The resulting sets capture directional and anisotropic uncertainty aligned with the decision objective while remaining computationally tractable. We provide finite-sample coverage guarantees and bounds on the sub-optimality gap to an oracle decision. This work bridges the gap between statistical validity and decision optimality, providing a principled framework for data-driven robust optimization.

2605.08374 2026-05-15 cs.AI

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Bo Tang, Weinan Zhang, Muning Wen

AI总结 本文提出了一种名为MemQ的新型记忆代理框架,通过将Q学习机制引入基于溯源DAG的记忆系统,解决了现有方法在处理记忆依赖关系时的不足。MemQ利用TD($λ$)资格迹对记忆Q值进行更新,并通过溯源DAG反向传播信用,使记忆之间的依赖关系得到更准确的评估。实验表明,MemQ在六个不同领域的基准测试中均表现出优越的泛化能力和运行时学习效果,尤其在涉及多步骤任务的场景中提升显著。

Comments 22 pages, 11 figures (containing 43 individual image panels total)

详情
英文摘要

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($λ$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(γλ)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $γ$ and $λ$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

2605.08278 2026-05-15 cs.LG cs.AI cs.CR

Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

Fan Yang, Binyan Xu, Di Tang, Kehuan Zhang

AI总结 本文研究了图神经网络(GNN)在面对后门攻击时的防御问题,提出了一种名为PRAETORIAN的新防御方法。该方法通过分析潜在触发子图的内部关联和外部节点影响,检测异常注入结构并识别具有不成比例影响的触发节点,从而有效识别攻击。实验表明,PRAETORIAN在保持较高干净数据准确率的同时显著降低了攻击成功率,且对多种自适应攻击仍保持有效性,迫使攻击者陷入效用与可检测性之间的不利权衡。

详情
英文摘要

GNNs have become a standard tool for learning on relational data, yet they remain highly vulnerable to backdoor attacks. Prior defenses often depend on inspecting specific subgraph patterns or node features, and thus can be circumvented by adaptive attackers. We propose PRAETORIAN, a new defense that targets intrinsic requirements of effective GNN backdoors rather than surface-level cues. Our key observation is that flipping a victim node's prediction requires substantial influence on the victim: attackers tend to either inject many trigger nodes or rely on a small set of highly influential ones. Building on this observation, PRAETORIAN (i) analyzes internal correlations within potential trigger subgraphs to detect abnormally large injected structures, and (ii) quantifies external node influence to identify triggers with disproportionate impact. Across our evaluations, PRAETORIAN reduces the average attack success rate (ASR) to 0.55% with only a 0.62% drop in clean accuracy (CA), whereas state-of-the-art defenses still yield an average ASR of >20% and a CA drop of >3% under the same conditions. Moreover, PRAETORIAN remains effective against a range of adaptive attacks, forcing adversaries to either inject many trigger nodes to achieve high ASR (>80%), which incurs a >10% CA drop, or preserve CA at the cost of limiting ASR to 18.1%. Overall, PRAETORIAN constrains attackers to an unfavorable trade-off between efficacy and detectability.

2605.07594 2026-05-15 cs.RO

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

Xin Ding, Xinrui Wang, Yifan Yang, Hao Wu, Shiqi Jiang, Qianxi Zhang, Liang Mi, Hanxin Zhu, Kun Li, Yunxin Liu, Zhibo Chen, Ting Cao

AI总结 本文提出了一种名为 MemCompiler 的新型记忆系统,用于具身智能体,旨在解决现有记忆注入方法在动态环境中与智能体状态不匹配的问题。该方法通过将记忆利用重新定义为基于状态的记忆编译,利用学习得到的记忆编译器根据智能体当前状态动态选择并编译相关记忆,生成可执行的指导信息。实验表明,MemCompiler 在多个任务环境中显著提升了智能体性能,并降低了计算延迟,验证了其在效果与效率上的双重优势。

详情
英文摘要

Existing memory systems for embodied agents typically inject retrieved memory as static context at episode start, a paradigm we term Ahead-of-time Monolithic Memory Injection (AMMI). However, this static design quickly becomes misaligned with the agent's evolving state and may degrade lightweight executors below the no-memory baseline. To address this, we propose MemCompiler, which reframes memory utilization as State-Conditioned Memory Compilation. A learned Memory Compiler reads a structured Brief State capturing the agent's current execution state and dynamically selects and compiles only relevant memory into executable guidance. This guidance is delivered through a text channel and a latent Soft-Mem channel that preserves perceptual information not expressible in text. Across Alf World, EmbodiedBench, and ScienceWorld, MemCompiler consistently improves over no-memory across open-source backbones (up to +129%), matches or approaches frontier closed-source systems, and reduces per-step latency by 60%, demonstrating that state-aware memory compilation improves both effectiveness and efficiency.

2605.06132 2026-05-15 cs.CL

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

Chunyu Li, Mengyuan Zhang, Jingyi Kang, Ding Chen, Jiajun Shen, Bo Tang, Xuanhe Zhou, Feiyu Xiong, Zhiyu Li

AI总结 在智能体记忆系统中,重排序模型是连接用户查询与长期记忆的关键桥梁。现有方法多采用“检索-重排序”两阶段范式,但通用重排序模型依赖语义相似度匹配,缺乏真正的推理能力,导致检索结果虽语义相关却无法提供回答问题所需的关键信息。为此,本文提出MemReranker,一种基于Qwen3-Reranker并通过多阶段知识蒸馏构建的重排序模型家族,通过多教师对比生成校准标签、BCE点wise蒸馏优化得分分布、InfoNCE对比学习增强难例区分能力,并结合通用语料与包含时间约束、因果推理等场景的多轮对话数据进行训练,在多个基准测试中表现出色,尤其在推理能力和推理效率方面显著优于现有模型。

详情
英文摘要

In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10--20% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.