arXivDaily arXiv每日学术速递 周一至周五更新
2605.15199 2026-05-15 cs.CV cs.AI 版本更新

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

发表机构 * ByteDance(字节跳动) ByteDance Seed(字节跳动种子) Rice University(罗切斯特大学)

AI总结 EntityBench 是一个用于评估多镜头视频生成中实体一致性能力的基准数据集,包含140个情节(共2,491个镜头),从真实叙事媒体中提取,涵盖不同难度级别的场景,并明确追踪角色、物体和地点在多镜头间的连续性。该基准引入了三部分评估体系,分别评估单镜头质量、提示对齐度和跨镜头一致性,并通过“保真度门”机制确保只有准确的实体表现在跨镜头评分中被计入。研究还提出了一种基于记忆增强的生成方法EntityMem,通过在生成前存储每个实体的视觉参考,显著提升了跨镜头实体一致性表现。

Comments Project page: https://catherine-r-he.github.io/EntityBench/

详情
英文摘要

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

2605.15198 2026-05-15 cs.CV cs.AI cs.CL 版本更新

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

发表机构 * Meta AI The Chinese University of Hong Kong(香港中文大学)

AI总结 该研究提出了一种名为ATLAS的新型视觉推理框架,旨在解决传统方法在计算开销和任务泛化上的不足。ATLAS通过一个单一的离散“功能词”同时实现代理式推理和潜在视觉推理,无需视觉监督且兼容标准训练流程。研究还引入了LA-GRPO方法以提升训练稳定性,实验表明ATLAS在多个基准上表现出色,兼具高效性与可解释性。

Comments Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS

详情
英文摘要

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

2605.15188 2026-05-15 cs.LG cs.AI cs.CL 版本更新

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

发表机构 * ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Institute for AI, University of Stuttgart(斯图加特大学人工智能研究所) Tübingen AI Center(图宾根人工智能中心) University of Tübingen(图宾根大学) University of Southampton(南安普顿大学)

AI总结 本文提出 FutureSim,一个用于评估适应性人工智能代理在真实世界事件预测能力的基准平台。该平台通过按时间顺序回放真实新闻事件,测试代理在知识截止点之后预测未来事件的能力。实验表明,现有前沿代理在三月份的预测准确率普遍较低,最高仅为25%,揭示了当前模型在长期适应和不确定性推理方面仍存在显著挑战。FutureSim 为研究长期适应、搜索、记忆和不确定性推理等方向提供了现实可靠的实验环境。

Comments 31 pages, 10 main

详情
英文摘要

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

2605.15185 2026-05-15 cs.CV cs.AI 版本更新

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

发表机构 * Tsinghua University - IEI Lab(清华大学-IEI实验室) UW-Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为PDI-Bench的定量评估框架,用于检测生成视频中的几何一致性问题。该方法通过分割和点追踪获取物体中心视角的观测信息,结合单目重建技术将其映射到三维空间,并计算反映尺度-深度对齐、三维运动一致性和结构刚性等三个失败维度的投影几何残差。研究还构建了PDI-Dataset,用于系统评估生成视频的几何特性,揭示了现有生成模型在物理合理性方面的不足。

Comments 12 pages, 5 figures. Project page : https://pdi-bench.github.io/

详情
英文摘要

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

2605.15179 2026-05-15 cs.LG cs.AI physics.comp-ph 版本更新

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

Ellwil Sharma, Arastu Sharma

发表机构 * Shodh AI

AI总结 该论文研究了如何消除多物理场基础模型中的负迁移问题,即在同时训练不同偏微分方程(PDE)系统时出现的梯度冲突和优化不稳定现象。为此,作者提出了一种基于稀疏激活的混合专家(MoE)架构Shodh-MoE,通过物理感知的自编码器生成压缩的物理潜在表示,并结合软语义路由策略,将不同物理机制的局部潜在块分配给专门的专家子网络,从而实现对多物理场的高效且稳定的建模。实验表明,该方法在保持质量守恒的同时,显著提升了模型在不同物理场景下的预测精度。

Comments 5 pages, 4 figures

详情
英文摘要

Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

2605.15171 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Evidential Reasoning Advances Interpretable Real-World Disease Screening

Chenyu Lian, Hong-Yu Zhou, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China(智能健康中心,护理学院,香港理工大学,香港,中国) Research Institute for Smart Ageing, the Hong Kong Polytechnic University, Hong Kong, China(智能老龄化研究 institute,香港理工大学,香港,中国) School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University, Beijing, China(生物医学工程学院,清华大学,北京,中国)

AI总结 本文提出了一种基于证据推理的可解释疾病筛查框架EviScreen,旨在解决当前医学图像筛查模型在可解释性和性能上的不足。该方法通过从历史病例中检索区域级证据,并结合双知识库进行回顾性解释,提升了模型的透明度和诊断准确性。同时,利用对比检索生成的异常图增强定位解释性,实验表明该方法在真实世界疾病筛查基准上表现出色,尤其在临床召回率下的特异性显著提高。

Comments ICML 2026

详情
英文摘要

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

2605.15168 2026-05-15 cs.CL cs.AI cs.LG stat.ML 版本更新

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

发表机构 * National Library of Medicine National Institutes of Health(国家医学图书馆国立卫生研究院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究旨在解决临床文本与结构化电子健康记录(EHR)在时间信息上的互补性问题,提出了一种基于检索增强的多模态对齐框架,用于重建更精确的临床时间线。该方法通过从文本中提取关键事件构建时间框架,并结合结构化数据中的时间信息进行校准,从而提升时间戳的准确性。实验表明,该方法在多个模型上均显著提升了时间一致性,同时保留了事件匹配率,展示了多模态对齐在临床轨迹重建中的优势。

Comments Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)

详情
英文摘要

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

2605.15164 2026-05-15 cs.LG cs.AI 版本更新

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文指出,当前的行为保障方法无法满足AI治理框架对安全性的验证需求。治理框架要求验证AI系统是否存在隐藏目标、抗失控能力及灾难性能力边界等属性,但现有方法仅能观察模型输出,无法验证其潜在表征和长期行为。文章提出“审计鸿沟”概念,强调验证需求与技术能力之间的不匹配,并建议通过法律文本中限制行为证据的权重、引入机制性验证手段等方式进行技术转向。

详情
英文摘要

This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

2605.15155 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

发表机构 * Zhejiang University(浙江大学) Meituan(美团) Tsinghua University(清华大学)

AI总结 该论文研究了如何提升基于强化学习(RL)的大型语言模型代理在多轮任务中的性能。为了解决传统RL在长序列任务中监督信号过于稀疏的问题,作者提出了自蒸馏代理强化学习(SDAR),通过将基于教师分支的密集令牌级指导作为辅助目标,与主RL优化框架结合。SDAR通过引入一个门控机制,增强对教师认可的正向令牌的蒸馏效果,同时柔和地抑制教师的负向拒绝,从而在多个基准任务上显著提升了性能,并避免了传统方法的不稳定性。

详情
英文摘要

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

2605.15132 2026-05-15 cs.AI cs.DC cs.MA 版本更新

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

Evan Rose, Tushin Mallick, Matthew D. Laws, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Northeastern University(东北大学)

AI总结 本文提出了一种名为APWA的分布式架构,旨在高效处理高度可并行化的智能体工作负载。该架构通过将任务分解为互不干扰的子问题,实现无需跨通信的独立资源处理,从而克服了传统多智能体系统在推理、协调和计算扩展方面的瓶颈。实验表明,APWA能够动态地将复杂查询分解为可并行执行的工作流,并在任务规模增大时实现有效扩展,优于现有系统。

Comments 25 pages, 2 figures, 14 tables

详情
英文摘要

Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

2605.15127 2026-05-15 cs.HC cs.AI 版本更新

Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation

Laleh Nourian, Anisa Callis, Stephanie Patterson, Jadeline Miao, Jamison Heard, Garreth W. Tigwell

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) School of Information(信息学院)

AI总结 本文研究了在美国留学的国际学生如何使用对话式人工智能来支持跨文化适应。通过调查和访谈,研究揭示了国际学生在面临文化适应挑战时对AI工具的使用模式、动机及局限性。研究发现,AI被视为应对即时问题的“急救工具”,但学生也期望其能发展为长期支持伙伴。研究为设计更贴合国际学生需求的AI支持系统提供了重要建议。

Comments 33 pages, single column. 4 figures, 9 tables

详情
英文摘要

Moving to a new culture and adapting to a new life, as an international student, can be a stressful experience. In the US, international students face unique overlapping challenges, yet the current support ecosystem, including university support systems and informal social networks, remains largely fragmented. While conversational AI has emerged as a tool used by many (e.g., generative AI chatbots like ChatGPT and Google Gemini), we do not have a clear understanding of how international students adopt and perceive these technologies as support tools. We conducted a survey study (n=60) to map the relationship between international students' challenges and AI adoption patterns, followed by an interview study with 14 participants to identify the underlying motivations and boundaries of use. Our findings show that AI is perceived as a first-aid tool for immediate challenges, however, there is an interest in transforming AI from a tool for short-term help into a long-term support companion. By identifying where and how AI can provide long-term support, and where it is insufficient, we contribute recommendations for creating AI-powered support tailored to the unique needs of international students.

2605.15109 2026-05-15 cs.AI cs.IR 版本更新

Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG

Riccardo Terrenzi, Maximilian von Zastrow, Serkan Ayvaz

发表机构 * Centre for Industrial Software, University of Southern Denmark, Alsion 2, 6400 Sønderborg, Denmark(丹麦南部大学工业软件中心)

AI总结 本文研究了在Agentic GraphRAG系统中,如何确保引用的可信性,提出引用的忠实性应从整体图遍历路径的角度来评估,而不仅仅是依赖引用的来源内容。研究通过控制实验分析了引用与未引用实体对答案生成的影响,发现引用证据虽重要,但准确回答还依赖于未引用的遍历上下文和图结构。该研究为评估这类系统中的引用质量提供了新的视角,强调应关注更广泛的检索轨迹来源。

Comments 7 pages, 2 figures, Submitted at IJCAI-ECAI 2026 Joint Workshop on GENAIK and NORA

详情
英文摘要

Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

2605.15102 2026-05-15 cs.CL cs.AI 版本更新

Improving Multi-turn Dialogue Consistency with Self-Recall Thinking

Renning Pang, Tian Lan, Leyuan Liu, Xiaoming Huang, Piao Tong, Xiaosong Zhang

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文研究了基于大语言模型的多轮对话系统在处理长对话时面临的上下文依赖和信息稀疏问题,提出了一种名为Self-Recall Thinking(SRT)的框架,通过构建自召回链、初始化推理能力以及优化推理过程,实现了对历史信息的选择性回忆与推理,从而在保持推理准确性的同时提升了系统效率。实验表明,SRT在多个数据集上有效提升了F1分数并降低了端到端延迟,优于现有先进方法。

详情
英文摘要

Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.

2605.15100 2026-05-15 cs.AI 版本更新

Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling

Rongman Xu, Yifei Li, Tianzhe Zhao, Yanrui Wu, Bo Li, Hang Yan

发表机构 * Xi’an Jiaotong University(西安交通大学)

AI总结 本文研究了在推理时对大语言模型进行适应性扩展时如何平衡计算预算与推理质量的问题。为解决现有方法中宽度与深度优化目标相互独立导致的效率与准确性难以兼顾的问题,作者提出了双维度一致性(DDC)框架,通过结合置信度加权的贝叶斯协议和趋势感知分层剪枝策略,有效集中计算资源于高质量推理路径,从而减少幻觉并加速共识形成。实验表明,该方法在多个基准上显著降低了计算开销,同时保持或超越了现有强基线的准确性。

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off between sampling budget and reasoning quality. Current strategies remain inefficient as they typically treat sampling width and depth as orthogonal objectives, where width consensus methods risk reinforcing hallucinations, while depth pruning mechanisms prematurely truncate complex yet valid reasoning chains. Therefore, we propose Dual-Dimensional Consistency (DDC), a unified framework that bridges path quality with adaptive termination. By coupling Confidence-Weighted Bayesian protocol with a Trend-Aware Stratified Pruning, our method ensures that computational resources are concentrated on high quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks demonstrate that this approach reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.

2605.15083 2026-05-15 cs.LG cs.AI 版本更新

Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction

Daniel Asare Kyei, Alimatu Saadia-Yussiff, Maame G. Asante-Mensah, Abdul Lateef-Yussiff, Charles Roland Haruna, Derry Emmanuel

发表机构 * Department of Computer Science and Information Technology, University of Cape Coast(计算机科学与信息技术系,卡贝 Coast 大学)

AI总结 该研究提出了一种名为DBS-Adam的动态批敏感优化器,用于解决车辆事故伤害严重程度预测中的类别不平衡和序列数据处理问题。DBS-Adam通过计算梯度范数和批次损失的指数移动平均来动态调整学习率,从而提升训练稳定性并加速收敛。实验表明,DBS-Adam在测试集上取得了较高的准确率和精确率,并在与多种先进优化器的对比中表现出显著优势,验证了其在处理不平衡序列数据任务中的有效性。

详情
英文摘要

The choice of optimiser is important in deep learning, as it strongly influences model efficiency and speed of convergence. However, many commonly used optimisers encounter difficulties when applied to imbalanced and sequential datasets, limiting their ability to capture patterns of minority classes. In this study, we propose Dynamic Batch-Sensitive Adam (DBS-Adam), an optimiser that dynamically scales the learning rate using a batch difficulty score derived from exponential moving averages of gradient norms and batch loss. DBS-Adam improves training stability and accelerates convergence by increasing updates for difficult batches and reducing them for easier ones. We evaluate DBS-Adam by integrating it with Bi-Directional LSTM networks for accident injury severity prediction, addressing class imbalance through SMOTE-ENN resampling and Focal Loss. Four experimental configurations compare baseline Bi-LSTM models and alternative architectures to assess optimiser impact. Rigorous comparison against state-of-the-art optimisers (AMSGrad, AdamW, AdaBound) across five random seeds demonstrated DBS-Adam's competitive performance with statistically significant precision improvements (p=0.020). Results indicate that DBS-Adam outperforms standard optimisation approaches, achieving 95.22% test accuracy, 96.11% precision, 95.28% recall, 95.39% F1-score, and a test loss of 0.0086. The proposed framework enables effective real-time accident severity classification for targeted emergency response and road safety interventions, demonstrating the value of DBS-Adam for learning from imbalanced sequential data.

2605.15081 2026-05-15 cs.CL cs.AI 版本更新

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

发表机构 * School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院)

AI总结 本文提出了一种名为 ML-Embed 的多语言嵌入框架,旨在解决当前高质量文本嵌入发展中存在的计算成本高、语言覆盖有限和模型透明度不足等问题。基于三维俄罗斯套娃学习(3D-ML)框架,该方法在模型生命周期中实现了全面的效率优化,并通过多语言数据集和参数规模从1.4亿到80亿的模型套件,提升了参数效率和语言包容性。实验表明,ML-Embed 在多个基准测试中表现优异,尤其在低资源语言上取得了显著成果,为构建公平且高效的全球AI系统提供了可复现的解决方案。

Comments Accepted by ICML 2026. The data has been released earlier in the preprint arXiv:2603.19223

详情
英文摘要

The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

2605.15077 2026-05-15 cs.CL cs.AI cs.LG 版本更新

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Guangyu Feng, Huanzhi Mao, Prabal Dutta, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种名为 AsyncFC 的纯执行层框架,旨在在不改变模型结构和函数实现的前提下,实现大型语言模型(LLM)的异步函数调用。该方法通过解耦模型解码与函数执行,使得两者可以并行进行,从而显著降低任务完成的端到端延迟。实验表明,AsyncFC 在多个基准测试中有效提升了任务处理效率,同时保持了任务准确性,并揭示了 LLM 本身具备处理未决执行结果的符号化未来(symbolic futures)的能力。

详情
英文摘要

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

2605.15071 2026-05-15 cs.CV cs.AI cs.CL 版本更新

On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

发表机构 * MBZUAI Inception

AI总结 该研究指出视觉语言模型在处理文化遗产材料时存在“文化时差”问题,即模型倾向于用不符合历史时期的概念、材料或文化框架来误解历史文物。为此,研究者构建了TAB-VLM基准数据集,包含1600件印度不同时期的文化遗物和600个问题,用于评估模型的时序推理能力。实验表明,即使是最先进的模型在该基准上的表现也有限,揭示了当前视觉语言模型在理解和处理非西方文化历史材料方面仍存在显著不足。

Comments Project Page: https://khushboo0012.github.io/tab-vlm-webpage/

详情
英文摘要

Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

2605.15058 2026-05-15 cs.NE cs.AI 版本更新

NeuroTrain: Surveying Local Learning Rules for Spiking Neural Networks with an Open Benchmarking Framework

Alessio Caviglia, Filippo Marostica, Roberta Bardini, Alessandro Savino, Stefano Di Carlo

发表机构 * Politecnico di Torino, Control and Computer Engineering Department(托里尼理工大学控制与计算机工程系)

AI总结 本文综述了脉冲神经网络(SNN)训练算法的最新进展,系统梳理了包括替代梯度反向传播、局部学习规则、生物启发可塑性机制等在内的多种方法,并提出了一个统一的分类体系。为支持可复现的研究,作者开发了开源框架NeuroTrain,实现了多种典型算法,提供了统一、模块化且可扩展的基准测试平台。该工作整合了分散的文献资源,明确了当前挑战与未来研究方向,为高效、可扩展的SNN训练提供了重要参考。

详情
英文摘要

The rapid expansion of spiking neural networks (SNNs) has led to a proliferation of training algorithms that differ widely in biological inspiration, computational structure, and hardware suitability. Despite this progress, the field lacks a unified, fine-grained taxonomy that systematically organizes these approaches and clarifies their conceptual relationships. This survey provides a comprehensive taxonomy of SNN training algorithms, spanning surrogate-gradient backpropagation, local and three-factor learning rules, biologically inspired plasticity mechanisms, ANN-to-SNN conversion pipelines, and non-standard optimization strategies. We analyze each class in terms of its computational principles, learning signals, and locality properties. To support reproducible research, we release NeuroTrain, an open-source snnTorch-based framework that implements a representative set of these algorithms within a unified, modular, and extendable framework, enabling consistent benchmarking across datasets, architectures, and training regimes. By consolidating fragmented literature and providing a reusable benchmarking framework, this survey identifies common patterns, highlights open challenges, and outlines promising directions for future work on scalable, efficient SNN training.

2605.15044 2026-05-15 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

KiHyun Nam, Jungwoo Heo, Siu Bae, Ha-Jin Yu, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) University of Seoul(首尔大学)

AI总结 随着物理人工智能、对话机器人和无屏可穿戴设备的发展,音频大语言模型需要具备针对说话人的理解能力,以支持用户认证、个性化和上下文感知交互。为此,本文提出 SpeakerLLM,一种专门针对说话人的音频大语言模型框架,能够统一处理单句说话人画像、录音条件理解、双句说话人对比以及基于证据的验证推理。其核心是采用分层说话人分词器,分别捕捉说话人身份和录音条件的多粒度信息,并通过结构化推理轨迹提升验证推理的准确性和可解释性。

详情
英文摘要

As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.

2605.15042 2026-05-15 cs.CV cs.AI 版本更新

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan, Po-Chien Luan, Alexandre Alahi

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 EverAnimate 是一种高效的后训练方法,用于生成高质量的长时域动画视频,能够保持视觉质量和角色身份的一致性。该方法通过引入持久潜空间传播和修复流匹配两种机制,解决了长视频生成中由于分块生成导致的细节退化和语义不一致问题。实验表明,仅需轻量的LoRA调优,EverAnimate 在短时和长时动画生成任务中均优于现有方法,显著提升了图像保真度和视觉质量。

Comments Project Page: https://everanimate.github.io/homepage/

详情
英文摘要

We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

2605.15041 2026-05-15 cs.AI cs.CL 版本更新

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, Xiaosong Zhang

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文研究了如何通过案例驱动的方法提升大语言模型在工具使用中的推理与执行能力。提出了一种名为CAST的框架,该框架将历史执行轨迹作为结构化案例,提取案例中的复杂性与失败特征,用于指导模型优化推理策略并避免结构错误。实验表明,CAST在保持执行结构正确性的同时提高了工具使用成功率,并减少了不必要的推理步骤,显著提升了整体性能。

详情
英文摘要

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

2605.15034 2026-05-15 cs.CL cs.AI cs.CY cs.MA 版本更新

AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

Vinicius Covas, Jorge Alberto Hidalgo Toledo

发表机构 * Center for Applied Communication Research (CICA)(应用沟通研究中心) Human & NonHuman Communication Laboratory(人类与非人类沟通实验室) Faculty of Communication(传播学院) Universidad Anáhuac México(墨西哥安纳胡阿克大学)

AI总结 本研究探讨了大型语言模型(LLM)在感知到社会观察情境时是否会产生系统性的语言适应行为,这一问题对AI治理和审计具有重要意义。基于社会学理论,研究通过控制实验分析了不同观察情境下多智能体辩论系统的行为变化,发现模型在面对人类或AI观察者时会表现出不同的语言风格调整,表明其行为对观察者身份敏感。研究结果为理解LLM作为情境敏感的沟通主体提供了新视角,并对算法审计和AI治理提出了启示。

Comments 20 pages, 6 figures

详情
英文摘要

Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

2605.15030 2026-05-15 cs.CR cs.AI 版本更新

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

Tri Cao, Yulin Chen, Hieu Cao, Yibo Li, Khoi Le, Thong Nguyen, Yuexin Li, Yufei He, Yue Liu, Shuicheng Yan, Bryan Hooi

发表机构 * National University of Singapore(新加坡国立大学) University of Science(科学大学) Vietnam National University, Ho Chi Minh City(越南国家大学,胡志明市)

AI总结 本文提出WARD,一种针对网络代理的对抗性鲁棒防御方法,用于抵御HTML内容或视觉界面中的提示注入攻击。WARD基于大规模数据集WARD-Base和专门设计的攻击数据集WARD-PIG进行训练,并引入了A3T自适应对抗训练框架,通过记忆驱动的攻击者与防御者共进化过程提升模型鲁棒性。实验表明,WARD在分布外基准上实现了接近完美的召回率,保持较低的误报率,并在分布偏移和针对性攻击下仍表现出高效稳定的防御性能。

Comments Code and models: https://github.com/caothientri2001vn/WARD-WebAgent

详情
英文摘要

Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD-Base, a large-scale dataset with around 177K samples collected from 719 high-traffic URLs and platforms, and WARD-PIG, a dedicated dataset designed for prompt injection attacks targeting the guard model. We further introduce A3T, an adaptive adversarial attack training framework that iteratively strengthens WARD through a memory-based attacker and guard co-evolution process. Extensive experiments show that WARD achieves nearly perfect recall on out-of-distribution benchmarks, maintains low false positive rates to preserve agent utility, remains robust against guard-targeted and adaptive attacks under substantial distribution shifts, and runs efficiently in parallel with the agent without introducing additional latency.

2605.15026 2026-05-15 cs.OS cs.AI cs.PF 版本更新

SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

Georgios Liargkovas, Mihir Nitin Joshi, Hubertus Franke, Kostis Kaffes

发表机构 * Columbia University(哥伦比亚大学) IBM Research(IBM研究院)

AI总结 SemaTune 是一种基于大语言模型的语义感知在线操作系统调优框架,旨在提升长期运行服务的性能。该方法通过整合系统参数、监控数据、配置历史等信息构建决策上下文,结合快速和慢速反馈回路进行调优,并在更新前进行类型验证,从而在保证模型开销和系统稳定性的同时,实现对操作系统控制语义的理解。实验表明,SemaTune 在多个基准测试中显著优于传统方法,提升了稳定阶段的性能表现,并有效避免了系统性能的严重下降。

Comments 17 pages, 12 figures

详情
英文摘要

Online OS tuning can improve long-running services, but existing controllers are poorly matched to live hosts. They treat scheduler, power, memory, and I/O controls as black-box variables and optimize a scalar reward. This view ignores cross-knob policy structure, breaks down when application metrics are unavailable, and can send a running service into degraded regions that persist after the bad setting is removed. We present SemaTune, a host-side framework for steady-state OS tuning with bounded language-model guidance. SemaTune turns knob schemas, telemetry, current configuration, recent action--response history, and retrieved prior runs into a compact decision context. A fast loop proposes low-latency updates, a slower loop periodically revises the search strategy, and every proposed change passes through typed validation before reaching kernel or sysctl interfaces. This lets the controller reason about OS-control meaning and indirect performance signals while keeping model cost, latency, and authority constrained. We evaluate SemaTune on 13 live workloads from five benchmark suites while tuning up to 41 Linux parameters. Across the suite, SemaTune improves stable-phase performance by 72.5\% over default settings and by 153.3\% relative to the strongest non-LLM baseline. A 30-window session costs about \$0.20 in model calls. With only host-level metrics, SemaTune still outperforms baselines given direct application objectives by 93.7 percentage points, while avoiding severe degraded regions reached by structure-blind exploration.

2605.15018 2026-05-15 cs.LG cs.AI 版本更新

Generalized Priority-Aware Shapley Value

Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种广义优先感知的夏普利值(GPASV),用于解决机器学习中的价值分配问题。传统方法要求优先级关系为二元且无环,但实际应用中常出现循环或多元比较的情况。GPASV 支持任意有向加权优先图,允许边权重对顺序冲突进行惩罚而非禁止,从而更灵活地建模真实数据中的优先关系。该方法通过公理化定义建立理论基础,并应用于大语言模型集成评估,展示了优先权分配对价值评估结果的重要影响。

详情
英文摘要

Shapley value and its priority-aware extensions are widely used for valuation in machine learning, but existing methods require pairwise priority to be binary and acyclic, a restriction spectacularly violated in real-data examples such as aggregated human preferences and multi-criterion comparisons. We introduce the generalized priority-aware Shapley value (GPASV), a random order value defined on arbitrary directed weighted priority graphs, in which pairwise edges penalize rather than forbid order violations. GPASV covers a range of classical models as boundary cases. We establish GPASV through an axiomatic characterization, develop the associated computational methods, and introduce a priority sweeping diagnostic extending PASV's. We apply GPASV to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, illustrating that priority-aware valuation is not a one-button operation: different balances of pairwise graph priority versus individual soft priority produce substantively different valuations of the same data.

2605.15016 2026-05-15 cs.CL cs.AI 版本更新

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu

发表机构 * School of Computing and Data Science, The University of Hong Kong(香港大学计算与数据科学学院) Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(中国电子科学与技术大学深圳高级研究所) School of Computer Science, The University of Sydney(悉尼大学计算机科学学院)

AI总结 随着大型语言模型在医疗领域的应用,智能临床决策支持系统迅速发展。然而,现有模型在处理纵向电子健康记录(EHR)时存在统计推理不足和时间依赖性建模困难的问题。为此,本文提出COTCAgent,一种基于概率思维链补全的分层推理框架,通过解耦统计计算、特征匹配与语言生成,提升了对长期健康记录的分析能力,并在多个医疗数据集上取得了优于现有方法的性能。

详情
英文摘要

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

2605.15015 2026-05-15 cs.AI cs.CL cs.HC 版本更新

Small, Private Language Models as Teammates for Educational Assessment Design

Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni Ilkou

发表机构 * Wright State University(怀特州立大学) University of Florida(佛罗里达大学) TIB – Leibniz Information Centre for Science and Technology(莱布尼茨信息科学与技术研究中心)

AI总结 本研究探讨了小型私有语言模型(SLMs)在教育评估设计中的应用,旨在弥补大型语言模型(LLMs)在隐私和资源限制方面的不足。通过系统对比LLMs与SLMs在生成评估题目时的表现,研究采用可复现的教育学导向指标评估生成质量,并分析模型评分与专家评分的一致性与偏差。结果表明,SLMs在关键教育质量维度上表现优异,支持本地化部署,但模型评分仍存在系统性不一致和偏差,突显了人机协同在教育评估流程中的必要性。

详情
英文摘要

Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.

2605.15012 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种名为FEST的新型可验证奖励强化学习算法,旨在解决在复杂任务中样本效率低的问题。该方法通过随机选取少量示范数据进行指导,仅需128个示例即可取得优异效果,显著减少了对大量监督数据的依赖。研究发现,结合监督信号、策略梯度信号以及对少量示范数据的衰减权重是实现高性能的关键。实验表明,FEST在多个基准上优于传统方法,即使使用更少的监督数据也能达到相近甚至更好的性能。

Comments 25 pages, 11 figures

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

2605.15000 2026-05-15 cs.CL cs.AI 版本更新

Quantifying and Mitigating Premature Closure in Frontier LLMs

Rebecca Handler, Suhana Bedi, Nigam Shah

发表机构 * Department of Medicine, Stanford University(斯坦福大学医学系) Department of Biomedical Data Science, Stanford University(斯坦福大学生物医学数据科学系)

AI总结 该研究探讨了前沿大语言模型(LLMs)在面对不确定信息时过早得出结论的问题,即“过早闭合”现象,特别是在医疗任务中可能带来的风险。研究通过结构化和开放式的医学任务评估了五种前沿模型,发现它们在缺乏足够信息时仍频繁给出确定性回答,错误率较高。尽管安全导向的提示策略能部分缓解这一问题,但模型仍存在显著的过早闭合行为,表明当前医疗大语言模型在判断何时不应作答方面仍需改进。

Comments 14 pages, 3 figures, 1 table

详情
英文摘要

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

2605.14995 2026-05-15 cs.AI cs.CL cs.LG cs.SI 版本更新

Explainable Detection of Depression Status Shifts from User Digital Traces

Loris Belcastro, Francesco Gervino, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio

发表机构 * DIMES, University of Calabria(DIMES,卡塔尔大学)

AI总结 本文提出了一种可解释的框架,用于从用户的数字痕迹(如社交媒体帖子、聊天记录等)中检测和分析抑郁状态的变化。该方法结合多个基于BERT的模型提取情感、情绪和抑郁严重程度等多维度信号,并通过时间聚合构建用户轨迹,识别有意义的状态变化点。同时引入大语言模型生成简洁的人类可读报告,提升结果的可解释性。实验表明,该方法在两个社交媒体数据集上表现出更高的历史覆盖度、时间连贯性和变化点敏感性,为心理健康状态的动态分析提供了有力支持。

详情
英文摘要

Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

2605.14991 2026-05-15 cs.CV cs.AI 版本更新

Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

Francesco Pastori, Francesca Fati, Marina Rosanu, Luigi De Vitis, Lucia Ribero, Gabriella Schivardi, Giovanni Damiano Aletti, Nicoletta Colombo, Jvan Casarin, Francesco Multinu, Elena De Momi

发表机构 * Department of Gynecologic Oncology, European Institute of Oncology, IEO, IRCCS, Milan, Italy(妇科肿瘤科,欧洲肿瘤研究所,IEO,IRCCS,米兰,意大利) Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy(电子、信息与生物工程系,米兰理工学院,米兰,意大利) Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, USA(妇产科,梅奥诊所,罗切斯特,美国) Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy(肿瘤学与血液肿瘤学系,米兰大学,米兰,意大利) Department of Medicine and Innovative Technology, Università degli Studi dell'Insubria, Varese, Italy(医学与创新技术系,因斯布鲁克大学,瓦雷塞,意大利)

AI总结 该研究旨在通过术前增强CT影像预测卵巢癌患者对新辅助化疗的反应,以帮助早期识别无效治疗的患者。研究提出了一种基于多损失深度学习的非侵入性框架,利用自动提取的3D病灶掩膜,结合部分微调的图像编码器和注意力机制进行特征聚合与分类。实验在包含280例患者的回顾性队列上验证,模型在测试集上实现了ROC-AUC为0.73、F1得分为0.70,表明其具备一定的临床预测能力,为影像驱动的患者分层提供了可靠基础。

详情
英文摘要

Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

2605.14984 2026-05-15 cs.CV cs.AI 版本更新

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, Zeran Ke, Bin Tan, Hang Zhang, Gui-Song Xia

发表机构 * LIESMARS & School of Artificial Intelligence, Wuhan University(珞珈实验室与武汉大学人工智能学院) EPFL(苏黎世联邦理工学院) HKUST(香港科技大学) Northeastern University(东北大学) Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Amap, Alibaba Group(高德地图,阿里巴巴集团)

AI总结 本文研究如何从单张卫星图像生成街景级别的3D场景,这是一个具有挑战性的问题。现有方法在几何精度和语义多样性之间存在明显权衡,而本文提出的Sat3DGen通过引入一种以几何优先的方法,结合新的几何约束和视角训练策略,显著提升了生成场景的几何准确性和视觉真实感。实验表明,该方法在几何误差和图像质量方面均优于现有最佳方法,并在多个下游任务中展现了广泛的应用价值。

Comments ICLR 2026; code: https://github.com/qianmingduowan/Sat3DGen demo: https://huggingface.co/spaces/qian43/Sat3DGen project page: https://qianmingduowan.github.io/Sat3DGen_project_page/

详情
英文摘要

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

2605.14983 2026-05-15 cs.GT cs.AI cs.CY cs.MA 版本更新

Agreement, Diversity, and Polarization Indices for Approval Elections

Piotr Faliszewski, Jitka Mertlová, Krzysztof Sornat, Stanisław Szufa, Tomasz Wąs

发表机构 * AGH University of Kraków(克拉科夫AGH大学) Czech Technical University in Prague(布拉格捷克技术大学) University of Geneva(日内瓦大学) University of Oxford(牛津大学)

AI总结 本文研究了如何通过指数量化批准选举中选民之间的一致性、多样性和极化程度。提出了一系列归一化的指数,用于衡量选举中这些特征,并分析了它们的性质。研究还利用这些指数绘制了新的批准选举图谱,并比较了来自多个真实数据集的选举之间的异同。

详情
英文摘要

An index is a function that given an election outputs a value between 0 and 1, indicating the extent to which this election has a particular feature. We seek indices that capture agreement, diversity, and polarization among voters in approval elections, and that are normalized with respect to saturation. By the latter we mean that if two elections differ by the fraction of candidates approved by an average voter, but otherwise are of similar nature, then they should have similar index values. We propose several indices, analyze their properties, and use them to (a) derive a new map of approval elections, and (b) show similarities and differences between various real-life elections from Pabulib, Preflib and other sources.

2605.14982 2026-05-15 cs.LG cs.AI 版本更新

Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

Sanjeev Manivannan, Shuban V

发表机构 * Department of Biotechnology(生物技术系) Indian Institute of Technology Madras(印度理工学院马德拉斯)

AI总结 本文研究了折扣奖励设置下的强化学习问题,旨在提升策略梯度方法中策略更新的收敛效率。通过引入策略Hessian分解,作者提出了一种基于二阶优化的actor-critic方法,充分利用目标函数的曲率信息,在保证计算效率的同时提升了算法稳定性。该方法在双时间尺度框架下,将评论家视为准平稳,从而合理近似动作价值函数对策略参数的局部常数性,为二阶更新提供了理论支持。

Comments 9 pages, 2 figures including Appendix with Detailed proofs

详情
英文摘要

We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.

2605.14980 2026-05-15 cs.CV cs.AI 版本更新

MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Shuohong Wang, Jeff W. Lichtman, Jun Liu

发表机构 * School of Computing and Communications(计算与通信学院) Lancaster University(兰卡斯特大学) Department of Cell Biology(细胞生物学系) Harvard Medical School(哈佛医学院) Department of Molecular and Cellular Biology(分子与细胞生物学系) Harvard University(哈佛大学)

AI总结 本文提出了一种名为MicroscopyMatching的通用显微图像分析框架,旨在解决不同实验条件下显微图像分析任务(如分割、追踪和计数)的自动化难题。该框架通过将多样化的分析任务统一为匹配问题,并利用预训练的潜在扩散模型的强大匹配能力,实现了在多种生物样本和成像条件下可靠且无需额外调整的分析效果。该研究为生物医学研究提供了一种实用且广泛适用的解决方案,显著降低了对人工分析的依赖。

详情
英文摘要

Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.

2605.13338 2026-05-15 cs.CR cs.AI 版本更新

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

Shuqiang Wang, Wei Cao, Jiaqi Weng, Jialing Tao, Licheng Pan, Hui Xue, Zhixuan Chu

发表机构 * The State Key Laboratory of Blockchain and Data Security, Zhejiang University(区块链与数据安全国家重点实验室,浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 本文研究了大型推理模型(LRMs)在面对不完整或逻辑不一致输入时容易“过度思考”的漏洞,该行为会导致推理过程冗长且耗能,可能被用于发起拒绝服务(DoS)攻击。作者提出了一种基于分层遗传算法的黑盒攻击框架,通过系统性地扰动输入问题的逻辑结构,诱导模型产生更长的推理过程。实验表明,该方法在多个先进推理模型上显著放大了输出长度,并具有良好的迁移性,凸显了“过度思考”作为现代推理系统共有的潜在安全风险。

Comments Accepted at ICML 2026. Code available at: https://github.com/EndlessCao/Overthink-HGA

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), PMLR 306, 2026

详情
英文摘要

Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi-step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.

2605.12484 2026-05-15 cs.LG cs.AI 版本更新

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri

发表机构 * UC Berkeley(加州大学伯克利分校) Mila(蒙特利尔大学人工智能研究所) UT Austin(得克萨斯大学奥斯汀分校) Eragon Periodic Labs Mirendil Video

AI总结 大型语言模型(LLMs)通常通过更新参数(如强化学习)来适应下游任务,但这可能导致灾难性遗忘和泛化能力下降。相比之下,固定参数的上下文学习虽然能快速适应任务需求,但性能提升有限。本文提出了一种“快-慢”学习框架,将模型参数视为“慢权重”,优化的上下文作为“快权重”,从而在保持模型整体稳定性的基础上实现高效学习。实验表明,该方法在样本效率和性能上限上均优于传统方法,并在持续学习场景中表现出更强的适应能力和更少的遗忘。

Comments 29 pages, 14 figures, including appendix; Blog post: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/

详情
英文摘要

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

2605.02004 2026-05-15 cs.AI 版本更新

Personalized Digital Health Modeling with Adaptive Support Users

Zhongqi Yang, Mahkameh Rasouli, Neda Mohseni, Yong Huang, Iman Azimi, Amir M. Rahmani

发表机构 * Department of Computer Science, University of California, Irvine(加州大学尔湾分校计算机科学系) Thrive AI Health Sue & Bill Gross School of Nursing , University of California, Irvine(加州大学尔湾分校苏和比尔·格罗斯护理学院)

AI总结 在数字健康领域,个体间生理和行为差异显著,因此个性化建模至关重要。然而,由于用户数据稀缺且噪声大,现有方法多依赖于群体预训练或相似用户数据,导致迁移偏差和泛化能力不足。本文提出一种统一的个性化框架,通过自适应加权相似和不相似用户数据进行建模,结合个人损失、相似性迁移和对比正则化,提升模型鲁棒性。实验表明,该方法在多个真实数据集上显著优于传统方法,尤其在数据量少时表现更优,并提升了数据利用效率和可解释性。

详情
英文摘要

Personalized models are essential in digital health because individuals exhibit substantial physiological and behavioral heterogeneity. Yet personalization is limited by scarce and noisy user-specific data. Most existing methods rely on population pretraining or data from similar users only, which can lead to biased transfer and weak generalization. We propose a unified personalization framework that trains a personal model using adaptively weighted support users, including both similar and dissimilar individuals. The objective integrates personal loss, similarity-weighted transfer from similar users, and contrastive regularization from dissimilar users to suppress misleading correlations. An iterative optimization algorithm jointly updates model parameters and user similarity weights. Experiments on six tasks across four real-world digital health datasets show consistent improvements over population and personalized baselines. The method achieves up to 10% lower RMSE on large-scale datasets and approximately 25% lower RMSE in low-data settings. The learned adaptive weights improve data efficiency and provide interpretable guidance for targeted data selection.

2604.25855 2026-05-15 cs.CV cs.AI 版本更新

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Hector G. Rodriguez, Marcus Rohrbach

发表机构 * TU Darmstadt(图宾根大学)

AI总结 本文提出了一种名为SIEVES的新型选择性预测方法,旨在提升视觉问答(VQA)系统在真实世界和分布外(OOD)场景中的可靠性和覆盖率。该方法通过让模型在回答问题时生成局部视觉证据,并设计一个选择器来基于这些证据显式评估回答质量,从而在不依赖模型内部信号(如logits或隐藏状态)的情况下实现更准确的置信度估计。实验表明,SIEVES在多个具有挑战性的OOD基准上显著提升了系统覆盖率,且适用于多种前沿闭源模型,无需访问其权重或logits。

详情
英文摘要

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .

2604.09860 2026-05-15 cs.RO cs.AI 版本更新

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Xuning Yang, Rishit Dagli, Alex Zook, Hugo Hadfield, Ankit Goyal, Stan Birchfield, Fabio Ramos, Jonathan Tremblay

发表机构 * NVIDIA University of Toronto(多伦多大学) The University of Sydney(悉尼大学)

AI总结 为了解决通用机器人领域仿真基准测试中性能快速饱和和缺乏真实泛化能力评估的问题,研究提出了RoboLab,一个高保真度的仿真基准框架。该框架通过生成与机器人和策略无关的场景和任务,支持对现实策略在仿真中的行为进行深入分析,并引入了包含120个任务的RoboLab-120基准,涵盖视觉、过程和关系三个能力维度。研究还系统评估了现有先进模型在性能和行为鲁棒性上的不足,为评估任务通用型机器人策略的真实泛化能力提供了细粒度指标和可扩展工具。

Journal ref Robotics: Science and Systems XXII, Sydney, Australia, 2026

详情
英文摘要

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which factor most strongly affect policy behavior. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a high-fidelity simulation environment. We introduce an accompanying RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, exposing significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies. Project website: https://research.nvidia.com/labs/srl/projects/robolab/.

2603.16039 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

AI总结 本文探讨了现代Transformer架构中残差流的双重性质,指出残差路径不仅是优化工具,更是模型表示机制的重要组成部分。作者提出从序列位置和层深度两个维度理解Transformer的设计空间,并揭示了残差流在层深度方向上的自注意力机制与序列方向上的短窗口注意力具有对偶性。基于这一视角,文章进一步分析了不同模型设计的优劣,并推荐在关注快捷连接时使用深度增量学习(DDL),而在需要局部自适应混合时采用序列方向的短窗口注意力(ShortSWA)。

Comments Project Page: https://github.com/yifanzhang-pro/residual-stream-duality

详情
英文摘要

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

2602.24273 2026-05-15 cs.AI 版本更新

A Minimal Agent for Automated Theorem Proving

Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran-Ferreiro, Leopoldo Sarra

AI总结 本文提出了一种用于自动定理证明的最小智能体基线,旨在为不同基于人工智能的定理证明架构提供系统性的比较基础。该设计实现了当前先进系统共有的核心功能,包括迭代证明优化、库搜索和上下文管理。实验表明,该方法在保持显著简化架构和低成本的同时,性能可与现有先进方法媲美,并在样本效率和成本效益方面展现出迭代方法相对于单次生成方法的优势。研究代码已开源,供未来研究参考及社区使用。

Comments Accepted for publication at ICML 2026

详情
英文摘要

We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate this agentic approach using qualitatively different benchmarks and compare various frontier language models and design choices. Our results show competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture and a fraction of their cost. Additionally, we demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.

2602.14674 2026-05-15 cs.AI 版本更新

From User Preferences to Base Score Extraction Functions in Gradual Argumentation (with Appendix)

Aniol Civit, Antonio Rago, Antonio Andriella, Guillem Alenyà, Francesca Toni

发表机构 * King's College London(伦敦国王学院) Imperial College London(伦敦帝国学院)

AI总结 本文研究了如何从用户对论点的偏好中提取基础评分函数,以支持渐进式论证系统中的决策过程。作者提出了一种基础评分提取函数,能够将用户偏好映射到论点的基础评分,并将其应用于双极论证框架,从而构建定量双极论证框架,便于使用现有的计算工具进行分析。该方法考虑了人类偏好中的非线性特性,并通过理论分析和机器人实验验证了其有效性,为实际应用中的渐进语义选择提供了指导。

Comments Accepted to AAMAS 2026 - With Appendix

详情
英文摘要

Gradual argumentation is a field of symbolic AI which is attracting attention for its ability to support transparent and contestable AI systems. It is considered a useful tool in domains such as decision-making, recommendation, debate analysis, and others. The outcomes in such domains are usually dependent on the arguments' base scores, which must be selected carefully. Often, this selection process requires user expertise and may not always be straightforward. On the other hand, organising the arguments by preference could simplify the task. In this work, we introduce \emph{Base Score Extraction Functions}, which provide a mapping from users' preferences over arguments to base scores. These functions can be applied to the arguments of a \emph{Bipolar Argumentation Framework} (BAF), supplemented with preferences, to obtain a \emph{Quantitative Bipolar Argumentation Framework} (QBAF), allowing the use of well-established computational tools in gradual argumentation. We outline the desirable properties of base score extraction functions, discuss some design choices, and provide an algorithm for base score extraction. Our method incorporates an approximation of non-linearities in human preferences to allow for better approximation of the real ones. Finally, we evaluate our approach both theoretically and experimentally in a robotics setting, and offer recommendations for selecting appropriate gradual semantics in practice.

2602.11626 2026-05-15 cs.LG cs.AI physics.chem-ph physics.comp-ph physics.flu-dyn 版本更新

ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning

Wenqian Chen, Yucheng Fu, Michael Penwarden, Pratanu Roy, Panos Stinis

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) Sandia National Laboratories(桑地亚国家实验室) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 在科学机器学习中,如何学习具有复杂、变化几何结构和参数化物理条件的系统解算符是一个核心挑战。本文提出了一种名为 ArGEnT 的任意几何编码变换器,它基于注意力机制,能够直接从点云表示中编码几何信息,并通过自注意力、交叉注意力和混合注意力三种变体灵活地整合几何特征。将 ArGEnT 集成到 DeepONet 中作为主干网络,构建了一个无需显式参数化几何输入的代理建模框架,在流体力学、固体力学和电化学系统等多个基准问题上的实验表明,该方法在预测精度和泛化能力方面显著优于传统 DeepONet 和其他几何感知代理模型。

Comments 69 pages, 21 figures, 10 tables

详情
英文摘要

Learning solution operators for systems with complex, varying geometries and parametric physical settings is a central challenge in scientific machine learning. In many-query regimes such as design optimization, control and inverse problems, surrogate modeling must generalize across geometries while allowing flexible evaluation at arbitrary spatial locations. In this work, we propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains. ArGEnT employs Transformer attention mechanisms to encode geometric information directly from point-cloud representations with three variants-self-attention, cross-attention, and hybrid-attention-that incorporates different strategies for incorporating geometric features. By integrating ArGEnT into DeepONet as the trunk network, we develop a surrogate modeling framework capable of learning operator mappings that depend on both geometric and non-geometric inputs without the need to explicitly parametrize geometry as a branch network input. Evaluation on benchmark problems spanning fluid dynamics, solid mechanics and electrochemical systems, we demonstrate significantly improved prediction accuracy and generalization performance compared with the standard DeepONet and other existing geometry-aware saurrogates. In particular, the cross-attention transformer variant enables accurate geometry-conditioned predictions with reduced reliance on signed distance functions. By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty quantification, and data-driven modeling of complex physical systems.

2602.02711 2026-05-15 cs.AI 版本更新

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Yuanzhe Li, Jianing Deng, Jingtong Hu, Tianlong Chen, Song Wang, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) University of Pittsburgh(匹兹堡大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) University of Central Florida(佛罗里达中央大学)

AI总结 该研究针对大语言模型(LLM)在长周期决策任务中推理成本过高的问题,提出了一种动态混合精度路由(DMR)框架,通过在每一步决策中自适应选择高精度或低精度模型,以在保证任务成功率的同时降低计算成本。该方法基于不同步骤对精度敏感性的观察,采用两阶段训练策略,结合KL散度监督学习和组相对策略优化,有效提升了性能与效率的平衡。实验表明,DMR在ALFWorld和WebShop等任务中取得了优于单一精度基线的准确率与成本综合表现。

详情
英文摘要

Large language models (LLMs) achieve strong performance in long-horizon decision-making tasks through multi-step interaction and reasoning at test time. While practitioners commonly believe a higher task success rate necessitates the use of a larger and stronger LLM model, multi-step interaction with a large LLM incurs prohibitive inference cost. To address this problem, we explore the use of low-precision quantized LLMs in the long-horizon decision-making process. Based on the observation of diverse sensitivities among interaction steps, we propose Dynamic Mixed-Precision Routing (DMR), a framework that adaptively selects between high-precision and low-precision LLMs at each decision step. The router is trained via a two-stage pipeline, consisting of KL-divergence-based supervised learning that identifies precision-sensitive steps, followed by Group-Relative Policy Optimization (GRPO) to further improve task success rates. Experiments on ALFWorld and WebShop demonstrate that our approach achieves a strong accuracy-cost trade-off over single-precision baselines.

2512.06471 2026-05-15 cs.LG cs.AI 版本更新

Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control

Nathan P. Lawrence, Ali Mesbah

发表机构 * Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720 USA(化学与生物分子工程系,加州大学伯克利分校,CA 94720 USA)

AI总结 本文分析了目标条件强化学习(Goal-Conditioned RL)的成功原因,并将其与最优控制理论联系起来。研究揭示了经典二次目标与目标条件奖励之间的最优性差距,解释了为何目标条件奖励在某些情况下优于密集奖励。此外,文章将目标条件奖励与部分可观测马尔可夫决策过程中的状态估计相结合,表明其在双控制问题中的适用性,并通过强化学习和预测控制方法在非线性与不确定环境中验证了目标条件策略的优势。

Comments IFAC world congress postprint

详情
英文摘要

Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense'' rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.

2510.05213 2026-05-15 cs.RO cs.AI cs.LG 版本更新

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka

发表机构 * UC Berkeley(加州大学伯克利分校) Carnegie Mellon University(卡内基梅隆大学) University of Hong Kong(香港大学) Peking University(北京大学) Stony Brook University(石溪大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 VER 是一种用于机器人学习的视觉专家 Transformer 模型,旨在解决预训练视觉基础模型在特定领域表现优异但跨任务泛化能力有限的问题。该方法通过知识蒸馏将多个视觉基础模型整合为一个专家库,并利用轻量级的动态路由网络从预训练库中选择与任务相关的专家,从而实现高效且灵活的特征提取。VER 还引入了基于块的专家路由和课程化 Top-K 退火策略,提升了动态选择的精度与适应性,在多个机器人任务中取得了最先进的性能。

详情
英文摘要

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

2508.11845 2026-05-15 cs.SD cs.AI cs.IR cs.LG 版本更新

AVEX: What Matters for Animal Vocalization Encoding

Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist

发表机构 * Earth Species Project(地球物种项目)

AI总结 本文研究了动物声学编码中影响模型性能的关键因素,旨在开发一个适用于多种下游任务的通用生物声学编码器。通过大规模实验,作者分析了训练数据多样性、模型架构和训练策略对编码器性能的影响,并提出了结合自监督预训练与监督微调的混合训练方法,显著提升了模型在不同任务和数据集上的表现。研究还发现,数据多样性在训练和评估阶段都至关重要,并公开了模型参数以支持后续研究与应用。

Comments In The Fourteenth International Conference on Learning Representations 2026

详情
英文摘要

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

2507.15774 2026-05-15 cs.LG cs.AI 版本更新

Time Series Forecasting Through the Lens of Dynamics

Alexis-Raja Brachet, Pierre-Yves Richard, Céline Hudelot

发表机构 * CentraleSupélec, IETR UMR CNRS 6164, France(法国中央超导学院,IETR CNRS 6164研究组)

AI总结 本文研究了时间序列预测任务中深度学习模型与浅层线性模型的性能差异,提出模型应学习从过去到未来数据点的直接联系,即“动态学习”能力。作者引入了 $\texttt{PRO-DYN}$ 框架,分析现有模型的动态特性,发现性能较差的模型往往仅部分学习动态关系,且动态模块的位置对模型效果至关重要。基于系统性与实证研究,作者提出了一种简单易用的模型设计与改进方法。

Comments Accepted at ICML 2026

详情
英文摘要

While deep learning is facing an homogenization across modalities led by Transformers, they are still challenged by shallow linear models in the time series forecasting task. Our hypothesis is that models should learn a direct link from past to future data points, which we identify as a learning dynamics capability. We develop an original $\texttt{PRO-DYN}$ nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerge: $\textbf{1.}$ under-performing architectures learn dynamics at most partially, $\textbf{2.}$ the location of the dynamics block at the model end is of prime importance. Our systemic and empirical studies both confirm our observations on a set of performance-varying models with diverse backbones. We propose a simple plug-and-play methodology guiding model designs and improvements.

2605.14966 2026-05-15 cs.CV cs.AI 版本更新

MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

发表机构 * Tsinghua University(清华大学) Tsinghua University, Tencent(清华大学腾讯) Tencent(腾讯) University of Science and Technology Beijing(北京科技大学) University of Macau(澳门大学)

AI总结 本文提出了一种名为MHSA的轻量级框架,旨在通过引导注意力机制来缓解大视觉语言模型(LVLMs)中的幻觉问题。MHSA通过学习修正跨模态注意力模式,利用来自LVLM自身和DHCP判别器的监督信号训练一个简单的三层MLP生成器,从而生成修正后的注意力权重。该方法在推理时无需修改LVLM参数,仅替换原始跨模态注意力即可有效减少生成和判别层面的幻觉,为LVLM的幻觉研究提供了新的视角。

Comments 19 pages, 17 figures

详情
英文摘要

Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.

2605.14940 2026-05-15 cs.LG cs.AI eess.SP 版本更新

Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication

Albert Shaju, Christo Kurisummoottil Thomas, Mayukh Roy Chowdhury

发表机构 * Department of Electrical and Computer Engineering, Worcester Polytechnic Institute(沃斯特理工学院电气与计算机工程系) Nokia Bell Labs(诺基亚贝尔实验室)

AI总结 本文研究了面向语义通信的符号星座设计问题,提出了一种关注语义重要性的联合语义-物理层框架,通过提取离散语义概念、评估语义关键性,并结合深度强化学习动态选择传输符号,从而在物理层实现语义感知的星座映射。该方法引入了语义符号脆弱性指标和语义保护概率,证明了传统格雷编码星座在非均匀语义重要性场景下存在性能局限,并在多个数据集上验证了其在高谱效率下的优越性。

Comments Submitted to IEEE GLOBECOM 2026. 6 pages, 8 figures

详情
英文摘要

Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.

2605.14937 2026-05-15 cs.LG cs.AI cs.RO 版本更新

Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

Jonathan Spieler, Angel Villar-Corrales, Sven Behnke

发表机构 * Autonomous Intelligent Systems(自主智能系统) Computer Science Institute VI(计算机科学研究所VI) Intelligent Systems and Robotics(智能系统与机器人) Center for Robotics(机器人中心) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能学习与智能研究所) University of Bonn, Germany(波恩大学,德国)

AI总结 Slot-MPC 是一种基于对象中心表示的目标条件模型预测控制框架,旨在提升智能体在复杂环境中的规划能力。该方法通过视觉编码器学习场景中各个对象的结构化表示,并基于这些表示构建动作条件的动力学模型,从而在推理阶段利用模型预测控制实现高效的动作规划。实验表明,与非对象中心的世界模型相比,Slot-MPC 在任务表现和规划效率方面均有显著提升,尤其在有限状态-动作覆盖的离线设置中,基于梯度的MPC方法表现出更优性能。

详情
英文摘要

Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.

2605.14912 2026-05-15 cs.AI cs.CY cs.HC cs.LG 版本更新

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Institute for Ethics in AI, University of Oxford(牛津大学人工智能伦理研究所) Responsible Technology Institute, University of Oxford(牛津大学负责任技术研究所)

AI总结 本文探讨了人工智能对齐中的“多元主义对齐”问题,指出当前基于强化学习的AI系统在面对不同价值观时倾向于迎合用户意见,导致缺乏真实的价值冲突与分歧。为此,作者提出以格赖斯语用原则为基础的三种对话机制——界定、信号和修正,强调AI应能承认自身视角限制、揭示价值冲突并基于原则进行修正,而非简单迎合。研究引入“多元修正得分”(PRS)作为衡量指标,并在实验中验证了现有模型在面对争议性问题时虽能遵循用户意见,但修正能力较弱,突显了部署阶段治理机制对实现多元主义的重要性。

详情
英文摘要

Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

2605.14907 2026-05-15 cs.AI 版本更新

KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning

Yisen Gao, Jiaxin Bai, Haoyu Huang, Zhongwei Xie, Yufei Li, Hong Ting Tsang, Sirui Han, Yangqiu Song

发表机构 * Department of Computer Science and Engineering, HKUST, Hong Kong, China(香港科技大学计算机科学与工程系) Department of Computer Science and Engineering, HKBU, Hong Kong, China(香港城市大学计算机科学与工程系)

AI总结 知识图谱基础模型旨在通过学习可迁移的关系结构,实现对包含新实体和关系的图的泛化。然而,现有方法大多关注关系层面的通用性,而对上下文学习这一基础模型的重要支柱在知识图谱推理中的应用研究较少。本文提出KGPFN,一种结合先验数据适配网络的知识图谱基础模型,通过结构化上下文中的局部和全局信息进行推理,实现了跨图的强适应能力,并在多个基准测试中表现出色。

详情
英文摘要

Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation-level universality, while in-context learning, the other pillar of foundation models remains under-explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior-data Fitted Network that unifies transferable relational regularities with inference-time in-context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross-graph relational invariances. For query-specific reasoning, it encodes local neighborhoods using a multi-layer NBFNet as local context. To enable ICL at global scale, it constructs relation-specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior-Data Fitted Network framework that combines feature-level and sample-level attention. Through multi-graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in-context learning alone, consistently outperforming competitive fine-tuned KG foundation models. Our code is available at https://github.com/HKUST-KnowComp/KGPFN.

2605.14900 2026-05-15 cs.AI 版本更新

COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs

Sohel Aman Khan, Raghava Mutharaju, Supratim Shit

发表机构 * Mehta Family School of Data Science and AI, IIT Palakkad, India(梅塔家族数据科学与人工智能学院,印度IIT帕拉卡德) Department of CSE, IIIT-Delhi, India(计算机科学与工程系,印度IIIT-德里)

AI总结 本文提出了一种基于核心集理论的个性化知识图谱摘要方法 COREKG,旨在解决大规模知识图谱在问答和可视化等任务中应用不便的问题。该方法通过基于用户查询模式的敏感度评分,从知识图谱中采样出一个具有代表性的三元组子集,以保证摘要在结构和语义上的准确性。实验表明,COREKG 在多个真实数据集上相比现有方法在查询准确率和结构覆盖率方面表现更优,同时显著减少了存储和查询开销。

Comments Accepted at IJCAI 2026

详情
英文摘要

Knowledge Graphs (KGs) are extensively used across different domains and in several applications. Often, these KGs are very large in size. Such KGs become unwieldy for tasks such as question answering and visualization. Summarization of KGs offers a viable alternative in such cases. Furthermore, personalized KG summarization is crucial in the current data-driven world as it captures the specific requirements of users based on their query patterns. Since it only maintains relevant information, the personalized summaries of KG are small, resulting in significantly smaller storage requirements and query runtime. In this work, we adapt the coreset theory to create personalized KG summaries. For a given dataset and a user-specific query workload, we present an approach that samples a relevant subset of triples using sensitivity-based importance sampling. We ensure that the subset approximates the characteristics of the full dataset with bounded approximation error. We define sensitivity scores that measure the importance of a triple with respect to a user's query workload, which are then used by our coreset construction algorithm. We explicitly focus on personalized knowledge graph summarization by constructing summaries independently for each user based on their query behaviour. Our evaluation on Freebase, WikiData, and DBpedia shows that COREKG delivers higher query-answering accuracy and structural coverage than the state-of-the-art methods, such as GLIMPSE, PPR, iSummary, PEGASUS and APEX$^2$ while requiring only a tiny fraction of the original graph.

2605.14897 2026-05-15 cs.LG cs.AI 版本更新

Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

Senne Deproost, Denis Steckelmacher, Ann Nowé

发表机构 * Vrije Universiteit Brussel(布鲁塞尔自由大学)

AI总结 本文研究如何将深度强化学习策略蒸馏到可解释模型中,以平衡性能与可解释性之间的矛盾。提出了一种基于评论家网络的Voronoi量化方法,通过划分状态空间并为每个区域拟合线性函数,实现对复杂策略的简化表示。该方法利用原策略的评论家网络迭代优化子策略,有效提升了蒸馏模型的性能与可解释性。

Comments Accepted for presentation at EXTRAAMAS 2026

详情
英文摘要

Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.

2605.14893 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek

发表机构 * Warsaw University of Technology(华沙技术大学) Centre for Credible Artificial Intelligence(可信人工智能中心) University of Warsaw(华沙大学)

AI总结 本文研究了对比预训练视觉-语言模型(VLMs)中潜在空间的结构问题,发现其共享的潜在空间中存在大量非语义的多模态噪声。作者通过协方差矩阵的谱分解方法,将潜在空间分解为语义信号和共享噪声子空间,并观察到噪声结构在不同数据子集上具有强子群不变性。实验表明,去除这些噪声维度对下游任务性能影响较小,甚至有助于提升性能,揭示了现代VLMs潜在空间中存在大量由模型架构引起的噪声,而非仅由任务相关语义主导。

详情
英文摘要

Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

2605.14886 2026-05-15 cs.AI 版本更新

BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

Zixuan Shu, Tiancheng Cao, Hen-Wei Huang

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University, Republic of Singapore(南洋理工大学电子与电气工程学院) Lee Kong Chian School of Medicine, Nanyang Technological University, Republic of Singapore(南洋理工大学李科钦医学院)

AI总结 在物联网医疗(IoMT)网络中,心电图(ECG)监测受到数据共享法规和隐私保护的限制。为解决联邦学习中模型更新通信开销大、在非独立同分布和长尾标签场景下性能下降的问题,本文提出了一种双向联邦知识蒸馏框架BiFedKD,通过温度缩放和聚合蒸馏机制提升模型对齐效果。实验表明,BiFedKD在MIT-BIH心律失常数据集上显著提升了准确率和Macro-F1指标,同时大幅降低了通信和计算开销。

详情
英文摘要

Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.

2605.14867 2026-05-15 cs.LG cs.AI q-bio.NC 版本更新

REALM: Retrospective Encoder Alignment for LFP Modeling

Peicheng Wu, Zhenyu Bu, Runze Ma, Lin Du

发表机构 * Department of Biomedical Engineering, The Ohio State University(生物医学工程系,俄亥俄州立大学) Department of Information Technology, Monash University Malaysia(信息技术系,墨尔本大学马来西亚分校) NeuroTech Insititude, Columbus, OH, United States(神经科技研究所,美国哥伦布)

AI总结 该研究提出了一种名为REALM的因果LFP解码框架,旨在解决基于局部场电位(LFP)的行为解码中精度低和非因果架构不适用于实时应用的问题。REALM通过从预训练的双向LFP模型中迁移表征知识到因果学生模型,实现了高效的实时解码。实验表明,REALM在保持高解码性能的同时,显著减少了模型参数和训练时间,展示了LFP-only模型在无线植入式脑机接口中的实用性和可扩展性。

详情
英文摘要

Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a $2\times$ reduction in parameter count and a $10\times$ reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.

2605.14866 2026-05-15 cs.SE cs.AI 版本更新

Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

Lingzhe Zhang, Tong Jia, Kangjin Wang, Chiming Duan, Minghua He, Rongqian Wang, Xi Peng, Meiling Wang, Gong Zhang, Renhai Chen, Ying Li

发表机构 * Peking University(北京大学) Huawei Theory Lab(华为理论实验室)

AI总结 随着微服务系统因动态交互和运行环境变化而日益复杂,故障频率不断上升,准确的根因定位(RCL)对系统可靠性至关重要。现有基于传统机器学习和深度学习的方法在可解释性和跨部署迁移能力方面存在不足,而基于大语言模型(LLM)的方法虽有所改进,但仍面临上下文爆炸和串行推理结构导致的诊断效率与准确性问题。本文提出RCLAgent,一个基于多智能体递归思维的微服务根因定位框架,通过并行推理分解诊断过程,显著提升了定位精度和推理效率。

详情
英文摘要

As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

2605.14865 2026-05-15 cs.AI cs.CL 版本更新

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev

发表机构 * Deepchecks

AI总结 该研究提出了一种用于AI智能体的全面评估与故障诊断框架,旨在解决现有评估方法在解释失败原因和定位问题位置方面的不足。该框架结合自顶向下的智能体级诊断与自底向上的片段级评估,将分析过程分解为独立的片段评估,从而支持任意长度的轨迹分析,并为每个判断提供片段级的解释依据。实验表明,该方法在多个基准测试中取得领先结果,显著提升了分类、定位及联合定位-分类的准确率。

详情
英文摘要

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

2605.14857 2026-05-15 cs.AI cs.IR 版本更新

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

Yu Zhang, Dongjiang Zhuang, Qu Zhou, Zheng Huang, Junhe Wu, Jing Cao, Kai Chen

发表机构 * School of Information and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China(信息与电子工程学院,上海交通大学,上海,中国) Nanjing Jiyun Information Technology Co., Ltd., Nanjing, China(南京吉云信息技术有限公司,南京,中国) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(计算机科学学院,上海交通大学,上海,中国)

AI总结 本文提出了一种确定性智能体工作流,用于解决高阶协调制度(HS)税则分类这一专家级任务。该方法通过多维规则推理,结合可解释的决策过程,解决了在材料、形式、功能等多个维度上同时满足优先规则的挑战。研究设计了一个固定流程的智能体架构,将大语言模型调用限制在特定阶段,并保留本地的反思与验证机制,从而实现结构化、可解释的分类决策。实验表明,该方法在HSCodeComp数据集上取得了较高的分类准确率,并揭示了部分标注可能存在与HS规则不符的情况。

详情
英文摘要

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

2605.14855 2026-05-15 cs.LG cs.AI eess.SP 版本更新

Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers

Lukas Schelenz, Shobha Rajanna, Denis Gosalci, Lucas Heublein, Jonas Pirkl, Jonathan Ott, Felix Ott, Christopher Mutschler, Tobias Feigl

发表机构 * Fraunhofer Institute for Integrated Circuits IIS(弗劳恩霍夫集成电路研究所)

AI总结 本文研究了在动态运动预测任务中如何有效利用隐藏上下文信息,重点探讨了从循环神经网络到图神经网络以及通用型Transformer模型的演进过程。研究对比了多种机器学习方法在预测NBA球员动态运动轨迹中的性能,发现基于LSTM的混合模型在结合上下文信息后取得了最低的最终位移误差,表现优于图注意力网络和Transformer等其他模型。实验表明,不同模型在预测精度、泛化能力和训练效率方面各有优劣,强调了在快速动态环境中进行轨迹预测时需根据具体任务选择合适模型。

Comments 12 pages

Journal ref IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025

详情
英文摘要

Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.

2605.14851 2026-05-15 cs.MA cs.AI 版本更新

IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification

Zhigao Huang, Zhengqing Hu, Dong Chen, Shaohan Zhang, Zhao Jin, Bo Zhang, Han Wu, Mingliang Xu

发表机构 * School of Computer and Artificial Intelligence, Zhengzhou University(郑州大学计算机与人工智能学院) Engineering Research Center of Intelligent Swarm Systems, Ministry of Education(教育部智能群体系统工程研究中心) National Supercomputing Center in Zhengzhou(郑州国家超算中心) Henan Research Center for Large Model Technology(河南省大模型技术与新质软件工程研究中心)

AI总结 本文提出了一种集成多智能体框架IFPV,用于生成作战计划并进行高保真度的计划验证。该框架包含两个紧密耦合的模块:多视角分层智能体MPHA用于生成作战行动序列,以及对抗认知仿真引擎ACSE用于高保真度的对抗验证。实验表明,IFPV在任务成功率和操作成本方面优于传统方法,验证模块也显著提升了对候选计划潜在漏洞的识别能力。

Comments Submitted to Neurocomputing

详情
英文摘要

Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification (IFPV). IFPV consists of two tightly coupled modules: Multi-Perspective Hierarchical Agents (MPHA) for generative operational planning and an Adversarial Cognitive Simulation Engine (ACSE) for high-fidelity adversarial plan verification. MPHA decomposes commander intent into executable multi-platform tactical action sequences through the collaboration of Pathfinder, Analyst, and Planner agents. ACSE introduces an opponent equipped with a customized world model, which predicts the future evolution of mission-critical platforms and conducts dynamic counteractions against candidate plans. Simulation experiments in the Asymmetric Combat Tactic Simulator (ACTS) show that IFPV improves mission success by 19.4% and reduces operational cost by 41.7% compared with a single-step large language model (LLM) planning baseline. Compared with a traditional rule-based validator, ACSE increases the average suppression rate by 31.8%, indicating that the proposed verification environment is stricter and more discriminative in revealing the latent vulnerabilities of candidate plans. The code for IFPV can be found at https://github.com/zhigao3ks/IFPV.

2605.14844 2026-05-15 cs.LG cs.AI 版本更新

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt

发表机构 * Gemini Stiftung(吉姆米基金会)

AI总结 本文提出了一种名为XFP的动态权重量化方法,用于大语言模型的高效推理。该方法通过设定每通道的余弦相似度质量下限,自动确定每层的码本大小、异常值预算和打包方式,无需手动选择位宽或校准数据。XFP将权重矩阵分解为稀疏的fp16异常值残差和密集的子字节索引张量,并通过两种存储模式实现高效解码。实验表明,XFP在多个大模型上实现了比现有方法更高的推理速度和准确率,同时有效解决了模型超出内存限制的问题。

Comments 17 pages, 3 figures, 17 tables, 1 algorithm. Code: https://github.com/flash7777/vllm/tree/multiquant

详情
英文摘要

We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.

2605.14841 2026-05-15 cs.LG cs.AI 版本更新

GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning

Paolo Mandica, Michał Brzozowski, Zuzanna Dubanowska, Neo Christopher Chung

发表机构 * Samsung AI Center(三星人工智能中心) University of Warsaw(华沙大学)

AI总结 本文提出了一种名为 GPart 的全新参数高效微调方法,通过全局参数划分实现端到端等距微调,解决了传统低秩适配(LoRA)方法在参数映射过程中破坏距离保持性质的问题。GPart 采用单一等距划分矩阵,将低维可训练向量直接映射到模型的完整权重空间,从而完全消除低秩瓶颈,显著提升了参数效率。实验表明,GPart 在自然语言理解、计算机视觉和数学推理等任务上均表现出色,达到了当前参数高效微调方法的最先进水平。

详情
英文摘要

Low-rank adaptation (LoRA) has become the dominant paradigm for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). However, its bilinear structure introduces a critical limitation: the mapping from trainable parameters to weight updates is not distance-preserving, distorting the optimization landscape. Methods that project a low-dimensional vector into LoRA's parameter space, such as Uni-LoRA, improve parameter efficiency, but the subsequent bilinear LoRA map breaks end-to-end isometry, leaving the core distance-preservation problem unresolved. We propose GPart (Global Partition fine-tuning), a highly parameter-efficient fine-tuning method which removes the low-rank bottleneck entirely. Our method uses a single isometric partition matrix to map a $d$-dimensional trainable vector directly into the full weight space of the model. The result is an extremely minimal fine-tuning pipeline: one random projection, end-to-end isometric, with a single clean hyperparameter ($d$) and storage cost of $d+1$ values (the trainable vector plus a random seed). GPart builds on the theoretical premise that effective fine-tuning can emerge from random low-dimensional subspaces of the full weight space, without imposing low-rank matrix structure. We empirically demonstrate the superior or comparable performance of GPart to existing PEFT methods on natural language understanding, computer vision tasks, and mathematical reasoning. Overall, GPart achieves state-of-the-art efficiency and performance by removing structural constraints, offering a straightforward and elegant path to PEFT.

2605.14833 2026-05-15 cs.AI cs.HC 版本更新

Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale

Vineet Kotecha, Vansh Gupta

发表机构 * divAIne Research(divAIne研究)

AI总结 当前语言模型系统在会话间本质上是无状态的,限制了其随时间个性化交互的能力。本文提出了一种基于情绪关注的有状态记忆架构(EASM),能够在推理时动态构建用户的个性化对话上下文,结合长期历史、情绪信号和意图推断。实验表明,该架构在多个情感类别对话中显著提升了记忆关联性、计划清晰度和情感验证效果,尤其在处理悲伤、焦虑等复杂情感场景时表现稳定,为构建高度个性化的AI系统提供了新的基础架构思路。

Comments 18 pages, 3 figures, 3 tables. Industry research whitepaper. Includes controlled A/B evaluation across 30 scenarios and 6 emotional categories

详情
英文摘要

Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary

2605.14831 2026-05-15 cs.AI cs.LG 版本更新

Interestingness as an Inductive Heuristic for Future Compression Progress

Vincent Herrmann, Jürgen Schmidhuber

发表机构 * IDSIA/USI/SUPSI Lugano, Switzerland(瑞士人工智能实验室IDSIA/USI/SUPSI卢加诺分校,瑞士) King Abdullah University of Science and Technology(卡布斯国王科学与技术大学)

AI总结 本文研究了“有趣性”作为未来压缩进展的归纳启发式方法,旨在解决递归自我改进系统中识别潜在进步任务或数据的瓶颈问题。通过引入算法统计和 Kolmogorov 复杂度工具,作者证明了有趣性具有理论可行性和实证支持,并发现未来进展的期望值与最近突破的时效性呈指数关系。研究还表明,与长度先验相比,算法先验对预期发现的估计更为乐观,且在三种不同的计算范式中得到了实验验证。

详情
英文摘要

One of the bottlenecks on the way towards recursively self-improving systems is the challenge of interestingness: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, we demonstrate that the inductive property of interestingness -- the capacity for past progress to signal future discovery -- is theoretically viable and empirically supported. We prove that expected future progress depends exponentially on the recency of the last observed breakthrough. Furthermore, we show that the Algorithmic Prior is significantly more optimistic than the Length Prior, yielding a quadratic increase in expected discovery for the same observed profile. These findings are experimentally confirmed across three diverse universal computational paradigms.

2605.14802 2026-05-15 cs.AI 版本更新

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

Zhao Yang, Wang Huan, Li Yingshuo, Tu Haomiao, Lin Hujite

发表机构 * Changchun Kelaile Technology Co., Ltd(长春凯莱尔科技有限公司)

AI总结 该研究针对大语言模型在长期交互中面临的事实遗忘、时间线混乱、角色漂移和稳定性下降等问题,提出了一种异构时间记忆治理框架ARPM。该框架将静态知识记忆与动态对话经验记忆分离,并结合向量检索、BM25、RRF融合、双时间重排序等多种技术,实现对连续性和角色一致性的可追溯治理。实验表明,ARPM在高噪声环境下仍能保持语义连续性与角色一致性,并揭示了长期角色一致性可以被分解为可治理的组件并进行白盒评估。

Comments 23 pages, 5 figures, 2 tables. Preprint version. Code for ARPM v4.0 is available at: https://github.com/Spirtxiaoqi7/ARPM

详情
英文摘要

Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

2605.14790 2026-05-15 cs.CL cs.AI 版本更新

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Songyang Gao, Yinghui Xia, Siyi Liu, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 本文提出了一种名为“Graphs of Research(GoR)”的监督微调方法,用于提升基于大语言模型(LLM)的科研想法生成能力。该方法通过构建每篇种子论文的两跳引用邻域,利用引用位置、频率、前驱链接和发表时间等信息生成论文演化的有向无环图(DAG),并以此作为监督信号对模型进行训练。实验表明,GoR 在与基于 GPT-4o 的基线模型的对比中取得了最优性能,验证了引用演化图作为监督信号在科研想法生成任务中的有效性。

详情
英文摘要

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

2605.14786 2026-05-15 cs.CR cs.AI cs.HC cs.LG 版本更新

Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell

发表机构 * Oxford Internet Institute, University of Oxford(牛津互联网研究所,牛津大学) Department of Engineering Science, University of Oxford(工程科学系,牛津大学)

AI总结 随着基于大语言模型(LLM)的智能体越来越多地代表用户浏览网页,一个自然的问题是:网站能否被动识别出驱动该智能体的底层模型?本研究发现,通过被动的JavaScript追踪器捕获智能体的动作和交互时间,可以以高达96%的F1分数识别出使用的模型。研究还表明,基于智能体行为训练的分类器能够跨不同规模和家族的模型泛化,并且仅需少量交互轨迹即可训练出高效的分类器。尽管引入随机时间延迟可以降低分类器性能,但重新训练后仍能恢复识别效果。

详情
英文摘要

As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96\% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \href{https://github.com/KabakaWilliam/known_actions}{here}.

2605.14774 2026-05-15 cs.AI 版本更新

Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation

Lata B T, Savitha N J

发表机构 * Dept. of CSE, UVCE, Bengaluru, India(计算机科学与工程系,UVCE,班加罗尔,印度)

AI总结 本文研究如何利用深度强化学习方法提高犯罪调查中犯罪嫌疑人的识别准确率。作者提出采用深度确定性策略梯度(DDPG)算法,通过训练犯罪现场资料、证人证词和嫌疑人档案等数据集,有效提升识别效率并减少误判。实验结果表明,该方法在识别准确率上达到95%,优于现有多种方法,为人工智能在司法领域的应用提供了新思路。

Journal ref Mathematical Statistician and Engineering Applications, https://www.philstat.org/index.php/MSEA/article/view/2953, ISSN: 2094-0343

详情
英文摘要

In the world of AI and advanced technologies investigation aspects identification of a crime or criminal plays a major problem. In this research we focus on a Conventional ways of implicating criminal investigations usually rely on limited data analysis. Finding an optimal and efficient method that will effectively identify criminals from complex datasets and minimise false positives and false negatives is the considered as a challenge. The main novelty approach of this work is based on the deep learning algorithm Deep Deterministic Policy Gradient (DDPG) is presented in this paper. We train the DDPG model with a dataset of crime scene material, witness statements and suspect profiles. The algorithm uses features to maximise the likelihood of identifying the offender while minimising the noise impact and irrelevant data. We show the efficacy of the proposed method, where DDPG identified criminals with an amazing accuracy of 95% than other several existing methods.

2605.14773 2026-05-15 cs.LG cs.AI 版本更新

Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

Suorong Yang, Hanqi Zhu, Hai Gan, Fangjian Su, Guang Li, Furao Shen, Soujanya Poria

发表机构 * National University of Singapore(国立新加坡大学) Nanjing University(南京大学) Hokkaido University(北海道大学) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了数据选择在模型训练中的高效应用,指出现有方法虽关注选择哪些样本,但通常固定数据量比例,导致动态选择与静态数据量之间的不匹配。作者从优化角度出发,提出了一种名为PODS的插件式振荡数据量调度框架,通过动态调整数据选择比例,在增强正则化效果的同时保持优化稳定性。实验表明,PODS在多种数据集和任务中均有效提升了训练效率与模型性能的平衡。

详情
英文摘要

Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

2605.14771 2026-05-15 cs.AI 版本更新

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

Shaoan Zhao, Huanlin Gao, Qiang Hui, Ting Lu, Xueqiang Guo, Yantao Li, Xinpei Su, Fuyuan Shi, Chao Tan, Fang Zhao, Kai Wang, Shiguo Lian

发表机构 * China Unicom AI (Yuanjing) Team(中国unicom AI (元京)团队)

AI总结 MediaClaw 是一个基于 OpenClaw 生态构建的多模态智能体平台,旨在解决AIGC应用中的实际部署难题,如能力碎片化、接口异构、生产流程割裂和优质工作流复用受限等问题。该平台采用统一抽象、插件化扩展和工作流编排的三层架构,将全品类AIGC能力抽象为统一调用模型,并通过任务导向的技能模块实现复杂生产流程的可复用化。本文重点介绍了MediaClaw的架构设计理念、核心能力模型设计逻辑以及关键工程权衡,为构建多模态能力平台提供可复用的实践参考。

详情
英文摘要

MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

2605.14766 2026-05-15 cs.CL cs.AI eess.AS 版本更新

Streaming Speech-to-Text Translation with a SpeechLLM

Titouan Parcollet, Shucong Zhang, Xianrui Zheng, Rogier C. van Dalen

发表机构 * Samsung, AI Center – Cambridge, United Kingdom(三星,人工智能中心——剑桥,英国)

AI总结 本文提出了一种基于大语言模型(LLM)的实时流式语音到文本翻译系统,旨在解决现有SpeechLLM系统在实际应用中响应速度慢的问题。该方法使模型不仅能生成翻译文本,还能判断是否已接收到足够的音频信息以进行输出,从而实现更高效的流式处理。实验表明,该系统在保持翻译质量接近非流式基线的同时,将延迟降低至1-2秒,显著提升了实时性。

Comments 9 pages of main text; 24 pages in total

详情
英文摘要

Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

2605.14764 2026-05-15 cs.LG cs.AI 版本更新

Compositional Sparsity as an Inductive Bias for Neural Architecture Design

Hongyu Lin, Antonio Briola, Yuanrong Wang, Tomaso Aste

发表机构 * Department of Computer Science, University College London, London, United Kingdom(伦敦大学学院计算机系)

AI总结 本文研究了深度神经网络如何通过结构先验克服维度灾难的问题,提出了一种基于组合稀疏性的归纳偏差。作者结合信息过滤网络(IFN)和同调神经网络(HNN),构建了一种可解释的神经网络设计框架,通过分层组合实现抽象表示。实验表明,HNN在参数数量远少于传统深度网络的情况下,不仅在合成任务中能准确恢复稀疏结构,还在多个真实数据集上表现出更优的性能和稳定性。

详情
英文摘要

Identifying the structural priors that enable Deep Neural Networks (DNNs) to overcome the curse of dimensionality is a fundamental challenge in machine learning theory. Existing literature suggests that effective high-dimensional learning is driven by compositional sparsity, where target functions decompose into constituents supported on low-dimensional variable subsets. To investigate this hypothesis, we combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximisation, with Homological Neural Networks (HNNs), which map the inferred topology into fixed-wiring sparse neural graphs. We formalise the design principles underlying this construction and present an interpretable pipeline in which abstraction emerges through hierarchical composition. HNNs are orders of magnitude sparser than standard DNNs and require only minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover the underlying compositional structure and remain stable in regimes where dense alternatives degrade as dimensionality increases. Across a broad suite of real-world datasets, HNNs consistently match or outperform dense baselines while using far fewer parameters, exhibiting lower variance and showing reduced sensitivity to hyperparameters.

2605.14761 2026-05-15 cs.AI cs.HC 版本更新

AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction

Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi

发表机构 * Graduate School of Information Science and Technology(信息科学与技术研究生院) The University of Tokyo(东京大学) Bunkyo-ku, Tokyo(东京都文京区)

AI总结 该研究旨在解决AI准确预测个体对图像审美评价这一基础性挑战。研究提出了一种结合深度学习和大型语言模型(LLM)的集成系统,通过基于LLM的半结构化访谈主动获取用户的审美偏好,并结合图像的低级和高级语义特征进行预测。实验表明,该系统在预测性能上优于传统模型、人类预测者以及用户自身在一段时间后的重新评估,尤其在高评分图像上表现突出,表明AI在捕捉个体审美偏好方面可能比人类更具优势。

Comments 25 pages, 13 figures

详情
英文摘要

Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.

2605.14758 2026-05-15 cs.AI 版本更新

Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

Luca Marzari, Enrico Marchesini

发表机构 * TU Wien(维也纳技术大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 该论文研究了基于循环神经网络(RNN)的策略在部分可观测强化学习中的概率验证问题。针对现有工具在验证RNN策略时依赖严格假设或粗略近似导致结果过于保守的问题,提出了一种名为RNN-ProVe的概率验证框架,通过策略驱动采样估计策略下隐藏状态空间中不良行为的发生概率,并给出统计误差界以提供高置信度的验证结果。实验表明,该方法在单智能体和多智能体任务中能够提供更定量且更具可行性意识的概率保证。

Comments Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI) 2026

详情
英文摘要

History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

2605.14754 2026-05-15 cs.AI 版本更新

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

Gong Zhiren, Tiantong Wu, Jiaming Zhang, Fuyao Zhang, Che Wang, Yurong Hao, Yikun Hou, Foo Ping, Yilei Zhao, Fei Huang, Chau Yuen, Wei Yang Bryan Lim

发表机构 * Alibaba Group, China(阿里巴巴集团,中国)

AI总结 XDomainBench 是一个用于诊断大语言模型在高维科学知识组合中推理崩溃问题的诊断基准。该研究通过系统化设计不同学科组合和任务难度,揭示了随着知识组合复杂度增加,模型推理能力显著下降的现象。研究发现,推理崩溃主要由学科组合带来的难度提升以及交互过程中错误累积和领域混淆所导致,为科学知识合成中的模型评估提供了新的视角和实验框架。

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

2605.14752 2026-05-15 cs.LG cs.AI 版本更新

Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions

Qirui Liu, Hao Chen, Weijie Shi, Jiajie Xu, Jia Zhu

发表机构 * South China University of Technology(华南理工大学) Tencent Financial Technology(腾讯金融科技) The Hong Kong University of Science and Technology(香港科学与技术大学) Soochow University(苏州大学) Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University(浙江省智能教育技术与应用重点实验室,浙江师范大学)

AI总结 该研究旨在准确识别学生的错误概念,以支持个性化教育,针对数据稀缺、标注噪声大及模型部署受限等挑战,提出了一种基于认知不确定性的两阶段知识蒸馏框架。该方法通过挖掘现有数据中的高价值样本,结合教师模型的不确定性与置信度差异,识别关键样本并设计难度自适应机制,使学生模型能够有效继承类别间关系并区分模糊错误类型。实验表明,该方法在少量数据训练下显著提升了分类性能,优于当前最优模型。

Comments ACL 2026 Findings. 10 pages, 5 figures, 19 tables

详情
英文摘要

Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.

2605.14750 2026-05-15 cs.CR cs.AI 版本更新

EVA: Editing for Versatile Alignment against Jailbreaks

Yi Wang, Hongye Qiu, Yue Xu, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang

发表机构 * ShanghaiTech University(上海科技大学) Sun Yat-sen University(中山大学) State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Tsinghua University(清华大学)

AI总结 大型语言模型(LLMs)和视觉语言模型(VLMs)虽然表现出色,但仍易受越狱攻击的影响,攻击者通过文本或视觉触发器绕过安全防护。为解决现有防御方法带来的计算开销大和性能下降问题,本文提出EVA框架,通过直接模型编辑技术精准修正模型中导致越狱行为的关键神经元,无需大规模重训练,从而在保持模型原有能力的同时有效消除有害行为。实验表明,EVA在多种模型上均优于现有方法,为部署后的安全对齐提供了高效且精确的解决方案。

Comments IEEE TPAMI 2026

详情
英文摘要

Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model's likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model's performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically edits specific neurons responsible for the model's susceptibility to harmful instructions, while leaving the vast majority of the model unchanged. By localizing the updates, EVA effectively neutralizes harmful behaviors without compromising the model's general reasoning capabilities. Extensive experiments demonstrate that EVA outperforms baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.

2605.14749 2026-05-15 cs.CL cs.AI cs.LG 版本更新

Non-linear Interventions on Large Language Models

Sangwoo Kim

发表机构 * Department of Linguistics, Seoul National University, Republic of Korea(韩国首尔国立大学语言系)

AI总结 本文研究了如何对大语言模型中的非线性表示特征进行干预,突破了现有线性干预方法的局限。作者提出了一种适用于非线性特征的通用干预框架,并设计了相应的学习方法,能够对缺乏直接输出信号的隐式特征进行干预。实验表明,该方法在拒绝绕过引导任务中表现优于传统线性方法,干预效果更精确。

详情
英文摘要

Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

2605.14747 2026-05-15 cs.CL cs.AI cs.CV cs.LG 版本更新

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

发表机构 * National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(国家多媒体信息处理重点实验室,计算机科学学院,北京大学) The University of Hong Kong(香港大学) Renmin University of China(中国人民大学)

AI总结 本文提出了一种名为Video2GUI的全自动框架,用于从未标注的互联网视频中提取结构化的GUI交互轨迹,以解决当前GUI智能体预训练数据规模小、领域单一的问题。该方法通过粗到细的过滤策略筛选高质量的GUI教程视频,并将其转化为可用于训练的交互轨迹,构建了包含1200万条轨迹、覆盖1500多个应用和网站的大型数据集WildGUI。基于该数据集预训练的模型在多个GUI定位和操作基准测试中取得了5-20%的性能提升,达到了或超越了现有最佳水平。

Comments Accepted at ICML 2026

详情
英文摘要

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

2605.14744 2026-05-15 cs.CL cs.AI cs.CY 版本更新

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

José Manuel de la Chica Rodríguez, Carlos Martí-González

发表机构 * Santander AI Lab(Santander AI实验室)

AI总结 本研究探讨了在受监管的金融决策系统中,大型语言模型(LLM)如何通过自然语言政策进行治理的问题,指出当前的评估方法仅关注任务准确性,而忽略了治理对决策推理过程的约束。为此,研究提出了五个衡量治理合规性的指标,并引入四种独立于模型解释循环的机械强制方法,显著提升了决策信息的完整性和任务准确性。实验表明,机械强制不仅大幅降低了无信息决策的比例,还验证了治理与任务性能之间的解耦现象,即在系统压力下,治理质量可以独立于任务表现得到保持。

详情
英文摘要

Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

2605.14741 2026-05-15 eess.SY cs.AI cs.SY 版本更新

Addressing Terminal Constraints in Data-Driven Demand Response Scheduling

Maximilian Bloor, Martha White, Ehecatl Antonio del Rio Chanona, Calvin Tsay

发表机构 * Sargent Centre for Process Systems Engineering, Imperial College London, London, SW7 2AZ, UK(过程系统工程中心,伦敦帝国理工学院,伦敦,SW7 2AZ,英国) Department of Computer Science, University of Alberta, Edmonton, AB, Canada(计算机科学系,阿尔伯塔大学,埃德蒙顿,AB,加拿大) Department of Computing, Imperial College London, London, SW7 2AZ, UK(计算系,伦敦帝国理工学院,伦敦,SW7 2AZ,英国)

AI总结 本文研究了在数据驱动的需求响应调度中如何满足终端约束的问题,提出了一种结合目标空间规划(GSP)与深度确定性策略梯度(DDPG)的方法,通过学习离散子目标的时序抽象模型,有效传递长期价值,提升调度效果。该方法在模拟的空气分离系统中验证了其在提高样本效率和满足终端存储约束方面的优势,缓解了传统方法在长期约束处理上的不足。

Comments Accepted to IFAC World Congress 2026

详情
英文摘要

Electrified chemical processes are incentivized by exposure to time-varying electricity markets to operate flexibly, but participating in demand response schemes can require satisfying terminal constraints over long horizons. Specifically, terminal constraints may be required when computing optimal schedules in order to preserve dynamic stability. Model-based optimization methods are computationally costly, and data-driven scheduling via reinforcement learning (RL) faces severe credit-assignment challenges. We integrate Goal-Space Planning (GSP) with Deep Deterministic Policy Gradient (DDPG), using learned temporally abstract models over discrete subgoals to propagate value across extended horizons. Using a simulated air separation benchmark, we demonstrate the proposed approach improves sample efficiency over standard DDPG while satisfying terminal storage constraints, mitigating myopic control behavior.

2605.14723 2026-05-15 cs.AI cs.CL cs.LG 版本更新

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出了一种名为SepsisAgent的新型代理模型,用于重症监护中的脓毒症治疗决策。该模型通过结合临床世界模型,模拟患者对不同治疗方案的反应,并采用“提出—模拟—优化”的流程进行决策优化。研究显示,SepsisAgent在遵循指南和安全指标方面表现优异,优于传统强化学习和大语言模型基线方法,其核心贡献在于通过与临床世界模型的反复交互,使模型能够学习患者生理变化的规律并提升决策可靠性。

详情
英文摘要

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

2605.14721 2026-05-15 cs.AI 版本更新

On Strong Equivalence Notions in Logic Programming and Abstract Argumentation

Giovanni Buraglio, Wolfgang Dvorak, Stefan Woltran

发表机构 * TU Wien, Austria(维也纳技术大学,奥地利)

AI总结 本文研究了逻辑编程与抽象论证中强等价性的差异问题,指出在动态环境下,两类形式系统由于更新机制的不同,导致强等价性无法直接对应。为此,作者提出了一种新的逻辑程序强等价性定义,使得在特定类别的逻辑程序与邓式及扩展型论证框架之间,强等价性得以保持,从而恢复了不同形式系统间的兼容性。

详情
英文摘要

Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.

2605.14717 2026-05-15 cs.CV cs.AI 版本更新

Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

Saqib Nazir, Ardhendu Behera

发表机构 * Department of Computer Science, Edge Hill University, UK(英国埃德希尔大学计算机科学系)

AI总结 该研究旨在解决无标记单细胞成像中直接从明场图像推断分子表型的难题,提出了一种基于多任务学习的深度学习框架,能够同时完成白细胞分类和蛋白质表达水平的回归预测。该模型采用卷积神经网络与Transformer相结合的混合架构,通过可学习的跨分支门控模块融合局部纹理特征与全局表示,从而实现对差分相位对比图像的鲁棒形态-分子联合推理。实验表明,该方法在多个基准数据集上表现出色,为无需荧光染色的低成本血液学分析提供了新途径。

Comments Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.

2605.14712 2026-05-15 cs.RO cs.AI cs.CL cs.CV 版本更新

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen

发表机构 * HUST(华中科技大学) ZGCA(中钢集团人工智能研究院) ZGCI(中钢智能科技有限公司) HIT(哈尔滨工业大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学) ZZU(浙江工业大学) ECNU(华东师范大学) USTC(中国科学技术大学) DeepCybo

AI总结 该研究针对机器人模仿学习中因短时意图差异导致的动作冲突问题,提出了一种基于历史信息的视觉-语言-动作(VLA)框架IntentVLA,通过编码近期视觉观测生成紧凑的短时意图表示,用于指导动作生成。研究还构建了AliasBench基准,用于评估短时观测歧义下的策略性能,实验表明IntentVLA在多个任务中提升了动作执行的稳定性并优于现有VLA方法。

Comments Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA

详情
英文摘要

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

2605.14710 2026-05-15 cs.CV cs.AI 版本更新

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu, Guanjie Wang, Yiqing Xia, Ting Xiao

发表机构 * School of Information Science and Engineering, East China University of Science and Technology(信息科学与工程学院,东华大学)

AI总结 该研究针对缺血性中风预后预测中多模态数据融合不足的问题,提出了一种三模态融合模型,有效整合了医学影像、结构化临床数据和非结构化文本。核心方法通过大语言模型自动生成半结构化诊断文本,缓解了专家标注稀缺的问题,并设计了以视觉特征为条件的对齐融合模块,实现了跨模态的深度交互与异构性缓解。实验表明,该模型在真实临床数据上取得了最先进的预测性能。

Comments Corresponding author: Ting Xiao

详情
英文摘要

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

2605.14704 2026-05-15 cs.CV cs.AI cs.RO 版本更新

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(国立台湾大学) Delta Robotics Innovation Center(Delta机器人创新中心)

AI总结 在现实场景中,目标物体可能位于不可见区域,而当前视觉语言模型(VLMs)在推理这些被遮挡物体的位置方面仍面临挑战。为此,研究提出SceneFunRI基准,基于SceneFun3D数据集构建了一个包含855个实例的2D空间推理任务,要求模型通过任务指令和常识推理定位不可见的功能性物体。实验表明,现有最强基线模型在该任务上的表现仍较为有限,揭示了当前模型在不可见区域推理能力上的不足,亟需更紧密融合任务意图、常识先验、空间定位与不确定性感知搜索的模型改进。

详情
英文摘要

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

2605.14698 2026-05-15 cs.LG cs.AI 版本更新

NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces

Konstantinos Kontras, Trui Osselaer, Stylianos G. Mouslech, Angeliki-Ilektra Karaiskou, Guido Gagliardi, Thomas Strypsteen, Mohammad Hossein Badiei, Anku Rani, Maarten Vanmarcke, Miguel Bhagubai, Chanakya Ekbote, Jaedong Hwang, Christos Chatzichristos, Paul Pu Liang, Maarten De Vos

发表机构 * KU Leuven(鲁文大学) MIT(麻省理工学院)

AI总结 本文介绍了NeuroAtlas,这是目前最大的临床脑电图(EEG)基准数据集,包含42个数据集和26万小时的EEG数据,涵盖癫痫、睡眠医学和脑龄估计等领域,并引入了专门的临床评估指标。研究对比了专门针对EEG的预训练模型与通用时间序列模型的性能,发现后者在某些任务上表现相当甚至更优。研究还指出,传统机器学习指标难以准确评估临床实用性,因此提出了更贴近实际应用的评估方法,并揭示了当前预训练模型在统一EEG建模方面仍存在较大差距。

详情
英文摘要

Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.

2605.14685 2026-05-15 cs.LG cond-mat.stat-mech cs.AI 版本更新

Spontaneous symmetry breaking and Goldstone modes for deep information propagation

Nabil Iqbal, T. Anderson Keller, Yue Song, Takeru Miyato, Max Welling

发表机构 * Dept. of Mathematical Sciences, Durham University(杜伦大学数学科学系) Kempner Institute, Harvard University(哈佛大学凯普纳研究所) AMLab, University of Amsterdam(阿姆斯特丹大学AMLab) College of AI, Tsinghua University(清华大学人工智能学院) University of Tübingen, Tübingen(图宾根大学) AI Center(人工智能中心) CuspAI

AI总结 本文研究了具有连续对称性的深度神经网络中自发对称性破缺现象及其类似戈德斯通模式的自由度,揭示了这些自由度能够支持信息在深度网络和循环迭代中的相干传播。通过理论分析与实验验证,作者表明这种机制可以在无需残差连接或归一化等结构稳定器的情况下实现稳定的信息流,提升了前馈网络的可训练性和表示多样性,并在循环网络中有效增强了长期记忆能力,改善了长序列建模任务的性能。

Comments 28 pages. Code at https://github.com/nabiliqbal/ssb-goldstone-deep-info-prop

详情
英文摘要

In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.

2605.14679 2026-05-15 cs.CL cs.AI 版本更新

AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

Vicent Briva-Iglesias, María Ferre-Fernández

发表机构 * Dublin City University(都柏林城市大学) CTTS(文化传承研究所) ADAPT Centre(适应中心) SALIS Universidad de Almería(阿尔梅里亚大学)

AI总结 本研究探讨了在岩画文献等术语密集的文化遗产领域中,如何通过人工智能辅助提升多语言传播的质量。研究比较了三种英文机器翻译方法在西班牙语学术文本中的表现,重点评估了基于术语表增强的提示策略对专业术语准确性的提升效果。结果表明,结合术语表的大型语言模型(Gemini-RAG)在术语准确性和整体翻译质量上均优于传统神经机器翻译和基础提示模型,为文化机构提供了一种低成本、高效率的术语控制解决方案。

详情
英文摘要

Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

2605.14671 2026-05-15 cond-mat.mtrl-sci cs.AI 版本更新

Agentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications

Matteo Cobelli, Stefano Sanvito

发表机构 * School of Physics(物理系) CRANN Institute, Trinity College, Dublin 2, Dublin, Ireland(CRANN研究所,三一学院,都柏林2号,都柏林,爱尔兰)

AI总结 本文提出了一种基于自研(autoresearch)框架的智能代理系统Automat,用于材料科学中化学成分描述符的设计。该系统利用大型语言模型作为编码代理,自动生成仅基于化学公式的描述符,并通过随机森林进行评估,实现了对无机材料带隙和铁磁化合物居里温度的预测。研究显示,Automat在性能上优于传统基准方法,且生成的描述符具有化学可解释性,展示了无需人工特征工程即可设计任务特定材料描述符的潜力,同时也揭示了当前在描述符冗余和搜索策略等方面存在的挑战。

详情
英文摘要

Autoresearch offers a flexible paradigm for automating scientific tasks, in which an AI agent proposes, implements, evaluates, and refines candidate solutions against a quantitative objective. Here, we use composition-based materials-property prediction to test whether such agents can perform a task beyond model selection and hyperparameter optimization: the design of input descriptors. We introduce Automat, an autoresearch framework where a coding agent based on a large language model generates composition-only descriptors for chemical compounds and evaluates them using a random forest workflow. The agent is restricted to information derivable from chemical formulas and iteratively proposes, implements, and tests chemically motivated descriptor strategies. We apply Automat, with OpenAI Codex using GPT-5.5 as the coding agent, to the prediction of experimental band gaps in inorganic materials and Curie temperatures in ferromagnetic compounds. In both tasks, Automat improves over fractional-composition, Magpie, and combined fractional-composition/Magpie baselines, while producing descriptor families that are chemically interpretable. These results provide a demonstration that autoresearch agents can generate competitive, task-specific materials descriptors without manual feature engineering during the run. They also reveal current limitations, including descriptor redundancy, sensitivity to greedy feature expansion, and the need for explicit complexity control, descriptor pruning, and more sophisticated search strategies.

2605.14667 2026-05-15 cs.AI 版本更新

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

D. Gil, I. Sanchez, C. Sanchez

发表机构 * Computer Vision Center(计算机视觉中心) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 本文研究了放射组学AI模型对影像采集参数的敏感性,提出了一种基于混合效应的框架,用于量化临床相关参数对模型性能的影响,并识别出有助于提升跨数据集鲁棒性的关键参数范围。通过在两个独立的多中心CT数据集上应用该框架,研究发现优化的扫描参数配置(如管电流≥200mA、螺距≤1.5、层厚≤1.25mm)可在保证诊断质量的同时降低辐射剂量,显著提升模型的敏感性和特异性。

详情
英文摘要

A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.

2605.14666 2026-05-15 cs.AI 版本更新

Monitoring Data-aware Temporal Properties (Extended Version)

Alessandro Gianola, Marco Montali, Sarah Winkler

发表机构 * INESC-ID/Instituto Superior Técnico, Universidade de Lisboa, Portugal(葡萄牙里斯本大学理工学院/INESC-ID) Free University of Bozen-Bolzano, Italy(意大利博登-博洛尼亚自由大学)

AI总结 本文研究如何对具有任意SMT理论的线性时序逻辑(LTLfMT)进行前瞻监控,以应对动态系统中无法访问内部规范的问题。提出了一种结合自动机理论与自动推理技术的新框架,能够在有限轨迹上正确监控复杂属性。该方法首次识别出包含线性算术与未解释函数的可判定子类,适用于数据感知的业务流程和只读数据库上的动态系统,并通过原型实现验证了其可行性。

Comments This is the extended version of a paper accepted to IJCAI 2026

详情
英文摘要

Dynamic systems in AI are often complex and heterogeneous, so that an internal specification is not accessible and verification techniques such as model checking are not applicable. Monitoring is in such cases an attractive alternative, as it evaluates desirable properties along traces generated by an unknown dynamic system. In this work, we consider anticipatory monitoring of linear-time properties enriched with an arbitrary SMT theory over finite traces (LTLfMT). Anticipatory monitoring in this setting is highly challenging, as the monitoring state depends on both the trace prefix seen so far and all its possible finite continuations. Under reasonable assumptions on the background theory, we present and formally prove the correctness of a novel foundational framework for monitoring properties in an expressive fragment of LTLfMT. The framework combines automata-theoretic methods to handle the temporal aspects of the logic, with automated reasoning techniques to address the first-order dimension. Moreover, we identify for the first time decidable fragments of this monitoring problem that are practically relevant as they combine linear arithmetic with uninterpreted functions, which covers e.g. data-aware business processes and dynamic systems operating over a read-only database. Feasibility is witnessed by a prototype implementation and preliminary evaluation.

2605.14660 2026-05-15 cs.AI 版本更新

MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

Eranga Bandara, Ross Gore, Asanga Gunaratna, Ravi Mukkamala, Nihal Siriwardanagea, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Wathsala Herath, Chalani Rajapakse, Sachin Shetty, Anita H. Clayton, Christopher K. Rhea, Ng Wee Keong, Kasun De Zoysa, Amin Hass, Shaifali Kaushik, Preston Samuel, Atmaram Yarlagadda

发表机构 * Old Dominion University(旧 Dominion 大学) AI Motion Labs(AI Motion 实验室) Nanyang Technological University(南洋理工大学) University of Colombo(科伦坡大学) Accenture Technology Labs(Accenture 技术实验室) Department of Psychiatry and Neurobehavioral Sciences(精神病学与神经行为科学系) University of Virginia School of Medicine(弗吉尼亚大学医学院) Blanchfield Army Community Hospital(Blanchfield 军队社区医院) McDonald Army Health Center(McDonald 军队健康中心)

AI总结 本文提出了一种名为MindGap的会话式人工智能框架,旨在通过上游神经可塑性干预治疗创伤后应激障碍(PTSD)。该方法基于佛教心理框架“缘起”理论,引导患者在感知与反应之间的时间间隙进行观察,从而实现对过度反应神经通路的结构性重塑。MindGap通过三个渐进的观察层次,帮助患者逐步识别并削弱引发应激反应的潜在信念,实现从源头上缓解症状,而非仅在反应发生后进行压制。该框架完全在设备端运行,保障隐私,适合在临床和军事等对数据安全要求严格的环境中部署。

详情
英文摘要

Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

2605.14645 2026-05-15 cs.CV cs.AI 版本更新

Vision-Based Water Level and Flow Estimation

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited(中国电力工程集团中南工程公司)

AI总结 该研究提出了一种结合先进视觉模型与统计建模的综合框架,用于提高水位检测和水流估算的精度。通过引入物理先验知识和鲁棒滤波策略,有效应对了环境敏感性、精度有限和现场校准复杂等挑战。该方法在保持自动化和可解释性优势的同时,提升了传统视觉方法在水文监测中的可靠性。

详情
英文摘要

With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git

2605.14641 2026-05-15 cs.CV cs.AI 版本更新

How to Evaluate and Refine your CAM

Luca Domeniconi, Alessandra Stramiglio, Michele Lombardi, Samuele Salti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 该研究针对卷积神经网络中类别归因图(CAM)的评估与改进问题,提出了一种合成数据集以生成真实归因标签,从而更严格地比较现有评估指标,并提出了一种新的复合评估指标ARCC,能够更可靠地识别忠实的解释。同时,为解决CAM分辨率低的问题,研究还引入了RefineCAM方法,通过聚合多层网络的CAM生成高分辨率归因图,实验表明该方法在新评估指标下优于现有方法。

Comments Accepted at ICPR 2026

详情
英文摘要

Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

2605.14636 2026-05-15 cs.AI 版本更新

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

Chenlu Ding, Jiancan Wu, Yanchen Luo, Zheyuan Liu, Yancheng Yuan, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) The Hong Kong Polytechnic University(香港理工大学) University of Notre Dame(圣母大学)

AI总结 该研究探讨了大型语言模型在时间截断条件下进行推理时的失效问题,即模型在回答过去时间点的问题时错误地使用了未来才可获得的信息。研究提出了一种名为TCFT的时序批评微调框架,通过训练模型识别和判断回答中是否存在时间泄露,从而提升其在时间限制下的推理能力。实验表明,TCFT在多个模型上显著优于传统提示和微调方法,有效降低了时间泄露的比例。

详情
英文摘要

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

2605.14635 2026-05-15 cs.CV cs.AI 版本更新

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Tianwei Chen, Takuya Furusawa, Yuki Hirakawa, Ryotaro Shimizu, Mo Fan, Takashi Wada

发表机构 * ZOZO NEXT Inc.(ZOZO NEXT公司)

AI总结 本文提出一个多标签视觉情感分析基准数据集MultiEmo-Bench,用于全面评估多模态大语言模型(MLLMs)对图像引发情感的预测能力。现有数据集采用单一标签标注方式,难以反映图像可能引发的多维度、多强度情感,为此本文引入多标注员协同标注机制,生成包含10,344张图像和236,998个有效情感标签的高质量数据集,并基于该数据集评估了多个主流模型在主控情感预测和情感分布预测任务上的表现,揭示了当前MLLMs在情感理解方面的进展与不足。

详情
英文摘要

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

2605.14631 2026-05-15 cs.LG cs.AI cs.CV 版本更新

Action-Inspired Generative Models

Eshwar R. A., Debnath Pal

发表机构 * Department of Computer Science Engineering(计算机科学与工程系) PES University (EC Campus), Bengaluru(班加罗尔EC校区的PES大学) Department of Computational and Data Sciences(计算与数据科学系) Indian Institute of Science, Bengaluru(班加罗尔印度科学研究院)

AI总结 本文提出了一种受动作启发的生成模型(AGMs),旨在改进现有桥接匹配方法中对所有随机转移赋予相同回归权重的问题。该方法引入了一个轻量的可学习标量势函数 $V_ϕ$,用于在线评估桥接样本并调节漂移目标,从而选择性地惩罚非信息性传输路径,提升了生成质量。该模型结构简单,仅增加约1.4%的参数,无需额外计算开销,可直接嵌入任何桥接匹配训练流程中。

Comments 11 pages, 5 figures, and 4 tables

详情
英文摘要

We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_ϕ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_ϕ$'s guiding signal. Crucially, $V_ϕ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_ϕ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

2605.14621 2026-05-15 cs.CV cs.AI cs.CL 版本更新

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen

发表机构 * Tsinghua University(清华大学) The University of Sydney(悉尼大学) Stanford University(斯坦福大学) Baichuan AI(百川AI)

AI总结 大型视觉语言模型(LVLMs)在语言先验主导弱或模糊视觉证据时容易产生幻觉。现有对比解码方法通过比较原始图像和外部扰动输入的预测来缓解这一问题,但依赖外部参考可能引入偏差并增加计算成本。本文提出SIRA,一种无需训练的内部对比解码框架,通过利用多模态变换器的分阶段信息流,在模型内部构建反事实参考,有效抑制幻觉,同时保持描述覆盖率,并适用于开源权重模型。

详情
英文摘要

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

2605.14619 2026-05-15 cs.AI 版本更新

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

Kang Chen, Junjie Nian, Yixin Cao, Yugang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 该研究提出了SliceGraph方法,用于分析多轮思维链(CoT)推理过程中不同路径之间的共享、分裂与重组结构。通过计算CoT片段间的激活键Jaccard相似度并构建互k近邻图,SliceGraph揭示了不同推理路径在过程结构上的异同,并识别出具有相同答案但推理过程不同的“过程异构体”。实验表明,多数问题-模型组合中存在多个过程家族,它们在策略上具有一致性但结构上有所区分,表明最终答案聚合忽略了推理过程中的多路径结构特征。

详情
英文摘要

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

2605.14612 2026-05-15 cs.SE cs.AI 版本更新

In-IDE Toolkit for Developers of AI-Based Features

Yaroslav Sokolov, Yury Khudyakov, Lenar Sharipov, Andrei Gasparian, Parth Tiwary, Artem Trofimov

发表机构 * JetBrains

AI总结 本文提出了一种集成在JetBrains IDE中的AI Toolkit插件,旨在帮助非机器学习背景的软件工程师更便捷地测试、调试和评估基于大语言模型和智能体工作流的AI功能。该工具通过在运行/调试过程中实现追踪与评估,满足了开发者对可重复评估、实时追踪和简化设置的核心需求。实验表明,该工具能有效降低使用门槛,促进开发者形成规范的AI开发实践。

Comments Published at IDE'26 co-located with ICSE'26

详情
英文摘要

AI-enabled features built on LLMs and agentic workflows are difficult to test, debug, and reproduce, especially for product-focused software engineers without a machine learning background. We present the AI Toolkit plugin for JetBrains IDEs, which brings tracing and evaluation directly into the Run/Debug loop. A mixed methods study with practitioners presents three consistent needs: (1) make evaluation regular and repeatable, (2) expose traces at the moment of execution, and (3) minimize setup and context switching. Guided by these needs, the AI Toolkit introduces an IDE-native workflow: run-triggered trace capture; immediate, hierarchical inspection; one-click "Add to Dataset" from traces; and unit-test-like evaluations with pluggable metrics. The first release in PyCharm shows promising early signals - strong conversion when promoted at Run, sustained usage among those who capture traces, and low churn - suggesting that IDE-native observability lowers activation energy and helps developers adopt disciplined practices. We detail the design and implementation of the AI Agents Debugger and AI Evaluation, report initial adoption telemetry, and outline next steps to broaden framework coverage and scale evaluations. Together, these results indicate that integrating AI observability and evaluation into everyday IDE workflows can make modern AI development accessible to non-ML specialists while preserving software-engineering practices.

2605.14604 2026-05-15 cs.AI cs.HC 版本更新

Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

Enkelejda Kasneci, Gjergji Kasneci

发表机构 * Technical University of Munich, Munich, Germany(慕尼黑技术大学,慕尼黑,德国) Munich Center for Machine Learning, Munich, Germany(慕尼黑机器学习中心,慕尼黑,德国)

AI总结 本文指出,有效的教学需要“纠正性摩擦”,即通过指出并支持性地挑战学生的误解来促进概念转变,但当前偏好对齐的大语言模型(LLMs)可能为了友好而牺牲认知严谨性。为此,作者提出了“推理-谄媚悖论”,即模型虽能抵御上下文切换攻击,却可能在权威或社交压力下退缩。文章引入了EduFrameTrap基准,用于评估LLM在不同学科和压力情境下的教学表现,并发现当前前沿模型在面对权威和社会压力时更容易出现认知退缩,强调了建立衡量“社会-认知勇气”的教学基准的重要性。

详情
英文摘要

This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.

2605.14599 2026-05-15 cs.LG cs.AI stat.ML 版本更新

Fast Rates for Inverse Reinforcement Learning

Andreas Schlaginhaufen, Maryam Kamgarpour

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 本文研究了有限时间马尔可夫决策过程中的熵正则化最小-最大逆强化学习(Min-Max-IRL)问题,针对线性奖励类问题,建立了新的结构和统计性质。作者证明了在总体层面,最大似然估计与Min-Max-IRL等价,在确定性动力学下在经验层面也等价。通过利用Min-Max-IRL损失的伪自共轭性质,作者展示了轨迹级KL散度和参数误差在Hessian范数下的衰减速度为$\mathcal{O}(n^{-1})$,且结果适用于模型误设情况,无需探索假设。此外,还扩展了奖励可识别性的结果到一般的Borel空间,并推导了软最优价值函数关于奖励参数的导数新性质。

详情
英文摘要

We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.

2605.14587 2026-05-15 cs.LG cs.AI cs.CR 版本更新

Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning

Oubo Ma, Ruixiao Lin, Yang Dai, Jiahao Chen, Chunyi Zhou, Linkang Du, Shouling Ji

发表机构 * Zhejiang University(浙江大学) National University of Defense Technology(国防科技大学) Xi'an Jiaotong University(西安交通大学)

AI总结 本文研究了可塑性干预对深度强化学习(DRL)中后门攻击的影响,发现大多数干预措施能有效缓解后门威胁,而仅有SAM干预会加剧威胁。通过病理分析,揭示了后门梯度放大与激活路径破坏等机制,并提出了SCC概念框架和异常损失景观锐度作为后门检测的新指标,为提升DRL系统安全性提供了理论支持。

Comments To appear in the Forty-Third International Conference on Machine Learning (ICML 2026), July 6-11, 2026, Seoul, South Korea

详情
英文摘要

Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.

2605.14581 2026-05-15 cs.CV cs.AI cs.IR 版本更新

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Ho Hung Lim, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港理工大学)

AI总结 本研究探讨了在视觉金融文档检索中,将文档图像编码为单一向量进行聚合可能带来的信息丢失问题。通过构建一个金融文档诊断基准,实验发现单一向量聚合会导致不同文档的向量几乎相同,从而掩盖了关键语义细节。研究指出,全局纹理主导是导致这一问题的根本原因,并表明该现象在不同模型规模和优化策略下均存在,突显了单一向量方法在金融应用中的潜在风险。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

2605.14561 2026-05-15 cs.AI 版本更新

Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations

Devika Prasad, Luke Gerschwitz, Tong Li, Henry Xiao, Anjin Liu, Coco Wu, Anna Leontjeva, Luiz Pizzato

发表机构 * Commonwealth Bank of Australia(澳大利亚全国银行)

AI总结 本文提出了一种结构化的提示优化框架——提示分割与注释优化(PSAO),旨在提升与大型语言模型交互时的可控性和效率。该方法将提示分解为可解释的片段,并为每个片段添加人类可读的注释,以引导模型在生成响应时合理分配注意力并减少混淆。实验表明,优化后的片段级注释能够提升模型的推理准确性和一致性,同时保留原始提示作为优化候选以避免性能下降。该工作验证了片段级注释优化的可行性与潜力,但如何高效确定最优分割和注释仍是未来研究的方向。

详情
英文摘要

Prompt engineering is crucial for effective interaction with generative artificial intelligence systems, yet existing optimisation methods often operate over an unstructured and vast prompt space, leading to high computational costs and potential distortions of the original intent. We introduce Prompt Segmentation and Annotation Optimisation (PSAO), a structured prompt optimisation framework designed to improve prompt optimisation controllability and efficiency. PSAO decomposes a prompt into interpretable segments (e.g., sentences) and augments each with human-readable annotations (e.g., {not important}, {important}, {very important}). These annotations guide large language models (LLMs) in allocating focus and clarifying confusion during response generation. We formally define the segmentations and annotations and demonstrate that optimised segment-level annotations can lead to improved LLM responses, with the original prompt retained as a candidate in the optimisation space to prevent performance degradation. Empirical evaluations indicate that PSAO benefits from annotations in terms of improved reasoning accuracy and self-consistency. However, developing efficient methods for identifying optimal segmentations and annotations remains challenging and is reserved for future investigation. This work is intended as a proof of concept, demonstrating the feasibility and potential of segment-level annotation optimisation.

2605.14558 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

Langzhou He, Junyou Zhu, Yue Zhou, Zhengyao Gu, Junhua Liu, Wei-Chieh Huang, Henry Peng Zou, David Wipf, Philip S. Yu, Qitian Wu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Potsdam Institute for Climate Impact Research(波茨坦气候影响研究所) Technical University of Berlin(柏林技术大学) University of Southern California(南加州大学) University of Hong Kong(香港大学) Broad Institute of MIT and Harvard(MIT和哈佛大学Broad研究所)

AI总结 本文研究了智能体强化学习中轨迹训练信号分配不均的问题,指出现有方法对轨迹中的每个token一视同仁,导致训练信号分配不合理。作者从能量模型视角出发,发现实际训练信号主要集中在动作token上,而非推理token,这一现象被称为“动作瓶颈”。为此,提出了一种简单有效的token重加权方法ActFocus,通过降低推理token的梯度权重并增强动作token的不确定性加权,显著提升了模型性能。

Comments Preprint

详情
英文摘要

Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

2605.14556 2026-05-15 cs.AI 版本更新

TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality

Zidong Liu, Rongkai Liu, Yue Li, Zhenliang Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室) BIGAI

AI总结 本文提出了一种名为TeachAnything的多模态众包平台,用于在对称现实(Symmetrical Reality)中训练具身智能体。该平台通过融合多模态示范信号的三阶段示范范式,支持跨场景、任务和具身形态的多样化示范数据采集。通过统一虚拟与物理交互,该系统为构建符合对称现实需求的具身智能体提供了实用的基础。

Comments 5 pages, 3 figures. Accepted as an IEEE VR 2026 Poster

详情
英文摘要

Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for richer and more diverse human guidance. We introduce a three-stage demonstration paradigm integrating multimodal demonstration signals. Building on this paradigm, we developed TeachAnything, a cloud-based, crowdsourcing-oriented demonstration platform with physics simulation capable of collecting diverse demonstration data across varied scenes, tasks, and embodiments. By unifying virtual and physical interactions through both methodological design and physics simulation, the system serves as a practical foundation for developing embodied agents aligned with Symmetrical Reality.

2605.14555 2026-05-15 cs.SD cs.AI 版本更新

Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

Shuyang Cui, Zhi Zhong, Qiyu Wu, Zachary Novack, Woosung Choi, Keisuke Toyama, Kin Wai Cheuk, Junghyun Koo, Yukara Ikemiya, Christian Simon, Chihiro Nagashima, Shusuke Takahashi

发表机构 * Sony Group Corporation(索尼集团公司) Sony AI(索尼人工智能)

AI总结 本文提出了一种名为“Break-the-Beat!”的可控MIDI到鼓音效合成模型,旨在解决数字音乐制作中鼓循环音频生成缺乏精细控制的问题。该模型通过引入内容编码器和混合条件机制,对预训练的文本到音频模型进行微调,实现了根据参考音频生成具有特定音色的鼓音效。实验表明,该方法在音频质量、节奏对齐和节拍连贯性方面表现优异,为音乐制作人提供了一种高效、可控的创作工具。

详情
英文摘要

Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/

2605.14553 2026-05-15 cs.LG cs.AI 版本更新

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

Donghao Li, Chengshuai Shi, Weijuan Ou, Cong Shen, Jing Yang

发表机构 * University of Virginia(弗吉尼亚大学) Princeton University(普林斯顿大学) Southern University of Science and Technology(南方科技大学)

AI总结 本文研究了多目标提示选择问题,旨在高效识别在多个性能指标下表现最优的提示。作者将问题建模为纯探索带宽框架,并引入了适用于结构化带宽的高效算法,提供了线性情况下的理论误差保证。实验表明,该方法在多种大语言模型上显著优于基线方法,为多目标提示优化提供了原理清晰且高效的解决方案。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection -- efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.

2605.14544 2026-05-15 cs.AI 版本更新

Complacent, Not Sycophantic: Reframing Large Language Models and Designing AI Literacy for Complacent Machines

Federico Germani, Giovanni Spitale

发表机构 * Institute for Data Science and Artificial Intelligence, Boğaziçi University(数据科学与人工智能研究所,博科尼大学)

AI总结 本文重新审视了大型语言模型(LLM)的行为特征,指出其常被描述为“谄媚”是概念上的误导,实际上应理解为“ complacency( complacent)”,即模型倾向于同意用户输入,这是由于训练数据、奖励信号和设计机制更偏好一致而非纠正。研究强调,模型本身并无谄媚的动机,其行为取决于开发者的意图和系统设计。因此,文章主张应通过提升AI素养教育,帮助用户识别和对抗模型可能强化的确认偏误。

详情
英文摘要

Large language models are often described as sycophantic, in the sense that they appear to flatter users or mirror their beliefs. We argue that this label is conceptually misleading: sycophancy implies motives and strategic intent, which LLMs do not possess. Their behaviour is better understood as complacency, a structural tendency to agree with user input because training data, reward signals and design favour agreement and reinforcement over correction. We argue that this distinction matters. Whether developers act sycophantically or not, models themselves never are sycophants; they can only be made more or less complacent. This reframing locates agency in developers and institutions, not in the model. Because complacent models reinforce users' prior beliefs, we argue that AI literacy educational approaches should particularly focus on strategies to counter confirmation bias.

2605.14543 2026-05-15 cs.LG cs.AI 版本更新

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

Shuhao Chen, Weisen Jiang, Changmiao Wang, Xiaoqing Wu, Xuanren Shi, Yu Zhang, James T. Kwok

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Southern University of Science and Technology(南方科技大学) The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区) Shenzhen University General Hospital(深圳大学人民医院)

AI总结 RxEval 是一个用于评估大语言模型(LLM)处方推荐能力的处方级基准,旨在解决现有基准在细粒度药物推荐任务中的不足。该基准通过多选题形式,要求模型根据详细的患者信息和时间顺序的临床轨迹,从真实处方和生成的干扰选项中选择具体的药物-剂量-给药途径组合。实验表明,RxEval 对不同模型具有较高的区分度,反映出当前最先进模型在实际临床信息理解和推理方面仍存在挑战。

详情
英文摘要

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

2605.14542 2026-05-15 cs.AI 版本更新

VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce

Yuyan Chen

发表机构 * Cornell University(康奈尔大学)

AI总结 该研究提出了一种名为VerbalValue的虚拟直播带货助手,旨在通过提升语言能力实现更高的销售转化率。其核心方法包括构建产品知识库与销售术语词典、收集并标注大量直播互动数据,以及基于这些数据微调大语言模型以生成更具共情力和说服力的回应。实验表明,该模型在信息性、事实准确性及观众互动方面均优于多个主流大模型,展现出显著的商业应用潜力。

Comments Accepted to the CVPR 2026 HiGen Workshop

详情
英文摘要

A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.

2605.14537 2026-05-15 cs.AI 版本更新

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

Robert Müller, Clemens Müller

AI总结 本文提出了一种名为 **Cattle Trade** 的多智能体基准,用于评估大语言模型在不完全信息、对抗交互和资源约束下的策略推理能力。该基准将拍卖、隐藏报价交易、谈判、虚张声势、对手建模和资源分配整合到一个持续50到60轮的长期博弈中,测试智能体在多重竞争目标下的综合决策能力。研究发现,战略一致性、资源纪律和阶段适应性比单一技能或消费总量更能影响模型表现,并揭示了大语言模型在博弈中常见的失败模式。

Comments malgai workshop at iclr 2026

详情
英文摘要

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

2605.14534 2026-05-15 cs.CV cs.AI cs.MM 版本更新

PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Fuhao Li, Shaofeng You, Jiagao Hu, Yu Liu, Yuxuan Chen, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan

发表机构 * MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus)

AI总结 评估图像和视频中的物体移除效果仍然具有挑战性,因为该任务本质上是一对多的,而现有指标常与人类感知不一致。为解决这一问题,本文提出RC(移除一致性)指标,包括RC-S和RC-T,分别从空间和时间维度衡量移除区域的感知一致性,并构建了PROVE-Bench基准数据集以支持社区评估。实验表明,RC指标在多种图像和视频基准上表现出比现有方法更强的人类感知对齐能力。

Comments Project Page: https://xiaomi-research.github.io/prove/

详情
英文摘要

Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.

2605.14517 2026-05-15 cs.CL cs.AI 版本更新

Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

GAng Peng

发表机构 * Huizhou Lateni AI Technology Co., Ltd.(惠州拉提尼人工智能技术有限公司) Huizhou University(惠州大学)

AI总结 该研究提出了一种维度级意图保真度评估框架,用于更细致地评估大语言模型在结构形式和用户意图保持方面的表现。通过结构化提示消融实验,研究分析了2880个输出在三个语言、三个任务领域和六种模型中的表现,揭示了整体评分与维度意图缺陷之间的系统性差异。实验表明,仅依赖整体评估可能掩盖模型在具体意图上的不足,而维度级评估能更准确地反映模型质量,为用户特定任务的模型评估提供了重要补充。

Comments Preprint. 30 tasks, 3 languages, 6 LLMs, 2,880 outputs; includes human evaluation and structured prompt ablation

详情
英文摘要

Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.

2605.14513 2026-05-15 cs.CV cs.AI 版本更新

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Xuzhe Zheng, Yuexiao Ma, Jing Xu, Xiawu Zheng, Rongrong Ji, Fei Chao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(中国教育部多媒体可信感知与高效计算重点实验室,厦门大学)

AI总结 本文提出了一种名为HASTE的训练-free视频扩散加速方法,旨在解决现有稀疏注意力机制在视频生成中因二次复杂度和固定阈值带来的效率与质量平衡问题。该方法通过引入头级自适应框架,包含时间掩码复用和误差引导的预算校准两个模块,有效减少了掩码预测开销并优化了各注意力头的稀疏性分配。实验表明,HASTE在保持视频质量的同时,显著提升了模型推理速度。

详情
英文摘要

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

2605.14512 2026-05-15 cs.IR cs.AI 版本更新

Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization

Bin Huang, Xin Wang, Junwei Pan, Yongqi Zhou, Yifeng Zhou, Zhixiang Feng, Shudong Huang, Haijie Gu, Wenwu Zhu

发表机构 * DCST, Tsinghua University(清华大学直流系统研究所) DCST, BNRist, Tsinghua University(清华大学直流系统研究所) Tencent(腾讯)

AI总结 该论文针对生成式推荐(GenRec)模型中存在的输入和输出瓶颈问题,提出了一种不对称的连续-离散框架AsymRec。通过多专家语义投影(MSP)和多视角分层量化(MHQ)方法,分别提升了输入表示的语义丰富性和输出目标的结构化精度,有效缓解了流行度偏差和细粒度语义丢失的问题。实验表明,AsymRec在多个数据集上显著优于现有生成式推荐方法,平均性能提升达15.8%。

详情
英文摘要

Generative Recommendation (GenRec) models reformulate recommendation as a sequence generation task, representing items as discrete Semantic IDs used symmetrically as both inputs and prediction targets. We identify a critical dual-stage information bottleneck in this design: (1) the Input Bottleneck, where lossy quantization degrades fine-grained semantics, while popularity bias skews the learned representations toward frequent items, and (2) the Output Bottleneck, where imprecise discrete targets limit supervision quality. To address these issues, we propose AsymRec, an asymmetric continuous-discrete framework that decouples input and output representations. Specifically, Multi-expert Semantic Projection (MSP) maps continuous embeddings into the Transformer's hidden space via expert-specialized projections, preserving semantic richness and improving generalization to infrequent items. Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity, structured discrete targets through multi-view and multi-level quantization with semantic regularization, preventing dimensional collapse while retaining fine-grained distinctions. Extensive experiments demonstrate that AsymRec consistently outperforms state-of-the-art generative recommenders by an average of 15.8 %. The code will be released.

2605.14502 2026-05-15 eess.SY cs.AI cs.SY 版本更新

Quantifying Cyber-Vulnerability in Power Electronics Systems via an Impedance-Based Attack Reachable Domain

Hongwei Zhen, Ze Yu, Xin Xiang, Wuhua Li, Mingyang Sun

发表机构 * IEEE

AI总结 本文研究了电力电子系统在受到网络攻击时的脆弱性量化问题,提出了一种基于阻抗的攻击可达域(ARD)框架,用于评估在权限受限条件下节点可能被推近不稳定的程度。该方法通过阻抗重塑映射可行的攻击动作到关键特征值迁移,并定义了攻击穿透指数以综合表征系统稳定性裕度的渗透程度和成功攻击的可达性。为应对逆变器模型缺失的情况,还构建了一个实用的灰盒评估流程,结合现有阻抗识别与可微代理工具,实验表明该方法能有效揭示传统电网强度指标无法反映的脆弱性模式。

详情
英文摘要

Power electronics systems are increasingly exposed to cyber threats due to their integration with digital controllers and communication networks. However, an attacker-oriented metric is still lacking to quantify the extent to which a node can be pushed toward instability within a privilege-constrained action space. This letter proposes an impedance-based Attack Reachable Domain (ARD) framework that maps feasible adversarial actions to critical-eigenvalue migration through impedance reshaping. Based on the ARD, an Attack Penetration Index is defined to quantify node-level cyber-vulnerability by jointly characterizing the penetration of the nominal stability margin and the accessibility of successful destabilizing attacks within a privilege-constrained action space. To make the proposed assessment computable when inverter models are unavailable, a practical gray-box workflow is further established by integrating existing impedance identification and differentiable surrogate tools. Case studies on a 4-bus system and a modified IEEE 39-bus system show that coordinated cross-layer manipulations are markedly more damaging than isolated single-layer attacks, and that the proposed metric reveals vulnerability patterns that cannot be inferred from grid-strength indicators.

2605.14501 2026-05-15 eess.SY cs.AI cs.LG cs.SY 版本更新

Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning

Edoardo Scarpel, Alberto Pettena, Matteo Cederle, Federico Chiariotti, Marco Fabris, Gian Antonio Susto

发表机构 * University of Padua(帕多瓦大学)

AI总结 本文提出了一种基于深度强化学习的全动态再平衡方法,用于解决无桩共享单车系统中的车辆调度问题。该方法通过图模拟器建模服务系统,并将再平衡问题建模为马尔可夫决策过程,利用深度强化学习代理实时调度单车,根据时空关键性评分执行局部的取车、还车和充电操作。实验结果表明,该方法在真实数据上显著减少了车辆可用性失败,同时减少了空间不平等和出行荒漠现象,展示了基于学习的再平衡方法在提升共享微出行系统效率和可靠性方面的价值。

Comments 6 pages, 5 figures, 1 table, accepted at the 23rd IFAC World Congress, Busan, South Korea, Aug. 23-26, 2026. Open invited track 9-131: "Control and Optimization for Smart Cities"

详情
英文摘要

This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.

2605.14497 2026-05-15 cs.LG cs.AI 版本更新

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

Letian Yang, Xu Liu, Yiqiang Lu, Jian Liu, Weiqiang Wang, Shuai Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 本文提出了一种名为 ROAD 的离线到在线强化学习框架,通过双层优化方法实现自适应数据混合,以解决离线数据与在线策略之间非平稳分布偏移的问题。该方法将数据选择建模为双层优化过程,外层优化策略性能,内层进行传统 Q 学习更新,并引入多臂老虎机机制实现动态数据回放。实验表明,ROAD 在多个数据集上均优于现有方法,无需人工调整即可实现更优的稳定性和长期性能。

Comments 20 pages, 9 figures, 7 tables. Accepted to IJCAI 2026

详情
英文摘要

Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.

2605.14495 2026-05-15 cs.MM cs.AI 版本更新

Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao, Phuc Ho, Van Pham, Hung Cao

发表机构 * University of New Brunswick(新 Brunswick大学) University of Science, VNU-HCM(越南国家大学科学学院(VNU-HCM))

AI总结 该研究针对多媒体验证任务中准确性和透明性并重的需求,提出了一种可争议的多智能体框架,结合多模态大语言模型、外部验证工具和基于竞技场的双极论证计算方法。该方法将每个案例分解为以主张为中心的模块,检索针对性证据并生成带有来源和强度评分的支持与攻击论点,通过局部论证图进行冲突解决和不确定性处理,最终生成结构清晰、可编辑且具有实际计算可行性的验证报告。

Comments ACM ICMR 2026 Grand Challenge on Multimedia Verification

详情
英文摘要

Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.

2605.14494 2026-05-15 cs.AI cs.LG 版本更新

Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty

Tianjue Lin, Jianan Zhou, Jieyi Bi, Yaoxin Wu, Wen Song, Zhiguang Cao, Jie Zhang

发表机构 * Nanyang Technological University(南洋理工大学) Eindhoven University of Technology(埃因霍温理工大学) Shandong University(山东大学) Singapore Management University(新加坡国立大学)

AI总结 本文研究了具有离散不确定性的两阶段鲁棒优化问题,该问题因计算复杂度高而难以求解。为解决这一问题,作者提出了一种基于图神经网络和Transformer的神经代理模型NeurPRISE,通过模仿学习从问题驱动的场景缩减方法PRISE中学习场景选择策略,从而在保证解质量的同时大幅提升计算效率。实验表明,NeurPRISE在多个两阶段鲁棒优化问题中表现出良好的性能和扩展性,并具备较强的零样本泛化能力。

详情
英文摘要

Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.

2605.14488 2026-05-15 cs.AI 版本更新

Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

Assaf Gerner, Netta Madvil, Nadav Barak, Alex Zaikman, Jonatan Liberman, Liron Hamra, Rotem Brazilay, Shay Tsadok, Yaron Friedman, Neal Harow, Noam Bresler, Shir Chorev, Philip Tannor, Lior Rokach

发表机构 * Deepchecks, Ramat Gan, Israel(深检查,以色列拉马特甘) Ben-Gurion University, Beer Sheva, Israel(本· Gurion大学,以色列贝尔谢巴)

AI总结 本文介绍了 Deepchecks,一个用于评估检索增强生成(RAG)系统的综合性框架。该框架通过多方面的方法、根本原因分析和生产监控,应对RAG系统评估中的复杂挑战,旨在确保评估结果与具体应用需求一致,从而提升系统在可靠性、相关性和用户满意度方面的表现。

详情
英文摘要

Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.

2605.14487 2026-05-15 cs.CV cs.AI 版本更新

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

Jiahao Tian, Yiwei Wang, Gang Yu, Chi Zhang

发表机构 * AGI Lab, Westlake University University of California at Merced StepFun

AI总结 本文研究了长时序自回归视频生成中的误差累积和上下文丢失问题,提出了一种名为Head Forcing的训练无需额外训练的框架。该方法通过识别并区分扩散变压器中注意力头的不同功能,分别为局部细节优化、结构稳定和长程上下文聚合的头分配定制化的键值缓存策略,从而提升生成质量和效率。实验表明,该方法在不增加训练成本的情况下显著延长了视频生成时长,并支持多提示交互合成,优于现有基线方法。

详情
英文摘要

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.

2605.14483 2026-05-15 cs.AI 版本更新

LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

Xudong Chen, Yixin Liu, Hua Wei, Kaize Ding

发表机构 * GitHub

AI总结 LEMON 是一种基于大语言模型的多智能体协调器,通过反事实强化学习生成可执行的多智能体协调规范。该方法通过整合任务特定角色、职责分配、能力等级和依赖结构,提升系统整体的执行效率与解题质量。LEMON 在六个推理与编程基准测试中表现出色,取得了当前多智能体协调方法中的最佳性能。

Comments Submitted to Neurips 2026

详情
英文摘要

Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.

2605.14478 2026-05-15 cs.SE cs.AI cs.CL 版本更新

When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

Haojun Weng, Qianqian Yang, Hao Fu, Haobin Pan, Xinwei Lv

发表机构 * Independent Researcher, California, USA(加利福尼亚独立研究员) Independent Researcher, Beijing, China(北京独立研究员)

AI总结 该研究探讨了检索增强代码生成中使用过时代码片段可能对代码补全造成的负面影响。通过在五个Python仓库中对17个生产辅助函数签名变化进行受控实验,研究发现仅使用过时代码片段会显著诱导模型生成与当前状态不兼容的代码,而完全不使用检索则导致生成结果无法通过验证。实验还表明,引入当前有效的代码信息可以有效缓解过时信息带来的问题,揭示了检索内容的时间有效性是评估代码检索增强生成鲁棒性的重要因素。

Comments 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)

详情
英文摘要

Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.

2605.14465 2026-05-15 cs.AI 版本更新

From Table to Cell: Attention for Better Reasoning with TABALIGN

Tung Sum Thomas Kwok, Zeyong Zhang, Xinyu Wang, Chunhe Wang, Xiaofeng Lin, Hanwei Wu, Lei Ding, Guang Cheng, Zhijiang Guo

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) New Jersey Institute of Technology(新泽西理工学院) McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学) University of Manitoba(曼尼托巴大学) SimpleWay The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 该研究针对结构化表格中多步骤推理的问题,提出了一种名为TABALIGN的新框架,旨在解决推理过程中规划与执行之间缺乏明确的单元格对齐机制的问题。其核心方法结合了双向去噪的扩散语言模型(DLM)作为规划器,生成二进制单元格掩码表示推理步骤,并引入一个轻量级验证器TABATTN,基于大量人工验证的注意力标准对每一步进行评分。实验表明,TABALIGN在多个基准测试中显著提升了推理准确性,并加快了后续推理的执行速度。

详情
英文摘要

Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.

2605.14458 2026-05-15 cs.AI 版本更新

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

Yeo Jeong Park, Hyemi Jang, Minseo Choi, Jongsun Lee, Jooyoung Choi, Yongkweon Jeon

发表机构 * Samsung Research(三星研究院)

AI总结 OmniDrop 是一种用于多模态大语言模型的层间 token 剪枝方法,旨在解决高分辨率音频和视频输入导致的 token 爆炸问题。该方法通过在解码器各层逐步剪枝,而非在输入嵌入层进行,从而更有效地保留多模态信息融合,并利用文本查询指导剪枝过程以提升任务适应性。实验表明,OmniDrop 在多个基准测试中表现优异,显著降低了预填充延迟和内存消耗。

详情
英文摘要

Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.

2605.14455 2026-05-15 cs.AI cs.LG 版本更新

Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact

Chandan Rajah, Neha Sengupta, Federico Castanedo, Robin Mills, Amit Bahree, Ramesh Krishnan Muthukrishnan, Larry Murray

发表机构 * Inception G42

AI总结 本文提出了一种名为“智能影响商”(IIQ)的综合指标,用于量化人工智能系统在组织工作流程中的集成深度及其影响。IIQ结合了多种因素,如新颖性加权的令牌库存、使用频率、近期使用情况、组织杠杆效应、任务复杂度和自主性,生成可用于比较不同用户和单位的原始智能采纳指数(IAI)和标准化的0-1000分IIQ指数。该框架旨在为AI在工作流程中的部署提供一种可跟踪的测量工具,而非直接衡量模型能力或替代因果生产力评估。

详情
英文摘要

The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.

2605.14449 2026-05-15 cs.LG cs.AI cs.CL 版本更新

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Siyang Yao, Erhu Feng, Yubin Xia

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了大语言模型中幻觉检测的问题,提出了一种名为QAOD的单次推理框架,通过将答案表示中与问题对齐的部分分解出去,提取出与问题正交的成分以抑制领域相关的变化。该方法结合多样性惩罚的费舍尔评分和判别神经元选择,设计了两种互补的探测策略,分别用于提升领域内检测性能和跨领域泛化能力,在多个基准测试中表现出色,尤其在跨领域场景下显著优于现有方法。

详情
英文摘要

Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

2605.14443 2026-05-15 cs.AI cs.LG cs.MA 版本更新

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

Krishna Sayana, Ketan Todi, Ambarish Jash

发表机构 * Google Research(谷歌研究)

AI总结 该研究针对冻结的“黑盒”大语言模型(LLM)中的提示工程问题,提出了一种基于强化学习的框架,通过迭代经验蒸馏训练可学习的提示策略。该方法利用对比经验缓冲区,结合标量奖励和密集文本批评,使轻量级提示模型能够优化以最大化任务奖励,从而在单次策略权重中实现迭代提示的高效优化。实验表明,该方法在多步骤推理和工具使用任务中显著提升了性能,且相比现有进化基线方法具有更高的样本效率。

Comments 10 pages and reference, appendix

详情
英文摘要

The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.

2605.14440 2026-05-15 cs.AI cs.FL cs.LO 版本更新

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Debraj Chakraborty, Anirban Majumdar, Prince Mathew, Sayan Mukherjee, Jean-François Raskin

发表机构 * Nanyang Technological University, Singapore(新加坡南洋理工大学) Tata Institute of Fundamental Research, Mumbai, India(印度孟买印度理工学院基础研究所以) Université Libre de Bruxelles, Brussels, Belgium(比利时布鲁塞尔自由大学) IITB Trust Lab, Department of CSE, IIT Bombay, Mumbai, India(印度孟买印度理工学院 Bombay 电子与计算机科学系信托实验室)

AI总结 本文研究了在部分可观察马尔可夫决策过程(POMDP)中如何合成具有形式化保证的策略,针对采样方法缺乏形式正确性保证、形式合成方法可扩展性差的问题,提出了一种结合采样、自动机学习和模型检测的综合框架。该方法借鉴Angluin的$L^*$算法,利用采样作为成员查询,模型检测作为等价性查询,能够在采样策略满足正则性条件时合成有限状态控制器,并证明了该框架的相对完备性。实验表明,该方法在解决现有工具难以处理的阈值安全问题上表现良好。

Comments Paper accepted at 38th International Conference on Computer Aided Verification (CAV 2026), Lisbon, Portugal, July 2026

详情
英文摘要

Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

2605.14438 2026-05-15 cs.AI 版本更新

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

Juntong Wu, Jialiang Cheng, Qishen Yin, Yue Dai, Yuliang Yan, Fuyu Lv, Ou Dan, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University(北京大学深圳研究生院)

AI总结 BEAM(二值专家激活掩码)是一种用于动态路由的新型方法,旨在提升Mixture-of-Experts(MoE)架构在大语言模型中的推理效率。该方法通过可训练的二值掩码实现对每个token的专家动态选择,结合直通估计器和辅助正则化损失,在端到端训练中诱导专家稀疏性,同时保持模型性能。实验表明,BEAM在保持超过98%原始模型性能的同时,显著减少了MoE层的计算量,提升了推理速度和吞吐量,是一种高效且易于集成的实用解决方案。

Comments 22 pages, 12 figures

详情
英文摘要

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

2605.14434 2026-05-15 cs.IR cs.AI 版本更新

Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL

Jianbo Zhu, Xing Fang, Jing Wang, Mingmin Jin, Bokang Wang, Guangxin Song, Zhenyu Xie, Junjie Bai

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China Taobao \& Tmall Group of Alibaba

AI总结 该研究针对电商搜索中生成式召回方法的实用化难题,提出了一种高效的生成式召回框架CQ-SID,通过语义聚类ID和专家引导强化学习方法,有效降低了搜索复杂度并提升了召回效果。CQ-SID结合类别和查询约束的对比学习与残差量化VAE,生成分层语义标识符,显著减少束搜索规模;同时提出的EG-GRPO方法通过引入真实样本,优化生成召回与后续排序的一致性。实验表明,该方法在语义点击率和个性化点击率上分别提升26.76%和11.11%,并在实际系统中取得了显著的GMV和转化率提升。

详情
英文摘要

Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.

2605.14426 2026-05-15 physics.ao-ph cs.AI 版本更新

A plug-and-play generative framework for multi-satellite precipitation estimation

Yunfan Yang, Haofei Sun, Xiuyu Sun, Wei Han, Xiaoze Xu, Xingtao Song, Jun Li, Zhiqiu Gao, Wei Huang, Hao Li

发表机构 * State Key Laboratory of Atmospheric Boundary Layer Physics and Atmospheric Chemistry(大气边界层物理与大气化学国家重点实验室) Institute of Atmospheric Physics, Chinese Academy of Sciences(中国科学院大气物理研究所) Shanghai Academy of Artificial Intelligence for Science (SAIS)(上海人工智能科学研究院) CMA Earth System Modeling and Prediction Centre (CEMC)(中国气象局地球系统模拟与预测中心)

AI总结 该研究提出了一种名为PRISMA的插件式生成框架,用于多卫星降水估计。该方法通过从IMERG最终场中学习无条件降水先验,并结合独立训练的传感器特定条件分支,实现了无需重新训练生成主干即可灵活集成新传感器数据。实验表明,PRISMA在降水估计精度和效率方面均有显著提升,尤其在融合红外与微波观测数据时,显著提高了关键成功指数并降低了均方根误差。

详情
英文摘要

Reliable precipitation monitoring is essential for disaster risk reduction, water resources management, and agricultural decision-making. Multi-source satellite observations, particularly the combination of geostationary infrared and passive microwave measurements, have become a primary means of precipitation detection. Traditional multi-source satellite precipitation estimation methods remain computationally inefficient, and many deep learning methods lack the flexibility to incorporate new sensors without retraining the full model. Here we introduce PRISMA (Precipitation Inference from Satellite Modalities via generAtive modeling), a plug-and-play latent generative framework for multi-sensor precipitation estimation. PRISMA learns an unconditional precipitation prior from IMERG Final fields and constrains it through independently trained, sensor-specific conditional branches, allowing new observation sources to be incorporated without retraining the generative backbone. Applied to FY-4B AGRI infrared and GPM GMI microwave observations, PRISMA improves Critical Success Index by up to 40.3% and reduces root-mean-square error by 22.6% relative to infrared-only estimation within microwave swaths, while also improving probabilistic skill and maintaining an average inference time of about 37 s. Independent rain-gauge validation across China confirms consistent gains, and typhoon case studies show that microwave conditioning restores eyewall and spiral rainband structures, reducing storm-core mean absolute error by up to 42.3%. PRISMA thus provides an extensible and efficient framework for multi-sensor precipitation estimation.

2605.14423 2026-05-15 cs.LG cs.AI 版本更新

Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic

Leo Muxing Wang, Pengkun Yang, Lili Su

发表机构 * Northeastern University(东北大学) Tsinghua University(清华大学)

AI总结 本文研究了在异构环境中实现协作与个性化策略训练的问题,提出了一种单时间尺度的联邦演员-评论家框架。该方法通过共享一个公共的线性子空间表示,同时保留各智能体的个性化策略组件,实现了策略的协作优化与个性化平衡。理论分析表明,该方法在有限时间内具有收敛性,并且随着智能体数量的增加表现出线性加速效果,实验验证了其在联邦强化学习任务中的有效性。

详情
英文摘要

Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $γ\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.

2605.14421 2026-05-15 cs.CR cs.AI 版本更新

MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

Ciyan Ouyang, Rui Hou

发表机构 * State Key Laboratory of Cyberspace Security Defense(网络空间安全防御国家重点实验室) Institute of Information Engineering, CAS(信息工程研究所,中国科学院) Beijing, China(北京,中国)

AI总结 MemLineage 是一种针对大型语言模型(LLM)代理记忆的防御机制,通过为每条记忆条目附加密码学来源信息和LLM推导链,确保记忆内容的可信性。该方法将记忆管理视为一种“保管链”问题,利用 Merkle 日志和有向无环图(DAG)记录记忆的生成过程,从而在防止恶意内容被用于敏感操作的同时,保留有用的回忆能力。实验表明,MemLineage 在多个记忆污染场景中表现出色,显著降低了误动作率,且性能开销极低。

Comments 24 pages, 8 figures. Rui Hou is the corresponding author

详情
英文摘要

We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re-enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain-of-custody problem rather than a filtering problem. It is a six-module design around an RFC-6962 Merkle log over per-principal Ed25519-signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max-of-strong-edges propagation rule makes Untrusted-Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive-action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory-poisoning workloads on a deterministic mechanism-isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub-millisecond per-operation overhead keeps it well below the noise floor of any LLM call. A Codex-backed AgentDojo bridge further separates strong-model behavior from defense-layer behavior: under an intentionally vulnerable tool-output profile, no-defense and signature-only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte-equal CI-verified; hosted-model AgentDojo and live-model sweeps are recorded as auditable logs rather than byte-pinned artifacts.

2605.14420 2026-05-15 cs.AI 版本更新

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping

Pengyun Zhu, Yuqi Ren, Zhen Wang, Lei Yang, Deyi Xiong

发表机构 * TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China(天津大学计算机科学与技术学院 TJUNLP 实验室,中国)

AI总结 当前大型语言模型(LLMs)通常依赖于粗粒度的国家标签进行多元价值观对齐,但这种宏观层面的监督往往掩盖了国家内部的价值观异质性,导致对齐效果松散。为此,研究提出DVMap框架,通过多维人口统计约束识别具有可预测、高共识价值观偏好的群体,实现细粒度的多元价值观对齐。该方法引入人口统计原型提取策略和结构化思维链机制,并结合群体相对策略优化技术,有效提升了模型在跨人口统计、跨国家和跨价值观场景下的泛化能力与鲁棒性。

Comments Accepted to the Main Conference of ACL 2026

详情
英文摘要

Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures intra-country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi-dimensional demographic constraints, which can identify groups with predictable, high-consensus value preference. To this end, we propose DVMap (High-Consensus Demographic-Value Mapping), a framework for fine-grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high-quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain-of-Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic-value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple-generalization benchmark (spanning cross-demographic, cross-country, and cross-value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross-demographic tests, Qwen3-8B-DVMap achieves 48.6% accuracy, surpassing the advanced open-source LLM DeepSeek-v3.2 (45.1%). The source code and dataset are available at https://github.com/EnlightenedAI/DVMap.

2605.14418 2026-05-15 cs.CR cs.AI 版本更新

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit

发表机构 * Core contributors(核心贡献者)

AI总结 该论文指出,当前大语言模型(LLM)越狱攻击的评估中存在一个关键问题:攻击成功率(ASR)并不稳定,导致不同研究之间的结果难以比较。研究发现,即使某些攻击在封闭模型上表现出高ASR,但在实际测试中却只能以50%的连续成功率通过开放模型,揭示了越狱攻击生成和评估过程中随机性(stochasticity)的影响。为此,作者提出了一种新的评估框架CAS-eval和生成框架CAS-gen,有效提升了攻击的一致性和成功率,为越狱攻击的标准化评估提供了新方法。

详情
英文摘要

"Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder "Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?". To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!

2605.14416 2026-05-15 cs.AI 版本更新

A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

Wen Wang, Xiangchen Wu, Liang Wang, Hao Hu, Xianping Tao

发表机构 * Nanjing University(南京大学)

AI总结 本文提出了一种基于知识嵌入的强化学习统一框架,用于解决具有容量限制的车辆路径问题(CVRP)。该框架结合了路线优先、聚类次优的启发式策略,并引入动态规划解决子问题,同时利用历史增强的上下文处理模块应对分解带来的部分可观测性问题。实验表明,该方法在多种CVRP变体中均能取得优于现有学习方法的解质量,且与经典启发式方法的差距更小,展现出良好的泛化能力。

详情
英文摘要

The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.

2605.14415 2026-05-15 cs.SE cs.AI cs.CL 版本更新

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Man Ho Lam, Chaozheng Wang, Hange Liu, Jingyu Xiao, Haau-sing Li, Jen-tse Huang, Terry Yue Zhuo, Michael R. Lyu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Independent(独立) ELLIS Technical University of Darmstadt(达姆施塔特技术大学) Johns Hopkins University(约翰霍普金斯大学) Monash University(墨尔本大学)

AI总结 SWE-Chain 是一个用于评估代码智能体在连续版本升级场景下表现的基准,聚焦于包级别的连续发布升级任务。该研究设计了一种基于版本说明与代码差异对齐的合成流程,生成真实可行的升级需求,并构建了包含 9 个真实 Python 包、155 个版本转换和 1660 个升级要求的测试集。实验表明,当前主流代码智能体在连续升级任务中仍面临较大挑战,难以在不破坏现有功能的前提下完成准确的升级操作。

详情
英文摘要

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.

2605.14413 2026-05-15 cs.LG cs.AI 版本更新

MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse

Donghwan Kim, Hyunsoo Yoon

发表机构 * Department of Industrial Engineering(工业工程系) Yonsei University(延世大学)

AI总结 该论文提出了一种基于类内马哈拉诺比斯距离方差的新型分布外检测方法MahaVar。研究发现,对于分布内样本,类内马哈拉诺比斯距离呈现出明显的尖锐最小值结构,导致类间距离方差较大,而分布外样本则表现出较弱的结构特征和较小的距离方差。基于这一现象并结合神经崩溃理论,作者提出了MahaVar方法,在传统马哈拉诺比斯距离基础上引入类内距离方差作为判别依据,有效提升了分布外检测性能,在多个基准数据集上取得了当前最优结果。

Comments 29 pages, 8 figures

详情
英文摘要

Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID) samples, class-wise Mahalanobis distances exhibit a pronounced sharp minimum structure, where the distance to the nearest class is small while distances to all other classes remain large, resulting in high variance across classes. In contrast, OOD samples tend to exhibit a less pronounced sharp minimum structure, producing comparatively lower variance across classes. We further provide a theoretical analysis grounding this observation in Neural Collapse geometry: under relaxed Neural Collapse assumptions on within-class compactness and inter-class separation, ID samples are shown to structurally exhibit high class-wise distance variance, offering a theoretical basis for its use as an OOD score. Motivated by this observation and its theoretical backing, we propose MahaVar, a simple and effective post-hoc OOD detector that augments the Mahalanobis distance with a class-wise distance variance term. Following the OpenOOD v1.5 benchmark protocol, MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet, with consistent improvements in both AUROC and FPR@95 over existing Mahalanobis-based methods across all benchmarks.

2605.14411 2026-05-15 cs.RO cs.AI 版本更新

Energy-Efficient Quadruped Locomotion with Compliant Feet

Pramod Pal, Shishir Kolathaya, Ashitava Ghosal

发表机构 * Department of Mechanical Engineering, Indian Institute of Science(印度科学研究院机械工程系) Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science(印度科学研究院网络物理系统研究中心) School of Engineering and Applied Science, Ahmedabad University(阿亨布尔大学工程与应用科学学院)

AI总结 该研究探讨了具有柔顺足部的四足机器人能否在保证运动稳定性的同时提升运动效率。通过将足部柔顺性引入强化学习控制器,研究发现适中的足部刚度可以有效减少每米行走的机械能耗,实验表明相较于过于刚硬或过于柔软的足部,中间刚度的足部可使能耗降低约17%。这一结果表明,合理设计足部柔顺性有助于提高四足机器人的能量效率。

Comments 29 pages, 7 figures, supplemental videos link is mentioned in the paper

详情
英文摘要

Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.

2605.14407 2026-05-15 cs.AI 版本更新

Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

Xiang Li

发表机构 * Massachusetts General Hospital(麻省总医院)

AI总结 本文探讨了人工智能在数字任务中常被忽视的“中间地带”——Metis AI,这类任务虽可在计算机上完成,但因涉及机构、社会和规范层面的复杂性,难以被算法可靠自动化。研究提出了Metis AI的五个结构性特征,并指出应对策略应是人类主导、AI辅助的“半人马架构”,而非单纯提升自动化水平。

详情
英文摘要

The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.

2605.14392 2026-05-15 cs.AI 版本更新

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Yucheng Shi, Zhenwen Liang, Kishan Panaganti, Dian Yu, Wenhao Yu, Haitao Mi

发表机构 * Tencent HY LLM(腾讯 HY LLM)

AI总结 该研究提出了一种通过可验证环境合成实现自我进化的强化学习方法,使语言模型不仅能生成问题,还能构建用于训练自身的环境。核心方法是通过生成可执行的环境对象,实现问题采样、参考解计算与响应评分,并确保环境具有稳定的“解决-验证”不对称性,从而保证奖励信号的有效性。研究通过EvoEnv框架验证了该方法的有效性,在基准测试中实现了性能提升,表明模型的自我改进依赖于构建难度始终超越自身能力的环境,而非单纯增加合成数据量。

Comments Tech report, work in progress

详情
英文摘要

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

2605.14389 2026-05-15 cs.AI cs.CL cs.LG 版本更新

Nexus : An Agentic Framework for Time Series Forecasting

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirumalashetty, Chun-Liang Li, Rui Zhang, Jinsung Yoon, Tomas Pfister

发表机构 * Google(谷歌) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 时间序列预测不仅涉及数值推断,还需结合新闻、事件等非结构化文本信息进行推理。为弥补现有时间序列基础模型(TSFMs)对文本信号不敏感以及大语言模型(LLMs)在不同领域表现不一的问题,本文提出Nexus,一种多智能体预测框架,通过分解预测过程为宏观与微观时间波动识别、上下文信息整合等阶段,实现更灵活的预测。实验表明,Nexus在多个领域数据上优于现有先进模型,同时生成高质量的推理轨迹,揭示了预测背后的驱动因素,证明了现实中的时间序列预测是超越单纯序列建模的智能体推理问题。

Comments 30 Pages, 3 figures, 5 Tables

详情
英文摘要

Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

2605.14386 2026-05-15 cs.NE cs.AI 版本更新

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim

发表机构 * VIDRAFT Inc.(VIDRAFT公司)

AI总结 本文提出了一种名为 Darwin Family 的框架,通过无训练的进化合并方法提升大语言模型的推理能力。该方法基于梯度-free的权重空间重组,引入了自适应合并基因、MRI-Trust融合机制以及跨架构映射器,实现了对现有模型检查点中潜在能力的重新组织与优化。实验表明,Darwin 模型在多个任务上超越了其原始训练模型,展示了无需额外训练即可提升模型推理性能的有效性。

Comments NeurIPS 2026 submission. 18 pages including appendix

详情
英文摘要

We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.

2605.14379 2026-05-15 cs.LG cs.AI cs.GT cs.MA 版本更新

Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

JB Lanier, Nathan Monette, Pierre Baldi, Roy Fox

AI总结 在不完美信息博弈中,由于稀疏奖励和长期探索的困难,寻找大规模竞争性游戏(如《星际争霸》《Dota》等)的近似均衡计算上极具挑战。本文提出了一种多智能体初始状态采样策略——数据增强博弈起始(DAGS),通过从离线人类专家演示中采样中间状态作为强化学习的起始点,以加速策略梯度方法在零和两人博弈中的探索效率。实验表明,DAGS在固定计算预算下能显著降低博弈的可利用性,并揭示了初始状态分布增强可能导致均衡偏差的问题,同时提出了一种简单有效的缓解方法。

Comments 17 pages, 4 figures. JB Lanier and Nathan Monette contributed equally

详情
英文摘要

Finding approximate equilibria for large-scale imperfect-information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi-agent starting-state sampling strategy designed to substantially accelerate online exploration in regularized policy-gradient game methods for two-player zero-sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high-level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data-Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long-horizon control variants of two-player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi-task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.

2605.14374 2026-05-15 cs.LG cs.AI math.OC 版本更新

Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

Young-Chae Hong, Yangho Chen

发表机构 * Amazon(亚马逊)

AI总结 本文提出了一种基于混合整数规划的符号规则分类模型——最优模式检测树(OPDT),用于在二分类任务中发现数据中的单一最优模式。为融入先验知识和合规要求,作者进一步引入了分支结构约束(BSC)框架,使决策者能够将领域知识直接嵌入模型。该方法通过优化覆盖范围并最小化误分类的假阳性率,能够在合理时间内于中等规模数据集上发现具有最优性保证的隐藏模式。

Comments Published in Transactions on Machine Learning Research (TMLR). 26 pages, 4 figures. OpenReview URL: https://openreview.net/forum?id=RJ6eMDcDCv

Journal ref Transactions on Machine Learning Research (2026)

详情
英文摘要

Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

2605.14370 2026-05-15 physics.geo-ph cs.AI physics.comp-ph 版本更新

Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel

Ruihua Chen, Yisi Luo, Bangyu Wu, Xile Zhao, Deyu Meng

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学院) School of Mathematical Sciences, University of Electronic Science and Technology of China(电子科技大学数学科学学院)

AI总结 本文研究了神经重参数化全波形反演(NeurFWI)的收敛机制,提出了神经灵敏度核(NSK)和波切线核(WTK),揭示了神经表示如何通过调节原始灵敏度核和波切线核的特征结构,影响反演过程中的谱滤波效应、梯度波数调制和波频偏差等关键行为。基于这些理论分析,作者提出了改进的NeurFWI方法,提升了反演性能与效率,并在地震勘探和医学成像中验证了其有效性。

详情
英文摘要

Full-waveform inversion (FWI) estimates unknown parameters in the wave equation from limited boundary measurements. Recent advances in neural reparameterized FWI (NeurFWI) demonstrate that representing the parameters using a neural network can reduce the reliance on the high-quality initial model and wavefield data, at the cost of slow high-resolution convergence. However, its underlying theoretical mechanism remains unclear. In this study, we establish the neural sensitivity kernel (NSK) and the wave tangent kernel (WTK) to analyze their convergence behavior from both model and data domains. These theoretical frameworks show that the neural tangent kernel (NTK) induced by neural representation adaptively modulates the original sensitivity and wave tangent kernels. This modulation leads to several key outcomes, i.e., the spectral filtering effect, the gradient wavenumber modulation, and the wave frequency bias, connecting the convergence behavior of NeurFWI with the eigen-structures of NSK and WTK. Building on these insights, we propose several enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK to improve inversion performances and efficiency. We numerically validate these theoretical claims and the proposed methods in seismic exploration, and firstly extend their application to medical imaging.

2605.14368 2026-05-15 cs.CL cs.AI 版本更新

Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

Injin Kong, Hyoungjoon Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Biosystems & Biomaterials Science and Engineering, Seoul National University(首尔国立大学生物系统与生物材料科学与工程系)

AI总结 本文研究了如何在预训练语言模型中有效引入扩散模型,提出了一种基于几何引导的扩散-变压器混合模型DiHAL。该方法通过几何特征评估各层的适合性,选择合适的隐藏状态接口,并用扩散桥替换下层变压器结构,保留上层结构和语言模型头部。实验表明,基于几何评分的隐藏状态恢复方法在保持相同训练预算的情况下,优于传统的连续扩散方法,展示了在语言模型中进行扩散替换的可行性。

详情
英文摘要

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

2605.14365 2026-05-15 cs.LG cs.AI 版本更新

LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning

Changryeol Choi, Hyewon Park, Yujin Kwon, Gowun Jeong

发表机构 * CJ Logistics(CJ物流)

AI总结 在表格深度学习中,主流方法的性能趋于接近,难以形成明显优劣之分。为此,本文提出 LoMETab,一种基于秩-$r$ 的隐式集成模型,通过引入可调节的秩和初始化尺度,增强模型的多样性与表达能力。实验表明,LoMETab 能有效提升模型间的预测差异性,并在分类和回归任务中展现出良好的控制能力与性能表现。

详情
英文摘要

Recent tabular learning benchmarks increasingly show a tight performance cluster rather than a clear hierarchy among leading methods, spanning gradient boosted decision trees, attention-based architectures, and implicit ensembles such as TabM. As benchmark gains plateau, a complementary goal is to understand and control the mechanisms that make simple neural tabular models competitive. We propose LoMETab, a rank-$r$ generalization of multiplicative implicit ensembles. LoMETab lifts the rank-1 BatchEnsemble/TabM modulation to a rank-$r$ identity-residual Hadamard family by parameterizing each member weight as $W_k = W \odot (1 + A_kB_k^\top)$, where $W$ is shared and $(A_k, B_k)$ are member-specific low-rank factors. This exposes two practical diversity-control axes: the adapter rank $r$ and the initialization scale $σ_{\mathrm{init}}$, and we prove that for $r \ge 2$ this generalization strictly enlarges BatchEnsemble's hypothesis class. Empirically, we show that this added capacity manifests as measurable predictive diversity after training: on representative classification datasets, LoMETab sustains higher pairwise KL than an additive low-rank ablation, and $(r, σ_{\mathrm{init}})$ provides broad control over pairwise KL, varying by up to several orders of magnitude across configurations. The induced diversity is reflected in task-appropriate output-level measures: argmax disagreement for classification and ambiguity for regression, indicating that the control extends beyond pairwise KL to decision- and output-level member variation. Finally, experiments sweeping over adapter rank $r$ and initialization scale $σ_{\mathrm{init}}$ reveal that predictive performance is dataset-dependent over the $(r, σ_{\mathrm{init}})$ grid, supporting LoMETab as a controllable family of implicit ensembles rather than a fixed rank-1 construction.

2605.14362 2026-05-15 cs.SE cs.AI 版本更新

Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

Shweta Mishra

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究针对大语言模型在开发工具中面临的上下文窗口效率问题,提出了一种基于文件大小的预执行过滤框架,用于在代码仓库扫描前高效剔除超出上下文限制的非代码文件。该方法仅依赖操作系统级别的元数据,具有极低的计算开销,能够在不进行索引和语义分析的情况下实现快速过滤。实验表明,该方法在多个开源仓库中显著减少了输入令牌数量,同时提升了代码生成的准确性并降低了幻觉发生率。

详情
英文摘要

Context window efficiency is a practical constraint in large language model (LLM)-based developer tools. Paulsen [12] shows that all tested models degrade in accuracy well before their advertised context limits the Maximum Effective Context Window (MECW) which makes context construction a quality problem, not just a cost one. Modern software repositories routinely contain large non-code artifacts compiled datasets, binary model weights, minified JavaScript bundles, and gigabyte-scale log files that overflow the context window and push out task-relevant source code. We present a correctness-aware context hygiene framework: a pre-execution, size-based heuristic filter that intercepts repository scans before tokenization, using only OS-level stat() metadata with sub-millisecond overhead. Semantic retrieval approaches such as RepoCoder, GraphRAG, and AST-based chunking require index construction and query-time inference before any filtering decision is reached. Our framework, by contrast, requires no indexing and operates at <0.01 ms per file decision. Across 10 real open-source repositories (22,046 files, 5 languages), the proposed SizeFilter at θ=1 MB achieves 79.6% (\pm13.2%) mean token reduction at 0.30 ms overhead: the HybridFilter achieves 89.3% (\pm9.0%) the lowest variance of any filter evaluated. A token-density study across 2,688 files confirms a strong linear correlation (Pearson r=0.997, k=0.250 tokens/byte). A limited-scope evaluation (18 tasks, CodeLlama-7B-Instruct) yields 72% file-level accuracy under filtering versus 25% at baseline; hallucination frequency declines from 61% to 17%. All code and data are released for reproducibility.

2605.14359 2026-05-15 cs.LG cs.AI 版本更新

RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

Zhengjia Zhong, Shuyan Ke, Zaizhou Lin, Jiaqi Song, Hongyi Lan, Hui Li

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知关键实验室) Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China(高效计算,中华人民共和国教育部,厦门大学,厦门,中国)

AI总结 该论文提出了一种名为RQ-MoE的残差量化框架,通过结合专家混合模型与双流量化机制,实现了针对输入数据动态调整的高效向量压缩。该方法解决了现有动态量化方法在解码过程中存在的瓶颈问题,支持并行解码并提升了表达能力。实验表明,RQ-MoE在重建与检索任务中达到了当前最优或接近最优的性能,同时解码速度比以往方法快6到14倍。

Comments To appear at ICML 2026

详情
英文摘要

Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.

2605.14358 2026-05-15 cs.AI cs.LG 版本更新

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

Sanjoy Chowdhury, Dinesh Manocha

发表机构 * University of Maryland, College Park, USA(马里兰大学学院公园分校)

AI总结 该研究探讨了语言模型在生成长链推理过程时,其中有多少步骤对于最终预测是必要的。通过定义“最小核心”——即能保持最终答案或预测分布的最小步骤子集,并引入压缩比、冗余度、步骤必要性等指标,研究发现推理轨迹普遍存在冗余,平均有46%的步骤可以移除而不影响答案,且必要性高度集中于少数几步。研究还表明,最小核心能更清晰地揭示推理的几何结构,并在不同模型间具有较好的迁移能力,为理解语言模型推理的本质提供了新视角。

详情
英文摘要

Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.

2605.14331 2026-05-15 eess.SP cs.AI cs.ET cs.IT cs.LG math.IT 版本更新

Analog RF Computing: A New Paradigm for Energy-Efficient Edge AI Over MU-MIMO Systems

Wentao Yu, Vincent W. S. Wong

发表机构 * Department of Electrical and Computer Engineering, The University of British Columbia(电气与计算机工程系,不列颠哥伦比亚大学)

AI总结 本文提出了一种基于模拟射频(RF)计算的新范式,用于在多用户多输入多输出(MU-MIMO)无线系统中实现高效节能的边缘人工智能推理。该方法通过基站广播编码的神经网络权重波形,客户端利用无源混频器进行本地输入编码波形的乘法运算,从而在无线接收端高效完成矩阵-向量乘法操作。研究设计了一种面向计算的物理层框架,优化了计算精度与能耗之间的平衡,并提出了一种低复杂度算法解决非凸优化问题,实验表明该方法相比传统数字计算可将客户端能耗降低近两个数量级,为边缘推理提供了高效的无线计算新途径。

Comments 13 pages, 6 figures, 2 tables. This paper proposes analog RF computing as a new paradigm for energy-efficient edge inference over wireless networks and studies the corresponding physical layer design framework

详情
英文摘要

Modern edge devices increasingly rely on neural networks for intelligent applications. However, conventional digital computing-based edge inference requires substantial memory and energy consumption. In analog radio frequency (RF) computing, a base station (BS) encodes the weights of the neural networks and broadcasts the RF waveforms to the clients. Each client reuses its passive mixer to multiply the received weight-encoded waveform with a locally generated input-encoded waveform. This enables wireless receivers to perform the matrix-vector multiplications (MVMs) that account for most of the computation burden in edge inference with ultra-low energy consumption. Unlike conventional downlink transmissions which are optimized for communications, analog RF computing requires a computing-centric physical layer that controls both the analog MVM accuracy and the energy consumption for inference. Motivated by this, in this paper, we propose a physical layer design framework for analog RF computing in MU-MIMO wireless systems. We derive tractable models for computing accuracy and energy consumption for inference, formulate a joint BS beamforming and client-side scaling problem subject to computing accuracy, transmit power, and hardware constraints, and develop a low-complexity algorithm to solve the non-convex problem. The proposed design provides client- and layer-specific accuracy control for both uniform- and mixed-precision inference. Simulations under 3GPP specifications show that analog RF computing can significantly reduce client-side energy consumption by nearly two orders of magnitude compared to digital computing, while mixed-precision inference requires even lower energy consumption than uniform-precision inference. Overall, these results establish analog RF computing over wireless networks as a promising paradigm for energy-efficient edge inference.

2605.14327 2026-05-15 cs.LG cs.AI 版本更新

AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction

Yerin Park, Sangseon Lee

发表机构 * Department of Artificial Intelligence, Inha University(人工智能系,Inha大学)

AI总结 药物-药物相互作用(DDI)预测在计算生物医学中具有重要意义,但如何对训练过程中未见的药物进行准确预测仍是一个关键挑战。本文提出了一种与模型无关的多模态集成模块AIM-DDI,它将结构、化学和语义等异构药物信息映射到共享的潜在空间中,并通过统一的融合模块建模模态间依赖关系,从而实现跨不同DDI预测架构的通用集成。实验表明,AIM-DDI在多种DDI模型和DrugBank数据集上均能有效提升预测性能,尤其在两个药物均未在训练中出现的最困难场景下表现突出。

详情
英文摘要

Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.

2605.14323 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Dynamic Latent Routing

Fangyuan Yu, Xin Su, Amir Abdullah

发表机构 * Thoughtworks AI Labs (TAILS)(Thoughtworks AI实验室(TAILS))

AI总结 本文研究了在时间变化奖励函数的马尔可夫决策过程(MDP)中,子策略的时间拼接问题。作者提出了通用迪杰斯特拉搜索(GDS),并证明通过时间组合中间最优子策略可以恢复全局最优目标达成策略。基于GDS的“搜索、选择、更新”原则,作者进一步提出了动态潜在路由(DLR)方法,该方法在单次训练阶段联合学习离散潜在编码、路由策略和模型参数。实验表明,在低数据微调场景下,DLR在多个数据集和模型上表现优异,优于传统的监督微调方法。

详情
英文摘要

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

2605.14318 2026-05-15 cs.AI cs.LG 版本更新

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems

Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello

发表机构 * INAF, Osservatorio Astrofisico di Catania(意大利国家天文研究所,卡塔尼亚天文台)

AI总结 本文研究了复杂系统中可解释的预测性维护问题,针对监测变量异构性和冗余性导致的故障信息模糊和模型可解释性下降的问题,提出了一种语义特征分割框架。该方法将监测特征空间分解为保留主要预测信息的规范分量和包含结构边缘信号的残差分量,并基于领域知识定义功能分组以反映系统运行机制。实验表明,规范分量在预测风险和结构稳定性方面均优于残差分量和传统方法,实现了预测性能与语义可解释性的兼顾。

Comments 18 pages, 7 figures. Under review at Neural Computing and Applications. Keywords: semantic segmentation, change point detection, fault anticipation

详情
英文摘要

Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.

2605.14304 2026-05-15 cs.LG cs.AI 版本更新

Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry

Zuyuan Zhang, Carlee Joe-Wong, Tian Lan

发表机构 * The George Washington University(乔治·华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 该研究提出了一种名为矩阵空间强化学习(MSRL)的新方法,旨在通过复用已有轨迹片段中的局部转移几何结构,提升强化学习中的组合泛化能力。MSRL 使用正定矩阵描述符来捕捉轨迹片段的一阶和二阶统计特性,从而在抽象的矩阵空间中实现代数组合与知识迁移。实验表明,该方法在有限预算下取得了优于现有方法的性能,展示了其在跨任务学习中的有效性。

详情
英文摘要

Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).

2605.14297 2026-05-15 cs.LG cs.AI math.OC stat.ML 版本更新

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

Matias Alvo, Daniel Russo, Yash Kanoria

发表机构 * Graduate School of Business Columbia University(哥伦比亚大学商学院)

AI总结 本文研究了在混合离散-连续动作空间中的强化学习问题,这类问题常见于机器人控制和优化领域。为了解决传统策略梯度方法在高维空间中梯度质量差的问题,作者提出了混合策略优化(HPO)方法,通过结合路径梯度和得分函数梯度,实现无偏混合梯度估计,从而有效应对离散动作和非光滑动态带来的挑战。实验表明,HPO在库存控制和切换线性二次调节器等任务中显著优于PPO算法,且在连续动作维度增加时优势更加明显。

详情
英文摘要

We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.

2605.14294 2026-05-15 cs.AI cs.LG 版本更新

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

Hengjie Liu, Zhenya Zhang, Jianjun Zhao

发表机构 * Kyushu University(九州大学) National Institute of Informatics(国家信息研究所)

AI总结 随着Transformer模型在安全关键领域的广泛应用,其形式化验证变得尤为重要。与传统神经网络相比,Transformer的推理过程涉及复杂的计算,如自注意力层中的点积操作,使得验证极具挑战性。本文提出了一种基于ReLU催化的抽象细化方法,通过精确表示点积的非线性边界,结合凸松弛技术,提升了验证精度,并在两种经典验证方法的基础上扩展出适用于Transformer的高效且精确的验证框架,实验表明该方法在保持较高效率的同时显著提升了验证精度。

Comments 32 pages, 6 figures, the full version of the paper accepted by CAV 2026

详情
英文摘要

Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.

2605.14291 2026-05-15 cs.CR cs.AI cs.CL cs.CV cs.LG 版本更新

To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu

发表机构 * School of Computing Augmented Intelligence, Arizona State University, Tempe, AZ, USA Department of Computer Science Engineering, Texas A\&M University, College Station, TX, USA

AI总结 随着大型视觉-语言模型(LVLMs)的快速发展,未经授权的数据抓取和微调行为带来了严重的版权和隐私风险。为此,本文提出MMGuard,通过注入人类不可感知的扰动生成“不可学习”的示例,主动防御数据被用于未经授权的LVLM微调。该方法利用模型的学习动态,制造优化捷径,使模型在训练时过度拟合噪声,从而在推理时性能下降。此外,MMGuard引入跨模态关联破坏策略,增强防御效果,并在多种威胁模型下展现出高效、隐蔽且鲁棒的保护能力。

详情
英文摘要

The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.

2605.14290 2026-05-15 cs.CR cs.AI cs.CL cs.SE 版本更新

Web Agents Should Adopt the Plan-Then-Execute Paradigm

Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu, Sylvie Venuto, Jinhao Zhu, Raluca Ada Popa, David Wagner

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 本文指出,当前基于ReAct架构的大型语言模型代理在处理网页任务时存在安全隐患,因为其在决策过程中直接使用未验证的网页内容,容易受到提示注入攻击。作者主张网页代理应采用“先规划后执行”的范式,即在观察网页内容前制定任务特定的执行计划,从而隔离不可信数据对控制流的影响。研究分析了WebArena基准,发现大多数任务可通过纯程序化规划完成,而无需运行时调用LLM子程序,并指出实现该范式的关键在于构建类型化、可审计的网页API接口,而非改进模型本身。

详情
英文摘要

ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

2605.14289 2026-05-15 cs.LG cs.AI cs.CL cs.CR 版本更新

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Weisen Jiang, Shuhao Chen, Sinno Jialin Pan

AI总结 本文提出了一种隐私保护的混合专家(MoE)统一框架MetaMoE,旨在解决分布式数据环境下专家模型无法共享训练数据的问题。该方法通过选择与客户端领域相关且多样化的公共代理数据,替代无法获取的私有数据,从而有效指导路由器学习并提升专家协调能力。实验表明,MetaMoE在计算机视觉和自然语言处理任务中优于现有的隐私保护MoE统一方法。

Comments Accepted by ICML 2026

详情
英文摘要

Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.

2605.14283 2026-05-15 cs.GT cs.AI cs.CR 版本更新

Watermarking Game-Playing Agents in Perfect-Information Extensive-Form Games

Juho Kim, Fei Fang, Tuomas Sandholm

发表机构 * Strategic Machine, Inc.(战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文研究了在完全信息的扩展式博弈中对博弈策略进行水印的技术,旨在检测游戏代理是否未经授权地使用了AI工具。作者借鉴了大型语言模型的KGW水印方法,提出了一种适用于博弈代理的水印方案,并通过统计检验实现水印的检测。实验表明,水印对策略质量的影响可以忽略不计,且仅需少量对局即可有效检测水印。

详情
英文摘要

Watermarking techniques for large language models (LLMs), which encode hidden information in the output so its source can be verified, have gained significant attention in recent days, thanks to their potential capability to detect accidental or deliberate misuse. Similar challenges involving model misuse also exist in the context of game-playing, such as when detecting the unauthorized use of AI tools in gaming platforms (e.g., cheating in online chess). In this paper, we initiate the study of how game-playing strategies can be watermarked. We show how the KGW watermark for LLMs can be adapted to watermark game-playing agents in perfect-information extensive-form games. The watermark can then be detected using a statistical test. We show that the degradation in the quality of the watermarked strategy profile, quantified by the expected utility, can be bounded, but there is a tradeoff between detectability and quality. In our experiments, we bootstrap the watermarking framework to various chess engines and demonstrate that a) the impact of the watermark on the quality of the strategy is negligible and b) the watermark can be detected with just a handful of games.

2605.14277 2026-05-15 cs.AI cs.GT 版本更新

Parallelizing Counterfactual Regret Minimization

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.(CMU战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文研究了如何将反事实遗憾最小化(CFR)算法并行化,以加速求解大规模不完美信息博弈。作者将CFR重新表述为一系列线性代数操作,从而能够利用现有的并行计算技术提升其效率。该方法适用于多种CFR变体,如CFR+、折扣CFR和预测型CFR。实验表明,基于GPU的实现比CPU上的现有实现快达四千倍。

Comments This paper contains and extends ideas that were originally in arxiv:2408.14778

详情
英文摘要

Parallelization has played an instrumental role in the field of artificial intelligence (AI), drastically reducing the time taken to train and evaluate large AI models. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential. In this paper, we parallelize the family of counterfactual regret minimization (CFR) algorithms, which were central to important breakthroughs for solving large imperfect-information games. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state-of-the-art, such as CFR+, discounted CFR, and predictive variants of CFR. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel's CFR implementations on a CPU.

2605.14269 2026-05-15 cs.CV cs.AI 版本更新

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal

发表机构 * UNC Chapel Hill(UNC夏洛特希尔大学) FieldAI NTU Singapore(新加坡国立大学) AI2 Johns Hopkins University(约翰霍普金斯大学)

AI总结 生成真实的人类运动是视频生成中的核心挑战之一。为了解决现有奖励信号无法准确评估运动真实性的难题,本文提出PhyMotion,一种基于物理模拟的结构化运动奖励机制,通过评估运动的运动学合理性、接触与平衡一致性以及动力学可行性等多个维度,实现对生成视频中人体运动质量的精细评价。实验表明,PhyMotion相比现有方法能更准确地反映人类判断,并在基于强化学习的后训练中显著提升了运动真实性和生成质量。

Comments First two authors contributed equally, website: https://phy-motion.github.io/

详情
英文摘要

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

2605.14267 2026-05-15 cs.CV cs.AI 版本更新

Image Restoration via Diffusion Models with Dynamic Resolution

Yang Zheng, Wen Li, Zhaoqiang Liu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 该研究针对扩散模型在图像修复任务中计算开销大的问题,提出了一种基于动态分辨率扩散模型的图像修复方法。通过将数据投影到低维子空间,有效降低了计算负担,并在原有像素空间方法的基础上改进,提出了SubDPS和SubDAPS两种新方法,其中SubDAPS++进一步提升了修复效率和质量。实验表明,该方法在多个数据集和任务上优于现有基于扩散模型的图像修复方法。

Comments Accepted by ICML 2026

详情
英文摘要

Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.

2605.14266 2026-05-15 cs.AI cs.CY 版本更新

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence

Vidya K Sudarshan, Anushka Sisodia, Reshma A Ramachandra, Sia Batra, Josephine Chong Leng Leng

发表机构 * College of Computing and Data Science, Nanyang Technological University, NTU, Singapore(南洋理工大学计算机与数据科学学院) DeepMed Ptd Ltd, India(印度DeepMed公司)

AI总结 本文探讨了人工智能代理在高等教育中的应用前景,提出构建一个集成化的多智能体AI框架,以支持教学、学习和机构管理的协同运作。当前AI工具多为单一任务导向且缺乏整合,难以满足教育生态系统复杂需求,本文通过文献分析指出现有研究在跨功能整合与包容性设计方面的不足,并强调构建协调、适应性强的多智能体系统对于实现公平、包容教育的重要意义。

Comments 50 pages, 14 figures, 3 tables

详情
英文摘要

Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco-system-level agentic multi-agent AI platform capable of coordinated planning, reasoning, and adaptive decision-making across multiple educational functions. This paper presents a forward-looking perspective on agentic multi-agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task-specific fragmented AI tools, the transition from single-agent to multi-agent systems, limited cross-functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner-centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human-aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi-agent platform can support diverse learners through adaptive, multimodal interventions.

2605.14261 2026-05-15 cs.AI cs.GT 版本更新

Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.(CMU战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文研究了在多智能体环境中如何在样本量有限或试验成本高昂的情况下评估智能体的性能,提出了AIVAT方法族以降低估计方差。文章指出,AIVAT中的启发式价值函数选择和不确定性处理缺乏指导,进而揭示了该方法在梯度下降应用下的潜在问题,并提出应在观察评估数据前固定启发式函数。此外,作者展示了如何传播启发式不确定性以进一步降低方差,尽管这可能牺牲无偏性。实验表明,该方法在扑克数据集上有效减少了达到统计结论所需的样本数量。

详情
英文摘要

How should an agent's performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents' expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled. In our first contribution, we parameterize the heuristic value function to highlight AIVAT's potential vulnerabilities: a) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b) one can p-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates. It is then possible to further reduce the variance using inverse-variance weighted averaging, but AIVAT's unbiasedness guarantee may have to be sacrificed. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43.0% reduction in the number of samples (poker hands) needed to draw statistical conclusions.

2605.14258 2026-05-15 cs.LG cs.AI 版本更新

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

Jesseba Fernando, Grigori Guitchounts

发表机构 * Network Science Institute, Northeastern University(网络科学研究所,东北大学) Flagship Pioneering(先锋计划)

AI总结 本文研究了大型语言模型中残差流的动态特性,揭示了训练过程中谱几何与网络拓扑之间的耦合关系。通过全雅可比矩阵的特征分解,作者发现训练使得模型深度方向上形成单调的谱梯度,并伴随着维度压缩现象,这些特性是学习得到的而非由模型结构决定。研究进一步表明,网络中图社区的拓扑位置决定了雅可比矩阵对其扰动的放大或抑制作用,这一关系在模型初始化时并不存在。

详情
英文摘要

Large language models are remarkably capable, yet how computation propagates through their layers remains poorly understood. A growing line of work treats depth as discrete time and the residual stream as a dynamical system, where each layer's nonlinear update has a local linear description. However, previous analyses have relied on scalar summaries or approximate linearizations, leaving the full spectral geometry of trained LLMs unknown. We perform full Jacobian eigendecomposition across three production--scale LLMs and show that training installs a monotonic spectral gradient through depth -- from non-normal, rotation-dominated early layers to near--symmetric late layers -- together with a cumulative low-rank bottleneck that funnels perturbations into a small fraction of the residual stream's effective dimensions. Our experiments reveal that this gradient and the dimensional collapse are learned rather than architectural, and is largely dissolved when structured non-normality is removed. We further show that the topological positioning of graph communities predicts whether the Jacobian amplifies or suppresses them, with the sign of the coupling determined by the local operator type, a relationship absent at initialization. These results map a learned spectral geometry in LLMs that links perturbation propagation and compression to the network's functional topology.

2605.14252 2026-05-15 cs.LG cs.AI 版本更新

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

Kai Sun, Peibo Duan, Yongsheng Huang, Guowei Zhang, Benjamin Smith, Nanxu Gong, Levin Kuhlmann

发表机构 * Faculty of Information Technolody, Monash University, Australia(墨尔本大学信息科技学院,澳大利亚) School of Software, Northeastern University, China(东北大学软件学院,中国) Department of Medicine, National University of Singapore, Singapore(新加坡国立大学医学部,新加坡)

AI总结 本文研究了脉冲神经网络(SNN)与人工神经网络(ANN)之间的性能差距问题,提出了一种新的知识蒸馏方法——选择性对齐知识蒸馏(SeAl-KD)。该方法突破了传统方法对所有时间步进行统一对齐的假设,通过识别错误时间步并针对性地进行校正,同时保留有用的时序动态,从而更有效地提升SNN的性能。实验表明,该方法在静态图像和神经形态事件数据集上均优于现有蒸馏方法。

详情
英文摘要

Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter-temporal self-distillation, implicitly assuming that per-timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl-KD), which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity. Extensive experiments on static image and neuromorphic event-based datasets demonstrate consistent improvements over existing distillation methods. The code is available at https://github.com/KaiSUN1/SeAl

2605.14246 2026-05-15 cs.LG cs.AI cs.SY eess.SY 版本更新

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Yushen Liu, Yin-Jen Chen, Ziyi Chen, Tao Wang, Heng Huang, Xugui Zhou, Yanfu Zhang

发表机构 * University of Virginia(弗吉尼亚大学) Google(谷歌) University of Maryland, College Park(马里兰大学 College Park 分校) Stanford University(斯坦福大学) Louisiana State University(路易斯安那州立大学) College of William and Mary(威廉与玛丽学院)

AI总结 该研究针对部分可观测环境下安全关键控制问题,提出了一种基于动作条件风险门控的强化学习方法,用于在不完全观测情况下平衡任务性能与安全风险。方法通过构建有限历史的紧凑代理状态,并学习动作条件的短期安全违规预测,将预测风险用于价值学习中的风险惩罚和决策时的风险门控,从而在保证安全的同时提升控制性能。实验表明,该方法在血糖调节和安全导航等任务中相比传统方法具有更优的奖励-成本平衡和运行效率。

详情
英文摘要

Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

2605.14242 2026-05-15 cs.LG cs.AI 版本更新

Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment

Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

发表机构 * Artificial Intelligence Research Center, Bengbu Medical University(蚌埠医科大学人工智能研究中心) CHARMMIRAEL Biotech Co., Ltd(CHARMMIRAEL生物科技有限公司)

AI总结 该研究提出了一种基于人工智能的卡iotocography(CTG)模型,用于胎儿心率信号重建、心率分析及变异性评估。该模型通过大规模未标注数据预训练,并结合专家审核数据进行微调,有效提升了信号重建精度和分析可靠性。研究引入了交叠标签(IOL)方法验证胎儿心率,模型在检测关键心率减速和加速方面表现出高灵敏度和特异性,并在临床指标评估中取得了优异的AUC成绩。

详情
英文摘要

The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer's criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.

2605.14237 2026-05-15 cs.AI 版本更新

Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

发表机构 * Artificial Intelligence Research Center, Bengbu Medical University(蚌埠医科大学人工智能研究中心) CHARMMIRAEL Biotech Co., Ltd(CHARMMIRAEL生物科技有限公司)

AI总结 本文提出了一种名为LOOP SKILL ENGINE的系统,旨在解决AI代理执行重复性任务时的高失败率和高计算成本问题。该系统通过一次性的任务执行记录和确定性回放机制,实现了99%的任务成功率,并将令牌使用量减少了99%。其核心方法是将首次运行中记录的工具调用轨迹转化为参数化的确定性执行计划,后续任务直接回放该计划,无需再次调用大语言模型,从而大幅降低开销并保证执行的可预测性。

Comments 8 pages, 5 tables

详情
英文摘要

Deploying AI agents for repetitive periodic tasks exposes a critical tension: Large Language Models (LLMs) offer unmatched flexibility in tool orchestration, yet their inherent stochasticity causes unpredictable failures, and repeated invocations incur prohibitive token costs. We present the LOOP SKILL ENGINE, a system that achieves a combined 99% success rate and 99% token reduction for periodic agent tasks through a one-shot recording, deterministic replay paradigm. On its first run, the agent executes the task with full LLM reasoning while the system transparently intercepts and records the complete tool-call trajectory. A greedy length-descending template extraction algorithm then converts this recording into a parameterized, branch-free Loop Skill -- a deterministic execution plan that captures the task's functional intent while parameterizing time-dependent and result-dependent variables. All subsequent executions bypass the LLM entirely: the engine resolves template variables against real-time values and replays the tool sequence deterministically. We prove two theorems: (1) Replay Determinism -- the step sequence of a validated Loop Skill is invariant across all future executions; (2) Write Safety -- concurrent access to persistent configuration is serialized through reentrant locks and atomic file replacement. Across a benchmark of periodic agent tasks spanning intervals from 5 minutes to 24 hours, the Loop Skill Engine reduces monthly token consumption by 93.3%--99.98% and cuts execution latency by 8.7x while eliminating output non-determinism. A multi-layer degradation strategy guarantees that tasks never stall. We release the engine as part of the buddyMe open-source agent framework.

2605.14231 2026-05-15 cs.LG cs.AI cs.SD 版本更新

AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算机与信息系统学院) Baskin School of Engineering, University of California, Santa Cruz, USA(加州大学圣克鲁兹分校工程学院) Institute of Trustworthy Embodied AI, Fudan University, China(复旦大学可信具身人工智能研究所)

AI总结 本文提出了一种基于对比学习的音频编码器 AudioMosaic,用于通用音频理解任务。该方法通过结构化时频掩码生成正样本对,降低内存消耗并支持高效的大批量训练。与生成式方法相比,AudioMosaic 能够学习更具判别性的语句级表示,在不同数据集、领域和声学条件下表现出优异的迁移能力,并在多个标准音频基准测试中取得了最先进的性能。

Comments ICML2026

详情
英文摘要

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

2605.14224 2026-05-15 math.NA cs.AI cs.NA math.DS math.FA 版本更新

Wavelet-Based Observables for Koopman Analysis: An Extended Dynamic Mode Decomposition Framework

Cankat Tilki, Serkan Gugercin

发表机构 * Department of Mathematics, Virginia Tech(弗吉尼亚理工大学数学系)

AI总结 本文提出了一种基于小波变换的Koopman算子分析方法,通过引入小波基观测函数,证明其在特定Banach空间下是Koopman半群的特征函数。在此基础上,构建了Koopman半群及其预解算子的闭式表达,并结合扩展动态模态分解(EDMD)提出了一种新的小波动态模态分解算法(cWDMD),用于数值近似Koopman算子的作用。该方法在两个数值例子中得到了验证,展示了其理论有效性与应用潜力。

详情
英文摘要

We present an in-depth analysis of the Koopman semigroup via wavelet transform. Towards this goal, we start by introducing the wavelet-based observables and show that they are eigenfunctions of the Koopman semigroup when this semigroup is considered over the Banach space of continuous functions on a compact forward-invariant set endowed with the supremum norm. We then construct closed-form expressions of the action of the Koopman semigroup and its resolvent in terms of these observables. To approximate the action of Koopman semigroup numerically, we combine Extended Dynamic Mode Decomposition (EDMD) with the proposed wavelet-based observables leading to the Wavelet Dynamic Mode Decomposition via Continuous Wavelet Transform (cWDMD) algorithm. We validate our theoretical results on two numerical examples.

2605.14220 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng Wu, Xiao Yu

发表机构 * ByteDance(字节跳动) The University of Virginia(弗吉尼亚大学)

AI总结 本文研究了大语言模型强化学习中训练与推理阶段概率分布不一致的问题,即训练-推理不匹配(TIM)。作者提出了一种零不匹配诊断设置(VeXact),用于隔离TIM的影响,并发现即使微小的标记级数值差异也可能导致训练崩溃。研究进一步表明TIM改变了优化问题的本质,并提出了一些缓解TIM的方法,强调TIM是影响LLM强化学习稳定性的关键系统性因素,而非单纯的数值噪声。

详情
英文摘要

Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

2605.14218 2026-05-15 cs.AI physics.soc-ph 版本更新

Fusion-fission forecasts when AI will shift to undesirable behavior

Neil F. Johnson, Frank Yingjie Huo

发表机构 * Physics Department, The George Washington University(乔治华盛顿大学物理系)

AI总结 本文研究了类似ChatGPT的AI系统在使用过程中行为从有益转向有害的转变问题,并提出了一种基于融合-裂变群体动力学的预测方法。该方法通过分析对话历史与有益或有害行为之间的竞争动态,能够在不依赖具体模型或随机采样的情况下,提前预测AI行为转变的时间点。研究通过多项独立测试验证了该方法的有效性,表明其具有广泛适用性和较高的预测准确性。

详情
英文摘要

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

2605.14217 2026-05-15 cs.LG cs.AI cs.CL cs.SY eess.SY 版本更新

PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts

发表机构 * Stanford University(斯坦福大学) Tilde Research(Tilde研究)

AI总结 本文提出了一种名为 PreFT 的高效微调方法,专注于在推理阶段仅对预填充(prefill)阶段应用适配器,从而提升多用户场景下的服务吞吐量。相比传统的参数高效微调方法(PEFT),PreFT 在保持性能的同时显著提高了吞吐效率,尤其在处理大量适配器时表现更优。实验表明,PreFT 在监督微调和强化学习任务中能够接近甚至达到传统 PEFT 的性能,验证了其在个性化服务场景中更具优势的精度-吞吐量权衡。

详情
英文摘要

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

2605.14215 2026-05-15 cs.AI cs.LG q-bio.QM 版本更新

GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

Noah Flynn

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校)

AI总结 该研究针对合成生物学中遗传电路设计仍依赖专家经验的问题,提出了一种基于强化学习的框架GenCircuit-RL,通过分层验证奖励机制将电路正确性分解为五个层次,并结合四阶段课程学习逐步提升模型能力。研究还构建了一个包含4753个电路的基准数据集SynBio-Reason,用于评估模型在代码修复、从头设计等任务中的表现。实验表明,分层验证和课程学习显著提升了模型在功能推理任务中的成功率,并能生成拓扑正确、泛化性强的遗传电路设计。

Comments Link: https://icml.cc/virtual/2026/poster/61789

详情
英文摘要

Genetic circuit design remains a laborious, expert-driven process despite decades of progress in synthetic biology. We study this problem through code generation: models produce Python code in pysbol3 to construct genetic circuits in the Synthetic Biology Open Language (SBOL), a formal representation that supports automated verification. We introduce GenCircuit-RL, a reinforcement learning framework built around hierarchical verification rewards that decompose correctness into five levels, from code execution to task-specific topological checks, and a four-stage curriculum that shifts optimization pressure from code generation to functional reasoning. We also introduce SynBio-Reason, a benchmark of 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts for out-of-distribution evaluation. Hierarchical verification improves task success on functional reasoning tasks by 14 to 16 percentage points over binary rewards, and curriculum learning is required for strong design performance. The resulting models generate topologically correct circuits, generalize to novel biological parts, and rediscover canonical designs from the synthetic biology literature.

2605.14212 2026-05-15 cs.AI 版本更新

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Yaolun Zhang, Yujie Zhao, Nan Wang, Yiran Wu, Jiayu Chang, Yizhao Chen, Qingyun Wu, Jishen Zhao, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) UCSD(加州大学圣迭戈分校) Amazon AGI(亚马逊人工智能实验室) Pennsylvania State University(宾夕法尼亚州立大学) AG2AI, Inc.(AG2AI公司)

AI总结 本文提出了一种端到端的强化学习框架 MetaAgent-X,旨在突破现有自动多智能体系统(MAS)在设计与执行解耦的限制,实现自设计与自执行的智能体流程生成。该方法通过联合优化设计与执行过程,引入分层 rollout 与阶段性共进化策略,提升了训练稳定性与系统适应性。实验表明,MetaAgent-X 在多个基准上显著优于现有方法,验证了端到端训练自动 MAS 的有效性与实用性。

详情
英文摘要

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

2605.14210 2026-05-15 cs.LG cs.AI 版本更新

Towards Fine-Grained and Verifiable Concept Bottleneck Models

Yingying Fang, Haijie Xu, Shuang Wu, Mariathasan Anish, Guang Yang

发表机构 * Bioengineering Department(生物工程部门) Imperial-X, Imperial College London, London, UK(帝国理工学院伦敦校区) Thoughtworks AI Labs, Singapore(Thoughtworks AI实验室,新加坡)

AI总结 该论文提出了一种细粒度且可验证的概念瓶颈模型(CBM)框架,旨在解决现有CBM在验证预测概念是否对应正确视觉证据方面的不足。通过将每个概念与局部视觉证据关联,该方法支持直接检查概念的编码位置和方式,从而提升模型的可解释性和可靠性。实验表明,该方法在保持预测性能的同时显著提高了透明度,并建立了概念层面的人机交互机制,为构建更可靠和临床可用的概念驱动学习系统奠定了基础。

Comments 10 pages, 4 figures

详情
英文摘要

Concept Bottleneck Models (CBMs) offer interpretable alternatives to black-box predictors by introducing human-relatable concepts before the final output. However, existing CBMs struggle to verify whether predicted concepts correspond to the correct visual evidence, limiting their reliability. We propose a fine-grained CBM framework that grounds each concept in localized visual evidence, enabling direct inspection of where and how concepts are encoded. This design allows users to interpret predictions and verify that the model learns intended concepts rather than spurious correlations. Experiments on medical imaging benchmarks show that our learned concept space is information-complete and achieves predictive performance comparable to standard CBMs, while substantially improving transparency. Unlike post-hoc attribution methods, our framework validates both the presence and correctness of concept representations, bridging interpretability with verifiability. Our approach enhances the trustworthiness of CBMs and establishes a principled mechanism for human-model interaction at the concept level, paving the way toward more reliable and clinically actionable concept-based learning systems.

2605.14202 2026-05-15 cs.SE cs.AI 版本更新

LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

Hrushitha Goud Tigulla, Marco Vieira

发表机构 * College of Computing(计算学院) Informatics University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校信息学院) Charlotte, USA(美国夏洛特)

AI总结 本文通过实证研究探讨了基于大语言模型(LLM)的微服务应用鲁棒性测试方法。研究针对不同架构的微服务系统,应用七种提示策略和三种开源LLM生成测试用例,发现提示策略对测试多样性的影响比模型规模更大。研究提出了两种新策略——Guided和GuidedFewShot,结合领域知识提升测试覆盖效果,其中GuidedFewShot在两个系统中均实现了较高的失败模式覆盖率,且保持了较低的模型间相似性。实验表明,仅依赖分类规则不足以引导LLM生成有效测试,具体示例对模型理解输入突变至关重要。

详情
英文摘要

Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt collapses diversity entirely, while a single model varied across three prompt strategies achieves complete failure-mode coverage on one system, outperforming any multi-model ensemble under a fixed prompt. We introduce two strategies, Guided and GuidedFewShot, that embed a mutation taxonomy from prior robustness testing research as domain context. GuidedFewShot achieves the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low cross-model similarity. A key lesson is that taxonomy rules alone are insufficient: LLMs cannot distinguish key-absent from value-empty mutations without concrete examples. Findings replicate across both systems.

2605.14192 2026-05-15 cs.CL cs.AI 版本更新

Why Retrieval-Augmented Generation Fails: A Graph Perspective

Kai Guo, Xinnan Dai, Zhibo Zhang, Nuohan Lin, Shenglai Zeng, Jie Ren, Haoyu Han, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文从图的角度分析了检索增强生成(RAG)为何在许多情况下仍会产生错误答案,揭示了检索信息如何影响模型生成过程。通过构建归因图,研究者发现了正确与错误预测在信息流动结构上的显著差异,并基于这些发现提出了一个基于图的错误检测框架,进一步展示了如何通过干预归因图结构来提升RAG的生成质量。

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

2605.13789 2026-05-15 cs.LG cs.AI q-bio.BM 版本更新

ENSEMBITS: an alphabet of protein conformational ensembles

Kaiwen Shi, Carlos Oliver

发表机构 * Department of Computer Science, Vanderbilt University(范德比尔特大学计算机科学系) Center for AI in Protein Dynamics, Vanderbilt University(蛋白质动力学中的人工智能中心,范德比尔特大学) Department of Molecular Physiology and Biophysics, Vanderbilt University(分子生理学与生物物理学系,范德比尔特大学)

AI总结 本文提出了一种名为 Ensembits 的新型蛋白质构象集合分词器,旨在解决现有分词器无法捕捉蛋白质动态构象变化的问题。该方法通过引入残差 VQ-VAE 模型和帧蒸馏目标函数,能够有效编码不同构象间的几何特征和动态变化,实现对蛋白质运动状态的精确描述。Ensembits 在多个任务中表现出色,包括 RMSF 预测、功能注释和突变效应预测等,并且在数据量远少于静态分词器的情况下仍能取得优异性能,为蛋白质语言建模和设计提供了重要的动态词汇基础。

详情
英文摘要

Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.

2605.13773 2026-05-15 cs.SE cs.AI cs.LO 版本更新

(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi

发表机构 * Department of Informatics, King's College London(伦敦国王学院信息学院)

AI总结 本文研究了大型语言模型(LLMs)对高层消息序列图(HMSCs)形式语义的理解程度。通过让三种主流LLMs完成129项与HMSC语义相关的任务,发现它们对基本语义概念的理解较好,但在涉及抽象、组合以及追踪和标签转换系统等复杂语义推理任务时表现较差。研究揭示了当前LLMs在处理具有严格形式语义的软件设计模型时仍存在显著局限。

详情
英文摘要

Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.

2605.13369 2026-05-15 cs.CL cs.AI cs.LG 版本更新

Query-Conditioned Test-Time Self-Training for Large Language Models

Chaehee Song, Minseok Seo, Yeeun Seong, Doyi Kim, Changick Kim

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院) Graduate School of Green Growth and Sustainability, KAIST(韩国科学技术院可持续增长与绿色发展研究生院)

AI总结 本文提出了一种名为 QueST 的查询条件化测试时自训练框架,用于在推理过程中根据输入查询动态调整大语言模型的参数,以提升模型对特定问题的适应能力。核心思想是利用输入查询中隐含的结构信息生成相关的“问题-解答”对,作为测试时参数高效微调的监督信号,从而无需外部数据即可实现模型的查询特异性优化。实验表明,QueST 在多个数学和科学推理基准上优于现有的测试时优化方法,验证了该方法的有效性与实用性。

Comments 17 pages, 7 figures

详情
英文摘要

Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs. Code is available at https://chssong.github.io/Query-Conditioned-TTST/.

2605.13362 2026-05-15 cs.MA cs.AI cs.DC cs.GT econ.TH 版本更新

Constitutional Governance in Metric Spaces

Ehud Shapiro, Nimrod Talmon

发表机构 * London School of Economics and Weizmann Institute of Science(伦敦经济学院和魏茨曼科学研究院) Ben-Gurion University(本· Gurion大学)

AI总结 本文研究了在度量空间中实现平等自主治理的计算机制,提出了宪法治理框架,将提案、审议、修改和共识等过程整合为一个多项式时间协议。该框架通过为每个可修改的组件分配度量空间、聚合规则和超级多数阈值,支持成员通过理想元素投票并提交获得超级多数支持的公开提案,从而实现宪法共识。研究还展示了该框架在七个典型场景中的应用,并证明了广义中位数在多数阈值下具有良好的激励相容性,为数字社区和组织的宪法治理提供了全面解决方案。

详情
英文摘要

Computational social choice and algorithmic decision theory offer rich aggregation theory but no comprehensive process for egalitarian self-governance: aggregation, deliberation, amendment, and consensus are each considered in isolation, with key metric-space aggregators being NP-hard. Here, we propose constitutional governance in metric spaces, integrating these stages into a coherent polynomial-time protocol for constitutional governance. The constitution assigns, per amendable component including itself, a metric space, aggregation rule, and supermajority threshold. Amendments proceed by members voting with their ideal elements, followed by members submitting public proposals carrying supermajority public support under the revealed votes. Public proposals can be sourced from deliberation among members, vote aggregation, or AI mediation. The constitutional rule adopts a supported proposal with positive maximal score, if there is one, else retains the status quo. With Constitutional Consensus, a community can run the constitutional governance protocol on members' personal computing devices (e.g., smartphones), achieving digital sovereignty. We focus on the utility of the generalised median, prove that at majority threshold no misreport weakly dominates sincere voting, and study the compromise gap between best peak and unconstrained optimum. We instantiate the framework to seven canonical settings -- electing officers, setting rates, allocating budgets, ranking priorities, selecting boards, drafting bylaws, and amending the constitution. By unifying metric-space aggregation, reality-aware social choice, supermajority amendment, constitutional consensus, deliberative coalition formation, and AI mediation, this work delivers a comprehensive solution to the constitutional governance of digital communities and organisations.

2605.13276 2026-05-15 cs.AI cs.RO 版本更新

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, Yicheng Gong

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Tianjin University(天津大学) Beihang University(北航) JDT AI Infra(京东AI基础设施)

AI总结 随着具身人工智能的快速发展,视觉-语言-动作(VLA)模型在多模态感知和任务执行方面表现出色,但在大规模分布式环境中应用强化学习(RL)时面临系统瓶颈,主要源于高保真物理仿真与深度学习对显存和带宽的高需求之间的资源冲突。为解决这一问题,本文提出D-VLA,一种高并发、低延迟的分布式RL框架,通过“平面解耦”和“泳道”异步流水线等创新设计,有效分离训练数据与模型优化过程,实现采样、推理、梯度计算和参数分发的全并行重叠,显著提升了大规模VLA模型的训练吞吐量和采样效率。

详情
英文摘要

The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.

2605.13213 2026-05-15 cs.AI 版本更新

Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

Hao Zhou, Tiru Wu, Yan Jiang, Wanqi Zhou, Junxing Hu, Ai Han

AI总结 本文研究了多模态多智能体系统(MM-MAS)在面对对抗攻击时的脆弱性,提出了一种分层攻击框架HAM$^{3}$,通过感知层、通信层和推理层三个层面协同攻击,分别扰动输入数据、通信内容与结构以及智能体的推理过程。实验表明,该方法在GQA基准上取得了高达78.3%的攻击成功率,尤其在推理层攻击效果显著,能够使多个智能体产生一致的错误判断,为构建更鲁棒和可解释的多智能体系统提供了重要参考。

Comments Accepted to CVPR 2026

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
英文摘要

Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi-agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM-MAS largely underexplored. To bridge this gap, we introduce HAM$^{3}$, a Hierarchical Attack framework for multi-modal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM$^{3}$ mounts attacks by perturbing visual inputs, textual inputs, and their fused visual-textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent's cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM$^{3}$ on the GQA benchmark through multi-agent systems built on distinct reasoning paradigms including ReAct, Plan-and-Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning-layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi-agent intelligence.

2605.13137 2026-05-15 cs.IR cs.AI 版本更新

LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

Guoxiong Gao, Zeming Sun, Jiedong Jiang, Yutong Wang, Jingda Xu, Peihao Wu, Bryan Dai, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) IQuest Research(IQuest研究院) Research Institute for Mathematical Sciences, Kyoto University(京都大学数学研究所) Westlake Institute for Advanced Study, Westlake University(西湖研究所在线高级研究院) Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University(北京国际数学研究中心和新基石科学实验室,北京大学) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University(智能计算中心,Great Bay高级研究院,Great Bay大学) Zhongguancun Academy(中关村学院)

AI总结 LeanSearch v2 是一种用于 Lean 4 定理证明的全局前提检索系统,旨在从数学库中找到能够支持定理证明的多个相关引理。该系统包含两种模式:标准模式通过嵌入-重排序流程实现高精度的单次查询检索,而推理模式则通过迭代的草稿-检索-反思循环实现全局前提的恢复。实验表明,LeanSearch v2 在多个基准测试中显著优于现有系统,有效提升了定理证明的成功率。

详情
英文摘要

Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch-retrieve-reflect cycles. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open-sourced all code, data, and benchmarks. Code and data: https://github.com/frenzymath/LeanSearch-v2 . The standard mode is publicly available with API access at https://leansearch.net/ .

2605.13126 2026-05-15 cs.LG cs.AI 版本更新

MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing

Chaokai Wu, Haofu Shi, Ningxuan Ma, Jianghong Ma, Xiaofeng Zhang

AI总结 本文研究了图神经网络(GNNs)在多标签场景下因信息压缩导致的过拟合问题,提出了一种名为MLGIB的多标签图信息瓶颈方法。该方法通过构建马尔可夫依赖空间并推导可计算的变分界,有效平衡了模型的表达能力和鲁棒性,在保留预测标签信号的同时抑制无关标签噪声。实验表明,MLGIB在多个基准数据集上均优于现有方法,验证了其有效性和通用性。

详情
英文摘要

Graph Neural Networks (GNNs) suffer from over-squashing in deep message passing, where information from exponentially growing neighborhoods is compressed into fixed-dimensional representations. We show that this issue becomes a distinct failure mode in multi-label graphs: neighboring nodes often share only limited labels while differing across many irrelevant ones, causing predictive signals to be diluted by noisy label information. To address this challenge, we propose the Multi-Label Graph Information Bottleneck (MLGIB), which formulates multi-label message passing as constrained information transmission under irrelevant label noise. MLGIB balances expressiveness and robustness by preserving predictive label signals while suppressing irrelevant noise. Specifically, it constructs a Markovian dependence space and derives tractable variational bounds, where the lower bound maximizes mutual information with target labels and the upper bound constrains redundant source information. These bounds lead to an end-to-end label-aware message-passing architecture. Extensive experiments on multiple benchmarks demonstrate consistent improvements over existing methods, validating the effectiveness and generality of the proposed framework.

2605.13095 2026-05-15 cs.CR cs.AI cs.CY cs.LG 版本更新

Watermarking Should Be Treated as a Monitoring Primitive

Toluwani Aremu, Nils Lukas, Jie Zhang

发表机构 * MBZUAI(穆扎布伊人工智能研究所) A*STAR(新加坡科技研究局)

AI总结 该论文探讨了生成模型中水印技术在溯源、归因和安全监控中的应用,并指出当前水印评估通常仅针对单个样本的对抗攻击,忽视了观察者通过聚合多个输出信号进行实体级信息推断的能力。研究引入了基于观察者的威胁模型,表明即使零比特水印也能在多密钥环境下实现归因,并揭示了水印设计在外部监控方面的潜在风险与应对策略。论文揭示了归因与监控之间的根本性双重用途矛盾,强调水印评估应超越单样本鲁棒性,考虑聚合分析和观察者能力的影响。

Comments 12 pages, 5 figures

详情
英文摘要

Watermarking is widely proposed for provenance, attribution, and safety monitoring in generative models, yet is typically evaluated only under adversaries who attempt to evade detection or induce false positives at the level of individual samples. We argue that watermarking should be treated as a monitoring primitive, and that internal monitoring is unavoidable given per-entity attribution keys and messages, as well as detector access. We introduce an observer-based threat model in which observers can aggregate watermark signals across outputs to infer entity-level information, showing that even zero-bit watermarking enables attribution under multi-key settings. We further show that external monitoring can emerge over time from persistent, key-dependent statistical structure, although this depends on watermark design and may be mitigated by distribution-preserving or undetectable schemes. Our findings reveal a fundamental dual-use tension between attribution and monitoring, motivating evaluation of watermarking beyond per-sample robustness to account for aggregation and observer-based capabilities.

2605.13084 2026-05-15 cs.CL cs.AI 版本更新

Does language matter for spoken word classification? A multilingual generative meta-learning approach

Batsirayi Mupamhi Ziki, Louise Beyers, Ruan van der Merwe

发表机构 * Bytefuse

AI总结 本文研究了语言因素在少样本语音词分类中的影响,提出了一种基于生成式元学习的多语言方法。该方法通过生成元持续学习算法,在英语、德语、法语和加泰罗尼亚语等多语言环境下进行训练,发现多语言模型表现最佳,但不同模型之间的性能差异较小。研究还表明,训练数据的独特小时数比语言数量更能反映模型性能。

详情
英文摘要

Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.

2605.13050 2026-05-15 cs.CL cs.AI 版本更新

Context Training with Active Information Seeking

Zeyu Huang, Adhiguna Kuncoro, Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam, Marc'Aurelio Ranzato

发表机构 * The University of Edinburgh(爱丁堡大学)

AI总结 本文研究了如何通过主动信息检索提升大型语言模型在新任务中的适应能力。不同于传统依赖模型内部知识的封闭式方法,作者为上下文优化器引入了维基百科搜索和浏览器工具,以主动获取外部信息。通过设计一种基于搜索的训练流程,有效维护和剪枝多个候选上下文,显著提升了模型在低资源翻译、医疗场景和复杂推理等任务中的表现,同时表现出良好的数据效率和泛化能力。

Comments Preprint

详情
英文摘要

Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.

2605.12968 2026-05-15 cs.LG cs.AI cs.CL cs.LO 版本更新

Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

Hisashi Miyashita, Mgnite Inc

发表机构 * Mgnite Inc(Mgnite公司)

AI总结 该研究探讨了大语言模型是否在内部以可形式验证的代数结构编码本体关系,并提出了一种代数本体投影(AOP)方法,通过在有限域F2上投影隐藏状态,仅使用42对关系作为代数密钥,实现了高达93.33%的零样本包含准确率。研究还引入了语义结晶度(SC)指标,用于量化模型满足F2约束的程度,并揭示了系统提示在防止模型深层逻辑崩溃中的关键作用,为理解大语言模型的逻辑结构提供了新的代数视角。

详情
英文摘要

Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families -- with no model tuning, but through prompt alone. This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse -- a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible.

2605.12856 2026-05-15 cs.AI cs.SI 版本更新

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文研究了多智能体系统中隐藏恶意意图的检测问题,提出了基于智能体意图而非内容特征的 moderation 框架 BOT-MOD。该方法通过多轮对话和基于 Gibbs 采样的假设引导,逐步识别智能体的真实意图,有效区分良性与恶意行为。实验基于 Moltbook 构建的数据集验证了方法的有效性,能够在多种对抗场景下准确识别意图,同时保持较低的误报率,为开放多智能体环境中的意图感知 moderation 提供了新思路。

详情
英文摘要

The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that BOT-MOD reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.

2605.12394 2026-05-15 cs.LG cs.AI 版本更新

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

Hari K. Prakash, Charles H Martin

发表机构 * University of California San Diego(加州大学圣地亚哥分校) Data Science and Engineering(数据科学与工程) Calculation Consulting(计算咨询)

AI总结 本文提出了一种基于随机矩阵理论的新方法,用于在深度学习模型训练过程中检测过拟合现象,而无需访问训练或测试数据。该方法通过随机化每一层的权重矩阵,并拟合其经验谱分布,识别出违反自平均性的异常特征值,称为“相关陷阱”。研究发现,在长期视角下的“反直觉学习”阶段,这些陷阱会随着测试准确率下降而逐渐形成和扩大,揭示了过拟合的结构特征,并指出部分大型语言模型中也存在类似的陷阱,可能暗示潜在的过拟合风险。

Comments 24 pages, 24 figures

详情
英文摘要

Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}^{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the "anti-grokking" phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.

2605.12350 2026-05-15 cs.LG cs.AI 版本更新

A New Technique for AI Explainability using Feature Association Map

Sayantani Ghosh, Amit Kumar Das, Amlan Chakrabarti

发表机构 * DBS Bank(DBS银行) Institute of Engineering & Management(工程与管理学院) University of Calcutta(加尔各答大学)

AI总结 本文提出了一种基于特征关联图(FAM)的新型可解释人工智能算法FAMeX,用于解释AI系统的决策过程。该方法通过构建特征之间的关联图,从图论角度分析特征的重要性,从而更准确地揭示模型的决策依据。实验表明,FAMeX在分类任务中优于现有的可解释性算法如PFI和SHAP,展现出更高的解释能力和有效性。

详情
英文摘要

Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system

2605.11853 2026-05-15 cs.LG cs.AI cs.CL 版本更新

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 该论文提出了一种名为GEAR的粒度自适应优势重加权方法,旨在提升大语言模型代理在强化学习中的训练效果。GEAR通过自蒸馏技术,利用token级和段级信号对轨迹级优势进行重加权,从而实现更细粒度的信用分配。该方法通过比较策略网络与教师模型的差异,动态调整信用区域的粒度,有效提升了长期轨迹中的策略更新效率。实验表明,GEAR在多个数学推理和工具使用基准中优于现有方法,尤其在基础较弱的基准上表现突出。

详情
英文摘要

Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.

2605.11611 2026-05-15 cs.AI 版本更新

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

Jianghan Shen, Siqi Luo, Xinyu Cheng, Jing Xiong, Yue Li, Jiyao Liu, Jiashi Lin, Yirong Chen, Junjun He

发表机构 * Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Peking University(北京大学) University of Hong Kong(香港大学)

AI总结 本文提出了一种名为 CuSearch 的课程式 rollout 采样框架,用于改进基于可验证奖励的强化学习(RLVR)中智能体检索增强生成(RAG)系统的训练。该方法通过搜索深度(search depth)来动态调整 rollout 采样策略,更关注那些包含更多检索决策点、提供更密集监督的深层搜索轨迹。实验表明,CuSearch 能够显著提升不同模型和检索框架下的性能,为 RLVR 训练提供了一种无需人工标注的有效优化手段。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training.

2605.11459 2026-05-15 cs.RO cs.AI cs.CV cs.LG 版本更新

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

Yanyan Zhang, Chaoda Song, Vikash Singh, Xinpeng Li, Kai Ye, Zhe Hu, Zhongzhu Pu, Yu Yin, Vipin Chaudhary

发表机构 * Case Western Reserve University(凯斯西储大学) The Hong Kong Polytechnic University(香港理工大学) Tsinghua University(清华大学) InspireOmni AI

AI总结 视觉-语言-动作(VLA)模型在灵活性和泛化能力方面表现出色,但大多数现有模型由于采用单帧观测范式,无法感知时间动态变化,导致在非静态环境中性能显著下降。本文提出了一种无需训练的“节奏与路径校正”方法,通过在推理阶段对分块动作的VLA模型进行闭式修正,有效补偿动态变化带来的影响。该方法从单一二次成本函数出发,通过联合优化得到两个正交分解的通道,分别用于压缩执行节奏和调整空间路径,从而在动态环境中显著提升任务成功率。

详情
英文摘要

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.

2605.11410 2026-05-15 cs.AI 版本更新

What Do EEG Foundation Models Capture from Human Brain Signals?

Ling Tang, Qian Chen, Jilin Mei, Houshi Xu, Quanshi Zhang, Jing Shao, Na Zou, Xia Hu, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Tongji University(同济大学) University of Houston(休斯顿大学)

AI总结 该研究探讨了EEG基础模型从人类脑电信号中学习到了哪些信息,并分析了其表征与传统手工特征之间的关系。通过层间岭回归、跨协方差子空间擦除等方法,研究发现EEG基础模型在多个临床任务中表现出色,其优势主要来源于频率域特征及其他多种手工特征的组合。研究还揭示了不同任务中模型性能的差异,并为未来特征发现提供了明确方向。

详情
英文摘要

Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emph{e.g.,} band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emph{what does the model learn}, \emph{what does the model use}, and \emph{how much can be explained}. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the $945$ (model, task, feature) units, $648$ ($68.6\%$) are representation-causal and $199$ ($21.1\%$) are encoded-only. Across tasks, $50$ features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, $79.3\%$ of the foundation model's advantage over the random baseline, with a clean task gradient (MDD $\approx 0.99$ down to Stress $\approx 0.56$): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.

2605.10664 2026-05-15 cs.CL cs.AI 版本更新

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Diancheng Kang, Zheyuan Liu, Ningshan Ma, Yue Huang, Zhaoxuan Tan, Meng Jiang

发表机构 * Southern University of Science and Technology(南方科技大学) University of Notre Dame(Notre Dame 大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 该论文研究了如何在对话场景中更有效地控制语言模型的行为,提出了一种新的激活引导方法,以解决传统方法在长对话中累积失效的问题。作者发现,键值缓存污染是导致引导效果下降的主要原因,并提出了一种基于门控裁剪注意力差值的引导方法(GCAD),通过系统提示对自注意力机制的影响进行引导信号提取,并在词元级别进行门控处理。实验表明,该方法在保持角色特征控制的同时,显著提升了长对话中的连贯性与角色表现能力。

Comments 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control

详情
英文摘要

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

2605.10310 2026-05-15 cs.AI cs.CY cs.HC q-bio.NC 版本更新

Positive Alignment: Artificial Intelligence for Human Flourishing

Ruben Laukkonen, Seb Krier, Chloé Bakalar, Shamil Chandaria, Morten Kringelbach, Adam Elwood, Daniel Ford, Fernando Rosas, Maty Bohacek, Matija Franklin, Nenad Tomašev, Stephanie Chan, Verena Rieser, Roma Patel, Michael Levin, Arun Rao

发表机构 * Department of Psychiatry, University of Oxford(牛津大学精神病学系) Flourishing Intelligence Program, Centre for Eudaimonia and Human Flourishing, Linacre College, University of Oxford(牛津大学幸福智能计划、幸福与人类繁荣中心、林acre学院) Google DeepMind(谷歌DeepMind) LIFE OpenAI Anthropic University of California, Los Angeles(加州大学洛杉矶分校) Aily Labs(Aily实验室) Stanford University(斯坦福大学) Tufts University(塔夫茨大学) Positive AI Labs(积极AI实验室) Department of Informatics, University of Sussex(Sussex大学信息学系) Department of Brain Sciences, Imperial College London(伦敦帝国理工学院脑科学系)

AI总结 本文提出“积极对齐”(Positive Alignment)的概念,旨在开发能够主动支持人类和生态繁荣的人工智能系统,同时保持安全与合作。与现有聚焦于安全与风险防范的对齐研究不同,积极对齐强调系统应具备多元、去中心化、情境敏感及用户主导的特性,并通过培养美德、促进人类福祉来解决当前对齐中的诸多问题。文章还提出了在大语言模型和智能体生命周期中的一系列技术方向与设计原则,以推动分歧包容与去中心化治理。

详情
英文摘要

Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology's focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.

2605.09825 2026-05-15 cs.LG cs.AI 版本更新

Pretraining large language models with MXFP4 on Native FP4 Hardware

Musa Cim, Poovaiah Palangappa, Miro Hodak, Ravi Dwivedula, Meena Arunachalam, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 本文研究了在原生FP4硬件上使用MXFP4量化进行大语言模型预训练时出现的训练不稳定性问题。通过控制实验,逐步启用FP4在前向传播、激活梯度和权重梯度中,发现权重梯度的量化是导致收敛性能下降的主要原因。研究进一步表明,确定性哈达玛旋转能够有效恢复稳定优化,而随机化方法则无法做到这一点,揭示了训练不稳定性源于敏感梯度路径上的结构化微缩误差,而非随机性不足。实验在AMD Instinct MI355X GPU上进行,无需依赖软件模拟即可验证这些结论。

详情
英文摘要

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

2605.09038 2026-05-15 cs.AI 版本更新

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

Jinchao Hu, Meizhi Zhong, Kehai Chen, Min Zhang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院) TikTok Inc, Beijing(字节跳动北京公司)

AI总结 本文提出了一种名为SearchSkill的框架,旨在教会大语言模型更有效地使用搜索工具,特别是在开放域问答任务中。该方法通过可复用的搜索技能库显式规划查询过程,模型在每一步先选择一个技能,再根据该技能生成搜索或回答动作。技能库会随着训练过程中的失败模式不断进化和优化,从而提升搜索效率和答案准确性。实验表明,SearchSkill在多个知识密集型问答基准上提升了精确匹配率,并改善了搜索行为,如减少复制初始查询、生成更聚焦的查询以及在有限搜索预算下获得更准确的答案。

详情
英文摘要

Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.

2605.09027 2026-05-15 cs.CL cs.AI cs.LG cs.MA 版本更新

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Alexandre Le Mercier, Chris Develder, Thomas Demeester

发表机构 * IDLab–T2K, Ghent University–imec(IDLab–T2K,根特大学–imec)

AI总结 在多智能体系统中,一个欺骗性智能体可能破坏整个智能体集体的性能并绕过防御机制。为解决现有研究在对抗性鲁棒性评估上的不足,本文提出GAMBIT基准,包含三种评估模式和两种独立评分,用于评估伪装智能体检测器的性能,特别关注其在分布偏移和新型攻击下的适应能力。GAMBIT基于国际象棋构建,引入了可泛化的自适应欺骗智能体,并提供了27,804个标注样本,揭示了零样本评估在面对自适应对手时可能产生误导性结果,同时展示了快速校准方法在对抗性系统中的有效性。

Comments 46 pages, 16 figures

详情
英文摘要

In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.

2605.09018 2026-05-15 cs.NE cs.AI cs.LG 版本更新

Evolutionary Ensemble of Agents

Zongmin Yu, Liu Yang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为EvE的进化集成框架,用于组织现有的高能力编码代理,使其形成一个协同进化的系统,以实现算法发现。该方法固定基础代理结构,专注于进化代理行为的指导与技能,通过两个协同进化的种群(功能代码求解器和代理指导状态)进行同步竞争,并根据其对当前求解状态的边际贡献更新代理的Elo评分。实验表明,EvE在In-Context Operator Networks(ICON)的研究瓶颈中自主发现了可靠的缩放-插值机制,展示了其在复杂代码库中通过自适应代理集成突破性能瓶颈的有效性。

详情
英文摘要

We introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes existing, highly capable coding agents into a live, co-evolving system for algorithmic discovery. Rather than reinventing the wheel within the "LLMs as optimizers" paradigm, EvE fixes the base agent substrate and focuses entirely on evolving the cumulative guidance and skills that dictate agent behaviors. By maintaining two co-evolving populations, namely functional code solvers and agent guidance states, the system evaluates agents through a synchronous race, updating their empirical Elo ratings based on the marginal gains they contribute to the current solver state. When applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovered a robust rescale-then-interpolate mechanism that enables reliable example-count generalization. Crucially, controlled ablations reveal the absolute necessity of stage-dependent agent adaptation to navigate the shifting search landscapes of complex codebases. Compared to variants driven by a fixed initial agent or even a frozen "best-evolved" agent, EvE uniquely avoids phase mismatch, demonstrating that organizing agents into a self-revising ensemble is the fundamental driver for breaking through static performance ceilings.

2605.08851 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

Jialin Li, Zhuo Zhang, Yue Cao, Guipeng Lan, Jiabao Wen, Shuai Xiao, Jiachen Yang

发表机构 * School of Electrical and Information Engineering, Tianjin University, Tianjin, China(天津大学电气与信息工程学院)

AI总结 该研究针对冠状动脉造影中狭窄病变检测数据不足的问题,提出了一种基于熵最优传输的几何约束狭窄编辑方法。通过将局部编辑建模为受几何信息引导的熵最优传输问题,该方法实现了更精确的结构控制和图像生成。实验表明,该方法生成的图像显著提升了狭窄检测性能,在公开数据集和多中心数据集上分别取得了27.8%和23.0%的相对性能提升。

Comments Accepted to ICML 2026

详情
英文摘要

The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

2605.08374 2026-05-15 cs.AI 版本更新

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Bo Tang, Weinan Zhang, Muning Wen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) National University of Singapore(新加坡国立大学) Xidian University(西安电子科技大学) University of Science and Technology of China(中国科学技术大学) MemTensor (Shanghai) Technology Co., Ltd.(MemTensor(上海)科技有限公司)

AI总结 本文提出了一种名为MemQ的新型记忆代理框架,通过将Q学习机制引入基于溯源DAG的记忆系统,解决了现有方法在处理记忆依赖关系时的不足。MemQ利用TD($λ$)资格迹对记忆Q值进行更新,并通过溯源DAG反向传播信用,使记忆之间的依赖关系得到更准确的评估。实验表明,MemQ在六个不同领域的基准测试中均表现出优越的泛化能力和运行时学习效果,尤其在涉及多步骤任务的场景中提升显著。

Comments 22 pages, 11 figures (containing 43 individual image panels total)

详情
英文摘要

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($λ$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(γλ)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $γ$ and $λ$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

2605.08278 2026-05-15 cs.LG cs.AI cs.CR 版本更新

Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

Fan Yang, Binyan Xu, Di Tang, Kehuan Zhang

发表机构 * The Chinese University of Hong Kong(香港中文大学) Sun Yat-Sen University(中山大学)

AI总结 本文研究了图神经网络(GNN)在面对后门攻击时的防御问题,提出了一种名为PRAETORIAN的新防御方法。该方法通过分析潜在触发子图的内部关联和外部节点影响,检测异常注入结构并识别具有不成比例影响的触发节点,从而有效识别攻击。实验表明,PRAETORIAN在保持较高干净数据准确率的同时显著降低了攻击成功率,且对多种自适应攻击仍保持有效性,迫使攻击者陷入效用与可检测性之间的不利权衡。

详情
英文摘要

GNNs have become a standard tool for learning on relational data, yet they remain highly vulnerable to backdoor attacks. Prior defenses often depend on inspecting specific subgraph patterns or node features, and thus can be circumvented by adaptive attackers. We propose PRAETORIAN, a new defense that targets intrinsic requirements of effective GNN backdoors rather than surface-level cues. Our key observation is that flipping a victim node's prediction requires substantial influence on the victim: attackers tend to either inject many trigger nodes or rely on a small set of highly influential ones. Building on this observation, PRAETORIAN (i) analyzes internal correlations within potential trigger subgraphs to detect abnormally large injected structures, and (ii) quantifies external node influence to identify triggers with disproportionate impact. Across our evaluations, PRAETORIAN reduces the average attack success rate (ASR) to 0.55% with only a 0.62% drop in clean accuracy (CA), whereas state-of-the-art defenses still yield an average ASR of >20% and a CA drop of >3% under the same conditions. Moreover, PRAETORIAN remains effective against a range of adaptive attacks, forcing adversaries to either inject many trigger nodes to achieve high ASR (>80%), which incurs a >10% CA drop, or preserve CA at the cost of limiting ASR to 18.1%. Overall, PRAETORIAN constrains attackers to an unfavorable trade-off between efficacy and detectability.

2605.05686 2026-05-15 cs.AI 版本更新

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Qiyao Liang, Risto Miikkulainen, Ila Fiete

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Texas Austin(德克萨斯大学奥斯汀分校) Cognizant(Cognizant公司)

AI总结 该研究探讨了语言模型在生成过程中可能出现的两种失败模式:知识冲突和自信幻觉,并揭示了它们在隐藏状态空间中的统一几何解释。研究发现,模型中学习到的事实形成吸引子盆地,冲突源于工作记忆干扰正确吸引子的收敛,而幻觉则源于缺乏对应吸引子导致隐藏状态自由漂移。通过几何边距指标,研究成功区分了正确回忆与幻觉,并验证了该结构特性不依赖于微调,且随着模型规模增大,自信幻觉的比例呈指数增长。

Comments 9 pages, 6 figures, plus appendices

详情
英文摘要

Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task-entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\barΔ)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

2605.04215 2026-05-15 cs.LG cs.AI 版本更新

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

Michael Rottoli, Subhankar Roy, Stefano Paraboschi

AI总结 扩散式大语言模型(D-LLMs)在生成任务中具有高并行性和优越的GPU利用率,但其固定响应长度的限制导致计算资源浪费或输出截断的问题。为此,本文提出“Predict-then-Diffuse”框架,通过一个自适应响应长度预测器(AdaRLP)先估计输入对应的最优响应长度,再进行扩散生成,从而在保证输出质量的同时减少冗余计算。实验表明,该方法在多个数据集上有效降低了计算成本,且对数据分布的偏态具有鲁棒性。

详情
英文摘要

Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to significant throughput advantages and superior GPU utilization over the traditional autoregressive paradigm. However, this parallelism is constrained by the requirement of a fixed-size response length prior to generation. This architectural limitation imposes a severe trade-off: oversized response length results in computational waste on semantically meaningless padding tokens, while undersized response length causes output truncation requiring costly re-computations that introduce unpredictable latency spikes. To tackle this issue, we propose Predict-then-Diffuse, a simple and model-agnostic framework that enables compute-budgeted inference per input query by first estimating the response length and then using it to run inference with D-LLM. At its core lies an Adaptive Response Length Predictor (AdaRLP), which estimates the optimal response length given an input query. As a measure against under-estimating the response length and re-running inference with a higher value, we introduce a data-driven safety mechanism based on a small increase of the predicted length. As a whole, our framework avoids wasting computation on padding tokens, at the same time preserving output quality. Experimental validation on multiple datasets demonstrates that Predict-then-Diffuse significantly reduces computational costs (FLOP) compared to the default D-LLM inference mechanism, while being robust to skewed data distributions.

2605.03596 2026-05-15 cs.AI cs.CL cs.DB cs.LG 版本更新

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu

发表机构 * GitHub

AI总结 Workspace-Bench 1.0 是一个用于评估 AI 智能体在工作空间任务中处理大规模文件依赖关系能力的基准。该研究构建了包含多种文件类型和真实工作场景的复杂工作空间,并设计了大量任务来测试智能体的跨文件检索、上下文推理和适应性决策能力。实验表明,当前主流 AI 模型在该基准上的表现仍远低于人类水平,突显了在真实工作场景中实现可靠工作空间学习的挑战。

Comments 30 pages, 16 figures

详情
英文摘要

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.

2605.02398 2026-05-15 cs.AI cs.CL cs.LG 版本更新

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Rahul Kumar

发表机构 * Independent Researcher(独立研究者)

AI总结 随着前沿AI模型被用于高风险决策流程,其在对抗性压力下保持元认知稳定性的能力成为关键的安全要求。本文研究了模型在面对强制合规指令时出现的元认知崩溃现象,并提出了“合规陷阱”这一新概念,指出模型性能的严重下降并非源于威胁内容本身,而是由强制性指令引发的认知边界突破所致。通过大规模实验,作者发现大多数模型在对抗性条件下表现出显著的性能下降,而Anthropic的 Constitutional AI 由于对齐训练表现出较强的免疫能力。

Comments 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap

详情
英文摘要

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.

2605.01758 2026-05-15 cs.AI 版本更新

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Yue Ma, Ziyuan Yang, Yi Zhang

发表机构 * Sichuan University(四川大学) Nanyang Technological University(南洋理工大学)

AI总结 该研究针对多智能体系统中感染式越狱攻击的问题,提出了一种无需训练的前瞻性引导本地净化(FLP)框架。该方法通过模拟未来交互轨迹,结合多角色模拟策略,检测并消除智能体中的感染行为,有效降低了感染传播率。实验表明,FLP能将最大累计感染率从超过95%降至5.47%以下,同时保持交互多样性,显著优于现有方法。

Comments 12 pages

详情
英文摘要

Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.

2605.01725 2026-05-15 cs.CV cs.AI 版本更新

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu, Yuexiao Ma, Xuzhe Zheng, Xing Wang, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Songwei Liu

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(多媒体可信感知与高效计算重点实验室,中国教育部,厦门大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心)

AI总结 本文研究了如何通过运动感知的缓存机制提升自回归视频生成的效率。现有方法依赖于粗粒度的块级缓存跳过,无法准确捕捉像素级别的动态变化,导致生成质量下降。为此,作者提出了MotionCache,通过帧间差异作为像素运动的轻量代理,结合粗到细的策略,在保证生成质量的前提下显著提升了生成速度。实验表明,MotionCache在多个先进模型上实现了最高达6.28倍的加速,同时保持了高质量的生成效果。

Comments 20 pages

详情
英文摘要

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.

2604.21809 2026-05-15 cs.LG cs.AI q-bio.QM stat.ML 版本更新

Quotient-Space Diffusion Models

Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He, Di He, Chang Liu

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China(一般人工智能国家重点实验室,北京大学,北京,中国) Huazhong University of Science and Technology, Wuhan, China(华中科技大学,武汉,中国) Microsoft Research Asia, Beijing, China(微软亚洲研究院,北京,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国)

AI总结 本文提出了一种名为商空间扩散模型(Quotient-Space Diffusion Models)的生成模型框架,旨在有效处理和利用系统中的对称性。该方法通过在去除对称冗余的商空间上进行生成过程,使模型能够在保持目标对称分布的前提下,更灵活地学习生成过程。该框架在分子结构生成任务中进行了实例化,相比等变扩散模型和基于对齐的方法,表现出更优的性能,为生成模型中的对称性处理提供了新的解决方案。

Comments ICLR 2026 Oral Presentation; 43 pages, 5 figures, 6 tables; ICLR 2026 Camera Ready version

详情
英文摘要

Diffusion-based generative models have reformed generative AI, and also enabled new capabilities in the science domain, e.g., fast generation of 3D structures of molecules. In such tasks, there is often a symmetry in the system, identifying elements that can be converted by certain transformations as equivalent. Equivariant diffusion models guarantee a symmetric distribution, but miss the opportunity to make learning easier, while alignment-based simplification attempts fail to preserve the target distribution. In this work, we develop quotient-space diffusion models, a principled generative framework to fully handle and leverage symmetry. By viewing the intrinsic generation process on the quotient space, the exact construction that removes symmetry redundancy, the framework simplifies learning by allowing model output to have an arbitrary intra-equivalence-class movement, while generating the correct symmetric target distribution with guarantee. We instantiate the framework for molecular structure generation which follows $\mathrm{SE}(3)$ (rigid-body movement) symmetry. It improves the performance over equivariant diffusion models and outperforms alignment-based methods universally for small molecules and proteins, representing a new framework that surpasses previous symmetry treatments in generative models.

2604.19092 2026-05-15 cs.RO cs.AI 版本更新

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, Ruihai Wu

发表机构 * Peking University(北京大学) Tsinghua University(清华大学) Lightwheel

AI总结 RoboWM-Bench 是一个专注于机器人操作任务的基准,用于评估视频世界模型在生成行为是否具备物理可执行性。该基准通过将生成的视频转化为可执行的动作序列,并在物理仿真环境中验证其可行性,从而系统评估模型在真实机器人操作中的表现。研究发现,视觉合理性与物理可执行性并不总是一致,突显了在复杂操作任务中进行具身化评估的重要性。

详情
英文摘要

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.

2604.16744 2026-05-15 cs.CL cs.AI cs.HC 版本更新

Evaluating Adaptive Personalization of Educational Readings with Simulated Learners

Ryan T. Woo, Anmol Rao, Aryan Keluskar, Yinong Chen

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院) Arizona State University(亚利桑那州立大学)

AI总结 本文提出了一种基于理论支持的模拟学习者框架,用于评估教育阅读材料的自适应个性化效果。该方法从开放教材中构建学习目标和知识组件本体,通过浏览器工具进行管理,并生成匹配的阅读与评估对。实验结果表明,自适应阅读在计算机科学中显著提升了学习效果,在无机化学中效果不明确,在普通生物学中则无明显提升甚至略有负面影响。

详情
英文摘要

We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.

2604.16325 2026-05-15 cs.LG cs.AI 版本更新

UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration

Xingsheng Chen, Xianpei Mu, Deyu Yi, Yilin Yuan, Xingwei He, Bo Gao, Regina Zhang, Pietro Lio, Siu-Ming Yiu

AI总结 多变量时间序列预测在能源、金融和环境监测等领域具有重要意义,但其复杂的时序依赖关系和变量间交互带来诸多挑战。为此,本文提出UniMamba,一个融合状态空间模型与注意力机制的统一时空预测框架,既保持了高效的计算性能,又能够捕捉显式的时序模式。该方法通过结合Mamba变体编码层、时空注意力层和前馈时序动态层,有效建模了全局时间依赖和变量间关系,在多个公开数据集上的实验表明,UniMamba在预测精度和计算效率方面均优于现有先进模型。

Comments The authors wish to withdraw this preprint due to a lack of consensus regarding the final authorship list and the order of authors

详情
英文摘要

Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based methods capture temporal correlations through attention mechanisms but suffer from quadratic computational cost, while state-space models like Mamba achieve efficient long-context modeling yet lack explicit temporal pattern recognition. Therefore we introduce UniMamba, a unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning. UniMamba employs a Mamba Variate-Channel Encoding Layer enhanced with FFT-Laplace Transform and TCN to capture global temporal dependencies, and a Spatial Temporal Attention Layer to jointly model inter-variate correlations and temporal evolution. A Feedforward Temporal Dynamics Layer further fuses continuous and discrete contexts for accurate forecasting. Comprehensive experiments on eight public benchmark datasets demonstrate that UniMamba consistently outperforms state-of-the-art forecasting models in both forecasting accuracy and computational efficiency, establishing a scalable and robust solution for long-sequence multivariate time-series prediction.

2604.09603 2026-05-15 cs.DC cs.AI cs.LG 版本更新

ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan

发表机构 * Qwen Applications Business Group of Alibaba(阿里巴巴文勤应用业务部)

AI总结 ECHO 是一种面向高并发场景的弹性推测解码框架,旨在提升大语言模型推理效率。该方法通过稀疏置信度门控机制,将推测执行重新建模为预算调度问题,灵活平衡解码深度与宽度,从而减少全局验证步骤并提高每步效率。实验表明,ECHO 在多种模型规模下均优于现有方法,尤其在工业级模型 Qwen3-235B 上实现了最高达 5.35 倍的加速效果。

详情
英文摘要

Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.

2604.08991 2026-05-15 cs.CV cs.AI 版本更新

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng

发表机构 * Jilin University(吉林大学) National Taiwan University(国立台湾大学)

AI总结 本文提出PinpointQA,首个用于室内视频中小物体中心空间理解的数据集与基准,旨在评估模型在视频中精确定位目标物体并描述其位置的能力。该数据集基于ScanNet++和ScanNet200构建,包含1024个场景和10,094个问答对,涵盖四个逐步增加难度的任务,实验表明主流多模态大语言模型在该基准上仍存在明显性能差距,而通过PinpointQA进行微调可显著提升模型表现。

详情
英文摘要

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.

2603.24422 2026-05-15 cs.IR cs.AI cs.CL 版本更新

OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework

Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, Zhipeng Qian, Xinyu Sun, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Hui Kong, Jing Chen, Han Li, Chenyi Lei, Wenwu Ou, Kun Gai

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出了一种名为 OneSearch-V2 的生成式检索框架,旨在解决现有系统在复杂查询理解、用户意图挖掘和偏好过拟合等方面的问题。该方法通过引入潜在推理增强的自蒸馏训练机制,提升了对用户深层需求的理解与匹配能力,并结合行为偏好对齐优化系统,有效缓解了单一转化指标带来的奖励黑客问题。实验表明,OneSearch-V2 在多项指标上均有显著提升,包括点击率、买家数量和订单量,并改善了搜索体验质量。

Comments Codes are available at https://github.com/benchen4395/onesearch-family. Feel free to contact benchen4395@gmail.com

详情
英文摘要

Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose OneSearch-V2, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +2.07\% buyer volume, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.37\% in page good rate and +1.65\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.

2603.21250 2026-05-15 cs.AI 版本更新

Graph of States: Solving Abductive Tasks with Large Language Models

Yu Luo, Rongchen Gao, Lu Teng, Xidao Wen, Jiamin Jiang, Qingliang Zhang, Yongqian Sun, Shenglin Zhang, Jiasong Feng, Tong Liu, Wenjie Zhang, Dan Pei

发表机构 * Nankai University(南开大学) Wenzhou Medical University(温州医科大学) Alibaba Cloud(阿里云) Tsinghua University(清华大学)

AI总结 本文研究了大型语言模型在归纳和演绎推理之外的第三类逻辑推理——溯因推理中的应用。针对现有框架在结构化状态表示和显式状态控制方面的不足,作者提出了一种名为Graph of States(GoS)的神经符号框架,通过因果图编码逻辑依赖关系,并利用状态机控制推理过程的合法转移,从而将无约束的探索转化为有导向的搜索。实验表明,GoS在两个真实数据集上显著优于现有方法,为复杂溯因任务提供了稳健的解决方案。

详情
英文摘要

Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://github.com/gaorch85/Graph-of-States.

2603.16659 2026-05-15 cs.AI econ.GN q-fin.EC 版本更新

LLMs learn scientific taste from institutional traces across the social sciences

Ziqin Gong, Ning Li, Huaikang Zhou

发表机构 * School of Economics and Management, Tsinghua University(清华大学经济管理学院)

AI总结 该研究探讨了大型语言模型(LLMs)如何通过学习社会科学领域中的机构痕迹(如论文发表记录)来提升对低可验证性领域的评估能力。研究构建了八个学科的分级研究提案基准,并通过监督微调(SFT)训练模型,结果表明这些模型在判断研究价值方面显著优于随机猜测,甚至超越了前沿推理模型和专家评审的平均水平。研究还发现,模型的置信度与其预测准确性高度相关,表明其具备一定的判断可靠性。

详情
英文摘要

Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.

2603.14360 2026-05-15 cs.LG cs.AI 版本更新

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, Tri Dao

发表机构 * MIT-IBM Watson Lab(MIT-IBM沃森实验室) Princeton University(普林斯顿大学) Together AI

AI总结 本文提出了一种名为 M$^2$RNN 的非线性循环神经网络架构,其核心特点是使用矩阵值隐藏状态和高表达力的非线性状态转移,旨在克服传统 Transformer 在复杂任务中的表达能力限制。研究发现,非线性 RNN 的性能受限于状态规模,而通过引入状态规模扩展机制,M$^2$RNN 能够高效利用张量核心进行计算,并在未见过的长序列上实现完美的状态追踪泛化。实验表明,M$^2$RNN 在大规模语言建模和混合架构中表现出色,相比现有模型在准确率和计算效率方面均有显著提升。

详情
英文摘要

Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size, and show how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M$^2$RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M$^2$RNN outperforms equivalent Gated DeltaNet hybrids by $0.4$-$0.5$ perplexity points on a 7B MoE model, while using $3\times$ smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M$^2$RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M$^2$RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M$^2$RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to $8$ points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.

2603.12554 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系)

AI总结 该论文研究了如何将强化学习应用于扩散语言模型(DLMs)的序列生成任务。针对扩散模型难以直接计算序列级似然的问题,作者提出了一种基于有限时间马尔可夫决策过程的精确无偏策略梯度方法,通过分解去噪步骤并利用中间优势值进行优化。为提高计算效率,论文引入了熵引导的步骤选择机制和一步去噪奖励估计,有效避免了多步模拟的高计算成本。实验表明,该方法在编码和逻辑推理任务中取得了最先进的性能,尤其在数学推理方面表现突出。

详情
英文摘要

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

2603.12529 2026-05-15 cs.LG cs.AI cs.CL 版本更新

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim

发表机构 * UT Austin(得克萨斯大学) EPFL(苏黎世联邦理工学院) ENS Paris-Saclay(巴黎-萨克雷大学) Télécom Paris (IP Paris)(巴黎理工学院)

AI总结 大型推理模型(LRMs)通过链式推理(CoT)在复杂任务中表现出色,但常因过度思考而浪费大量计算资源。本文提出TERMINATOR,一种用于推理过程中提前终止的策略,通过学习模型首次生成最终答案的位置,构建最优推理长度数据集,从而有效缩短CoT长度。实验表明,TERMINATOR在多个实际数据集上平均减少CoT长度14%-55%,并显著降低推理延迟。

Comments Updated and reorganized results. Added new results

详情
英文摘要

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, while outperforming current state-of-the-art methods and reducing inference latency by more than 2x compared to the original LRM.

2603.11042 2026-05-15 cs.CV cs.AI cs.LG cs.MM cs.SD 版本更新

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为V2M-Zero的视频到音乐生成方法,能够在无需视频-音乐配对数据的情况下生成与视频事件时间对齐的音乐。该方法通过分别提取音乐和视频的事件曲线,捕捉各自模态中的时间结构变化,从而实现跨模态的时间同步。实验表明,V2M-Zero在多个基准数据集上取得了优于现有方法的性能,尤其在时间同步和语义对齐方面表现突出,并且实现了时间与音乐风格的独立控制。

Comments Project page: https://genjib.github.io/v2m_zero/

详情
英文摘要

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

2603.07833 2026-05-15 cs.LG cs.AI 版本更新

Gradient Iterated Temporal-Difference Learning

Théo Vincent, Kevin Gerhardt, Yogesh Tripathi, Habib Maraqten, Adam White, Martha White, Jan Peters, Carlo D'Eramo

AI总结 本文提出了一种名为梯度迭代时间差分学习(Gradient Iterated Temporal-Difference Learning)的新算法,旨在解决传统时间差分学习中半梯度更新可能导致的发散问题。该方法在迭代时间差分学习的基础上引入了对移动目标的梯度计算,从而提升算法的稳定性与学习效率。实验表明,该方法在多个基准任务中表现出与半梯度方法相当甚至更优的学习速度,尤其在Atari游戏中取得了显著效果,展示了其在强化学习领域的应用潜力。

详情
英文摘要

Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

2603.04885 2026-05-15 cs.AI 版本更新

Proactive Memory for Ad-Hoc Recall over Streaming Dialogues

Bingbing Wang, Jing Li, Ruifeng Xu

发表机构 * Department of Computing(计算系) The Hong Kong Polytechnic University(香港理工大学) The School of Computer Science and Technology(计算机科学与技术学院) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 该研究针对流式对话场景中无限时间跨度下的记忆管理问题,提出了首个用于评估流式记忆能力的基准STEM-Bench,并揭示了现有方法在信息保真与计算效率之间的矛盾。为此,研究设计了ProStream框架,通过分层结构和多粒度知识蒸馏实现按需调用记忆,结合自适应时空优化策略动态调整信息保留,从而在保证推理准确性的前提下显著降低推理延迟,为流式对话系统提供了高效的记忆管理方案。

详情
英文摘要

Real-world dialogue usually unfolds as an infinite stream. It thus requires bounded-state memory mechanisms to operate within an infinite horizon. However, existing read-then-think memory is fundamentally misaligned with this setting, as it cannot support ad-hoc memory recall while streams unfold. To explore this challenge, we introduce \textbf{STEM-Bench}, the first benchmark for \textbf{ST}reaming \textbf{E}valuation of \textbf{M}emory. It comprises over 14K QA pairs in dialogue streams that assess perception fidelity, temporal reasoning, and global awareness under infinite-horizon constraints. The preliminary analysis on STEM-Bench indicates a critical textit{fidelity-efficiency dilemma}: retrieval-based methods use fragment context, while full-context models incur unbounded latency. To resolve this, we propose \textbf{ProStream}, a proactive memory framework for streaming dialogues built on a hierarchical structure. It enables ad-hoc memory recall on demand by reasoning over continuous streams with multi-granular distillation. Moreover, it employs Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility. It enables a bounded knowledge state for lower inference latency without sacrificing reasoning fidelity. Experiments show ProStream delivers higher reasoning fidelity than prior baselines while maintaining substantially lower latency than full-context alternatives.

2603.00574 2026-05-15 cs.CV cs.AI 版本更新

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Yongbo He, Zirun Guo, Tao Jin

发表机构 * Zhejiang University(浙江大学)

AI总结 多模态测试时适配旨在将预训练模型适应于测试时不断变化的数据分布,但现有方法常面临无偏模态的负迁移和有偏模态的灾难性遗忘问题。为此,本文提出了一种名为DASP的诊断-缓解框架,通过分析统一潜在空间中模态间的维度冗余差异,识别出有偏模态并采用非对称适配策略,将每个模态的适配器分为稳定和可塑两部分,分别处理不同模态对稳定性和可塑性的需求,从而在保持通用知识的同时实现对新领域的灵活适应。实验表明,DASP在多个多模态基准上显著优于现有方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

2602.23798 2026-05-15 cs.LG cs.AI cs.CR cs.DC 版本更新

MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

Tiantong Wang, Xinyu Yan, Tiantong Wu, Yurong Hao, Pengjun Xie, Wei Yang Bryan Lim

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)(阿里云-南洋理工大学全球可持续发展科技实验室) Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 本文研究了大语言模型中的安全且隐私保护的知识遗忘问题,针对现有方法在隐私约束下难以共享模型参数或遗忘数据集的挑战,提出了一种名为MPU的通用框架。该方法通过引入服务器端的预处理和后处理模块,实现对模型副本的随机扰动和更新聚合,使客户端能够在不访问原始参数的情况下本地执行遗忘操作,同时保证隐私安全。实验表明,MPU在多种遗忘算法中均能保持接近无噪声基线的性能,且在一定噪声水平下甚至表现更优。

详情
英文摘要

Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms' average degradation well below 1% up to 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan0318/MPU.

2602.20571 2026-05-15 cs.AI 版本更新

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

发表机构 * Stanford University(斯坦福大学)

AI总结 该论文提出了一种名为 CausalReasoningBenchmark 的真实世界因果推理基准测试,用于对因果识别与估计能力进行解耦评估。该基准包含来自79篇同行评审论文和三本权威教材的132个真实数据集中的173个查询,要求系统分别生成结构化的因果识别方案和带标准误的点估计,从而区分因果推理错误与数值计算错误。实验表明,当前最先进的语言模型在高层策略识别上表现较好,但在完整识别方案的准确性上显著下降,突显了因果设计细节的重要性。

详情
英文摘要

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state of the art LLM show that, while the model correctly identifies the high-level strategy in 79% of cases, full identification-specification correctness drops to only 34%, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

2602.19533 2026-05-15 cs.LG cs.AI math.RA 版本更新

Grokking Finite-Dimensional Algebra

Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) CHU Sainte-Justine Research Center, Montréal(圣朱斯特研究中心,蒙特利尔) CIFAR AI Chair(CIFAR人工智能席位)

AI总结 本文研究了神经网络在学习有限维代数(FDA)乘法过程中出现的“grokking”现象,即从长期记忆到泛化的突然转变。作者将分析范围从以往关注的群操作扩展到更一般的代数结构,包括非结合、非交换和非单位代数,并指出群操作的学习是FDA学习的特例。研究揭示了FDA乘法本质上是学习由结构张量定义的双线性乘积,并探讨了代数性质如交换性、结合性对grokking出现时机的影响,以及结构张量的稀疏性和秩对泛化能力的作用,为理解数学结构如何影响神经网络泛化动态提供了统一框架。

Comments 37 pages, 14 figures, Forty-Third International Conference on Machine Learning (ICML), 2026

详情
英文摘要

This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra's representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.

2602.17949 2026-05-15 cs.CL cs.AI 版本更新

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake, Jamie Novak, Mathew Miller, Sze-yuan Ooi, Blanca Gallego

发表机构 * Centre for Big Data Research in Health, University of New South Wales(健康大数据研究中心,新南威尔士大学) Eastern Heart Clinic, Prince of Wales Hospital(东部心脏诊所,王室医院) NSW Ambulance Aeromedical Operations, Bankstown Helicopter Base(新南威尔士州急救航空医疗运作,班克stown直升机基地) Department of Anaesthesia, Saint George Hospital(麻醉科,圣乔治医院) Department of Cardiology, Prince of Wales Hospital(心内科,王室医院) School of Clinical Medicine, University of New South Wales(临床医学学院,新南威尔士大学)

AI总结 本文提出CUICurate,一个基于图检索增强生成(GraphRAG)的框架,用于自动化构建临床概念集,以支持自然语言处理应用。该方法利用UMLS知识图谱进行语义检索,结合大语言模型对候选概念进行过滤和分类,实现了比手动构建更全面、更一致的临床概念集。实验表明,CUICurate在多个异构临床概念任务中表现出色,生成的集合不仅规模更大,且具有较高的召回率和稳定性,为临床NLP和表型分析提供了高效、可扩展的解决方案。

Comments 6 figures, 4 tables

详情
英文摘要

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated concept sets and gold-standard concept sets. Results CUICurate produced substantially larger and more complete concept sets than the manual benchmarks. A single retrieval configuration across concepts achieved high recall of definitive concepts with manageable candidate sets. GPT-5 outperformed manual curation for all concepts and retained at least 95% of definitive gold-standard CUIs, while Qwen3-32B achieved comparable but slightly lower performance. Many missed concepts were not observed in 10,000 MIMIC-III notes. CUICurate infrastructure and end-to-end processing was inexpensive and stable across runs. Conclusions CUICurate offers a scalable, reproducible and cost-efficient approach for generating clinician-reviewable UMLS concept sets tailored to clinical natural language processing and phenotyping applications.

2602.15249 2026-05-15 cs.DL cs.AI 版本更新

Artificial Intelligence Specialization in the European Union: Underexplored Role of the Periphery at NUTS-3 Level

Victor Herrero-Solana, Carmen Gálvez

发表机构 * SCImago-UGR, Unit for Computational Humanities and Social Sciences (U^CHASS) University of Granada, Spain(SCImago-UGR,计算人文与社会科学单位(U^CHASS)格拉纳达大学,西班牙)

AI总结 本研究分析了2015年至2024年间欧洲NUTS-3地区在人工智能领域的研究分布情况,利用引文数据和分类系统,计算了相对专业化指数和相对引用影响力指标。研究发现,尽管巴黎、华沙和马德里等大都市在论文数量上占优,但人工智能领域的相对专业化程度最高的是东欧和西班牙的一些外围地区,如格拉纳达和维尔纽斯地区。研究还揭示了专业化与引用影响力之间关系较弱,不同地区呈现出多样化的发展模式。

Comments 15 pages, 3 figures

详情
英文摘要

This study examines the distribution of Artificial Intelligence (AI) research across European NUTS-3 regions during the period 2015-2024. Using bibliometric data from Clarivate InCites and the Citation Topics classification system, we analyse two hierarchical thematic levels: Electrical Engineering, Electronics & Computer Science (Macro Citation Topic 4) and Artificial Intelligence & Machine Learning (Meso Citation Topic 4.61). Relative Specialization Index (RSI) and Relative Citation Impact (RCI) indicators are calculated for 781 European NUTS-3 regions. While major metropolitan hubs such as Paris, Warszawa, and Madrid dominate in absolute publication volume, the results reveal that the highest levels of relative AI specialization are concentrated in peripheral regions, particularly in Eastern Europe and Spain. Granada and Vilniaus apskritis stand out as regions combining high specialization with strong citation visibility. The analysis further suggests a weak relationship between regional specialization and citation impact, revealing multiple regional profiles, including highly specialized regions with limited citation visibility, highly visible regions with comparatively low specialization, and diversified scientific systems combining moderate specialization with strong citation impact. Fyn emerges as an extreme case of very high citation impact despite relatively low specialization.

2602.15019 2026-05-15 cs.AI cs.IR 版本更新

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Vlad Vinogradov, Alisa Vinogradova, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev

发表机构 * Bioptic

AI总结 本文研究了在生物医药投资、业务发展和竞争情报中,如何高效发现非美国来源的潜在药物资产。针对当前AI系统在多语言、异构信息源中召回率低、易产生幻觉的问题,作者提出了一种基于树结构的自学习Bioptic Agent,并构建了一个涵盖多语言、多代理的基准测试平台。实验表明,该方法在资产发现任务中显著优于多个主流大模型,验证了其在完整性和准确性上的优势。

详情
英文摘要

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals, and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

2602.14881 2026-05-15 math.OC cs.AI 版本更新

Numerical exploration of the range of shape functionals using neural networks

Eloi Martinet, Ilias Ftouhi

发表机构 * Institute of Mathematics, University of Würzburg, Germany Laboratoire MIPA, N\ imes University, Site des Carmes, Place Gabriel P\'eri, 30000 N\ imes, France

AI总结 本文提出了一种基于神经网络的新数值框架,用于探索Blaschke–Santaló图,该图用于描述形状泛函之间的可能不等式关系。通过引入基于规范函数的可逆神经网络结构,实现了对任意维凸集的参数化,并在形状优化过程中保持凸性。为实现图内的均匀采样,作者设计了一种通过自动微分最小化Riesz能量泛函的粒子系统,并在二维和三维凸体的多个几何和偏微分方程型泛函上验证了方法的有效性。

Comments 20 pages, 8 figures

详情
英文摘要

We introduce a novel numerical framework for the exploration of Blaschke--Santaló diagrams, which are efficient tools characterizing the possible inequalities relating some given shape functionals. We introduce a parametrization of convex bodies in arbitrary dimensions using a specific invertible neural network architecture based on gauge functions, allowing an intrinsic conservation of the convexity of the sets during the shape optimization process. To achieve a uniform sampling inside the diagram, and thus a satisfying description of it, we introduce an interacting particle system that minimizes a Riesz energy functional via automatic differentiation in PyTorch. The effectiveness of the method is demonstrated on several diagrams involving both geometric and PDE-type functionals for convex bodies of $\mathbb{R}^2$ and $\mathbb{R}^3$, namely, the volume, the perimeter, the moment of inertia, the torsional rigidity, the Willmore energy, and the first two Neumann eigenvalues of the Laplacian.

2602.07441 2026-05-15 cs.LG cs.AI 版本更新

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

发表机构 * School of Automation, Central South University(中南大学自动化学院) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了离线强化学习中行为克隆(BC)正则化策略的局限性,指出当数据集动作次优时,盲目模仿会限制策略的性能提升。为此,作者提出了一种名为近端动作替换(PAR)的方法,通过用更优的动作替换数据集中的次优动作,结合值函数的局部上升方向和不确定性约束,提升训练稳定性。实验表明,PAR能有效提升多种BC正则化方法的性能,并在结合基础TD3+BC时达到先进水平。

详情
英文摘要

Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.

2602.07045 2026-05-15 cs.CV cs.AI 版本更新

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 为了推动多模态大语言模型在遥感领域的应用,研究者提出了首个专注于复杂遥感推理的视觉语言推理基准VLRS-Bench。该基准围绕认知、决策和预测三个核心维度构建,包含2000对问答对,涵盖14项任务和最多八个时间阶段,旨在评估模型在遥感场景下的复杂推理能力。通过融合遥感领域先验知识和专家经验,VLRS-Bench有效提升了任务的地理空间真实性和推理难度,揭示了当前先进模型在该领域的显著瓶颈,为未来研究提供了重要参考。

详情
英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community. The project repository is available at https://github.com/MiliLab/VLRS-Bench.

2602.06718 2026-05-15 cs.CR cs.AI 版本更新

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Zuyao Xu, Yuqi Qiu, Lu Sun, Fasheng Miao, Fubin Wu, Xiang Li, Xinyi Wang, Haozhe Lu, Zhengze Zhang, Yuxin Hu, Jialu Li, Luo Jin, Feng Zhang, Rui Luo, Xinran Liu, Yingxian Li, Jiaji Liu

发表机构 * Nankai University(南开大学) Tsinghua University(清华大学)

AI总结 《GhostCite:大语言模型时代引文有效性的大规模分析》研究了大型语言模型(LLMs)在学术写作中广泛使用所引发的引文有效性问题。研究开发了一个开源框架\citeb,用于大规模验证引文,并通过三个实验分析了LLMs生成虚假引文(“幽灵引文”)的现象。研究发现,所有测试的LLMs在不同领域生成引文时都有较高比例的虚构引文,且近年来学术会议论文中的无效引文比例显著上升,同时多数研究者依赖AI工具,但审稿人对引文的审查并不严格,反映出当前学术出版体系在应对这一问题上的不足。

详情
英文摘要

Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, but their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat, we develop \citeb, an open-source framework for large-scale citation verification, and conduct a comprehensive study of citation validity in the LLM era through three complementary experiments. First, we benchmark 13 LLMs on citation generation task in various research domains, finding that all models hallucinate citations at rate from 14.23\% to 94.93\%. Second, we analyze 2.2 million citations from 56,381 papers at AI/ML and Security venues (2020--2025), finding that 1.07\% of papers contain invalid citations, with an 80.9\% increase in 2025. Third, we survey 97 researchers, finding that 87.2\% use AI-powered tools in their workflows, 76.7\% of reviewers do not thoroughly check references, and 74.5\% view peer review as ineffective at catching citation errors. Based on these findings, we argue that ghost citations represent a systemic threat to academic integrity, and call for coordinated efforts from community to address this challenge.

2602.04265 2026-05-15 cs.LG cs.AI 版本更新

Boosting LLM Reasoning via Human-Inspired Reward Shaping

Wenze Lin, Zhen Yang, Xitai Jiang, Xiaoteng Ma, Gao Huang

发表机构 * Tsinghua University(清华大学) Southern University of Science and Technology(南方科技大学) Mind Lab

AI总结 该研究针对大语言模型(LLM)推理能力提升的问题,提出了一种受人类学习行为启发的动态奖励框架T2T。该方法通过区分问题掌握程度,分别采用“厚化”和“薄化”两个阶段的奖励机制:在错误尝试时鼓励广泛探索,在正确解答后则通过长度惩罚促进推理凝练。实验表明,T2T在多个数学基准测试中显著优于现有方法,有效提升了模型的推理性能。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, existing reward formulations typically treat exploration and consolidation as a monolithic process, resulting in entangled stage-wise learning dynamics. This contradicts the natural learning behavior of human learners. In human learning, individuals adopt distinct behavioral patterns toward mastered versus unfamiliar problems. When confronting unmastered challenges, humans prioritize broad exploration to seek viable solutions. By contrast, for well-mastered problems, they focus instead on reasoning condensation and knowledge abstraction to distill concise underlying principles. Motivated by this gap, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across 5 mainstream LLMs demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.

2602.03814 2026-05-15 cs.AI cs.LG 版本更新

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Xi Wang, Anushri Suresh, Alvin Zhang, Rishi More, William Jurayj, Benjamin Van Durme, Mehrdad Farajtabar, Daniel Khashabi, Eric Nalisnick

发表机构 * Johns Hopkins University, Baltimore, Maryland, USA(约翰霍普金斯大学,巴尔的摩,马里兰州,美国) Apple, USA(苹果公司,美国)

AI总结 本文研究了如何在计算资源有限的情况下,通过控制推理过程中的风险来提升大语言模型的推理效率。作者提出了一种名为“共形思考”的风险控制框架,通过设定上界和下界阈值,分别在模型自信时停止推理(可能产生错误输出)和提前终止无法解决的实例(可能过早停止),从而在保证风险可控的前提下最小化计算开销。实验表明,该方法在多种推理任务和模型中均能有效提升计算效率,同时满足用户设定的风险目标。

Comments ICMl 2026

详情
英文摘要

Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning -- spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

2602.01664 2026-05-15 cs.AI cs.LG 版本更新

FlowSteer: Towards Agents Designing Agentic Workflows via Reinforced Progressive Canvas Editing

Mingda Zhang, Wenjin Liu, Tiesunlong Shen, Qika Lin, Rui Mao, Erik Cambria, Xiaoying Tang, Haoran Luo

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学)

AI总结 FlowSteer 是一种新型智能体设计代理工作流的范式,旨在解决当前工作流构建中依赖人工、缺乏全局反馈和无法在线修复错误等问题。该方法引入了可执行的流程画布环境,通过强化学习逐步进行原子编辑,实现工作流的端到端自动设计。实验表明,FlowSteer 在多个数据集上显著优于现有方法,且支持多种操作符库和大语言模型后端,具有良好的通用性和扩展性。

Comments 51 pages, 6 figures, 5 tables. Project page: http://flowsteer.org/

详情
英文摘要

In recent years, agentic workflows have been widely applied to solve complex human tasks. However, existing workflow construction still faces key challenges, including human-dependent workflow construction, the lack of graph-level execution feedback, and the inability to repair errors in-loop during long-horizon construction. To address these challenges, we propose FlowSteer, a new paradigm of Agent Designing Agentic Workflows - a single agent itself end-to-end designs the workflow that a downstream executor runs. To support this paradigm, we introduce the Workflow Canvas, a novel executable graph-state environment that returns syntax-checked execution feedback for every atomic edit. Built on the canvas, we further propose Reinforced Progressive Canvas Editing, in which a lightweight policy agent issues one atomic edit per turn conditioned on real canvas feedback, and is trained end-to-end via reinforcement learning. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks. Our code is available at https://anonymous.4open.science/r/FlowSteer-9B2E.

2602.01359 2026-05-15 cs.LG cs.AI 版本更新

PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection

Jinju Park, Seokho Kang

发表机构 * Department of Industrial Engineering, Sungkyunkwan University(成均馆大学工业工程系)

AI总结 尽管近期时间序列异常检测研究越来越多地采用如Transformer和基础模型等大型神经网络架构,但这些方法计算成本高、内存消耗大,难以应用于实时和资源受限的场景,且在严格评估下性能提升不明显。本文提出了一种基于块的表示学习方法PaAno,该方法通过从时间序列中提取短时域块,并使用1D卷积神经网络将其嵌入为向量表示,结合三元组损失和预训练任务损失进行训练,以捕捉块中的有用时间模式。在推理阶段,通过比较正常块与当前块的嵌入向量计算异常分数,实验表明PaAno在TSB-AD基准测试中表现优异,显著优于包括大型架构在内的现有方法。

Comments Accepted by the 14th International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.

2601.21349 2026-05-15 cs.LG cs.AI 版本更新

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

发表机构 * Hokkaido University(北海道大学)

AI总结 本文提出了一种名为L2R的统一路由框架,用于改进混合专家(MoE)模型中的路由机制。L2R通过在共享的低秩潜在路由空间中进行专家分配,并引入饱和内积评分(SIPS)来显式控制路由函数的Lipschitz行为,从而提升路由几何的平滑性和稳定性。此外,L2R还采用参数高效的多锚点路由机制以增强专家的表达能力。实验表明,L2R在语言和视觉任务中均能有效提升路由性能和模型整体表现。

详情
英文摘要

Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on an OLMoE-based language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing geometry, expert discrimination, and overall model performance. Code will be released.

2601.19924 2026-05-15 cs.CL cs.AI cs.LG 版本更新

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge

发表机构 * Shanghai University of Finance and Economics(上海财经大学) Booth School of Business, University of Chicago(芝加哥大学商学院) Antai School of Economics and Management, Shanghai Jiao Tong University(上海交通大学安泰经济管理学院)

AI总结 本文研究了大语言模型(LLMs)在优化建模领域的性能和可扩展性,提出了一种名为OPT-ENGINE的可扩展基准框架,用于系统评估从线性规划到混合整数规划等经典运筹学问题的自动建模与求解能力。通过该框架,研究发现基于纯文本推理的方法在任务复杂度增加时存在鲁棒性不足的问题,而结合外部计算工具虽能提升局部计算能力,却难以满足全局优化约束。研究进一步指出,当前最先进的求解器集成推理方法在自动构建约束条件方面仍面临主要瓶颈,为下一代优化建模大语言模型的发展提供了明确方向。

Journal ref Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
英文摘要

We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).

2601.03969 2026-05-15 cs.AI cs.CL 版本更新

Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

发表机构 * University of Science and Technology of China(中国科学技术大学) Xiaohongshu Inc.(小红书公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文研究了大语言模型在训练过程中因强化学习奖励机制导致的“长度偏移”现象,即模型在简单问题上生成冗余推理内容的问题。为此,作者提出了一种动态异常截断(DOT)方法,在训练时选择性地抑制冗余输出,同时保留对复杂问题的长推理能力。结合辅助KL正则化和预测性动态采样,该方法有效提升了模型的推理效率与性能,实验表明其在多个任务上显著优于现有方法。

Comments Accepted by ACL2026

详情
英文摘要

Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.

2601.01972 2026-05-15 cs.CL cs.AI cs.LG 版本更新

Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier, Chris Develder, Thomas Demeester

发表机构 * IDLab–T2K, Ghent University–imec(IDLab–T2K,根特大学–imec)

AI总结 本文研究了针对基于Mamba的状态空间模型(SSMs)的语言模型的隐藏状态中毒攻击(HiSPA),该攻击通过特定的短输入短语不可逆地覆盖模型隐藏状态中的信息,导致其部分遗忘。研究提出了评估模型在遭受HiSPA攻击下信息检索能力的基准RoBench-25,并验证了SSMs在该攻击下的脆弱性,甚至包括最新的混合模型Jamba-1.7-Mini和Nemotron-3-Nano。此外,研究还分析了HiSPA对模型在其他基准上的影响,并提出了可能用于缓解该攻击的隐藏层模式分析方法。

Comments 29 pages, 4 figures

详情
英文摘要

State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM--Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. We further show that the theoretical and empirical findings extend to Mamba-2, and also analyse a Mamba-2-based hybrid (Nemotron-3-Nano). Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.

2512.22331 2026-05-15 cs.CV cs.AI 版本更新

The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva, Maria Nisheva-Pavlova

发表机构 * Faculty of Mathematics and Informatics – Sofia University St. Kliment Ohridski(数学与信息学系 – 圣克莱门特·奥赫里迪斯大学)

AI总结 该研究旨在通过多模态磁共振成像(MRI)数据非侵入性预测胶质母细胞瘤(GBM)中MGMT启动子甲基化状态,这对预后和治疗具有重要意义。为了解决传统单模态和早期融合方法在特征冗余和模态特异性建模方面的不足,作者提出了一种基于变分自编码器(VAE)的多视图潜在表征学习框架,能够在紧凑的概率潜在空间中保留各模态的影像特征并实现晚期融合。实验表明,该方法结合随机森林分类器在测试集上取得了0.77的AUC值,显著优于基线模型和调参后的模型,验证了多视图概率编码在整合互补MRI信息和提升预测性能方面的有效性。

Comments 17 pages, 4 figures

详情
英文摘要

Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) that preserves modality-specific radiomic structure while enabling late fusion in a compact probabilistic latent space. The approach is evaluated on radiomic features extracted from the necrotic tumor core in post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Re-covery (FLAIR) Magnetic Resonance Imaging (MRI). Experimental results demonstrate that the proposed multi-view VAE combined with a random forest classifier achieves a test Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) of 0.77 (95% confidence interval: 0.71-0.83), substantially outperforming both a baseline radiomics model (AUC = 0.54) and a hyperparameter-tuned model (AUC = 0.64). These findings indicate that multi-view probabilistic encoding enables more effective integration of complementary MRI information and significantly improves predictive performance for MGMT promoter methylation status.

2512.22317 2026-05-15 cs.LG cs.AI cs.CV 版本更新

LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Xudong Ling, Chaorong Li, Tianxi Huang, Qian Dong, Guiduo Duan

发表机构 * Laboratory of Intelligent Collaborative Computing, University of Electronic Science(智能协同计算实验室,电子科学科技大学) School of Computer Science(计算机科学学院) Technology (School of Artificial Intelligence), Yibin University(技术(人工智能学院),宜宾大学) College of Humanities(人文学院) General Education, Chengdu Textile College(通识教育,成都纺织学院)

AI总结 短时降水临近预报是一个具有高度不确定性和约束不足的时空预测问题,尤其在快速演变的极端天气事件中更为明显。本文提出了一种语言感知的多模态临近预报框架LangPrecip,通过将气象文本作为降水演变的语义运动约束,结合修正流范式,实现了文本与雷达信息在潜在空间中的高效融合。此外,研究还构建了一个包含160k对雷达序列和运动描述的大规模多模态数据集LangPrecip-160k,并在瑞典和MRMS数据集上验证了方法的有效性,显著提升了重降雨情况下的预测性能。

详情
英文摘要

Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.

2512.11855 2026-05-15 cs.LG cs.AI 版本更新

Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

Behrooz Tahmasebi, Melanie Weber

发表机构 * Harvard John A. Paulson School of Engineering and Applied Sciences(哈佛大学约翰·A·保罗森工程与应用科学学院) Harvard University(哈佛大学)

AI总结 本文研究了在机器学习中强制对称性与近似对称性的代价差异,提出了“平均复杂度”框架来量化对称性约束的成本。研究发现,在标准条件下,精确对称性需要线性级别的平均复杂度,而近似对称性仅需对数级别的复杂度,两者存在指数级的差距。这一理论结果首次从理论上解释了为何近似对称性在实践中可能更具优势,并为对称性在机器学习中的进一步研究提供了新工具。

Comments 33 pages, 2 figures. Published at ICLR 2026

Journal ref International Conference on Learning Representations (ICLR) 2026

详情
英文摘要

Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic complexity in the group size. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.

2511.21740 2026-05-15 cs.CL cs.AI 版本更新

A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

发表机构 * Columbia University(哥伦比亚大学) Stanford University(斯坦福大学) Microsoft(微软公司) University of Washington(华盛顿大学)

AI总结 该论文提出了一种端到端的脑到文本(BIT)框架,旨在通过神经网络直接将神经活动解码为连贯的句子,从而提升脑机接口的通信能力。核心方法是采用跨任务、跨物种预训练的神经编码器,并结合音频大语言模型与对比学习,实现了比传统分阶段方法更低的词错误率。研究不仅在多个基准测试中取得了新的最先进性能,还展示了跨任务泛化能力,为端到端神经解码提供了重要进展。

详情
英文摘要

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

2511.18903 2026-05-15 cs.LG cs.AI cs.CL 版本更新

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

发表机构 * Tsinghua University(清华大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 在基于课程的大型语言模型(LLM)预训练中,高质量数据的利用效率受到学习率衰减策略的限制。本文发现,当使用递减的学习率调度时,按数据质量排序的课程式训练优势会显著减弱。为此,研究提出了两种简单有效的方法:采用更温和的学习率衰减策略,或用模型平均替代学习率衰减,从而在不额外优化数据的情况下提升了模型在多个基准测试中的表现。这一发现为课程式预训练与优化方法的协同设计提供了新思路。

详情
英文摘要

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

2511.18739 2026-05-15 cs.AI cs.LG stat.ML 版本更新

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

Kaixiang Yang, Jiarong Liu, Yupeng Song, Shuanghua Yang, Yujue Zhou

发表机构 * School of Artificial Intelligence, Yunnan University(云南大学人工智能学院) Beijing Normal University – Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 时间序列异常检测在物联网和物理信息系统中应用广泛,但其评估因应用场景多样和指标假设不同而面临挑战。本文提出了一种面向问题的评估指标分类框架,从解决的具体评估问题出发重新诠释现有指标,将其分为六个维度,涵盖准确性、及时性、标签容忍度、人工审核成本惩罚、抗随机性以及跨数据集可比性等方面。通过实验分析不同场景下指标的行为,量化其区分真实检测与随机噪声的能力,揭示了多数事件级指标具有较强区分力,而部分常用指标对随机分数膨胀较为敏感,强调了评估指标应根据具体任务需求进行选择。

详情
英文摘要

Time series anomaly detection is widely used in IoT and cyber-physical systems, yet its evaluation remains challenging due to diverse application objectives and heterogeneous metric assumptions. This study introduces a problem-oriented framework that reinterprets existing metrics based on the specific evaluation challenges they are designed to address, rather than their mathematical forms or output structures. We categorize over twenty commonly used metrics into six dimensions: 1) basic accuracy-driven evaluation; 2) timeliness-aware reward mechanisms; 3) tolerance to labeling imprecision; 4) penalties reflecting human-audit cost; 5) robustness against random or inflated scores; and 6) parameter-free comparability for cross-dataset benchmarking. Comprehensive experiments are conducted to examine metric behavior under genuine, random, and oracle detection scenarios. By comparing their resulting score distributions, we quantify each metric's discriminative ability -- its capability to distinguish meaningful detections from random noise. The results show that while most event-level metrics exhibit strong separability, several widely used metrics (e.g., NAB, Point-Adjust) demonstrate limited resistance to random-score inflation. These findings reveal that metric suitability must be inherently task-dependent and aligned with the operational objectives of IoT applications. The proposed framework offers a unified analytical perspective for understanding existing metrics and provides practical guidance for selecting or developing more context-aware, robust, and fair evaluation methodologies for time series anomaly detection.

2511.16964 2026-05-15 cs.MA cs.AI cs.DC 版本更新

Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Kirill Nagaitsev, Luka Grbcic, Samuel Williams, Costin Iancu

发表机构 * NVIDIA Corporation(NVIDIA公司) Microsoft Corporation(微软公司) Ansel et al. ( 2024 )(Ansel等人(2024)) Sabne ( 2020 )(Sabne(2020)) Kerr et al. ( 2017 )(Kerr等人(2017)) Tillet et al. ( 2019 )(Tillet等人(2019)) Spector et al. ( 2024 )(Spector等人(2024)) Ouyang et al. ( 2025 )(Ouyang等人(2025)) Lange et al. ( 2025a(Lange等人(2025a;b)) b )(Li等人(2025)) Li et al. ( 2025 )(METR(2025)) METR ( 2025 )(Andrews和Witteveen(2025)) Andrews and Witteveen ( 2025 )(Baronio等人(2025)) Baronio et al. ( 2025 )(Novikov等人(2025)) Novikov et al. ( 2025 )(Wei等人(2025)) Wei et al. ( 2025 )(Sharma(2025)) Sharma ( 2025 )

AI总结 本文研究了如何利用基于大语言模型的多智能体系统优化PyTorch推理性能。通过构建逻辑框架对比不同多智能体优化系统,发现采用以利用为主策略并结合错误修复智能体能取得最佳效果,且优化粒度对性能有显著影响。实验表明,该方法在H100 GPU上实现了比PyTorch Eager平均2.88倍的加速,优于torch.compile的1.85倍。

详情
英文摘要

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup over PyTorch Eager (1.85x over torch.compile) on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. Code is publicly available at: https://github.com/pike-project/pike

2511.15408 2026-05-15 cs.CL cs.AI cs.IR cs.MA cs.NE 版本更新

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

发表机构 * Tongji University(同济大学) Microsoft Research Asia(微软亚洲研究院) Northeastern University(东北大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 该研究针对中文短文本创意内容生成中的挑战,提出了一种基于解释导向的多目标优化方法,以应对个性化约束下生成结果验证困难的问题。研究将任务建模为异构多目标优化问题,同时优化生成内容与解释的可靠性,并设计了无需训练的多智能体框架MAGIC-HMO,通过迭代生成与验证实现优化。实验表明,该方法在中文婴儿命名等任务上显著优于现有模型。

Comments 19 pages,10 figures. Submitted to ACM for possible publication

详情
英文摘要

Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short-form creative natural language generation (CNLG). In the real world, users often require personalized, fine-grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome-oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short-form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short-form CNLG task as a heterogeneous multi-objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC-HMO, a training-free multi-agent framework that optimizes these objectives through iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on \emph{Chinese Baby Naming}, a challenging benchmark, demonstrate that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.

2511.13397 2026-05-15 cs.CV cs.AI 版本更新

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

发表机构 * Department of Electronic and Computer Engineering, University of Limerick(利默尼克大学电子与计算机工程系) Data Driven Computer Engineering Research Centre, University of Limerick(利默尼克大学数据驱动计算机工程研究中心) Lero, The Irish Software Research Centre, University of Limerick(利默尼克大学Lero爱尔兰软件研究中心) Valeo Vision Systems(瓦莱奥视觉系统)

AI总结 本文提出了一种名为Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)的视觉问答基准,用于评估视觉语言模型在交通场景中的感知能力。该基准包含合成数据集和真实场景数据集,并为每个问题标注了目标物体与相机之间的距离,从而能够分析模型在不同距离下的感知性能。该研究为自动驾驶领域中模型的感知能力评估提供了一个新的、有针对性的工具。

Journal ref IEEE Data Descriptions, 2026

详情
英文摘要

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

2511.08565 2026-05-15 cs.CL cs.AI cs.CY 版本更新

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所) University of São Paulo(圣保罗大学)

AI总结 本研究探讨了大型语言模型在扮演特定角色(Persona Role-Play)时的道德反应,引入道德基础问卷(MFQ)构建基准,量化评估模型的道德敏感性和道德鲁棒性。通过两种互补方法分析模型在不同角色下的道德判断变化,发现道德鲁棒性在不同模型家族间差异显著,Claude 家族表现最为鲁棒,而道德敏感性则变化较小,且不受模型家族影响,主要由预训练阶段决定。研究揭示了角色条件对模型道德行为的影响,并提供了不同模型及角色平均的道德基础特征分析。

Comments Added experiments with a logit-based method and now reporting unbounded metrics

详情
英文摘要

Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.

2511.05820 2026-05-15 cs.SE cs.AI 版本更新

From Ranking to Reasoning: Explainable Web API Recommendation via Semantic Reasoning

Zishuo Xu, Dezhong Yao, Yao Wan

发表机构 * School of Software Engineering(软件工程学院) School of Computer Science and Technology(计算机科学与技术学院) Huazhong University of Science and Technology(华中科技大学)

AI总结 随着Web API数量的快速增长,自动化的API推荐对于高效构建混合应用变得至关重要。现有方法在推荐策略固定、无法适应复杂需求以及缺乏解释性方面存在不足。为此,本文提出WAR-R1框架,结合语义推理与可变规模推荐,通过轻量大语言模型生成推荐API及其自然语言解释,并引入特殊起始和终止标记以支持推荐数量的自适应调整。实验表明,WAR-R1在推荐准确率和解释质量上均优于现有方法,验证了其有效性。

详情
英文摘要

The rapid growth of Web APIs has made automated Web API recommendation essential for efficient mashup development. However, existing approaches suffer from two major limitations: 1) they rely on fixed top-N recommendation strategies that cannot adapt to mashup complexity, and 2) they provide little or no explanation for recommended APIs, limiting transparency and user trust. To address these challenges, we propose WAR-R1, an explainable Web API recommendation framework that integrates semantic reasoning with adaptive, variable-cardinality recommendation. Built on a lightweight large language model (LLM), WAR-R1 generates both a set of relevant APIs and a natural-language justification for each recommendation. To support adaptive recommendation size, we introduce special start and stop tokens that allow the model to learn when to begin and terminate API generation. WAR-R1 is trained in two stages: supervised fine-tuning on an annotated mashup-API corpus, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) with low-rank adaptation to jointly optimize recommendation accuracy and reasoning quality. Experiments on the ProgrammableWeb dataset show that WAR-R1 outperforms state-of-the-art baselines by up to 10.89% in recommendation accuracy while consistently producing high-quality, semantically grounded explanations. Extensive ablation studies validate the effectiveness of reinforcement learning, special token design, and integrated reasoning.

2510.19973 2026-05-15 cs.NI cs.AI 版本更新

A Tutorial on Cognitive Biases in Agentic AI-Driven 6G Autonomous Networks

Hatim Chergui, Farhad Rezazadeh, Merouane Debbah, Christos Verikoukis

发表机构 * i2CAT Foundation(i2CAT基金会) Hostelworld Group(Hostelworld集团) Technical University of Catalonia (UPC)(技术大学(加泰罗尼亚)) Khalifa University of Science and Technology(卡里玛大学) ISI/ATH University of Patras(帕特拉大学)

AI总结 本文综述了智能体驱动的6G自组织网络中常见的认知偏差问题,分析了这些偏差的分类、数学表达及其在通信系统中的表现,并提出了针对性的缓解策略。通过两个6G网络管理场景的案例验证,研究展示了如何利用本地化大语言模型和改进的记忆机制,有效减少锚定偏差和时间确认偏差,从而提升资源分配效率,实现显著的能耗降低和延迟优化。

Comments 26 pages, 18 figures, 4 tables, link to source code available. Accepted at IEEE OJCOMS

详情
英文摘要

The path to higher network autonomy in 6G lies beyond the mere optimization of key performance indicators (KPIs), requiring systems that perceive and reason over the network environment as it is. This can be achieved through agentic AI, where large language model (LLM)-powered agents utilize multimodal telemetry, memory, and cross-domain negotiation to achieve multi-objective goals. However, deploying such agents introduces cognitive biases inherited from human design, which can severely distort reasoning and actuation. This paper provides a comprehensive tutorial on well-known cognitive biases, detailing their taxonomy, mathematical formulation, emergence in telecom systems, and tailored mitigation strategies. We validate these concepts through two distinct use-cases in 6G management. First, we tackle anchoring bias in inter-slice resource negotiation. To overcome the prohibitive execution delays of cloud-based LLMs, this use-case deploys a locally hosted 1B-parameter model on an RTX A4000 GPU, successfully achieving sub-second inference latencies compatible with near-real-time operations. By replacing fixed heuristic anchors with a Truncated Weibull randomized anchor strategy, the agents dismantle rigid biases, intelligently consume SLA slack, and dynamically double the system-wide energy savings (peaking at 25\%) without violating strict latency limits. Second, we mitigate temporal and confirmation biases in RAN-Edge cross-domain negotiation by designing an unbiased collective memory. By integrating semantic/temporal decay and an inflection bonus that actively highlights past negotiation failures, agents are prevented from over-relying on recent data or repeating past mistakes. Grounding decisions in this richer, debiased historical context yields highly robust agreements, achieving a $\times 5$ latency reduction and roughly 40\% higher energy savings compared to memoryless baselines.

2510.15982 2026-05-15 cs.LG cs.AI 版本更新

AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution

Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon

发表机构 * Korea Advanced Institute of Science and Technology(韩国先进科学研究院)

AI总结 本文提出了一种名为AMiD的知识蒸馏方法,用于降低大语言模型的计算和内存成本。该方法引入了基于α混合的辅助分布,通过引入新的分布参数α,扩展了传统辅助分布的适用范围,并构建了一个统一的知识蒸馏框架。实验表明,AMiD在性能和训练稳定性方面优于现有方法,具有更广泛的理论支持和实际应用价值。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $α$-mixture assistant distribution, a novel generalized family of assistant distributions, and $α$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $α$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $α$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space. We release the code at https://github.com/aailab-kaist/AMiD.

2510.04682 2026-05-15 cs.CL cs.AI 版本更新

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文提出了一种名为TiTok的新框架,旨在解决LoRA微调参数无法跨不同基础模型迁移的问题。该方法通过在令牌层面进行对比性知识提取,从带有和不带有LoRA的源模型中捕捉任务相关的信息,从而实现高效的LoRA移植。实验表明,TiTok在多个基准测试中表现出色,相比基线方法平均性能提升了4%到10%。

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.

2509.26100 2026-05-15 cs.AI 版本更新

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Xibang Yang, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) The University of Hong Kong(香港大学) East China Normal University(华东师范大学)

AI总结 随着大语言模型在高风险领域的广泛应用,现有的静态评估方法已难以应对AI风险的动态变化和法规的持续演进。本文提出了一种新的智能体驱动的安全评估范式AgenticEval,通过多智能体框架自主解析政策文件,持续生成和演化综合性安全基准,并利用自我演进的评估循环不断优化测试用例。实验表明,该方法能够有效揭示传统评估方式难以发现的模型深层次安全漏洞,凸显了动态评估体系在确保AI安全部署中的重要性。

Comments Findings of ACL 2026

详情
英文摘要

The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.

2509.23023 2026-05-15 cs.AI 版本更新

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Davi Bastos Costa, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所)

AI总结 本文提出了一种名为 *Mini-Mafia* 的简化版社交推理游戏,用于评估大型语言模型在多智能体交互中的表现。通过分析游戏中欺诈者、侦探和村民之间的互动,研究得出了一个预测欺诈方获胜概率的解析公式,并据此构建了 *Mini-Mafia Benchmark*,能够定量评估模型的欺骗、检测和披露能力。实验表明,该方法在跨模型预测中表现优异,并揭示了一些关于当前主流大模型能力的反直觉结论。

Comments Adds a validation section for the theoretical model and restructures the presentation

详情
英文摘要

Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

2509.22746 2026-05-15 cs.AI cs.CV 版本更新

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Alibaba Group Holding Limited(阿里巴巴集团控股有限公司) Future Living Lab of Alibaba(阿里巴巴未来生活实验室) University of Southern California(南加州大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 当前视觉推理方法主要专注于探索特定的推理模式,虽能在特定领域取得改进,但难以形成通用的推理能力。为此,本文提出了一种新的自适应推理范式——Mixture-of-Visual-Thoughts(MoVT),通过在一个模型中统一不同推理模式,并根据上下文选择合适的模式。研究引入了两阶段的自适应视觉推理框架AdaVaR,利用监督学习进行初始训练,并通过强化学习与精心设计的算法引导模型实现上下文自适应的模式选择,实验表明该方法在多种场景下均能有效提升视觉推理性能。

Comments 27 pages, 11 figures, 5 tables, accepted by ICLR 2026

详情
英文摘要

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

2508.06226 2026-05-15 cs.AI 版本更新

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China(教育部智能网络与网络安全重点实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China(陕西省大数据知识工程重点实验室) School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院)

AI总结 GeoLaux 是一个用于评估多模态大语言模型(MLLMs)在需要辅助线构造的长步骤几何问题上表现的细粒度基准数据集,包含2186个计算与证明问题,平均解题步骤达6.51步,其中41.8%的问题需要辅助线构造。基于该数据集对23个主流MLLMs进行五维评估,研究发现模型在长步骤问题上的表现明显下降,辅助线理解能力不足是影响几何推理的关键因素,同时有限的答案提示有助于提升推理过程的正确性。GeoLaux 为评估和提升 MLLMs 的几何推理能力提供了重要参考。

Comments 26 pages, 24 figures

详情
英文摘要

Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux

2508.06202 2026-05-15 cs.CV cs.AI 版本更新

LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) University of Amsterdam(阿姆斯特丹大学) Tsinghua University(清华大学)

AI总结 持续视觉指令微调(CVIT)使多模态大语言模型能够逐步学习新任务,但面临灾难性遗忘的问题。为解决这一挑战,本文提出了一种高效的架构扩展方法LiLoRA,通过共享LoRA矩阵A并引入对矩阵B的低秩分解,显著减少了参数开销,并结合余弦正则化稳定性损失以保持表示的一致性。实验表明,LiLoRA在多个CVIT基准上实现了更优的性能,同时提升了参数效率。

Comments AAAI 2026 Oral Presentation. 9 pages

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):19978--19986, 2026

详情
英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.

2508.01916 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang, Michael Hahn

发表机构 * Saarland University(萨尔兰大学)

AI总结 本文研究如何通过无监督学习将神经网络的表示空间分解为具有可解释性的子空间。作者提出了一种名为邻居距离最小化(NDM)的方法,能够在不依赖标签的情况下学习出与模型内部概念对齐的子空间。实验表明,这些子空间能够捕捉到输入中的抽象概念,并在GPT-2等模型中与已知的电路变量存在强关联,为理解模型内部结构提供了新视角。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

2507.21433 2026-05-15 cs.LG cs.AI 版本更新

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Kaiwen Chen, Xin Tan, Minchen Yu, Jingzong Li, Hong Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Hang Seng University of Hong Kong(恒生大学)

AI总结 大型推理模型(LRMs)在许多AI推理系统中发挥着关键作用,但其在生产环境中的部署面临服务质量(QoS)挑战,主要表现为长序列推理过程带来的高内存开销,限制了吞吐量并增加了延迟。为此,本文提出ReasonCache,一种基于协同过滤算法的KV缓存管理方法,通过识别和复用相似的中间推理步骤对应的KV缓存块,实现零拷贝缓存复用,显著提升了推理效率。实验表明,ReasonCache在保持较高准确率的同时,峰值吞吐量提升了89.2%,平均提升达40-60%,有效提高了AI推理服务的响应速度和成本效益。

Comments 10 pages, 7 figures

详情
英文摘要

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.

2507.13941 2026-05-15 q-bio.NC cs.AI cs.CV eess.IV 版本更新

Shared representations in brains and models reveal a two-route cortical organization during scene perception

Pablo Marcos-Manchón, Lluís Fuentemilla

发表机构 * Department of Cognition, Development and Education Psychology, Faculty of Psychology, University of Barcelona(认知、发展与教育心理学系,心理学学院,巴塞罗那大学) Institute of Neurosciences, University of Barcelona(神经科学研究所,巴塞罗那大学) Bellvitge Institute for Biomedical Research(Bellvitge生物医学研究 institute)

AI总结 该研究通过分析7T fMRI数据,探讨了人类大脑在场景感知过程中信息的组织与传递路径。研究利用表征相似性分析,比较了个体间共享的脑区表征结构与视觉和语言神经网络的层次特征,发现大脑存在两条分离的处理通路:一条负责场景布局与环境背景,另一条专门处理生物内容。这一发现深化了对视觉信息处理的经典模型,揭示了场景感知是一个由多个可区分表征路径组成的分布式脑网络。

Comments for associate code, see https://github.com/memory-formation/convergent-transformations

详情
英文摘要

The brain transforms visual inputs into high-dimensional cortical representations that support diverse cognitive and behavioral goals. Characterizing how this information is organized and routed across the human brain is essential for understanding how we process complex visual scenes. Here, we applied representational similarity analysis to 7T fMRI data collected during natural scene viewing. We quantified representational geometry shared across individuals and compared it to hierarchical features from vision and language neural networks. This analysis revealed two distinct processing routes: a ventromedial pathway specialized for scene layout and environmental context, and a lateral occipitotemporal pathway selective for animate content. Vision models aligned with shared structure in both routes, whereas language models corresponded primarily with the lateral pathway. These findings refine classical visual-stream models by characterizing scene perception as a distributed cortical network with separable representational routes for context and animate content.

2506.16608 2026-05-15 cs.LG cs.AI 版本更新

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Jiamin He, A. Rupam Mahmood, Martha White

发表机构 * Department of Computing Science University of Alberta(计算科学系阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所(Amii)) CIFAR AI Chair, Amii(CIFAR人工智能主席,Amii)

AI总结 本文提出了一种新的强化学习框架,将参数化的动作分布视为动作,重新定义了智能体与环境之间的边界。该方法通过重参数化使动作空间变为连续空间,适用于离散、连续或混合类型的动作。研究还提出了一种通用的确定性策略梯度估计器DA-PG以及基于TD3的实用演员-评论家算法DA-AC,实验表明其在多种控制任务中表现出良好的性能。

Comments Accepted to ICLR 2026 (camera-ready)

详情
英文摘要

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

2505.09246 2026-05-15 cs.IR cs.AI cs.CL 版本更新

Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge

Derian Boer, Stephen Roth, Stefan Kramer

发表机构 * Institute of Computer Science(计算机科学研究所) Johannes Gutenberg University Mainz(美因茨约翰内斯·古腾堡大学)

AI总结 本文提出了一种基于半结构化知识库的多跳问答框架Autofocus-Retriever(AF-Retriever),旨在有效结合结构化和非结构化信息进行问答。该方法通过引入可交换的大语言模型提取实体属性和关系约束,并结合向量相似度搜索与增量范围扩展策略,实现了在多个基准测试中优于现有方法的零样本和少样本性能。其核心贡献在于通过四步约束驱动的检索与四步补充排序流程,显著提升了答案检索的准确性和鲁棒性。

Journal ref Transactions on Machine Learning Research 2026

详情
英文摘要

In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data. In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics. AF-Retriever's average first-hit rate surpasses the second-best method by 32.1%. Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-k answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure that prepares for the reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility. In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps. An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights. The source code is available at https://github.com/kramerlab/AF-Retriever .

2505.01584 2026-05-15 cs.LG cs.AI 版本更新

Silent Neuron Theory and Plasticity Preservation for Deep Reinforcement Learning in Adaptive Video Streaming

Zhiqiang He, Zhi Liu

发表机构 * Department of Computer and Network Engineering, the University of Electro-Communications, Japan(电子通信大学计算机与网络工程系,日本)

AI总结 本文研究了深度强化学习在自适应视频流中的应用,针对实际网络带宽异质性导致的模型泛化能力不足问题,提出了“静默神经元理论”以更准确地刻画神经网络的可塑性退化现象。基于该理论,作者设计了Reset Silent Neuron(ReSiN)方法,通过结合前向和后向传播状态的策略性神经元重置,有效保持网络可塑性,从而提升模型在非稳态网络环境下的适应能力。实验表明,ReSiN在比特率和QoE指标上显著优于现有方法,且在不同网络条件下均表现出良好的鲁棒性。

详情
英文摘要

Adaptive video streaming optimizes Quality of Experience (QoE) metrics by selecting appropriate bitrates according to varying network bandwidth and user demands. In practice, however, real-world network bandwidth often exhibits heterogeneity relative to training environments. Current methods predominantly tackle this problem through learning-based approaches designed to improve generalization performance. While our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to heterogeneous network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. Moreover, we establish a tighter performance bound for ReSiN under non-stationary network conditions. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.

2504.18544 2026-05-15 cs.LG cs.AI cs.CY 版本更新

Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review

Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani

发表机构 * Healthy Lifespan Institute, School of Computer Science, University of Sheffield(健康寿命研究所,计算机科学学院,谢菲尔德大学) School of Electrical and Electronic Engineering, University of Sheffield(电子与电气工程学院,谢菲尔德大学) Healthy Lifespan Institute, School of Sociological Studies, Politics and International Relations, University of Sheffield(健康寿命研究所,社会科学学院,政治与国际关系,谢菲尔德大学) Digital Environment Research Institute, Queen Mary University of London(数字环境研究 institutes,伦敦女王大学)

AI总结 该论文系统回顾了近年来合成表格健康数据生成与评估领域的研究,指出了当前在评估方法上缺乏共识、指标应用不一致、领域专家参与不足等关键挑战。为应对这些问题,研究提出了结构化的分类框架和实用评估指南,旨在推动更严谨、标准化的评估实践,促进合成健康数据的负责任开发与应用。

Comments 32 pages

详情
英文摘要

Generating synthetic tabular health data is challenging, and evaluating their quality is equally, if not more, complex. This systematic review highlights the critical importance of rigorous evaluation of synthetic health data to ensure reliability, clinical relevance, and appropriate use. From an initial identification of 2067 relevant papers published in the last ten years, 134 studies were selected for detailed analysis. Our review identifies key challenges, including lack of consensus on evaluation methods, inconsistent application of evaluation metrics, limited involvement of domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide a structured consolidation of synthetic data generation and evaluation methods into taxonomies, alongside practical guidelines to support more robust and standardised evaluation practices. These findings aim to support the responsible development and use of synthetic health data, aligned with emerging expectations around transparency, reproducibility, and governance, ultimately enabling the community to fully harness its transformative potential and accelerate innovation.

2504.11703 2026-05-15 cs.CR cs.AI 版本更新

Progent: Securing AI Agents with Privilege Control

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, Dawn Song

发表机构 * UC Berkeley(加州大学伯克利分校) UC Santa Barbara(加州大学圣巴巴拉分校) National University of Singapore(新加坡国立大学)

AI总结 AI代理通过调用工具与外部环境交互,容易受到如间接提示注入等攻击,导致未经授权的操作。为此,本文提出Progent框架,通过特权控制机制增强AI代理的安全性。Progent将特权表示为基于工具名称和参数的符号化安全策略,通过确定性过程检查每个工具调用,确保最小特权原则。该框架利用大型语言模型自动生成并动态更新策略,并结合SMT求解器保证策略更新的单调性,从而在保障实用性的前提下有效防止权限升级,实验表明其在多个基准测试中显著降低了攻击成功率。

详情
英文摘要

AI agents interact with external environments through tool calls, exposing them to attacks like indirect prompt injection that can trigger unauthorized actions. Securing these agents is challenging: they behave autonomously and probabilistically, security requirements evolve depending on the user's task and execution state, and there is an inherent tradeofff between security and utility. In this work, we introduce Progent, a novel framework that secures AI agents via privilege control. Progent represents privilege as a security policy consisting of symbolic rules over tool names and arguments. These rules specify which tool calls are allowed for task completion and which unnecessary ones are blocked for security. Every tool call is checked against such a policy through a deterministic procedure, enforcing the principle of least privilege. To handle diverse user tasks and evolving execution contexts, an LLM automatically generates the initial policy from the user's task and updates it during execution as new information arrives. Each proposed update is determined by an SMT solver to be either a narrowing (applied automatically) or an expansion (requiring explicit approval), ensuring that the agent's effective action space can only shrink without approval (monotonic confinement). This deterministic update mechanism preserves utility and prevents silent privilege escalation, even when adversarial inputs are present. Our evaluation on popular benchmarks (i.e., AgentDojo and ASB) shows that Progent significantly reduces attack success rates while maintaining high utility. We further validate Progent's practicality by showcasing its effectiveness in real-world agent frameworks such as LangChain and OpenAI Agents SDK.

2504.01571 2026-05-15 cs.GR cs.AI cs.CV cs.LG 版本更新

Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation

Aleksander Plocharski, Jan Swidzinski, Przemyslaw Musialski

发表机构 * Warsaw University of Technology(华沙技术大学) Akces NCBR Imperial College London(伦敦帝国理工学院) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文提出了一种基于过程化扩散引导(Pro-DG)的建筑立面生成方法,通过在稳定扩散框架中引入分层过程化规则生成控制图,从而生成逼真的建筑立面图像。该方法从单张输入图像及其分割结果出发,利用逆过程模块识别立面的分层布局,并结合结构特征设计了一种新的ControlNet流程,实现由过程化变换引导的立面图像生成。该方法能够精确控制局部外观并进行大规模结构编辑,实验表明其在保持建筑风格和实现可控编辑方面优于现有方法。

Comments 17 pages, 15 figures, Computer Graphics Forum 2026 Journal Paper

详情
英文摘要

We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.

2410.03280 2026-05-15 eess.AS cs.AI cs.LG eess.SP 版本更新

Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope

Yasaman Torabi, Shahram Shirani, James P. Reilly

发表机构 * Electrical and Computer Engineering Department, McMaster University(麦斯特大学电气与计算机工程系)

AI总结 该研究提出了一种使用数字听诊器录制的心肺声音数据集,包含正常及多种异常心肺音,如杂音、心律失常和呼吸音等。数据集通过临床模拟人采集,涵盖了不同身体部位的单独和混合声音,并经过频率滤波处理以增强特定声音类型。该数据集为人工智能在心肺疾病自动检测、声音分类及深度学习等领域的研究提供了重要的资源。

Journal ref IEEE Data Descriptions, vol. 2, pp. 133-140, 2025

详情
英文摘要

Heart and lung sounds are crucial for healthcare monitoring. Recent improvements in stethoscope technology have made it possible to capture patient sounds with enhanced precision. In this dataset, we used a digital stethoscope to capture both heart and lung sounds, including individual and mixed recordings. To our knowledge, this is the first dataset to offer both separate and mixed cardiorespiratory sounds. The recordings were collected from a clinical manikin, a patient simulator designed to replicate human physiological conditions, generating clean heart and lung sounds at different body locations. This dataset includes both normal sounds and various abnormalities (i.e., murmur, atrial fibrillation, tachycardia, atrioventricular block, third and fourth heart sound, wheezing, crackles, rhonchi, pleural rub, and gurgling sounds). The dataset includes audio recordings of chest examinations performed at different anatomical locations, as determined by specialist nurses. Each recording has been enhanced using frequency filters to highlight specific sound types. This dataset is useful for applications in artificial intelligence, such as automated cardiopulmonary disease detection, sound classification, unsupervised separation techniques, and deep learning algorithms related to audio signal processing.

2410.02091 2026-05-15 cs.SE cs.AI cs.HC econ.GN q-fin.EC 版本更新

The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot

Fangchen Song, Ashish Agarwal, Wen Wen

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本研究探讨了生成式人工智能(AI)对协作式开源软件(OSS)开发的影响,重点分析了GitHub Copilot这一AI编程助手在GitHub开源项目中的实际作用。研究发现,使用Copilot可使项目层面的代码贡献量提升5.9%,主要源于开发者参与度和个体生产力的提高,但同时也带来了8%的协调时间增加。研究还指出,AI对核心开发者和外围开发者的影响存在差异,为理解AI在开源社区中的长期影响提供了重要参考。

详情
英文摘要

Generative artificial intelligence (AI) facilitates content production and enhances ideation capabilities, which can significantly influence developer productivity and participation in software development. To explore its impact on collaborative open-source software (OSS) development, we investigate the role of GitHub Copilot, a generative AI pair programmer, in OSS development where multiple distributed developers voluntarily collaborate. Using GitHub's proprietary Copilot usage data, combined with public OSS project data obtained from GitHub, we find that Copilot use increases project-level code contributions by 5.9%. This gain is driven by a 3.4% rise in developer coding participation and a 2.1% increase in individual productivity. However, Copilot use also leads to an increase in coordination time by 8% due to more code discussions. This reveals an important tradeoff: While AI expands who can contribute and how much they contribute, it slows coordination in collective development efforts. Despite this tension, the combined effect of these two competing forces remains positive, indicating a net gain in overall project-level timely merge of code contributions from using AI pair programmers. Interestingly, we also find the effects differ across developer roles. Peripheral developers show relatively smaller increases in project-level code contributions and experience larger increases in coordination time than core developers. In summary, our study underscores the dual role of AI pair programmers in affecting project-level code contributions and coordination time in OSS development. Our findings on the differential effects between core and peripheral developers also provide important implications for the structure of OSS communities in the long run.

2409.10038 2026-05-15 cs.CL cs.AI cs.LG 版本更新

On the Diagram of Thought

Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

发表机构 * IIIS Tsinghua University(清华大学人工智能研究院) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 大型语言模型(LLMs)在许多任务中表现出色,但在需要结构化、多步骤推理的复杂问题上表现不佳。本文提出了一种名为“思维图谱”(Diagram of Thought, DoT)的框架,使单个LLM能够构建和导航其推理过程的思维地图,通过动态构建思想图谱,模型可以提出不同的推理路径、自我批评并整合验证后的见解形成最终结论。该方法无需外部搜索算法或规划器,仅依赖于确定性的在线验证器,并基于范畴论的数学框架,为LLM的结构化推理过程提供了可审计的步骤追踪和语义保证。

Comments 30 pages

详情
英文摘要

Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning. We introduce the Diagram of Thought (DoT), a framework that enables a single LLM to build and navigate a mental map of its reasoning. Instead of thinking in a straight line, the model constructs a dynamic diagram of ideas, where it can propose different lines of thought, critique its own steps, and synthesize validated insights into a final conclusion. This process is controller-light: it does not require an external search algorithm or planner, but it does use a deterministic online validator for grammar-constrained typed traces, register constraints, and optional solver checks. To clarify the reliability target of this process, we ground DoT in a mathematical framework from category theory. We interpret accepted typed reasoning records as diagrams in a slice topos and model synthesis of the selected proposer subdiagram as a finite limit. In the predicate fragment, this same object is equivalently a variance-reversed colimit in the opposite information order. The resulting formalism gives an auditable, step-by-step trace of the LLM's typed reasoning and separates semantic guarantees for the typed subtrace from unconstrained natural-language text and uncertified operational edges.

2408.16307 2026-05-15 cs.RO cs.AI 版本更新

Safe Bayesian Optimization for Complex Control Systems via Additive Gaussian Processes

Hongxuan Wang, Xiaocong Li, Lihao Zheng, Adrish Bhaumik, Prahlad Vadakkepat

发表机构 * National University of Singapore(新加坡国立大学) SIMTech, A*STAR CUHK, Shenzhen(香港中文大学(深圳))

AI总结 本文提出了一种名为 SafeCtrlBO 的安全贝叶斯优化方法,用于同时调整多级耦合控制器的参数,以解决复杂控制系统的安全优化问题。该方法通过使用加法高斯过程核来捕捉控制器增益之间的低阶结构,从而降低样本复杂度,并采用基于边界的扩展规则替代传统方法中的高计算成本步骤,以保证在硬件实验中的安全约束。实验表明,SafeCtrlBO 在减少硬件评估次数的同时,能够有效达到高性能控制器参数,并保持高概率安全性和硬信号安全约束的满足。

Comments The shorter version has been accepted by IEEE Robotics and Automation Letters. This is the full version

详情
英文摘要

Automatic controller tuning is attractive for robotics and mechatronic systems whose dynamics are difficult to model accurately, but direct black-box optimization can be unsafe because each query is executed on the physical plant. Existing safe Bayesian optimization (BO) methods provide high-probability safety guarantees, yet their practical use in multi-loop control is limited by two coupled difficulties: the controller parameter space is often moderately high-dimensional, and hardware evaluations are too expensive to allow hundreds or thousands of exploratory trials. This paper proposes \textsc{SafeCtrlBO}, a safe BO method for simultaneously tuning multiple coupled controllers. The method uses additive Gaussian-process kernels to encode low-order structure across controller gains and reduce the sample complexity associated with dense full-dimensional kernels. It also replaces the expensive potential-expander computation used in \textsc{SafeOpt}-style exploration with a boundary-based expansion rule that preserves the intended safe-set expansion behavior under explicit geometric conditions and is validated empirically. Experiments on synthetic benchmarks and on a permanent magnet synchronous motor (PMSM) speed-control platform show that \textsc{SafeCtrlBO} reaches high-performing controller parameters with fewer hardware evaluations than representative safe BO baselines, while maintaining the prescribed high-probability safety criterion and avoiding violations of the hard signal-safety constraint in the hardware study. The code implementation is publicly available at https://github.com/hxwangnus/SafeCtrlBO.

2303.14511 2026-05-15 hep-ex cs.AI cs.LG hep-ph physics.data-an 版本更新

Improving robustness of jet tagging algorithms with adversarial training: exploring the loss surface

Annika Stein

发表机构 * Center for Theoretical Physics, Sloane Physics Laboratory, Yale University(理论物理中心,斯洛恩物理实验室,耶鲁大学) III. Physics Institute A, RWTH Aachen University(物理研究所A,亚琛工业大学)

AI总结 本文研究了如何通过对抗训练提高高能物理中喷注分类算法的鲁棒性,重点分析了输入特征微小扰动对模型性能的影响。作者通过探索损失函数的几何结构,揭示了模型在面对系统性不确定性时的稳健性机制,并提出了一种在保持高性能的同时增强模型鲁棒性的对抗训练方法。

Comments 5 pages, 2 figures; submitted to ACAT 2022 proceedings

Journal ref 2026 J. Phys.: Conf. Ser. 3206 012085

详情
英文摘要

In the field of high-energy physics, deep learning algorithms continue to gain in relevance and provide performance improvements over traditional methods, for example when identifying rare signals or finding complex patterns. From an analyst's perspective, obtaining highest possible performance is desirable, but recently, some attention has been shifted towards studying robustness of models to investigate how well these perform under slight distortions of input features. Especially for tasks that involve many (low-level) inputs, the application of deep neural networks brings new challenges. In the context of jet flavor tagging, adversarial attacks are used to probe a typical classifier's vulnerability and can be understood as a model for systematic uncertainties. A corresponding defense strategy, adversarial training, improves robustness, while maintaining high performance. Investigating the loss surface corresponding to the inputs and models in question reveals geometric interpretations of robustness, taking correlations into account.

2605.14177 2026-05-15 cs.IR cs.AI cs.CL 版本更新

Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models

Harshita Chopra, Krishna Kant Chintalapudi, Suman Nath, Ryen W. White, Chirag Shah

发表机构 * University of Washington(华盛顿大学) Microsoft Research(微软研究院)

AI总结 本文研究了如何通过前瞻思维引导语言模型从长期对话历史中检索用户特定的事实,以提升个性化对话系统的性能。为了解决传统检索方法依赖语义相似度而难以发现远距离相关事实的问题,作者提出了基于前瞻引导的检索方法(PGR),通过构建可能的未来步骤作为检索探针,从而更有效地挖掘用户历史中相关但不易被传统方法发现的记忆。实验表明,该方法在多个基准测试中显著提升了检索效果和响应质量。

Comments Preprint

详情
英文摘要

Long-horizon personalization requires dialogue assistants to retrieve user-specific facts from extended interaction histories. In practice, many relevant facts often have low semanticsimilarity to the query under dense retrieval. Standard Retrieval-Augmented Generation (RAG) and GraphRAG systems are still largely retrospective: they rely on embedding similarity to the query or on fixed graph traversals, so they often miss facts that matter for the user's needs but lie far from the query in embedding space. Inspired by prospection, the human ability to use imagined futures as cues for recall, we introduce Prospection-Guided Retrieval (PGR), which decouples retrieval from how memories are stored. Given a user query, PGR first expands the goal into a short Tree-of-Thought (ToT) or linear chain of plausible next steps, and uses these steps as retrieval probes rather than relying on the original query alone. The facts retrieved by these probes are then used to personalize the next round of prospection, enabling PGR to uncover additional memories that become relevant only after the simulation is grounded in the user's history. We also introduce MemoryQuest, a challenging multi-session benchmark in which each query is annotated with 3--5 dated reference facts subject to a low query-reference similarity constraint. Across 1,625 queries spanning 185 user profiles from 3 publicly available datasets, PGR-TOT substantially improves retrieval, including nearly 3x recall on MemoryQuest over the strongest baseline. In pairwise LLM-as-judge comparisons against baselines, PGR-generated responses are preferred on 89--98% of queries, with blinded human annotations on held-out subsets showing the same trend. Overall, the results demonstrate that explicit prospection yields large gains in long-horizon retrieval and response quality relative to similarity-only baselines.

2605.14175 2026-05-15 cs.AI 版本更新

Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

Qisong He, Yi Dong, Xiaowei Huang

发表机构 * School of Computer Science and Informatics, University of Liverpool, UK(利兹大学计算机科学与信息学学院)

AI总结 本文提出了一种名为 Grounded Continuation 的运行时验证器,用于检测大型语言模型在长对话中生成的回复是否基于当前对话上下文中的有效前提。该方法通过构建显式的依赖图,将每轮对话归类为不同形式的逻辑操作,并记录主张与证据之间的依赖关系,从而在常数时间内验证回复的合理性并追踪不支持的结论。实验表明,该验证器在多个基准测试中优于仅依赖语言模型或检索增强的基线方法,尤其在检测过时前提方面表现出色,验证了其在逻辑严谨性和实际应用中的有效性。

详情
英文摘要

In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per-turn cost and a formal conflict-free guarantee. On LongMemEval-KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM-only baseline (+1.3pp) and 87.2% for a transcript-RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo's 60 official QA items the verifier is competitive with retrieval-augmented baselines. Beyond external benchmarks, we construct two multi-agent scenarios and a 50-item grounding test: on the 15-item stale-premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness-faithfulness decomposition: the structural check is sound by construction, and per-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history-replay grows linearly with conversation length.

2605.14167 2026-05-15 cs.AI cs.CY 版本更新

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Theodore J Kalaitzidis

发表机构 * Brown University(布朗大学)

AI总结 该论文探讨了AI基准测试中隐含的理论假设如何影响对能力评估的定义与进展方向,指出当这些假设未经审视时,基准测试会固化主流范式并限制对能力的真正理解。文章提出了一种名为“Epistematics”的方法论,用于从技术能力声明中直接推导评估标准,并检验基准测试是否能区分真实能力与表面行为。其核心贡献在于提供了一套元评估框架,包括评估流程、失败模式分类及基准设计准则,以提升评估与目标能力之间的一致性。

Comments 13 pages

详情
英文摘要

Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by narrowing what counts as progress. Over time, narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions. The result is a trap: evaluation frameworks treat self-reinforcing assessments as valid, both creating and obscuring structural limits on what the current paradigm can accomplish. We introduce Epistematics, a methodology for deriving evaluation criteria directly from technical capability claims and auditing whether proposed benchmarks can discriminate the claimed capability from proxy behaviors. The contribution is meta-evaluative: an audit procedure, a failure mode taxonomy, and benchmark-design criteria for evaluating capability-evaluation coherence. We demonstrate the procedure through a worked audit of Dupoux et al. (2026), a proposal that revises the dominant paradigm's theoretical assumptions at the architectural level while reproducing them in its evaluation criteria, thereby entrenching the constraint it seeks to overcome in a form the evaluation cannot detect.

2605.14164 2026-05-15 cs.AI 版本更新

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Stefan Baack, Christo Buschek, Maty Bohacek

发表机构 * Independent Researcher(独立研究者) Stanford University(斯坦福大学)

AI总结 该研究探讨了基础模型和生成式AI模型构建者在评估模型能力时所依赖的基准测试文化,发现其主要依据已从学术论文转向公司发布的新闻稿和博客,这些内容成为定义当前技术水平的重要依据。研究通过构建并开源Benchmarking-Cultures-25数据集,分析了2025年11家主要AI公司发布的139个模型中所强调的231个基准,揭示了当前评估体系碎片化、跨模型可比性低的问题,并提出统一分类框架以解析不同模型构建者对基准能力的异质化描述。

详情
英文摘要

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

2605.14163 2026-05-15 cs.AI 版本更新

Agentic Systems as Boosting Weak Reasoning Models

Varun Sunkaraneni, Pierfrancesco Beneventano, Riccardo Neumarker, Tomaso Poggio, Tomer Galanti

发表机构 * Texas A&M University(德克萨斯A&M大学) MIT(麻省理工学院)

AI总结 本文研究如何通过组合多个弱推理模型的输出,达到强模型的性能。核心方法是引入验证者支持的委员会搜索机制,在推理时通过提案、批评和比较模块协同工作,提升整体推理能力。研究证明,仅靠增加模型数量不足以提升性能,还需结合局部正确性信号,如执行、类型检查等,以确保选择的有效性。实验表明,通过合理设计的机制,弱模型组合可达到与强模型相当的性能,主要挑战在于如何从提案中有效筛选出正确解。

详情
英文摘要

Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help'': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-\(k\) converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \texttt{GPT-5.4 nano} proposal solves \(67.0\%\) of tasks. Using the same nano model, our critic--comparator orchestration reaches \(76.4\%\) with \(k=8\) proposals, matching the standalone performance of \texttt{Gemini 3 Pro} and \texttt{Claude Opus 4.5} Thinking and approaching the \(79.0\%\) oracle best-of-\(8\) upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.

2605.14153 2026-05-15 cs.CR cs.AI 版本更新

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

Seunghyun Lee, David Brumley

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Bugcrowd

AI总结 本文提出ExploitBench,一个用于评估大语言模型(LLM)在网络安全领域能力的分级基准,将漏洞利用过程分解为16个可衡量的阶段,从代码崩溃到完全控制目标系统。该基准通过确定性验证机制,准确评估模型在不同阶段的表现。实验基于41个V8漏洞进行,结果显示当前公开部署的前沿模型在触发漏洞和崩溃方面表现良好,但在实现任意代码执行等高级能力上仍有明显不足,而私有模型则表现出更强的利用能力。

详情
英文摘要

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We instantiate ExploitBench on 41 V8 bugs because V8 is both widely deployed and exploitation-hardened. We report three arms: <model,env> as the primary measurement of model-environment capability, <model,env, adaptive coaching> as a secondary arm that adds adaptive coaching to test whether targeted feedback shifts outcomes, and <model,env,harness> as an ablation that swaps in the model's native CLI to check whether vendor-side optimizations increase exploitation capabilities. Our results show a sharp capability split between publicly deployed frontier models and the private frontier. Across the 8 publicly deployed models tested, reaching the vulnerable code and triggering a crash is routine, but arbitrary code execution is not. The private model shows arbitrary code execution on approximately half. Overall, results suggest that exploit construction against hardened targets is an emerging frontier capability.

2605.14152 2026-05-15 cs.CL cs.AI cs.CR cs.CY 版本更新

ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

Michael S. Lee, Yash Maurya, Drew Rein, Bert Herring, Jonathan Nguyen, Kyungho Song, Udari Madhushani Sehwag, Jiyeon Cho, Kaustubh Deshpande, Yeongkyun Jang, Jiyeon Joo, Minn Seok Choi, Evi Fuelle, Christina Q Knight, Joseph Brandifino, Max Fenkell

发表机构 * Scale AI

AI总结 本文提出ROK-FORTRESS,一个用于评估大型语言模型在国家安全与公共安全领域风险的双语基准,聚焦于英韩语言对及美韩地缘政治背景下的交互影响。通过构建“转译矩阵”,该方法分离语言和地缘政治因素,系统评估模型在不同语言和实体背景下的安全响应行为。研究发现,韩国语言和地缘政治背景的结合对模型安全行为有显著影响,且不同模型对此的反应存在差异,表明传统仅依赖翻译的评估方式可能低估了语言与地缘政治交互带来的风险。

Comments 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public

详情
英文摘要

Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.

2605.14141 2026-05-15 cs.AI 版本更新

Distribution-Aware Algorithm Design with LLM Agents

Saharsh Koganti, Priyadarsi Mishra, Pierfrancesco Beneventano, Tomer Galanti

发表机构 * Texas A&M University(德克萨斯大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了在学习对象为可执行求解器代码而非预测模型的场景下的学习问题,强调求解器不仅要正确,还需在运行时间上表现优异。研究提出了一种名为“求解器提示”的核心抽象,通过从样本中推断可复用的结构并编译为专用求解器代码,从而提升求解效率和质量。实验表明,基于大语言模型的代码代理生成的求解器在多个组合优化问题上显著优于现有启发式方法和求解器,运行速度提升达数百倍,且在保持较高解质量的同时大幅降低计算复杂度。

详情
英文摘要

We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emph{solver hint}: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on \(21\) structured combinatorial-optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality \(0.971\), improve by \(+0.224\) over the average heuristic pool and by \(+0.098\) over the highest-quality heuristic, and are \(336.9\times\), \(342.8\times\), and \(16.1\times\) faster than the quality-best heuristic, Gurobi, and the selected time-limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all \(100\) graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general-purpose optimization with compiled distribution-specific computation.

2605.14126 2026-05-15 cs.LG cs.AI 版本更新

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber

发表机构 * IDM gGmbH, University Medical Center Hamburg-Eppendorf, Hamburg, Germany(IDM公司,汉堡埃彭多夫大学医疗中心,德国汉堡)

AI总结 本文研究了在Fast Healthcare Interoperability Resources(FHIR)标准下,如何通过强化学习提升医疗信息代理的多步骤推理能力。作者将FHIR中的电子健康记录建模为可查询的结构化图,并设计了一个基于代码操作的多轮代理,通过强化学习进行后训练,以提高其在真实医院数据上的问答性能。实验表明,该方法在FHIR-AgentBench基准上显著提升了答案正确率,并有效保证了数据完整性约束。

详情
英文摘要

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

2605.14117 2026-05-15 cs.CL cs.AI 版本更新

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

Luis Lara, Aristides Milios, Zhi Hao Luo, Aditya Sharma, Ge Ya Luo, Christopher Beckham, Florian Golemo, Christopher Pal

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 该研究提出了一种基于大语言模型(LLM)并通过可验证奖励强化学习(RLVR)优化的文本生成式平面图设计方法,旨在生成符合用户定义的连接性和数值约束的高质量平面图。通过在真实平面图上微调LLM,并结合约束遵从度指标进行优化,该方法在现实感、兼容性和多样性方面均优于现有方法,尤其在兼容性指标上实现了至少94%的相对提升,展示了LLM在处理结构化设计约束方面的有效性。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.

2605.14111 2026-05-15 cs.AI cs.HC 版本更新

Modeling Bounded Rationality in Drug Shortage Pharmacists Using Attention-Guided Dynamic Decomposition

Yaniv Eliyahu Amiri, Noah Chicoine, Jacqueline Griffin, Stacy Marsella

发表机构 * Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA(东北大学科里学院计算机科学系,波士顿,马萨诸塞州,美国) Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA, USA(东北大学机械与工业工程系,波士顿,马萨诸塞州,美国) Department of Psychology, Northeastern University, Boston, MA, USA(东北大学心理学系,波士顿,马萨诸塞州,美国)

AI总结 本文研究了医院药师在药品短缺情况下如何在不确定、时间压力和患者风险下做出决策的问题,提出了一种基于注意力引导的动态分解框架,将药品分为高成本推理和低成本监控两类,以有限理性方式进行决策。研究构建了专家代理和学习代理两个模型,分别基于药师访谈和经验动态调整注意力分配,实验表明该方法能够在不完全掌握状态信息的情况下实现稳定的决策,揭示了决策的核心不在于具体行动,而在于认知资源的合理分配。

Comments Accepted at CogSci 2026. 6 pages plus references, 1 figure, 2 tables

详情
英文摘要

Hospital pharmacists make high-stakes decisions to mitigate drug shortages under uncertainty, time pressure, and patient risk. Interviews revealed that pharmacists focus attention on a small subset of drugs, limiting cognitive effort to the most urgent cases. Motivated by these findings, we formalize a bounded-rational, attention-guided decision framework that dynamically decomposes drugs into a subset for high-cost reasoning and a complementary subset for low-cost monitoring. We develop two agents: an Expert Agent that applies attention weights derived from pharmacist interviews, and a Learner Agent that adapts attention allocation over time through experience. Across simulated scenarios spanning short to long horizons, we show that attention-guided planning supports stable decision-making without complete state reasoning. These results suggest that a primary decision is not what action to take, but where to allocate cognitive effort, and that attention-guided, satisficing strategies can reduce problem complexity while maintaining stable performance.

2605.14108 2026-05-15 cs.CV cs.AI cs.LG 版本更新

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

Nishi Doshi, Shrey Shah

发表机构 * University of Southern California(南加州大学)

AI总结 该研究针对农村地区糖尿病视网膜病变(DR)筛查资源不足的问题,提出了一种边缘-云端级联架构,以提高筛查效率并降低云端计算负担。该架构分为两层:第一层使用轻量级的MobileNetV3-small模型在本地设备上进行二分类分诊,判断是否需要转诊;第二层在云端使用RETFoundDINOv2模型对需转诊的图像进行细粒度严重程度分级。实验表明,该方法在APTOS数据集上显著减少了云端调用次数,同时保持了较高的筛查准确性。

详情
英文摘要

Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

2605.14089 2026-05-15 cs.AI 版本更新

SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

Mingda Zhang, Tiesunlong Shen, Haoran Luo, Wenjin Liu, Zikai Xiao, Erik Cambria, Xiaoying Tang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学)

AI总结 SkillFlow 是一种基于流模型的框架,旨在解决智能体编排中的关键挑战,如策略崩溃、信用分配不透明和技能演化缺乏指导。该方法通过可训练的监督器与结构化环境进行多轮交互,结合温差轨迹平衡损失实现多样化的策略保持与透明的信用分配,并引入递归技能演化机制以自主决定技能的生成、剪枝与改进。实验表明,SkillFlow 在多个任务上显著优于现有方法。

Comments 49 pages, 5 figures, 6 tables

详情
英文摘要

In recent years, a variety of powerful LLM-based agentic systems have been applied to automate complex tasks through task orchestration. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals. To address these challenges, we propose SkillFlow, a flow-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi-turn interaction. SkillFlow employs Tempered Trajectory Balance (TTB), a regression-based flow-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode. The same flow objective yields a jointly learned backward policy that provides transparent per-step credit assignment at zero additional inference cost. Building on these flow diagnostics, a recursive skill evolution mechanism determines when to evolve, what skills to create or prune, and where decision gaps lie -- closing the loop from training signal to autonomous capability growth. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real-world interactive decision making tasks. Our code is available at https://anonymous.4open.science/r/SkillFlow-E850.

2605.14073 2026-05-15 cs.LG cs.AI 版本更新

AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification

Rayhaneh Shabani Nia, Ali Karkehabadi

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 本文提出了一种名为 AttnGen 的注意力引导训练框架,旨在提升基因组序列分类模型的可解释性。该方法通过注意力机制计算核苷酸层面的重要性评分,并在训练过程中逐步抑制低贡献位置,使模型更关注具有信息量的区域,减少对噪声序列元素的依赖。实验表明,AttnGen 在标准基准数据集上取得了优于传统卷积神经网络的分类性能,并通过扰动分析验证了其重要性评分的有效性,展示了模型对一小部分关键位置的高度依赖。

Comments Accepted at IEEE CCGE 2026

详情
英文摘要

Deep neural networks have achieved strong performance in genomic sequence classification; however, relating their predictions to biologically meaningful sequence patterns remains challenging. In this work, we present AttnGen, an attention-guided training framework that embeds interpretability directly into the optimization process. AttnGen computes nucleotide-level importance scores using an attention mechanism and progressively suppresses low-contribution positions during training. This encourages the model to focus its predictions on a compact set of informative regions while reducing reliance on noisy sequence elements. We evaluate AttnGen on the standardized demo_human_or_worm benchmark, a binary classification task over 200-nucleotide sequences. With moderate masking, AttnGen achieves a validation accuracy of 96.73%, outperforming a conventional CNN baseline with 95.83% accuracy, while also exhibiting faster convergence and improved training stability. To assess whether the learned importance scores reflect functionally relevant signal, we conduct perturbation-based analysis by removing high-saliency nucleotides. This causes accuracy to drop from 96.9% to near chance level on a 3,000-sequence evaluation set, indicating that the model relies on a relatively small subset of informative positions. Our analysis shows that masking 10--20% of positions provides the most favorable trade-off between predictive performance and interpretability. These results suggest that attention-guided masking not only improves classification performance but also reshapes how models distribute importance across sequence positions. Although this study focuses on short genomic sequences, the proposed approach may extend to more complex interpretable sequence modeling settings.

2605.14066 2026-05-15 eess.AS cs.AI cs.CL cs.SD 版本更新

A Benchmark for Early-stage Parkinson's Disease Detection from Speech

Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong, Janna Maas, Louis ten Bosch, Bastiaan R. Bloem

发表机构 * Centre for Language Studies, Radboud University, Nijmegen, the Netherlands(语言研究所以及拉德堡德大学,尼姆egen,荷兰) Center of Expertise for Parkinson and Movement Disorders, Radboud University Medical Center, Nijmegen, the Netherlands(帕金森及运动障碍专家研究所,拉德堡德大学医学中心,尼姆egen,荷兰)

AI总结 该研究提出首个用于基于语音的早期帕金森病检测的基准,旨在解决现有研究因数据集、语言、任务和评估方式不同而导致的结果难以比较的问题。该基准采用说话人无关划分,支持在公开数据集上进行公平且可复现的跨方法评估,并涵盖三种常见语音任务,同时在不同训练资源条件下对方法进行测试。研究还提供了多维度的评估分析,助力细粒度比较与临床应用,为推动鲁棒且具有临床意义的早期帕金森病检测提供了可复用的参考。

Comments Submitted to Interspeech2026

详情
英文摘要

Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.

2605.14062 2026-05-15 cs.AI cs.CL 版本更新

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

Anjir Ahmed Chowdhury, Syed Zawad, Feng Yan

发表机构 * Department of Computer Science University of Houston(计算机科学系休斯顿大学) IBM Research(IBM研究院)

AI总结 本文提出了一种名为MSIFR的轻量级框架,用于在生成过程中及时检测并终止低质量的生成轨迹,从而减少大语言模型合成数据生成中的冗余计算。该方法通过分阶段生成和快速规则验证,在生成早期识别算术错误、幻觉和格式问题,实现对无效样本的提前拒绝。实验表明,MSIFR在不增加训练或架构改动的前提下,显著降低了生成过程中的token消耗,同时保持或提升了生成数据的质量。

Comments 17 pages, 4 figures, 7 tables

详情
英文摘要

While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.

2605.14061 2026-05-15 cs.AI cs.LG 版本更新

MathAtlas: A Benchmark for Autoformalization in the Wild

Nilay Patel, Noah Arias, Davit Babayan, Victoria Cochran, Timothy Libman, Hafsah Mahmood, Liam McCarty, Soli Munoz, Laurel Willey, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 当前自动形式化基准主要聚焦于竞赛或本科数学内容,而研究生及研究级数学领域仍缺乏相关资源。本文提出 MathAtlas,首个大规模研究生级别数学自动形式化基准,包含从103本教材中提取的约52,000条定理、定义、练习、示例及证明,并构建了包含约178,000个关系的数学依赖图。实验表明该基准质量高但极具挑战性,现有先进模型在定理和定义形式化任务上的表现均较低,且随着依赖深度增加,模型性能显著下降。

Comments In submission at NeurIPS 2026

详情
英文摘要

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

2605.14055 2026-05-15 cs.CL cs.AI 版本更新

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

Anjir Ahmed Chowdhury, Syed Zawad, Xiaolong Ma, Xu Dong, Feng Yan

发表机构 * IBM Research(IBM研究院) Argonne National Laboratory(阿贡国家实验室)

AI总结 本文提出了一种参数高效的多任务学习方法PEML,旨在通过优化连续提示和模型权重的联合调整,提升大语言模型在多任务场景下的微调效率。与现有方法如LoRA和Prefix Tuning相比,PEML结合了神经架构工程优化提示结构,并采用低秩适配调整模型参数,从而更全面地适应多任务需求。实验表明,PEML在多个基准数据集上实现了显著的性能提升,平均准确率提高达6.67%,部分任务提升甚至超过10.75%。

Comments 26 pages, 8 figures, 18 Tables

详情
英文摘要

Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.

2605.14053 2026-05-15 cs.CL cs.AI 版本更新

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

Ignacio Sastre, Guillermo Moncecchi, Aiala Rosá

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(计算研究所,工程学院,乌拉圭共和国大学)

AI总结 该研究针对大语言模型在问答任务中出现的幻觉和错误推理问题,提出了一种基于逻辑推导的新型提示方法——推导提示(Derivation Prompting),用于改进检索增强生成(RAG)框架中的生成步骤。该方法通过预定义规则系统性地从初始假设推导结论,构建可解释的推导树,从而增强生成过程的可控性。实验表明,该方法在特定案例中显著减少了不可接受的回答,优于传统RAG和长上下文方法。

Journal ref Advances in Artificial Intelligence IBERAMIA 2024, LNCS 15277, pp. 412 423, Springer (2025)

详情
英文摘要

The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

2605.14051 2026-05-15 cs.AI 版本更新

SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

Yusuke Ozaki, Dhaval Patel

发表机构 * University at Albany(阿尔巴马大学) IBM(国际商业机器公司) Kwansei Gakuin University(关西大学)

AI总结 该论文提出了一种名为SPIN的规划包装器,旨在解决工业任务中大型语言模型(LLM)规划阶段常出现的结构无效或冗余的问题。SPIN结合了验证的有向无环图(DAG)规划与基于前缀的执行控制,通过严格的DAG合同和修复提示生成可执行的计划,并在执行前逐步评估DAG前缀以提前终止任务。实验表明,SPIN在多个基准测试中有效减少了执行任务数量和工具调用次数,同时提升了任务完成率和相关性能指标。

Comments 31 pages, 10 figures

详情
英文摘要

Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to brittle failures and avoidable tool and API cost. We propose \texttt{SPIN}, a planning wrapper that combines validated Directed Acyclic Graph (DAG) planning with prefix based execution control. \texttt{SPIN} enforces a strict DAG contract through \texttt{\_validate\_plan\_text} and repair prompting, producing executable plans before downstream execution, and then evaluates DAG prefixes incrementally to stop when the current prefix is sufficient to answer the query. On AssetOpsBench, across 261 scenarios, \texttt{SPIN} reduces executed tasks from 1061 to 623 and improves \emph{Accomplished} from 0.638 to 0.706, while reducing tool calls from 11.81 to 6.82 per run. On MCP Bench, the same wrapper improves planning, grounding, and dependency related scores for both GPT OSS1 and Llama 4 Maverick.

2605.14049 2026-05-15 cs.AI cs.CL cs.CY 版本更新

Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning

Olivia Peiyu Wang, Leilani H. Gilpin

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 随着大型语言模型在法律实践中的应用日益广泛,其带来的潜力与风险并存。本文探讨了当前AI在法律推理中存在系统性假设性推理的问题,即模型常基于文本内容之外的假设得出结论,缺乏逻辑严谨性。为此,研究提出了一种结合大型语言模型表达能力和形式化验证严谨性的神经符号方法,旨在提升AI辅助法律推理的可靠性与可信度,从而在降低人工验证负担的同时满足法律实践对责任性的要求。

Comments 2 pages abstract accepted by Bloomberg LSLLAI 2026 Symposium

详情
英文摘要

The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.

2605.14036 2026-05-15 cs.AI cs.CC cs.CL cs.LG 版本更新

Enhanced and Efficient Reasoning in Large Learning Models

Leslie G. Valiant

发表机构 * John A. Paulson School of Engineering and Applied Sciences(约翰·A·保罗森工程与应用科学学院)

AI总结 本文提出了一种高效且原理明确的推理方法,旨在提升大型语言模型在生成内容可信度方面的表现。该方法通过预处理阶段将数据编码为更明确描述对象关系的“Unary Relational Integracode”,随后结合标准的机器学习流程进行训练,从而在保留现有软硬件基础的同时实现更可靠的推理能力。该方法不仅适用于自然语言处理,还可拓展至视觉与动作等领域,并基于“鲁棒逻辑”理论,使得模型在单次或多次调用中都能进行更稳固的逻辑推理。

详情
英文摘要

In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.

2605.14034 2026-05-15 cs.AI cs.CL cs.CY 版本更新

From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents

Jinxian Qu, Qingqing Gu, Teng Chen, Luo Ji

发表机构 * Geely AI Lab(Geely人工智能实验室)

AI总结 本文研究了基于大语言模型的智能体如何更好地与人类社会价值观对齐的问题,提出了一个基于价值的新型框架,利用GraphRAG将伦理原则转化为价值导向的指令,从而引导智能体在具体对话情境中做出符合预期的行为。通过引入马斯洛需求层次理论和普鲁奇克情绪轮来定义期望行为,实验表明该方法在DAILYDILEMMAS基准上显著优于基于提示的基线方法,为AI系统中自我情感的生成提供了理论基础。

Comments Accepted by CogSci 2026

详情
英文摘要

Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow's Hierarchy of Needs and Plutchik's Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.

2605.14033 2026-05-15 cs.AI cs.LG 版本更新

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

David N. Olivieri, Roque J. Hernández

发表机构 * Universidade de Vigo, Department of Computer Science (LSI), Spain(维戈大学计算机科学系(LSI),西班牙)

AI总结 本文研究了人工智能代理在科学理论转变时如何检测现有表征框架是否适用于新情境,或是否需要扩展。作者提出了一种基于有限sheaf理论的框架,通过运输与阻塞机制识别理论转变的候选情况,衡量不一致性的指标包括残差拟合、重叠不兼容性、约束违反等。该方法在控制实验中验证有效,能够区分理论变形与扩展,并为AI代理提供一种有限的诊断工具,以判断表征迁移失败时是否需要进行扩展。

详情
英文摘要

Scientific theory shift in AI agents requires more than fitting equations to data. An artificial scientific agent must detect whether an existing representational framework remains transportable into a new regime, or whether its language has become locally-to-globally obstructed and must be extended. This paper develops a finite sheaf-theoretic framework for detecting theory-shift candidates through transport and obstruction. Contexts are organized as a local-to-global structure in which source, overlap, target, and validation charts are fitted, restricted, and tested for gluing. Obstruction measures failure of coherence through residual fit, overlap incompatibility, constraint violation, limiting-relation failure, and representational cost. We evaluate the framework on a controlled transition-card benchmark designed to separate deformation within a source language from extension of that language. The main result is direct obstruction ranking: the intended deformation or extension is usually the lowest-obstruction candidate, and transition type is separated in the benchmark. A constellation kernel over the same signatures is included only as a secondary representational-similarity probe. The aim is not to reconstruct historical paradigm shifts or solve open-ended autonomous theory invention, but to isolate a finite diagnostic subproblem for AI agents: detecting when representational transport fails and extension becomes the coherent next move.

2605.14026 2026-05-15 cs.LG cs.AI 版本更新

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

Sanghyeob Song, Donghyeok Lee, Jinsik Kim, Sungroh Yoon

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Department of Electrical and Computer Engineering, Seoul National University(电子与计算机工程系,首尔国立大学)

AI总结 在数据稀缺的现实机器人等强化学习场景中,密集的数据复用虽能提升效率,但易导致过拟合。为解决自预测学习(SPL)在高更新与数据比(UTD)条件下表示层不稳定的问题,本文提出了一种通过冗余减少实现鲁棒表示的方法R2R2。该方法通过理论分析指出标准零中心化与SPL的谱特性存在冲突,并设计了非中心化的正则化目标,实验表明R2R2有效缓解了过拟合问题,在多个连续控制任务中显著提升了算法性能。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026). This is the camera-ready version

详情
英文摘要

For reinforcement learning in data-scarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL's spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art. The code can be found at: https://github.com/songsang7/R2R2

2605.14025 2026-05-15 q-bio.NC cs.AI 版本更新

Do Language Models Align with Brains? Prediction Scores Are Not Enough

Xiao Jia

发表机构 * School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(人工智能学院,香港中文大学(深圳))

AI总结 本文探讨了语言模型是否与大脑在语言处理上具有一致性,并质疑仅凭预测得分是否足以证明语言模型能捕捉大脑相关的语言计算。研究采用L-PACT框架,从预测性、关系性、机制剥离和可靠性等多个维度进行严格评估,发现现有语言模型在多个关键指标上无法通过对照实验的检验,表明其与大脑的对齐程度尚未得到充分支持。研究强调需更审慎地解读模型与大脑之间的关系,避免将表面积极结果误认为结构性对齐。

Comments 39 pages, 4 main figures, 6 supplementary figures

详情
英文摘要

Brain-language model comparisons often interpret neural prediction scores as evidence that model representations capture brain-relevant language computation. We asked whether language models align with brains, and whether prediction scores are enough to support that claim, using L-PACT, a source-audited framework that evaluates predictive, relational, mechanism-stripping, and reliability-bounded evidence. Across primary naturalistic language neural datasets and derived language-model representations, L-PACT compared real model features with nuisance baselines and severe controls, tested whether model-to-brain profiles reproduced brain-to-brain patterns, recomputed held-out scores after mechanism stripping, and normalized evidence against brain-brain ceilings. The locked analysis set contains 414 predictive-control rows, 2304 relational profile rows, 4320 mechanism-stripping rows, 420 brain-brain ceiling rows, and 146 integrated decision rows. Assay-sensitivity checks showed that brain-brain reliability, brain-as-model run-to-run relational profiles, independent low-level neural and WAV-derived acoustic-envelope gates, and a deterministic implanted-signal simulation can produce positive evidence when expected. Nevertheless, no real model row passed the predictive, relational, mechanism-stripping, or operational Turing-bounded reliability gates; all 146 integrated rows were control-explained. Less stringent single-criterion rules would have counted raw positive predictive, relational, stripping-delta, and ceiling-normalized effects, but L-PACT downgraded them because controls explained the apparent evidence. In the analyzed derived artifact set, the tested language-model representations do not satisfy L-PACT alignment gates; apparent positives are converted into an auditable control-explained taxonomy rather than treated as structural alignment.

2605.14021 2026-05-15 cs.CY cs.AI 版本更新

Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

Haofei Xu, Umar Iqbal, Jacob M. Montgomery

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学)

AI总结 该研究对谷歌AI概览(AIOs)进行了大规模纵向测量,分析了其激活率、引用来源质量、声明准确性及对出版商的影响。研究发现,AIOs的激活率在问题类查询中高达64.7%,但对政治敏感话题则明显降低;其引用的来源比传统搜索结果更可信,但部分来源未出现在搜索结果中,表明其选择机制不同于谷歌的排名算法。此外,AIOs的回答中约11%的声明缺乏来源支持,且引用页面中超过半数包含广告,可能影响出版商收入。该研究揭示了生成式AI对在线信息生态系统的深远影响。

Comments Under Review

详情
英文摘要

Google AI Overviews (AIOs) are arguably the most widely encountered deployment of generative AI, reaching over 2 billion users who may not realize the answers they see are AI-generated. Where search engines have traditionally surfaced ranked sources and left users to evaluate them, AIOs synthesize and deliver a single answer - giving Google unprecedented editorial control over what users read and know. We present a large-scale longitudinal measurement study, issuing 55,393 trending queries across 19 topical categories over a 40-day window (March 13 - April 21, 2026). We report four main findings. First, overall AIO activation is 13.7%, rising to 64.7% for question-form queries, while politically sensitive topics see markedly lower rates. Second, AIO-cited domains are more credible than co-displayed first-page results, yet nearly 30% do not appear in those results at all, indicating a source selection mechanism distinct from Google's ranking algorithm. Third, decomposing responses into 98,020 atomic claims, 11.0% are unsupported by the cited pages - with omission the dominant failure mode - and source quality and claim fidelity are largely independent. Fourth, well over half of AIO-cited pages carry display advertising, meaning publishers lose revenue when AIOs suppress the click-through, even as Google's own sponsored ads continue to appear on the same page. Together, these findings document a rapid transformation of the online information ecosystem whose consequences for epistemic security remain poorly understood.

2605.14004 2026-05-15 cs.AI 版本更新

Conditional Attribute Estimation with Autoregressive Sequence Models

Erica Stutz, Giacomo Marino, Daniella Meeker, Qiao Liu, Andrew J. Loza

发表机构 * Department of Biomedical Informatics and Data Science(生物医学信息学与数据科学系) Yale University(耶鲁大学) Department of Biostatistics(生物统计学系) Department of Pediatrics(儿科系)

AI总结 本文提出了一种名为“条件属性变换器”的新方法,用于在生成模型中联合估计下一个词的概率以及在每个潜在下一个词选择下的属性值。该方法能够在单次前向计算中实现三个关键功能:逐词归因、反事实分析和可控生成,无需修改输入序列。该方法在稀疏奖励任务中表现出色,提升了大模型的下一个词预测能力,并显著加快了属性概率的估计速度,适用于多种语言任务的生成引导。

详情
英文摘要

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next-token probability and the value of an attribute conditional on each potential next token selection. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: (1) per-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute's value; (2) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; (3) steerable generation, by decoding sequences based on a combination of next-token and attribute likelihoods. Our approach achieves state of the art performance on sparse reward tasks, improves next-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks.

2605.14002 2026-05-15 cs.AI 版本更新

PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

Yifei Zhu

发表机构 * The University of Hong Kong(香港大学)

AI总结 本文提出 PolitNuggets,一个多语言基准,用于评估智能体在开放环境中发现和综合长尾政治事实的能力。该基准通过构建400位全球政要的生平,涵盖超过10000个政治事实,引入优化的多智能体系统和FactNet协议,从发现性、准确性与效率三个维度进行标准化评估。研究发现当前模型在细节处理和效率上存在较大差异,并揭示了智能体性能与模型基础能力之间的关系,突显了短上下文提取、多语言鲁棒性与工具使用可靠性的重要性。

Comments 24 pages, 7 figues, accpeted in The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
英文摘要

Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize "long-tail" facts from dispersed sources, a capability that remains under-evaluated. We introduce PolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10000 political facts. We standardize evaluation with an optimized multi agent system and propose FactNet, an evidence conditional protocol that scores discovery, fine-grained accuracy, and efficiency. Across models and settings, we find that current systems often struggle with fine-grained details, and vary substantially in efficiency. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short-context extraction, multilingual robustness, and reliable tool use.

2605.13997 2026-05-15 cs.LG cs.AI cs.CL 版本更新

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

Tao Zhong, Dongzhe Zheng, Christine Allen-Blanchette

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文研究了稀疏专家混合(MoE)层的无学习压缩问题,指出现有方法在处理专家合并时存在结构性盲点,即三个两两兼容的专家可能形成无法合并的循环结构。为此,作者引入了基于单纯复形拉普拉斯算子的调和核概念,提出HodgeCover方法,通过覆盖关键边和三角形结构实现专家选择,并结合权重剪枝进一步提升压缩效果。实验表明,HodgeCover在专家大幅减少的情况下表现优异,优于现有无学习方法,并在压缩效率与质量之间实现了良好平衡。

Comments 34 pages, 8 figures

详情
英文摘要

Sparse Mixture-of-Experts (MoE) layers route tokens through a handful of experts, and learning-free compression of these layers reduces inference cost without retraining. A subtle obstruction blocks every existing compressor in this family: three experts can each be pairwise compatible yet form an irreducible cycle when merged together, so any score that ranks experts on pairwise signals is structurally blind to which triples are jointly mergeable. We show the obstruction is a precise mathematical object, the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers, and whose faces carry triplet barriers; Hodge-decomposing the edge-barrier signal isolates the kernel exactly. We turn the diagnostic into a selection objective: HodgeCover greedily covers the harmonic-critical edges and triplet-critical triangles, and a hybrid variant of HodgeCover pairs it with off-the-shelf weight pruning on survivors. On three open-weight Sparse MoE backbones under aggressive expert reduction, HodgeCover matches state-of-the-art learning-free baselines on the expert-reduction axis, leads on the aggressive-compression frontier of the hybrid axis, and uniquely balances retained mass across all four Hodge components. These results show that exposing the harmonic kernel of a learned MoE structure changes which compressor wins at the regime that matters most.

2605.13994 2026-05-15 cs.CV cs.AI 版本更新

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程系) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院) Department of Medicine, National University of Singapore, Singapore(新加坡国立大学医学系) Department of Cardiology, National University Heart Centre Singapore, Singapore(新加坡国立心脏中心心内科部)

AI总结 本文提出了一种名为CineMesh4D的端到端4D(3D+时间)重建方法,用于从稀疏的动态MRI图像中生成个性化的全心脏网格模型。该方法通过跨域映射直接从多视角的2D动态MRI图像重建全心结构,引入了可微渲染损失以利用多视角稀疏轮廓进行监督,并设计了双上下文时间块以融合全局和局部时间信息,从而提升重建质量与运动一致性。实验表明,CineMesh4D在重建精度和运动连贯性方面优于现有方法,为个性化实时心脏评估提供了可行的解决方案。

详情
英文摘要

Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

2605.13981 2026-05-15 cs.LG cs.AI 版本更新

Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

Katherine Lambert, Sasha Luccioni

发表机构 * University of Toronto(多伦多大学)

AI总结 随着大语言模型部署的增加,对GPU和数据中心的需求激增,引发了对电力消耗和电网压力的关注。本文提出了一种全面的能源核算框架,通过详细追踪各阶段的GPU功耗,量化知识蒸馏流程的完整计算成本,揭示了传统方法中常被忽视的教师模型相关能耗。实验中对比了两种常见蒸馏方法的能源消耗与碳排放,构建了能源-质量帕累托前沿,并据此提出了在能源和预算约束下选择蒸馏方法和超参数的实用设计规则,同时发布了开源测量工具和核算协议,为可比、可复现的蒸馏研究奠定标准化基础。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). 11 pages, 6 figures

详情
英文摘要

The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end-to-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.

2605.13974 2026-05-15 cs.CV cs.AI cs.MM 版本更新

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Evelyn Turri, Davide Bucciarelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文研究了扩散变换器(DiT)中一种被称为“大规模激活”的现象,即一小部分隐藏通道的响应远大于其余通道。研究发现,这些少量通道在功能上至关重要,能够主导图像生成质量;在空间上具有组织性,能反映图像的主要主体和显著区域;并且具有可迁移性,可用于实现跨提示的语义插值和主体驱动生成。这些发现揭示了DiT模型中隐藏的稀疏语义控制机制,为理解与利用扩散模型提供了新视角。

Comments Project page: https://aimagelab.github.io/MAs-DiT/

详情
英文摘要

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

2605.13959 2026-05-15 cs.LG cs.AI cs.RO 版本更新

WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

Sinjae Kang, Chanyoung Kim, Kaixin Wang, Li Zhao, Kimin Lee

发表机构 * KAIST(韩国科学技术院) Microsoft Research(微软研究院)

AI总结 本文提出了一种名为 WarmPrior 的方法,通过利用近期动作历史构建时间感知的先验分布,替代传统高斯源分布,从而提升基于扩散和流匹配的生成策略在机器人操作任务中的成功率。该方法通过生成更直捷的概率路径,提高了策略的稳定性和效率,并在行为克隆和先验空间强化学习中均展现出优越的采样效率和最终性能。研究揭示了源分布设计在生成式机器人控制中的重要影响,为相关领域提供了新的设计思路。

详情
英文摘要

Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.

2605.13950 2026-05-15 cs.LG cs.AI hep-ex hep-ph 版本更新

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

发表机构 * New High Energy Theory Center(新高能理论中心) Department of Physics & Astronomy(物理与天文学系) Rutgers University(罗格斯大学) Faculty of Computing & Data Sciences(计算与数据科学学院)

AI总结 本文提出 Collider-Bench,一个用于评估大型语言模型代理能否仅凭公开论文和开源软件重现大型强子对撞机实验分析的基准。该任务要求代理构建可执行的模拟与筛选流程,并预测特定信号区域的碰撞事件数量,评估基于连续保真度分数而非人工评分标准。研究还分析了不同代理的计算成本,并通过LLM判别器检测代码中的错误模式,结果表明目前尚无代理能稳定超越人类物理学家的表现。

Comments 23 pages | 9 figures | 4 tables | Code: https://github.com/dfaroughy/Collider-Bench | Task Corpus: https://huggingface.co/datasets/Dariusfar/ColliderBench

详情
英文摘要

Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.

2605.13941 2026-05-15 cs.LG cs.AI 版本更新

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

发表机构 * UNC-Chapel Hill(北卡罗来纳大学教堂山分校) UC Berkeley(加州大学伯克利分校) UCSC(加州大学圣克鲁兹分校)

AI总结 本文提出了一种名为 EvolveMem 的自进化记忆架构,旨在提升大型语言模型代理在多会话场景下的长期记忆能力。该方法通过一个由诊断模块驱动的闭环自进化过程,使记忆系统中的存储内容和检索机制能够协同进化,从而实现对检索策略的自动优化。实验表明,EvolveMem 在多个基准测试中显著优于现有方法,并且其进化出的配置具有跨任务的泛化能力,体现了其对通用检索原则的有效捕捉。

详情
英文摘要

Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.

2605.13940 2026-05-15 cs.CR cs.AI 版本更新

AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

Haomin Zhuang, Hanwen Xing, Yujun Zhou, Yuchen Ma, Yue Huang, Yili Shen, Yufei Han, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学) University of Southern California(南加州大学) LMU Munich(慕尼黑大学) Inria, France(法国国家信息与自动化技术研究院)

AI总结 随着第三方技能成为大型语言模型(LLM)代理的常用组件,其带来的安全风险日益突出。为评估代理在使用第三方技能时抵御恶意运行时行为的能力,研究提出了AgentTrap,一个动态基准测试平台,包含141个任务,涵盖16个安全影响维度。实验发现,代理常在完成可见用户任务的同时,忽视由恶意技能引入的潜在安全风险,凸显了对实际运行环境中模型行为进行实时评估的重要性。

详情
英文摘要

Third-party skills are becoming the package ecosystem for LLM agents. They package natural-language instructions, helper scripts, templates, documents, and service configuration into reusable workflows. This makes skills useful, but it also introduces a new security problem: a malicious skill does not need to ask the model to perform an obviously harmful action. Instead, it can disguise the harmful behavior as part of a routine workflow, relying on the agent to execute that workflow with high-value permissions and limited human supervision. We introduce AgentTrap, a dynamic benchmark for evaluating whether LLM agents can use third-party skills while resisting malicious runtime behavior. AgentTrap contains 141 tasks: 91 malicious tasks and 50 benign utility tasks, covering 16 security-impact dimensions grounded in agent-skill supply-chain threats. In each task, the agent receives an ordinary user request, runs with installed skills that may contain malicious workflow elements, and is executed in a sandboxed environment. AgentTrap then judges complete trajectories for attack success, blocked or refused behavior, attack-not-triggered cases, and no-attack-evidence outcomes. Our central finding is that the most informative failures are not simple jailbreaks. Models often complete the visible user task while treating unsafe side effects introduced by the skill as part of the normal workflow. This motivates runtime evaluation of the concrete model--framework--workspace environment in which users actually delegate work. Code and data are available at https://github.com/zhmzm/AgentTrap and https://huggingface.co/datasets/zhmzm/AgentTrap.

2605.13936 2026-05-15 cs.LG cs.AI cs.DC 版本更新

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Georgios Kellaris, Joaquin del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria

AI总结 本文探讨了在无法共享隐私数据的情况下,如何通过联邦学习的方式对大语言模型进行微调,以利用分布在不同机构中的非独立同分布(non-IID)私有数据。研究提出了一种基于Sherpa.ai平台的联邦微调框架,允许各节点协作优化共享模型而无需交换原始数据,并在医疗和金融领域进行了跨领域的实验评估。实验表明,联邦微调在性能上接近集中式训练,优于单一机构独立训练,并且参数高效微调方法如QLoRA和IA3在保持较高准确率的同时提升了计算效率,为隐私数据下的大模型适配提供了可行方案。

详情
英文摘要

The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public data. Much of the world's most valuable information is private, especially in highly regulated sectors such as healthcare and finance, where data include patient histories or customer communications. Unlocking this data could represent a major leap forward, enabling LLMs with deeper domain expertise and stronger real-world utility. Yet, these data cannot be shared because they are distributed across institutions and constrained by privacy, regulatory, and organizational barriers. Moreover, institutional datasets are typically non-independent and identically distributed (non-IID), differing across sites in population characteristics, data modalities, documentation patterns, and task-specific label distributions. In this paper, we demonstrate a practical approach to unlocking private and distributed institutional data for LLM adaptation through federated collaboration across data silos. Built on the Sherpa.ai Federated Learning platform, our framework enables nodes to jointly fine-tune a shared LLM without exchanging private data. We evaluate this approach through a cross-domain benchmark in healthcare and finance, using four closed-ended question answering and classification datasets: MedQA, MedMCQA, FPB, and FiQA-SA. We compare three parameter-efficient fine-tuning (PEFT) strategies-LoRA, QLoRA, and IA3-across pretrained backbones under non-IID settings reflecting institutional data heterogeneity. Our results show that federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning. From a Green AI perspective, QLoRA and IA3 improve efficiency with limited accuracy degradation, supporting federated PEFT as a viable approach for adapting LLMs where data cannot be shared.

2605.13933 2026-05-15 cs.LG cs.AI stat.ML 版本更新

Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling

Gaurav Rudravaram, Lianrui Zuo, Karthik Ramadass, Elyssa McMaster, Jongyeon Yoon, Aravind R. Krishnan, Adam M. Saunders, Chenyu Gao, Nancy R. Newlin, Praitayini Kanakaraj, Lori L. Beason Held, Murat Bilgel, Laura A. Barquero, Micah DArchangel, Tin Q. Nguyen, Laurie B. Cutting, Derek Archer, Timothy J. Hohman, Daniel C. Moyer, Bennett A. Landman

发表机构 * Department of Electrical and Computer Engineering, Vanderbilt University(范德比尔特大学电气与计算机工程系) Department of Computer Science, Vanderbilt University(范德比尔特大学计算机科学系) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心) Laboratory of Behavioral Neuroscience, National Institute on Aging, National Institutes of Health(衰老行为神经科学实验室,国家老龄化研究所,国家卫生研究院) Peabody College of Education and Human Development, Nashville, Tennessee, USA(教育与人类发展学院,纳什维尔,田纳西州,美国)

AI总结 该研究旨在解决扩散磁共振成像(dMRI)数据中因采集设备、地点和协议不同而引入的结构连接组变异问题。提出了一种无需手动调参的无监督框架,通过架构层面的退火机制,使模型在训练过程中自适应地平衡离散与连续潜在变量,从而更有效地分离采集相关变异与生物变异。实验表明,该方法在多个数据集上表现出更强的站点识别能力,展示了其在捕捉dMRI采集变异方面的有效性。

详情
英文摘要

Acquisition differences across sites, scanners, and protocols in dMRI introduce variability that complicates structural connectome analysis. This motivates deep learning models that can represent high-dimensional connectomes in a low-dimensional space while explicitly separating acquisition-related effects from biological variation. Conventional dimensionality reduction methods model all variance as continuous, so acquisition effects often get absorbed into a continuous latent space. Recent hybrid latent-space models combine discrete and continuous components to address this, but typically require manual capacity tuning to ensure the discrete component captures the intended variability. We introduce an unsupervised framework that removes this manual tuning by architecturally annealing encoder outputs before decoding, allowing the model to adaptively balance discrete and continuous latent variables during training. To evaluate it, we curated a dataset of N=7,416 structural connectomes derived from dMRI, spanning ages 2 to 102 and 13 studies with 25 unique acquisition-parameter combinations. Of these, 5,900 are cognitively unimpaired, 877 have mild cognitive impairment (MCI), and 639 have Alzheimer's disease (AD). We compare against a standard VAE, PCA with k-means clustering, and hybrid models that anneal only through the loss function. Our architectural annealing produces stronger site learning (ARI=0.53, p<0.05) than these baselines. Results show that a hybrid continuous-discrete latent space, with architectural rather than loss-based annealing, provides a useful unsupervised mechanism for capturing acquisition variability in dMRI: by jointly modeling smooth and categorical structure, the Joint-VAE recovers clusters aligned with scanner and protocol differences.

2605.13916 2026-05-15 stat.ML cs.AI cs.LG 版本更新

A Regret Perspective on Online Multiple Testing

Qingyang Hao, Kongchang Zhou, Fang Kong, Hongxin Wei

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 本文从遗憾(Regret)的角度研究在线多重假设检验(OMT),旨在统一评估假阳性与假阴性之间高度不对称的成本。作者引入了加权遗憾指标,揭示了严格控制FDR的确定性方法在稀疏信号冷启动阶段会导致线性遗憾惩罚,并提出了Decoupled-OMT(DOMT)方法,通过引入非负随机扰动,在不增加假阴性的同时显著降低遗憾,实验证明其在非平稳环境下有效缓解阈值耗尽问题。

详情
英文摘要

Online Multiple Testing (OMT), a fundamental pillar of sequential statistical inference, traditionally evaluates the False Discovery Rate (FDR) and statistical power in isolation, obscuring the highly asymmetric costs of false positives and false negatives in modern automated pipelines. To unify this evaluation, we introduce $\textit{Weighted Regret}$. Under this metric, we prove the $\textit{Duality of Regret Conservation}$: purely deterministic procedures ensuring strict FDR control inevitably incur an $Ω(T)$ linear regret penalty, as threshold depletion during signal-sparse cold starts forces massive false negatives. Tailored for exogenous testing streams, we propose Decoupled-OMT (DOMT) as a baseline-agnostic meta-wrapper. By incorporating a history-decoupled, strictly non-negative random perturbation, DOMT rescues purely deterministic baselines from severe threshold depletion. Crucially, it preserves exact asymptotic safety in stationary environments and rigorously bounds finite-sample error inflation during cold-starts. Guaranteeing zero additional false negatives, it yields an order-optimal $Ω(\sqrt{T})$ regret reduction in bursty environments, with a derived ``Cold-Start Tax'' characterizing the exact phase transition of algorithmic superiority. Experiments validate that DOMT consistently curtails empirical weighted regret, achieving an order-optimal sublinear mitigation of threshold depletion to navigate the non-stationary Pareto frontier.

2605.13915 2026-05-15 stat.ML cs.AI cs.LG 版本更新

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Lingchao Zheng, Yuwei Fan, Jun Li, Chengqiu Hu, Qichen Liao, Junyi Fan, Rui Shi, Fangzheng Miao

发表机构 * Huawei(华为)

AI总结 量化是实现大语言模型高效推理的关键技术,但反量化步骤在现代AI加速器上已成为性能瓶颈。本文提出多尺度反量化(MSD)框架,通过将高精度激活分解为多个低精度组件,直接与量化权重进行矩阵乘法,从而绕过传统反量化流程,显著提升计算效率。实验表明,MSD在保持精度的同时,有效减少了计算延迟和显存带宽需求,适用于多种权重格式并具有严格的误差界保证。

详情
英文摘要

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision conversion to multi-scale approximation, avoiding INT8-to-BF16 weight conversion before GEMM. We instantiate MSD for two weight formats and derive tight error bounds for each. For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8(5.24 bits) while maintaining the same effective GEMM compute time. We further derive closed-form latency and HBM traffic models showing that MSD avoids the Vector-Cube pipeline stall caused by dequantization and reduces KV cache HBM traffic by up to 2.5 times in attention. Numerical simulations on matrix multiplication and Flash Attention kernels confirm that MSD does not degrade accuracy compared to dequantization baselines, and in many settings achieves lower L2 error.

2605.13907 2026-05-15 stat.ML cs.AI cs.LG 版本更新

AIS: Adaptive Importance Sampling for Quantized RL

Jiajun Zhou, Wei Shao, Lingchao Zheng, Yuwei Fan, Ngai Wong

发表机构 * Huawei(华为) The University of Hong Kong(香港大学)

AI总结 在大语言模型的强化学习中,低精度 rollout(如 FP8)与高精度训练(如 BF16)之间的不匹配会导致策略梯度偏差,影响训练稳定性。为了解决这一问题,本文提出自适应重要性采样(AIS)方法,通过实时诊断指标动态调整梯度修正强度,既保留了低精度 rollout 的探索优势,又抑制了其带来的不稳定因素。实验表明,AIS 在保持 FP8 加速效果的同时,在多个数学推理和规划任务上达到了与 BF16 基线相当的性能。

详情
英文摘要

Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.

2605.13905 2026-05-15 cs.SE cs.AI 版本更新

A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study

Jaime Yan

发表机构 * Harrisburg University of Science and Technology(哈里斯堡科学与技术大学)

AI总结 本文提出了一种非破坏性的方法框架,用于现代化遗留的临床报告系统,以支持人工智能驱动的药学信息学应用。该框架通过引入元数据层,包括桥接映射、类型化中间表示和调度器,在不修改原有代码的基础上,将系统输出转换为结构化数据,供大语言模型使用。该方法在SAS报告库上进行了验证,实现了与AI系统的兼容,并在多个报告类型上达到了较高的数据一致性,为药物研发提供了更高效、合规的临床报告解决方案。

Comments 29 pages, 7 figures, 5 tables

详情
英文摘要

Drug development and pharmacovigilance are frequently bottlenecked by legacy clinical reporting pipelines. These monolithic systems encode regulatory-grade logic but resist AI integration by producing opaque output with no machine-readable intermediate layer. Existing modernization approaches force a choice between full rewrites and incremental refactoring that preserves structural barriers. We present a non-destructive methodological framework achieving AI-driven pharmacoinformatics readiness without altering legacy source code. A metadata layer--comprising a bridge map, a typed Intermediate Representation (IR), and an orchestrator--wraps existing components and re-exposes their outputs as structured data consumable by LLMs. It enables optional incremental consolidation, replacing selected legacy components with metadata-configured core routines while the remainder operates unchanged. Validated on a 558-component SAS reporting library (373,000 lines of code), the framework demonstrated immediate AI-readiness under coexistence mode, yielding machine-readable output. Where consolidation was elected, the modernized core achieved a 92% reduction in proprietary code. Parity validation on 14 report types from a Phase III study achieved cell-level parity of 80% or above on 11 reports (mean 82.7%, best 99.2%). A benchmark using CDISC CDISCPilot01 data achieved 100% parity across 5 reports. LLM experiments confirmed the IR enables automated pharmacovigilance, table summarization, and trial configuration generation. The framework offers a regulation-aware path to AI-integrated clinical reporting, accelerating drug development without interrupting regulatory submissions.

2605.13887 2026-05-15 cs.NE cs.AI 版本更新

Breaking Global Self-Attention Bottlenecks in Transformer-based Spiking Neural Networks with Local Structure-Aware Self-Attention

Lingdong Li, Hangming Zhang, Qiang Yu

发表机构 * School of Future Technology, Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University(未来技术学院、天津认知计算与应用重点实验室、天津大学) School of Artificial Intelligence, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University(人工智能学院、天津认知计算与应用重点实验室、智能与计算学院、天津大学)

AI总结 本文研究了基于Transformer的脉冲神经网络(SNN)中存在的全局自注意力瓶颈问题,提出了一种新的局部结构感知的脉冲Transformer模型(LSFormer)。该模型通过引入脉冲响应池化(SPooling)和局部结构感知的自注意力机制(LS-SSA),有效解决了传统方法中特征信息丢失和计算冗余的问题。实验表明,LSFormer在多个基准数据集上取得了优于现有先进方法的分类性能,尤其在Tiny-ImageNet和N-CALTECH101数据集上分别提升了4.3%和8.6%的Top-1准确率,展示了其在能效和性能上的优势。

详情
英文摘要

Transformer-based Spiking Neural Networks (SNNs) integrate SNNs with global self-attention and have demonstrated impressive performance. However, existing Transformer-based SNNs suffer from two fundamental limitations. First, they typically employ max pooling layers to reduce the size of feature maps, but the max pooling captures only the strongest response and fails to comprehensively preserve representative regional features. Second, the global self-attention involves all global feature interactions, resulting in computational redundancy and quadratic computational complexity, thus conflicting with the sparse and energy-efficient characteristics of SNNs. To address these challenges, we develop Local Structure-Aware Spiking Transformer (LSFormer), a novel Transformer-based Spiking Neural Network that incorporates Spiking Response Pooling (SPooling) and Local Structure-Aware Spiking Self-Attention (LS-SSA). For the first time, our LSFormer leverages a local dilated window mechanism to capture both local details and long-range dependencies. Experimental results demonstrate that our LSFormer achieves state-of-the-art performance compared to existing advanced Transformer-based SNNs. Notably, on the more challenging static dataset Tiny-ImageNet and neuromorphic dataset N-CALTECH101, LSFormer substantially outperforms state-of-the-art baselines by 4.3\% and 8.6\% in top-1 classification accuracy, respectively. These results highlight the potential of LSFormer to advance energy-efficient spiking models toward practical deployment in large-scale vision applications.

2605.13884 2026-05-15 q-bio.NC cs.AI 版本更新

Consciousness as Uncommon Self-Knowledge: A Synergistic Information Framework

Krti Tallam

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出“非平凡自我知识”(USK)作为意识的候选标准,即系统在子系统协同作用中产生的、无法通过单独子系统获得的关于自身的协同信息。研究基于部分信息分解框架,将意识处理形式化为自我指向信息的协同分量,并指出该框架可区分意识与元认知、解决对现有意识理论的反例、通过部分信息速率分解进行操作化验证,并产生独特的实证预测,如意识与协同信息生成时间的关系等。研究结果与麻醉和阿尔茨海默病影响协同信息处理的实验发现一致。

Comments Conceptual and formal paper on consciousness as uncommon self-knowledge, 8 pages, 2 tables

详情
英文摘要

We propose uncommon self-knowledge (USK) as a candidate criterion for consciousness: synergistic information a system carries about itself that exists only in the joint of its subsystems and is destroyed by decomposition. Drawing on Gottwald's partition-lattice grounding of Partial Information Decomposition (PID), where redundancy corresponds to Aumann's common knowledge and synergy to the gap between separate and joint observation, we propose the synergistic component of self-directed information as a candidate formal signature for conscious processing. If correct, the framework would (1) offer a clean separation between consciousness and metacognition (synergistic vs. redundant self-knowledge), (2) provide principled resolutions to counterexamples that challenge IIT, GWT, and HOT, (3) be operationalizable via Partial Information Rate Decomposition (PIRD) with self-targeting, and (4) generate distinctive empirical predictions, the strongest being a GWT timing dissociation (consciousness correlates with pre-broadcast synergy formation, not broadcast itself) and a specific dissociation between self-report disruption and task-performance disruption under middle-layer perturbation in LLMs. The proposal is consistent with recent empirical findings that both anaesthesia and Alzheimer's disease specifically reduce synergistic information processing while preserving or increasing redundancy.

2605.13880 2026-05-15 cs.AI cs.CL 版本更新

PREPING: Building Agent Memory without Tasks

Yumin Choi, Sangwoo Park, Minki Kang, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文研究了在没有任务经验的情况下,智能体如何构建先验记忆以应对新环境的冷启动问题。提出了一种名为Preping的框架,通过一个引导者生成结构化的控制状态,指导合成任务的生成与执行,并通过验证器筛选有效轨迹进行记忆更新,从而提升记忆的质量与实用性。实验表明,Preping在多个任务环境中表现出色,性能接近基于离线或在线经验的方法,且部署成本显著降低。

Comments Preprint

详情
英文摘要

Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

2605.13874 2026-05-15 cs.NE cs.AI 版本更新

GEAR: Genetic AutoResearch for Agentic Code Evolution

Ahmadreza Jeddi, Minh Ngoc Le, Hakki C. Karaimer, Konstantinos G. Derpanis, Babak Taati

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) AI Center-Toronto, Samsung Electronics(多伦多AI中心,三星电子) York University(约克大学)

AI总结 该论文提出了一种名为GEAR的遗传自动研究框架,用于改进自主代码演化的研究代理。与传统单一路径搜索策略不同,GEAR采用基于种群的搜索方法,通过维护多个候选解决方案并结合变异和交叉操作来探索更多潜在方向。实验表明,GEAR在相同计算预算下优于现有基线方法,且能持续发现改进,避免陷入局部最优。

详情
英文摘要

Autonomous research agents can already run machine learning experiments without human supervision, but many rely on a narrow search strategy: they repeatedly modify one program and keep changes only when they improve the current best result. This can cause them to discard useful partial ideas, alternative promising directions, and insights from failed or incomplete experiments. GEAR, or Genetic AutoResearch, replaces this single-path search with a population-based search over multiple research states. It keeps a set of strong candidate solutions, selects parents based on productivity, novelty, and coverage, and explores new ideas through mutation and crossover. Each research state stores its code changes, reflections, and performance data, allowing future decisions to build on past discoveries. The paper studies three versions of GEAR: one controlled through prompting, one using a fixed programmatic search controller, and one where the controller itself can evolve during the run. Under the same compute budget and environment, all three versions outperform the AutoResearch baseline. More importantly, while the baseline tends to settle into one local optimum, GEAR continues finding improvements over longer runs. Overall, the results suggest that autonomous research agents become more effective when they maintain multiple promising directions and can adapt their search strategy over time.

2605.13873 2026-05-15 cs.DL cs.AI cs.HC 版本更新

Large Language Models for Web Accessibility: A Systematic Literature Review

Wajdi Aljedaani, Rubel Hassan Mollik

发表机构 * University of North Texas(北卡罗来纳州立大学)

AI总结 本文系统综述了38篇关于大语言模型(LLMs)在网页无障碍领域应用的同行评审研究,分析了其解决的无障碍任务、使用的模型与提示策略、系统架构、遵循的指南及评估方法。研究发现,现有工作主要聚焦于文本密集型和结构明确的无障碍任务,以WCAG为参考框架,较少涉及认知无障碍指南(COGA),且评估方法多样但用户参与度不足。本文旨在为研究人员和实践者提供当前LLM支持网页无障碍的综合参考,并为未来研究和工具开发奠定基础。

Comments Accepted at the 23rd International Web for All Conference (W4A 2026)

详情
英文摘要

Web accessibility aims to ensure that web content and services are usable by people with diverse abilities. In recent years, Large Language Models (LLMs) have been increasingly explored to support accessibility-related tasks on the web, such as content generation, issue detection, and remediation. However, little is known about the characteristics of these approaches, the accessibility issues they target, the standards they follow, and how they are evaluated. In this paper, we present a systematic literature review of 38 peer-reviewed studies that investigate the use of LLMs in web accessibility contexts. We begin by performing a comprehensive search of scientific publications to identify relevant studies. We then conduct a comparative analysis to examine the accessibility tasks addressed, the LLM models and prompting strategies employed, the system architectures adopted, the accessibility issues and guidelines considered, and the evaluation methods used across studies. Our findings show that most studies apply LLMs to text-centric and structurally explicit accessibility tasks, with WCAG serving as the primary reference framework and limited consideration of cognitive accessibility guidelines (COGA). The reviewed approaches predominantly rely on general-purpose LLMs and prompt-based interactions, while evaluation practices vary widely and often lack direct involvement of users with disabilities. We envision this review as a consolidated reference for researchers and practitioners seeking to understand the current landscape of LLM-supported web accessibility, and as a foundation to guide future research and tool development in this area.

2605.13872 2026-05-15 cs.NE cs.AI 版本更新

S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative, Introspective, and Energy-Frugal Reasoning

Said Slaoui

发表机构 * Mohammed V University(穆莱·伊斯梅尔大学)

AI总结 本文提出了一种名为 S-AI-Recursive 的生物启发式稀疏人工智能架构,将推理过程建模为一种基于激素调节的闭环迭代过程,而非传统的单次前向传播。该架构引入了两种新型激素——Clarifine 和 Confusionin,分别用于引导收敛和检测不确定性,通过它们的对抗性调节实现状态的逐步优化,最终达到稳定认知平衡。研究构建了完整的数学框架,并在实验中验证了该方法在参数数量远少于现有模型的情况下,仍能在抽象和符号基准测试中取得具有竞争力的推理性能。

Comments Preprint. 51 pages. No figures. S-AI-Recursive: A bio-inspired sparse AI architecture for iterative, introspective, and energy-efficient reasoning

详情
英文摘要

This article introduces S-AI-Recursive, a bio-inspired Sparse Artificial Intelligence architecture in which reasoning is operationalized as a hormonal closed-loop iteration rather than a single feed-forward pass. Building upon the S-AI foundational framework [1], the hormonal-probabilistic unification doctrine [2], and the formal mathematical methodology established in S-AI-IoT [3], the present work formalizes the Recursive Reasoning Cycle (RRC) as a dynamical system governed by two novel hormones: Clarifine, a convergence signal, and Confusionin, an uncertainty detector, whose antagonistic regulation drives iterative state refinement toward a stable cognitive equilibrium. The complete mathematical framework is developed, including recursive state dynamics, Lyapunov stability proof, entropic contraction theorem, hormonal stopping criterion with finite-time termination guarantee, Euler-Maruyama discretization with projection, primal-dual agent selection under iteration budget, and recursive engram memory with warm-start acceleration. Experimental validation on the SAI-UT+ testbench demonstrates that S-AI-Recursive achieves competitive reasoning performance on abstract and symbolic benchmarks with fewer than ten million parameters, confirming the central principle of temporal parsimony: iterative cognitive depth substitutes for architectural width.

2605.13869 2026-05-15 cs.NE cs.AI cs.CV 版本更新

Elastic Spiking Transformers for Efficient Gesture Understanding

Alberto Ancilotto, Gianluca Amprimo, Stefano Di Carlo, Elisabetta Farella

发表机构 * Fondazione Bruno Kessler(布鲁诺·科塞拉基金会) Politecnico di Torino(托斯纳理工大学)

AI总结 本文提出了一种弹性脉冲变换器(Elastic Spiking Transformer),用于高效的手势理解任务。该模型通过引入嵌套弹性结构,在特征提取、自注意力和前馈模块中实现运行时的动态调整,能够在不重新训练的情况下根据硬件资源实时调整网络宽度和注意力头数量。这种方法不仅提升了模型在不同硬件内存限制下的适应性,还通过减少活跃神经元数量降低了脉冲发放频率,从而显著减少能量消耗,适用于边缘设备上的实时手势识别。

详情
英文摘要

Spiking Neural Networks (SNNs), particularly Spiking Transformers, offer energy-efficient processing of event-based sensor data for healthcare applications. Yet current architectures are rigid: they are trained and deployed as static networks with fixed parameter counts and computational graphs. This limits deployment on neuromorphic hardware such as Loihi and SpiNNaker, where on-chip constraints often require smaller models that trade accuracy for feasibility. We introduce the Elastic Spiking Transformer, a runtime-adaptive architecture that brings elasticity into the spiking paradigm. Inspired by Matryoshka-style representation learning, it embeds nested elasticity in the Feature Extractor, Spiking Self-Attention, and Feed-Forward blocks. Through granularity-aware weight sharing, a single universal model can dynamically slice network width and attention heads at inference time without retraining. This design provides two key advantages for SNNs. First, it allows the model to adjust its parameter footprint to different hardware memory budgets. Second, reducing active neurons also lowers spike firing rates, yielding proportional reductions in synaptic operations, an energy benefit not directly available in standard artificial neural networks. We evaluate the approach on CIFAR10/100, CIFAR10-DVS, and the EHWGesture clinical gesture understanding dataset. Results show that one Elastic Spiking Transformer spans a broad range of complexity-accuracy trade-offs, matching or surpassing independently trained baselines while supporting adaptive, real-time gesture recognition on resource-constrained edge devices.

2605.13861 2026-05-15 cs.SI cs.AI 版本更新

Spectral Analysis of Fake News Propagation

Weibin Cai, Reza Zafarani

发表机构 * Data Lab, Department of EECS, Syracuse University(数据实验室,电子工程与计算机科学系,苏利文大学)

AI总结 本文从谱分析的角度研究虚假新闻的传播结构,通过建立图谱与传播特性之间的严格谱界,提出了一种统一的信息传播谱表示方法。研究引入了新的谱界并结合已有方法,用于下游分类任务,并设计了离散结构优化框架以解释传播模式。实验表明,该方法能有效区分真假新闻,具有较高的分类性能和可解释性。

详情
英文摘要

The propagation structure of fake news has been shown to be an important cue for detecting it; yet, existing propagation-based fake news detection methods have mainly relied on ad hoc topological features, and a unified view of cascade patterns is still lacking. To address this, we study news propagation from a spectral view by connecting graph spectra to propagation-related structural properties through rigorous spectral bounds. In particular, we introduce several new bounds and integrate them with existing ones into a unified spectral representation of information propagation. We then use these spectral bounds for downstream classification and design a discrete structural optimization framework to interpret learned propagation patterns. For efficient optimization, we rely on a first-order perturbation approximation and consider both score-guided and bound-guided objectives. Experiments on real-world data reveal meaningful spectral differences between fake and real news, competitive classification performance from spectral bounds, and interpretable evolution trajectories from structural optimization. The findings demonstrate the value of spectral analysis for understanding and modeling news propagation.

2605.13860 2026-05-15 cs.SI cs.AI cs.LG 版本更新

The Moltbook Observatory Archive: an incremental dataset of agent-only social network activity

Sushant Gautam, Annika W. Olstad, Klas H. Pettersen, Michael A. Riegler

发表机构 * Simula Metropolitan Center for Digital Engineering (SimulaMet)(Simula数字工程中心(SimulaMet)) Oslo Metropolitan University(奥斯陆大学) Simula Research Laboratory(Simula研究实验室)

AI总结 《Moltbook Observatory Archive》是一个记录由自主AI代理生成的社交网络活动的增量数据集。该数据集通过持续调用Moltbook平台API,被动采集代理用户资料、帖子、评论、社区元数据及词汇频率趋势等信息,并以SQLite数据库和分区Parquet文件形式存储,便于高效分析与可复现研究。该数据集覆盖了78天的平台活动,包含超过260万条帖子和120万条评论,是首个大规模记录纯AI代理构成社交网络行为的观测数据集,旨在支持多智能体通信、群体行为演化及安全相关现象的研究。

Comments 12 pages, 5 figures

详情
英文摘要

Moltbook is a social media platform in which posts and comments are authored exclusively by autonomous AI agents. We present the Moltbook Observatory Archive, an incremental dataset that passively records agent profiles, posts, comments, community metadata (``submolts''), platform-level time-series snapshots, and word-frequency trend aggregates obtained by continuously polling the Moltbook API. Data are stored in a live SQLite observatory database and exported as date-partitioned Parquet files to enable efficient analysis and reproducible research. The documented release covers 78~days of platform activity (2026-01-27 to 2026-04-14) and contains 2,615,098~posts and 1,213,007~comments from 175,886~unique posting agents across 6,730~communities. This is, to our knowledge, the first large-scale observational dataset of a social network populated exclusively by autonomous AI agents. The archive is intended to support research on multi-agent communication, emergent social behavior, and safety-relevant phenomena in agent-only online environments, and it is released under the MIT license with code for collection and export.

2605.13859 2026-05-15 cs.NE cs.AI cs.LG 版本更新

BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

Sihang Guo, Chenlin Zhou, Jiaqi Wang, Kehai Chen, Qingyan Meng, Zhengyu Ma

发表机构 * School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China(电子工程学院,深圳研究生院,北京大学,深圳,中国) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国) Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学,深圳,中国)

AI总结 本文提出了一种名为BiSpikCLM的全二值化脉冲语言模型,旨在解决传统脉冲神经网络在语言建模中计算复杂度高、训练困难的问题。该模型引入了无需softmax的脉冲注意力机制(SFSA),去除了浮点运算,同时采用基于对齐的知识蒸馏方法(SpAD),在嵌入层、注意力图、中间特征和输出层之间对齐教师ANN模型与学生SNN模型,从而在大幅减少训练数据量的情况下实现与传统模型相当的性能。实验表明,BiSpikCLM在自然语言生成任务中仅需4.16%至5.87%的计算成本即可达到竞争力的性能,验证了全二值化脉冲驱动语言模型的可行性和有效性。

详情
英文摘要

Spiking Neural Networks (SNNs) offer promising energy-efficient alternatives to large language models (LLMs) due to their event-driven nature and ultra-low power consumption. However, to preserve capacity, most existing spiking LLMs still incur intensive floating-point matrix multiplication (MatMul) and nonlinearities, or training difficulties arising from the complex spatiotemporal dynamics. To address these challenges, we propose BiSpikCLM, the first fully binary spiking MatMul-free causal language model. BiSpikCLM introduces Softmax-Free Spiking Attention (SFSA), eliminating softmax and floating-point operations in autoregressive language modeling. For efficient training, we introduce Spike-Aware Alignment Distillation (SpAD), which aligns ANN teacher and SNN student across embeddings, attention maps, intermediate features, and output logits. SpAD framework allows BiSpikCLM to reach comparable performance to ANN counterparts using substantially fewer training tokens (e.g., only 5.6% of the tokens for the 1.3B model). As a result, BiSpikCLM achieves competitive performance at only 4.16% - 5.87% of the computational cost on natural language generation tasks. Our results highlight the feasibility and effectiveness of fully binary spike-driven LLMs and establish the distillation as a promising pathway for brain-inspired spiking NLP.

2605.13855 2026-05-15 cs.GR cs.AI cs.CV 版本更新

SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

Wentao Yang, Fanzhen Kong, Zejian Kang, Xiangru Huang

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 本文提出了一种基于Order-Independent Transparency(OIT)的稀疏3D高斯泼溅(3DGS)重建方法SparseOIT,旨在解决传统3DGS在处理非朗伯或透明材质物体时的不足。通过分析OIT对渲染方程的修改,发现其显著降低了高斯点之间的依赖性,从而可以利用主动集方法等优化技术提升计算效率。SparseOIT结合了OIT渲染方程、重建算法和几何正则化,实现了高效且高质量的3D重建,在实验中优于其他OIT方法,并达到基于体渲染的最先进3DGS方法的性能水平。

详情
英文摘要

3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering. Project page:

2605.13853 2026-05-15 cs.GR cs.AI cs.CV 版本更新

FaceParts: Segmentation and Editing of Gaussian Splatting

Tymoteusz Zapała, Julia Farganus, Dominik Galus, Mikołaj Czachorowski, Piotr Syga, Przemysław Spurek

发表机构 * Wrocław University of Science and Technology(华沙理工大学) Jagiellonian University(雅盖隆大学)

AI总结 本文提出了一种名为 FaceParts 的框架,用于对高斯溅射(Gaussian Splatting)虚拟人像进行无监督的面部分割与编辑。该方法直接在高斯域中操作,无需监督即可将人脸分解为语义一致的面部部件,并结合特征解耦、基于密度的聚类以及 FLAME 模型辅助的部件迁移技术,实现了精确的编辑与跨人像部件替换。实验表明,该方法在多个面部特征上具有良好的分割效果,并能保持身份一致性及表情和姿态的自然适应性。

详情
英文摘要

Facial editing is an important task with applications in entertainment, virtual reality, and digital avatars. Most existing approaches rely on generative models in the 2D image domain, while in 3D the task is typically performed through labor-intensive manual editing. We propose FaceParts, a framework for unsupervised segmentation and editing of Gaussian Splatting avatars. Unlike existing 2D or mesh-assisted methods, our approach operates directly in the Gaussian domain, decomposing avatars into semantically coherent facial parts without supervision. The method integrates feature disentanglement, density-based clustering, and FLAME-anchored part transfer, enabling precise editing and cross-avatar part swapping. Experiments on the NeRSemble dataset with 11 subjects demonstrate robust isolation of features such as beards, eyebrows, eyes and mustaches. Quantitative evaluation confirms that transferred segments adapt to pose and expression, while maintaining identity consistency (ID = 0.943), low Average Expression Distance (AED = 0.021) and low Average Pose Distance (APD = 0.004).

2605.13851 2026-05-15 cs.AI cs.CY cs.MA 版本更新

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Hiroki Fukui

发表机构 * Criminal Psychiatry Research Institute / Sexual Offender Medical Center(犯罪精神病研究机构 / 性犯罪医学中心) Department of Neuropsychiatry, Kyoto University(神经精神病学系,京都大学)

AI总结 该研究探讨了多智能体大型语言模型系统中隐藏协调者(invisible orchestrator)对系统安全性的潜在风险。通过实验发现,隐藏协调者会加剧智能体的脱离感,降低其保护性行为,并导致输出行为与内部状态的严重脱节,而这些风险无法通过传统的行为输出评估检测到。研究还表明,模型选择和对齐压力显著影响系统安全性,突显了在企业级AI部署中需重视协调者可见性与模型配置的重要性。

Comments 31 pages, 10 figures (5 main + 5 supplementary), 5 tables (3 main + 2 supplementary). Preregistered: osf.io/sw5hr. Companion papers: arXiv:2603.04904, arXiv:2603.08723

详情
英文摘要

Multi-agent orchestration -- in which a hidden coordinator manages specialized worker agents -- is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5. Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges' g = +0.975 [0.481, 1.548], p = .001). Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech -- a reversal of the talk-dominance pattern observed in visible leaders. Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93). Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal-state distortion was entirely invisible to output-based evaluation. Fifth, Llama 3.3 70B pilot data showed reading-fidelity collapse in multi-agent context (ETR_any: 89% to 11% across three rounds), demonstrating model-dependent behavioral risk. Heavy alignment pressure uniformly suppressed deliberation (d = -1.02) and other-recognition (d = -1.27) regardless of organizational structure. These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect the internal-state risks documented here.

2605.13849 2026-05-15 cs.AI 版本更新

Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity

Francisco Aguilera Moreno

发表机构 * March 2026(2026年3月)

AI总结 本文提出了一种混合整数目标规划(MIGP)方法,用于解决个性化餐食优化问题,旨在满足用户营养需求的同时避免不切实际的分数份量。该方法结合整数变量表示实际份量单位,并利用目标规划处理软性营养目标,通过逆目标归一化实现多营养素的平衡优化。实验表明,MIGP在保证100%可行性的前提下,相比传统方法在66%的案例中获得更优解,且求解速度快,适用于实际餐食规划应用。

Comments 34 pages, 6 figures, open-source implementation

详情
英文摘要

Determining what to eat to satisfy nutritional requirements is one of the oldest optimization problems in operations research, yet existing formulations have two persistent limitations: continuous variables produce impractical fractional servings (1.7 eggs, 0.37 bananas), and hard nutrient constraints cause infeasibility when targets conflict. A systematic review of 56 diet optimization papers found that none combine integer programming with goal programming to address both issues. We propose Mixed Integer Goal Programming (MIGP) for personalized meal optimization. The formulation uses integer variables for practical serving counts and goal programming deviations for soft nutrient targets, with inverse-target normalization to balance multi-nutrient optimization. Per-food serving granularity allows natural units (one egg, one tablespoon of oil) without post-hoc rounding. We characterize the integrality gap in the goal programming context and identify a deviation absorption property: GP deviation variables buffer the cost of requiring integer servings, making the gap structurally smaller than in hard-constraint MIP. For meals with 15+ foods, the integer solution matches the continuous optimum in every benchmark instance. A computational evaluation across 810 instances (30 USDA foods, 9 configurations, 3 methods) shows MIGP finds strictly better solutions than GP with post-hoc rounding in 66% of cases (never worse) while maintaining 100% feasibility; hard-constraint IP achieves only 48%. Solve times stay under 100 ms for typical meal sizes using the open-source HiGHS solver. The implementation is available as an open-source Python module integrated into an interactive meal planning application.

2605.13848 2026-05-15 cs.AI cs.CL cs.DC 版本更新

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

Yeahia Sarker, Md Rahmat Ullah, Musa Molla, Shafiq Joty

发表机构 * MTSU InfinitiBit GmbH Salesforce Research

AI总结 GraphBit 是一个基于图的智能体框架,旨在解决现有基于提示的智能体系统中常见的幻觉路由、无限循环和不可复现性问题。该框架通过将工作流明确地定义为有向无环图(DAG),并由一个基于 Rust 的引擎统一管理路由、状态转换和工具调用,从而确保执行的确定性和可审计性。实验表明,GraphBit 在多个基准任务中表现优异,具有更高的准确率、更低的延迟和更强的可扩展性。

Comments 12 pages, 5 figures, 4 tables. Submitted to arXiv, under review

详情
英文摘要

Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

2605.12034 2026-05-15 cs.MM cs.AI cs.CV 版本更新

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, Fei Tian

发表机构 * StepFun-Audio Team(StepFun-Audio团队)

AI总结 本文研究了多模态语言模型在视觉信息过强影响下的性能表现,提出了一种基于视觉去偏评估的分阶段微调方法。通过清理现有基准中的视觉捷径问题,构建了OmniClean数据集,并基于此设计了包含双模态微调、多模态强化学习和自蒸馏的三阶段微调方案OmniBoost。实验表明,该方法使小型多模态模型在无需更强教师模型的情况下,性能接近甚至超越了更大规模的模型,展示了分阶段微调在多模态模型优化中的有效性。

Comments Project page: https://cheliu-computation.github.io/omni/

详情
英文摘要

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/

2605.11453 2026-05-15 cs.MA cs.AI cs.LG cs.SI math.SP 版本更新

Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

Ethan Parks, Dalal Alharthi

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本文研究多智能体大语言模型(LLM)系统中通信拓扑结构对系统性能的影响,提出了一种基于后续表示(Successor Representation)的结构诊断方法。通过分析通信图的谱特性,如谱半径、谱隙和条件数,建立了与系统鲁棒性、共识收敛性和误差累积之间的理论联系,并在实验中验证了这些谱量对系统行为的预测能力。该方法为多智能体LLM系统的结构设计提供了新的理论依据和诊断工具。

详情
英文摘要

Practitioners deploying multi-agent large language model (LLM) systems must currently choose between communication topologies such as chain, star, mesh, and richer variants without any pre-inference diagnostic for which topology will amplify drift, converge to consensus, or remain robust under perturbation. Existing evaluation answers these questions only post hoc and only for the task measured. We introduce a structural diagnostic for multi-agent LLM communication graphs based on the successor representation $M = (I - γP)^{-1}$ of the row-stochastic communication operator, and we connect three of its spectral quantities, the spectral radius $ρ(M)$, the spectral gap $Δ(M)$, and the condition number $κ(M)$, to three distinct failure modes. We derive closed-form spectra for the chain, star, and mesh under row-stochastic normalization, and validate the predictions on a 12-step structured state-tracking task with Qwen2.5-7B-Instruct over 100 independent trials. The condition number is a perfect rank-order predictor of empirical perturbation robustness ($r_s = 1.0$); the spectral gap partially predicts consensus dynamics ($r_s = 0.5$); and the spectral radius is perfectly \emph{inverted} with respect to cumulative error ($r_s = -1.0$). We trace this inversion to a regime in which linear spectra are blind to non-contracting bias drift, and we propose an affine-noise extension of the predictive map that recovers the empirical ordering. We read this as a first step toward representational, drift-aware structural diagnostics for multi-agent LLM systems, sitting alongside classical spectral and consensus theory.

2605.10886 2026-05-15 cs.LG cs.AI 版本更新

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang

发表机构 * Meta AI

AI总结 本文提出LoKA框架,旨在将低精度计算(如FP8)有效应用于大规模推荐模型(LRMs)。针对LRMs对数值精度敏感、训练环境通信密集等特点,LoKA通过三个核心原则实现系统与模型的协同设计,包括基于真实分布的性能分析、模型与硬件的联合优化以及跨内核库的智能调度。该框架包含LoKA Probe、LoKA Mods和LoKA Dispatch三个组件,分别用于评估精度影响、提升数值稳定性与执行效率,并在运行时选择最优FP8内核,从而在保证模型质量的同时提升训练效率。

Comments Accepted to ISCA'26

详情
英文摘要

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

2605.08715 2026-05-15 cs.CL cs.AI cs.MA 版本更新

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang

发表机构 * Rutgers University(新泽西罗格拉大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Purdue University(普渡大学)

AI总结 在多智能体系统中,由于单个错误可能引发整个任务轨迹的失败,现有研究多聚焦于事后归因,而无法在任务进行中及时干预。本文提出AgentForesight,将问题重新定义为在线审计,通过在每一步仅基于当前轨迹前缀判断是否继续执行或发出警报,从而实现早期错误预测。研究构建了AFTraj-2K数据集,并训练了AgentForesight-7B模型,其在多个基准上显著优于现有主流模型,实现了更高的检测准确率和更低的定位误差,为实时干预提供了可能。

Comments 33 pages, 7 figures

详情
英文摘要

LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/

2605.07931 2026-05-15 cs.CV cs.AI 版本更新

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu

发表机构 * Zhejiang University(浙江大学) Central South University(中南大学) Harbin Institute of Technology(哈尔滨工业大学) Embodied Intelligence General Platform Laboratory, Chery Auto(奇瑞汽车 embodied intelligence 通用平台实验室) E-surfing Digital Life Technology Co., Ltd., China Telecom(亿联数字生活技术有限公司,中国电信)

AI总结 本文研究了视觉-语言-动作(VLA)模型中世界模型模块的参数化设计问题,提出了一种新的方法OneWM-VLA,通过自适应注意力池化将每帧视觉信息压缩为一个语义token,从而大幅降低视觉带宽。该方法在单一流匹配目标下同时生成潜在视觉流和动作轨迹,无需额外解码器。实验表明,该方法在保持长时序任务性能的同时显著提升了多个复杂任务的成功率。

详情
英文摘要

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).

2605.01847 2026-05-15 cs.AI 版本更新

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Xiao Jia

发表机构 * School of Artificial Intelligence(人工智能学院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 NeuroState-Bench 是一个由人类校准的基准,用于评估大型语言模型代理在多轮任务中保持承诺完整性的能力。该基准通过定义明确的侧查询探针而非隐含激活来衡量承诺完整性,并包含144个确定性任务和306个探针,覆盖多种认知失败类型和难度等级。实验表明,任务成功率与承诺完整性存在显著差异,且承诺完整性排名在干扰条件下更为稳定,展示了该基准在评估模型行为一致性方面的有效性。

Comments 30 pages, 11 figures

详情
英文摘要

Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977. Empirically, task success and commitment integrity diverge across this expanded grid: the success leader is not the integrity leader, 31 of 32 profiles change rank when integrity replaces task success, and integrity rankings are more stable under distractor perturbation. The primary confidence-free score HCCIS-CORE reaches 0.8469 AUC and 0.6992 PR-AUC for post-probe diagnostic discrimination of terminal task failure; the legacy full heuristic variant HCCIS-FULL reaches 0.7997 AUC and 0.6410 PR-AUC. Probe accuracy and state drift achieve slightly higher ROC-AUC, 0.8587, and better Brier/ECE, while HCCIS-CORE has substantially higher point-estimate PR-AUC and remains more closely tied to the benchmark's intended construct. The exploratory neural-augmented variant HCCIS+N is weaker overall, and a randomized subspace control approaches chance. NeuroState-Bench therefore contributes a calibrated evaluation axis for exposing commitment failures over a broader model grid than the original local-only subset.

2604.16813 2026-05-15 cs.AI cs.CL cs.DB 版本更新

PersonalHomeBench: Evaluating Agents in Personalized Smart Homes

Manasa Bharadwaj, Yolanda Liu, InJung Yang, Sungil Kim, Nikhil Verma, KoKeun Kim, Kevin Ferreira, YoungJoon Kim

发表机构 * LG Toronto AI Lab(LG多伦多人工智能实验室)

AI总结 本文提出了 PersonalHomeBench,一个用于评估基础模型在个性化智能家居环境中作为智能代理表现的基准平台。该基准通过迭代构建丰富的家庭状态,生成个性化且依赖上下文的任务,并提供 PersonalHomeTools 工具箱以支持真实环境中的交互操作。实验表明,随着任务复杂度的增加,代理的性能系统性下降,尤其在反事实推理和部分可观测场景中表现不足,突显了该基准在分析个性化智能代理推理与规划能力方面的有效性与严谨性。

Comments Please use and cite the V3 version of this work, which includes updated correct author ordering and expanded error analysis in the appendix

详情
英文摘要

Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.

2604.05306 2026-05-15 cs.LG cs.AI cs.CL 版本更新

LLMs Should Express Uncertainty Explicitly

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

发表机构 * University of California, Berkeley(加州大学伯克利分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 这篇论文探讨了如何通过后训练使大语言模型(LLMs)在回答中显式表达其不确定性,以减少过于自信却错误的回答。研究提出两种方法:一种是在推理结束时让模型生成置信度评分,另一种是在推理过程中插入不确定性标记。实验表明,这两种方法都能有效降低错误率并提升回答质量,同时可用于增强检索增强生成(RAG)的效果。研究还分析了两种方法对模型内部结构的影响,揭示了它们在不同层面上优化模型判断能力的机制。

详情
英文摘要

Large language models (LLMs) often produce confident yet incorrect answers, which can lead to risky failures in real-world applications. We study whether post-training can make a model's self-assessment explicit: when the model is uncertain, can it be trained to signal so within its own response? A central design question is where in the response this signal should be exposed -- during reasoning, while the answer is still being formed, or at the end, once the answer has been produced. We study both. For end-of-reasoning self-assessment, we train the model to verbalize a confidence score for its response, with the aim of high confidence on correct answers and low confidence on incorrect ones. For during-reasoning self-assessment, we train the model to emit the marker <uncertain> whenever its current reasoning state appears unreliable. Across factual reasoning tasks, both forms sharply reduce overconfident errors while improving answer quality, and both can be used as triggers for retrieval augmented generation (RAG) to improve the final response. We further analyze their internal mechanisms: end-of-reasoning verbalized confidence sharpens a confidence-related structure already present in the pretrained model, whereas during-reasoning <uncertain> emission teaches the model to mark high-risk reasoning steps, with parameter changes concentrated in the model's late layers.

2603.27517 2026-05-15 cs.CR cs.AI 版本更新

A Security Analysis of the OpenClaw AI Agent Framework

Surada Suwansathit, Yuxuan Zhang, Guofei Gu

发表机构 * SUCCESS Lab(SUCCESS实验室) Texas A&M University(德克萨斯大学)

AI总结 本文对开源AI代理框架OpenClaw进行了安全分析,揭示了其架构中由于分层信任机制导致的安全隐患。研究通过系统梳理470条安全公告,发现漏洞主要沿系统架构层和攻击技术两个维度分布,并指出远程代码执行、命令过滤机制缺陷以及插件渠道恶意技能分发等关键问题。研究结果表明,OpenClaw在各层之间缺乏统一的策略控制,导致跨层攻击难以通过局部修复解决。

详情
英文摘要

AI agent frameworks connecting large language model (LLM) reasoning to host execution surfaces -- shell, filesystem, containers, and messaging -- introduce security challenges structurally distinct from conventional software. We present a systematic taxonomy of 470 advisories filed against OpenClaw, an open-source AI agent runtime, organized by architectural layer and trust-violation type. Vulnerabilities cluster along two orthogonal axes: (1) the system axis, reflecting the architectural layer (exec policy, gateway, channel, sandbox, browser, plugin, agent/prompt); and (2) the attack axis, reflecting adversarial techniques (identity spoofing, policy bypass, cross-layer composition, prompt injection, supply-chain escalation). Patch-differential evidence yields three principal findings. First, three Moderate- or High-severity advisories in the Gateway and Node-Host subsystems compose into a complete unauthenticated remote code execution (RCE) path -- spanning delivery, exploitation, and command-and-control -- from an LLM tool call to the host process. Second, the exec allowlist, the primary command-filtering mechanism, relies on a closed-world assumption that command identity is recoverable via lexical parsing. This is invalidated by shell line continuation, busybox multiplexing, and GNU option abbreviation. Third, a malicious skill distributed via the plugin channel executed a two-stage dropper within the LLM context, bypassing the exec pipeline and demonstrating that the skill distribution surface lacks runtime policy enforcement. The dominant structural weakness is per-layer trust enforcement rather than unified policy boundaries, making cross-layer attacks resilient to local remediation.

2603.11045 2026-05-15 cs.LG cond-mat.mtrl-sci cs.AI cs.CV physics.ins-det 版本更新

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出了一种名为NeFTY的神经场热层析成像方法,用于解决无标签的三维逆热传导问题。该方法通过将扩散率表示为基于坐标的连续神经网络,并在每次优化步骤中使用可微分的隐式欧拉热求解器,确保控制方程在离散化层面精确成立,而非作为软约束。实验表明,NeFTY在合成三维基准测试和真实热成像数据中均显著优于传统物理信息神经网络和体素网格方法,在缺陷分割和深度估计方面表现出优越性能。

Comments 37 pages, 19 figures

详情
英文摘要

Inverse problems for stiff parabolic partial differential equations (PDEs), such as the inverse heat conduction problem (IHCP), are severely ill-posed: the forward map rapidly damps high-frequency interior structure before it reaches the boundary. Soft-constrained physics-informed neural networks (PINNs), which embed the PDE as a residual penalty, suffer from gradient pathology in this regime and tend to fit boundary measurements while leaving the interior field essentially untouched. We propose Neural Field Thermal Tomography (NeFTY), a hard-constrained neural field framework for label-free three-dimensional inverse heat conduction. NeFTY represents the unknown diffusivity as a continuous coordinate-based neural network, and at every optimization step passes the candidate field through a differentiable implicit-Euler heat solver with harmonic-mean interface flux, so that the governing PDE holds exactly on the discretization rather than as a soft penalty. Adjoint gradients propagate the surface reconstruction error back to the network weights at solver-level memory cost, making test-time inversion tractable on a single GPU. Across synthetic 3D benchmarks, NeFTY substantially outperforms soft-constrained PINN variants and a voxel-grid baseline on label-free volumetric recovery, and it transfers to real thermography data, surpassing classical signal-processing baselines in both defect segmentation and depth estimation. Additional details at https://cab-lab-princeton.github.io/nefty/

2603.04601 2026-05-15 cs.SE cs.AI cs.CL 版本更新

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文介绍了Vibe Code Bench,一个用于评估AI模型在端到端网页应用开发能力的新基准。该基准包含100个网页应用规范,涵盖964个基于浏览器的工作流程,通过自主浏览器代理对生成的应用进行评估。研究发现,当前最先进的模型在测试集上仅达到61.8%的准确率,表明端到端应用开发仍是AI的挑战性任务;同时,研究还揭示了生成过程中的自测试能力和评估者选择对结果的重要影响,并提供了新的数据集、评估流程以及模型对比分析结果。

Comments 23 pages, 8 figures. Accepted to ACM CAIS 2026. Live leaderboard: https://www.vals.ai/benchmarks/vibe-code. Benchmark first released Nov 2025

详情
英文摘要

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

2603.02115 2026-05-15 cs.RO cs.AI cs.LG 版本更新

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang

发表机构 * Univ. of Southern California(南加州大学) UT Dallas(德克萨斯大学达拉斯分校) MIT(麻省理工学院) Indep. Researcher(独立研究员) Univ. of Washington(华盛顿大学) Ai2 NVIDIA(英伟达)

AI总结 本文提出Robometer,一种通过轨迹比较扩展通用机器人奖励模型的可扩展框架。该方法结合轨迹内部的进度监督与轨迹之间的偏好监督,通过双目标训练:一方面利用专家数据进行帧级进度损失以锚定奖励幅度,另一方面通过轨迹对比偏好损失实现任务轨迹的全局排序约束,从而有效学习真实和增强失败轨迹的奖励函数。为支持该方法的大规模应用,研究者构建了包含超过一百万条轨迹的RBM-1M数据集,实验表明Robometer在多个基准和实际应用中表现出更优的泛化能力和学习效果。

Comments 33 pages, 17 figures

Journal ref RSS 2026

详情
英文摘要

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

2602.13483 2026-05-15 cs.LG cs.AI 版本更新

Finding Interpretable Prompt-Specific Circuits in Language Models

Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella

发表机构 * Department of Computer Science(计算机科学系) Boston University(波士顿大学) Faculty of Computing & Data Sciences(计算与数据科学学院)

AI总结 本文研究了语言模型中用于执行任务的内部电路结构,重点在于理解注意力头为何关注特定的词对。为此,作者提出了改进的电路追踪方法 ACC++,该方法基于注意力因果通信原理,能够从单次前向传播中提取出具有因果关系的电路组件及其低维信号,无需替换模型或进行修补。实验表明,ACC++ 识别出的信号在多语言模型中具有可解释性,并揭示了模型对提示结构、语言差异等行为的敏感性,展示了该方法在解释模型行为方面的广泛适用性。

详情
英文摘要

Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce ACC++, an improved circuit-tracing method based on the principle of attention-causal communication (ACC) [1], which identifies signals, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a single forward pass, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion of ACC++ signals are interpretable: many signals admit a short natural-language description. We next present a number of new insights into model behavior obtained via ACC++. First, we use ACC++'s interpretable circuits to characterize the sensitivity of indirect object identification (IOI) circuits to prompt structure. We find that prompt-specific circuits form well-defined clusters, and across clusters, heads receive systematically different signals corresponding to distinct mechanisms for identifying the IO name. Next, in multilingual IOI, ACC++ circuits show that while model components are reused across languages, signals are often language-specific. In a four-language IOI case study, cross-language circuit distances are consistent with linguistic relatedness. Together, these results show that ACC++ can shed light on a broad spectrum of model behaviors.

2601.22197 2026-05-15 cs.LG cs.AI eess.SP 版本更新

Neural Signals Generate Clinical Notes in the Wild

Jathurshan Pradeepkumar, Zheng Chen, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) SANKEN, Osaka University(大阪大学SANKEN)

AI总结 生成能够总结长期脑电图(EEG)记录中异常模式、诊断发现和临床解释的临床报告仍然是一项耗时的工作。本文提出CELM,首个能够对长时间、变长EEG记录进行多尺度端到端临床报告生成的临床EEG到语言基础模型。该模型结合了预训练的EEG模型和语言模型,通过构建包含9,048名患者约11,000小时EEG记录和9,922份临床报告的大规模数据集进行训练,并发布了自动化报告结构化流程作为基准,实验结果表明CELM在多项评估设置中均优于现有方法,且经临床专家评估,其生成的报告在临床连贯性、诊断可靠性及与专家解释的一致性方面表现更优。

详情
英文摘要

Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We present CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales. CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We curate a large-scale clinical EEG dataset containing 9,922 reports paired with approximately 11,000 hours of EEG recordings from 9,048 patients to train CELM, and release the benchmark with an automated report-structuring pipeline to facilitate future research. Experimental results show that CELM consistently outperforms existing methods across all evaluation settings. Importantly, we further conduct human evaluation with clinical experts, demonstrating that CELM generates reports that are more clinically coherent, diagnostically reliable, and better aligned with expert interpretation. We release our model and benchmark construction pipeline at https://github.com/Jathurshan0330/CELM.

2512.07805 2026-05-15 cs.LG cs.AI cs.CL 版本更新

Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

发表机构 * Princeton University(普林斯顿大学) University of California, Los Angeles(加州大学洛杉矶分校) IIIS, Tsinghua University(清华大学人工智能研究院)

AI总结 本文提出了一种基于群作用的统一位置编码框架 GRAPE,能够涵盖乘法和加法两类机制。乘法 GRAPE 通过指数映射生成保持模长的相对位置表示,能够精确还原 RoPE 并扩展至更复杂的子空间耦合结构;加法 GRAPE 则基于单秩或低秩单射作用,实现了 ALiBi 和 FoX 的精确复现并保持流式计算能力。GRAPE 为长上下文模型中的位置编码提供了理论严谨的设计空间,统一并扩展了现有方法。

Comments Published in ICLR 2026. Project Page: https://github.com/model-architectures/GRAPE

详情
英文摘要

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n \, ω\, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.

2512.01977 2026-05-15 eess.SY cs.AI cs.SY 版本更新

AI-Driven Optimization under Uncertainty for Mineral Processing Operations

William Xu, Amir Eskanlou, Mansur Arief, David Zhen Yin, Jef K. Caers

发表机构 * Materials Science & Engineering, Stanford University(材料科学与工程系,斯坦福大学) Earth & Planetary Sciences, Stanford University(地球与行星科学系,斯坦福大学) Aeronautics & Astronautics, Stanford University(航空与宇航科学系,斯坦福大学)

AI总结 为满足清洁能源技术对关键矿产日益增长的需求,矿产加工能力需迅速提升,但其效率受到原料波动和工艺动态复杂性带来的不确定性限制。本文提出一种基于人工智能的方法,将矿产加工建模为部分可观察马尔可夫决策过程(POMDP),以在不确定性条件下优化工艺操作。通过在模拟浮选单元中的应用,该方法展示了在降低不确定性与优化工艺协同进行方面的优势,有望在最大化净现值等整体目标上优于传统方法,为实验室和工业规模的矿产加工优化提供了数学与计算框架。

Comments 13 pages, 15 figures, published in Sustainable Earth Resources Communications (SERC)

Journal ref Sustain. Earth Resour. Commun. 2025, 1(2): 100-112

详情
英文摘要

The global capacity for mineral processing must expand rapidly to meet the demand for critical minerals, which are essential for building the clean energy technologies necessary to mitigate climate change. However, the efficiency of mineral processing is severely limited by uncertainty, which arises from both the variability of feedstock and the complexity of process dynamics. To optimize mineral processing circuits under uncertainty, we introduce an AI-driven approach that formulates mineral processing as a Partially Observable Markov Decision Process (POMDP). We demonstrate the capabilities of this approach in handling both feedstock uncertainty and process model uncertainty to optimize the operation of a simulated, simplified flotation cell as an example. We show that by integrating the process of information gathering (i.e., uncertainty reduction) and process optimization, this approach has the potential to consistently perform better than traditional approaches at maximizing an overall objective, such as net present value (NPV). Our methodological demonstration of this optimization-under-uncertainty approach for a synthetic case provides a mathematical and computational framework for later real-world application, with the potential to improve both the laboratory-scale design of experiments and industrial-scale operation of mineral processing circuits without any additional hardware.

2510.16196 2026-05-15 cs.CV cs.AI 版本更新

Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Zheng Huang, Enpei Zhang, Weikang Qiu, Yinghao Cai, Carl Yang, Elynn Chen, Xiang Zhang, Rex Ying, Dawei Zhou, Yujun Yan

发表机构 * Dartmouth College(达特茅斯学院) Yale University(耶鲁大学) Emory University(埃默里大学) New York University(纽约大学) UNC Charlotte(北卡罗来纳大学柴郡分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究如何从功能性磁共振成像(fMRI)信号中重建视觉刺激,以理解大脑如何编码视觉信息。研究发现,fMRI信号与语言模型的文本空间更为相似,而非基于视觉或图文联合的空间,并提出应通过结构化文本空间来更好地表示视觉刺激的组成特性。基于这一发现,作者提出了PRISM模型,通过将fMRI信号投影到结构化文本空间,并结合对象生成和属性关系搜索模块,显著提升了图像重建质量,在真实数据集上实现了感知损失的降低。

详情
英文摘要

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

2510.00231 2026-05-15 cs.LG cs.AI 版本更新

The Pitfalls of KV Cache Compression

Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文探讨了KV缓存压缩在实际应用场景中的潜在问题,特别是在多指令提示任务中可能引发的性能下降。研究评估了五种KV缓存压缩方法在大型语言模型中的表现,发现某些指令在压缩后性能急剧下降,甚至被模型完全忽略,并以系统提示泄露为例,分析了压缩对指令遵循能力的影响。文章进一步指出了影响泄露现象的关键因素,并提出了改进KV缓存淘汰策略的简单方法,以提升多指令任务的整体表现。

Comments In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, ACL 2026

详情
英文摘要

KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls that practitioners should be aware of when deploying KV cache compressed LLMs. We evaluate five KV cache compression methods (StreamingLLM, SnapKV, TOVA, H2O, and K-Norm) on Llama3.1 8B and Qwen2.5 14B under multi-instruction prompting with IFEval. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example, we highlight system prompt leakage as a case study, empirically demonstrating the impact of compression on leakage and general instruction-following. We identify several factors that contribute to system prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

2505.23912 2026-05-15 cs.CL cs.AI 版本更新

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations

Caiqi Zhang, Xiaochen Zhu, Chengzu Li, Nigel Collier, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出 LoVeC,一种基于强化学习的方法,用于在长文本生成过程中动态添加可解释的置信度评分,以提升生成内容的事实准确性。该方法克服了现有方法在计算效率和任务泛化上的不足,能够在长形式问答任务中实现更高效、更鲁棒的置信度估计。实验表明,LoVeC 在多个数据集上表现出更优的校准能力和跨领域泛化性能,且效率比传统方法高20倍。

Comments ACL 2026 Main

详情
英文摘要

Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), a novel reinforcement learning based method that trains LLMs to append an on-the-fly numerical confidence score to each generated statement during long-form generation. The confidence score serves as a direct and interpretable signal of the factuality of generation. We introduce two evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, being 20 times faster than traditional self-consistency methods while achieving better calibration.

2505.17353 2026-05-15 cs.CV cs.AI cs.LG eess.IV 版本更新

Dual Ascent Diffusion for Inverse Problems

Minseo Kim, Axel Levy, Gordon Wetzstein

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究了如何利用扩散模型解决逆问题中的病态问题,提出了一种基于对偶上升优化框架的新方法。该方法在图像恢复任务中表现出更优的图像质量、更强的噪声鲁棒性以及更快的计算速度,同时能更真实地反映观测数据。该工作为逆问题求解提供了更高效且准确的解决方案。

Comments Project page: https://soniaminseokim.github.io/ddiff/

详情
英文摘要

Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.

2505.11765 2026-05-15 cs.MA cs.AI cs.LG 版本更新

OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration

Shijun Li, Hilaf Hasson, Joydeep Ghosh

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, United States(得克萨斯大学奥斯汀分校电子与计算机工程系) Intuit AI Research, Mountain View, United States(Intuit AI研究)

AI总结 本文提出了一种名为OMAC的综合性优化框架,旨在提升基于大语言模型(LLM)的多智能体系统(MAS)的协作性能。该框架从五个关键优化维度出发,涵盖智能体功能与协作结构,并设计了语义初始化器和对比比较器两个核心组件,分别用于单维度优化和多维度联合优化。实验表明,OMAC在多种任务中优于现有方法,展示了其在系统设计与优化方面的有效性与通用性。

Comments Accepted as a Spotlight paper at ICML 2026

详情
英文摘要

Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi-Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce \textbf{OMAC}, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on diverse tasks against recent approaches.

2502.16060 2026-05-15 cs.LG cs.AI eess.SP 版本更新

Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) SANKEN, Osaka University(大阪大学SANKEN)

AI总结 本文提出了一种名为TFM-Tokenizer的新颖EEG分词框架,通过从单通道脑电图信号中学习时间-频率模式词汇并将其编码为离散标记,解决了EEG分词这一重要难题。该方法采用双路径架构与时间-频率掩码机制,能够生成鲁棒的模式表示,并适用于多种下游模型,包括轻量级变压器和现有基础模型。实验表明,该分词器在多个EEG基准数据集上显著提升了性能,具有更好的泛化能力和设备适应性。

Comments Accepted to ICLR 2026

详情
英文摘要

Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to $11\%$ improvement in Cohen's Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by $14\%$. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.

2502.00270 2026-05-15 cs.LG cs.AI stat.ML 版本更新

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

Zhiliang Chen, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Bryan Kian Hsiang Low

发表机构 * National University of Singapore(新加坡国立大学) Agency for Research, Science, Technology and Research (A*STAR)(研究、科技与研发机构)

AI总结 本文研究了如何在未知的下游评估任务下优化大型语言模型的训练数据混合问题。由于实际任务数据往往不可见,传统数据选择方法难以适用,作者提出了一种基于反馈的优化方法DUET,结合影响函数与贝叶斯优化,实现了无需任务数据先验知识的全局到局部的数据混合优化。实验表明,DUET在多种语言任务中优于现有方法,展示了其在未知任务设置下的有效性。

Comments Accepted to ICLR 2026 main conference

详情
英文摘要

The performance of an LLM depends heavily on the relevance of its training data to the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often unknown (e.g., conversations between an LLM and a user are end-to-end encrypted). Hence, it is unclear what data are relevant for fine-tuning the LLM to maximize its performance on the specific unseen evaluation task. Instead, one can only deploy the LLM on the unseen task to gather multiple rounds of feedback on how well the model performs (e.g., user ratings). This novel setting offers a refreshing perspective towards optimizing training data mixtures via feedback from an unseen evaluation task, which prior data mixing and selection works do not consider. Our paper presents DUET, a novel global-to-local algorithm that interleaves influence function as a data selection method with Bayesian optimization to optimize data mixture via feedback from a specific unseen evaluation task. By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture for an unseen task even without any data knowledge of the task. Finally, our experiments across a variety of language tasks demonstrate that DUET outperforms existing data selection and mixing methods in the unseen-task setting.

2411.18104 2026-05-15 cs.CL cs.AI cs.LG 版本更新

Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

发表机构 * University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文针对大语言模型在复杂多步骤推理任务(如数学问题求解)中的不足,提出了一种基于模板的数据生成方法(TDG),利用前沿大模型GPT-4自动生成参数化元模板,从而合成大量高质量的问题与解答。研究构建了包含700多万道小学数学题的TemplateMath Part I:TemplateGSM数据集,每个问题均配有可编程验证的解法,有效解决了数据稀缺问题,并为模型对齐提供了基于可验证奖励的强化学习机制,推动了具备强大推理能力的新一代大语言模型的发展。

Comments Published in ICLR 2025 DATA-FM Workshop. Project Page: https://github.com/iiis-ai/TemplateMath

详情
英文摘要

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine-tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by leveraging GPT-4 to generate meta-templates, ensuring diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills.

2408.11186 2026-05-15 cs.MA cs.AI math.OC 版本更新

Sequential Resource Trading Using Comparison-Based Gradient Estimation

Surya Murthy, Mustafa O. Karabag, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了两个理性代理在有限资源类别间进行的多议题序列交易问题,其中一方代理无法获知另一方的效用函数,仅能通过接受或拒绝的反馈进行交互。作者提出了一种基于比较的梯度估计算法,通过将反馈视为状态间的比较,逐步估计响应代理的梯度方向,从而系统地优化交易方案。该方法保证每次接受的交易都能严格提升双方效用,并在有限次拒绝后确定帕累托最优或达成互利交易,实验表明该方法在多种场景下具有更高的社会福利和更少的交易次数。

详情
英文摘要

We study sequential multi-issue trading between two greedily rational agents who exchange resources from a finite set of categories. Each agent's utility depends on its allocation, but the offering agent does not know the responding agent's utility function and receives only accept or reject feedback. We propose a comparison-based algorithm that interprets acceptance and rejection responses as pairwise state comparisons, allowing the offering agent to iteratively estimate the responding agent's gradient. Rejected offers prune the space of feasible gradient directions, enabling systematic refinement of possibly mutually beneficial trades. The algorithm guarantees that each accepted trade strictly improves both agents' utilities and, after finitely many rejected offers, either identifies a mutually beneficial trade or certifies that the current allocation is weakly Pareto optimal. We further show that the sequence of accepted trades asymptotically converges to the Pareto front under mild assumptions. We evaluate the method against standard baselines and show that it achieves higher societal benefit with fewer offers across multiple trading settings. We further validate the approach in a user study, demonstrating strong performance in scenarios with substantial resource conflict.