arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08496 2026-05-12 cs.AI

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Linh Le, David Williams-King, Mohamed Amine Merzouk, Aton Kamanda, Adam Oberman

AI总结 当前大型语言模型的对抗鲁棒性方法需要大量有害提示的数据集进行训练,但仍易受新型攻击和分布偏移的影响。本文提出了一种样本高效的防御方法——潜在人格对齐(LPA),通过在抽象人格特质而非具体有害行为上进行训练,实现模型的鲁棒性。LPA 在使用不到100条特质语句和潜在对抗训练的情况下,达到了与使用15万以上示例训练方法相当的攻击成功率,同时保持了优越的实用性,并在六个有害基准测试中显著提升了对未知攻击分布的泛化能力。

Comments published at Trustworthy AI Workshop, ICLR 2026

详情
英文摘要

Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks -- without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.

2605.08495 2026-05-12 cs.LG q-bio.NC

NeuralBench: A Unifying Framework to Benchmark NeuroAI Models

Hubert Banville, Stéphane d'Ascoli, Simon Dahan, Jérémy Rapin, Marlène Careil, Yohann Benchetrit, Jarod Lévy, Saarang Panchavati, Antoine Ratouchniak, Mingfang, Zhang, Elisa Cascardi, Katelyn Begany, Teon Brooks, Jean-Rémi King

AI总结 该论文提出了一种名为NeuralBench的统一框架,用于评估处理脑信号的神经AI模型。该框架通过配套的NeuralBench-EEG v1.0基准数据集,系统地测试了多种深度学习模型在36项EEG任务上的表现,并揭示了当前基础模型与任务特定模型性能接近、部分认知解码和临床预测任务仍具挑战性的关键发现。NeuralBench设计灵活,支持新增任务、数据集和成像模态,旨在推动神经影像模型评估标准的统一与社区协作。

Comments 31 pages, 9 figures

详情
英文摘要

Deep learning and large public datasets have recently catalyzed the proliferation of AI models for processing brain recordings. However, systematically evaluating these models remains a challenge: not only do the preprocessing pipelines, training and finetuning approaches largely vary across studies, but their downstream evaluation is often limited to small sets of tasks and/or datasets. Here, we present NeuralBench: a unified framework for benchmarking AI models of brain activity. We accompany this framework with NeuralBench-EEG v1.0 -- a large EEG benchmark that includes 36 electroencephalography (EEG) tasks and 14 deep learning architectures, and is evaluated on 94 datasets accessed through a standardized interface. This first EEG-focused release already highlights two main findings. First, current foundation models only marginally outperform task-specific models. Second, a large set of tasks (e.g. cognitive decoding, clinical predictions) remain highly challenging, even for the best models. Critically, NeuralBench is designed for the integration of new tasks, datasets, models, and neuroimaging modalities, as illustrated by preliminary extensions to MEG and fMRI datasets and models. Through this white paper, we invite the community to expand this open-source framework and work together toward a unified benchmarking standard for neuroimaging models.

2605.08493 2026-05-12 cs.CV

CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis

Haroon Wahab, Irfan Mehmood, Hassan Ugail

AI总结 无线胶囊内镜(WCE)能够非侵入性地观察小肠,但由于每次检查生成的图像量大且成像条件多变,其临床应用受到限制。为解决这一问题,本文提出CapCLIP,一种面向WCE的视觉-语言表征对齐框架,通过将胶囊内镜图像与基于标准化术语和病理感知描述模板的临床文本进行对齐,学习具有语义信息且可迁移的嵌入表示。实验表明,CapCLIP在零样本分类和跨模态检索等任务中显著优于现有方法,展示了语言引导表征学习在提升WCE分析泛化性和语义可解释性方面的潜力。

详情
英文摘要

Wireless capsule endoscopy (WCE) enables non-invasive visual assessment of the small bowel, but its clinical utility is constrained by the large volume of frames generated per examination and the difficulty of recognising subtle abnormalities under highly variable imaging conditions. Existing learning-based approaches for WCE are predominantly vision-only, often confined to narrow pathology sets, and show limited transfer across datasets and centres. To address these limitations, this study introduces CapCLIP, a domain-specific vision-language representation learning framework for WCE. CapCLIP aligns capsule endoscopy frames with clinically grounded textual descriptions derived from standardised nomenclature and pathology-aware caption templates, thereby learning embeddings that are both semantically informed and transferable. The proposed framework is evaluated against relevant open-source vision and vision-language foundation models under strict zero-shot conditions using unseen WCE datasets. Evaluation covers three downstream tasks: K-nearest neighbour classification, CLIP-style image-text classification, and text-to-image retrieval. Across these settings, CapCLIP consistently outperforms the compared baselines, with particularly strong gains in zero-shot image-text classification and cross-modal retrieval on out-of-distribution datasets. The results indicate that language-guided representation learning can improve both generalisation and semantic interpretability in WCE analysis. These findings position CapCLIP as a step toward foundation models tailored to capsule endoscopy and support the use of language-grounded WCE analysis.

2605.08489 2026-05-12 cs.RO

LE-PAVD: Learning-Enhanced Physics-Aware Vehicle Dynamics for High-Speed Autonomous Navigation

Musabbir Ahmed Arrafi, Malik Ali, Nicholas M. Stiffler, Krishna Bhavithavya Kidambi

AI总结 在高速自动驾驶导航中,精确建模非线性车辆动力学至关重要。本文提出了一种混合模型 LE-PAVD,结合物理先验知识与学习组件,提升了模型的物理一致性和预测精度。该模型引入了负载敏感轮胎力、纵向载荷转移、横向轮胎力效应和受限执行器输入等四个物理模块,并在仿真与真实数据上端到端训练。实验表明,LE-PAVD 在预测误差和推理效率方面均优于现有深度动力学模型,同时在闭环仿真中实现了更快的绕圈时间,且无赛道越界情况。

详情
英文摘要

Accurate modeling of nonlinear vehicle dynamics is essential for high-speed autonomous racing, where controllers operate at the handling limits. Model-based methods are interpretable but rely on simplifying assumptions, while purely learned models capture nonlinearities yet often lack physical consistency and generalization. We propose LE-PAVD (Learning-Enhanced Physics-Aware Vehicle Dynamics), a hybrid model that integrates physics priors with learned components. Our architecture adds four components: load-sensitive Pacejka tire forces, longitudinal load transfer, lateral tire-force effects, and rate-limited actuator inputs. Trained end-to-end on simulation and real-world telemetry, LE-PAVD enforces physical consistency while improving state prediction accuracy. On an unseen track, LE-PAVD reduces average displacement error (ADE) by 16.1$\%$, final displacement error (FDE) by 20.6$\%$, and lowers yaw-rate root mean squared error (RMSE) by 91.3$\%$ versus a deep dynamics baseline, while using 21.6$\%$ fewer FLOPs and achieving approximately 1.50$\times$ faster inference. In closed-loop simulations, LE-PAVD consistently outperforms the baseline by achieving faster lap times by 17.4$\%$ on a training track and 9.5$\%$ on a test track, without any track boundary violations. Overall, LE-PAVD offers a compact, physics-grounded dynamics backbone that improves predictive fidelity and closed-loop performance while reducing inference cost.

2605.08482 2026-05-12 cs.LG cs.CL

ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding

Mohammed Sameer Syed, Xuan Lu

AI总结 本文提出了一种名为ShifaMind的可解释ICD-10编码模型,其核心是引入乘法概念瓶颈(MCB)架构,通过改变瓶颈结构而非压缩宽度来提升模型性能与可解释性。该方法在保留概念接口以便临床解释的同时,利用乘法门控机制增强模型对临床文本的表征能力。实验表明,ShifaMind在MIMIC-IV数据集上取得了与当前最优模型相当甚至更优的编码性能,并在可解释性指标上也表现出显著优势。

详情
英文摘要

Automated ICD-10 coding from clinical discharge summaries requires models that are both accurate on long-tailed multi-label classification tasks and interpretable to clinicians. Concept Bottleneck Models (CBMs) offer a principled framework for interpretability by routing predictions through human-interpretable concepts, but this transparency often comes at a cost: compressing rich clinical text representations into a narrow concept layer can restrict gradient flow and limit predictive capacity. We present ShifaMind, a concept-grounded architecture built around a Multiplicative Concept Bottleneck (MCB), which changes the form, rather than the width, of the bottleneck. Instead of projecting through a narrow concept layer, ShifaMind uses a learned multiplicative gate over a concept-grounded representation while retaining a scalar concept interface for inspection. On MIMIC-IV top-50 ICD-10 coding, ShifaMind achieves performance competitive with LAAT, the strongest baseline, across F1, AUC, and ranking metrics, while outperforming five additional ICD-coding baselines and providing concept-mediated explanations. Its substantial gains over a capacity-matched Vanilla CBM in both predictive performance and interpretability-oriented metrics highlight the importance of the bottleneck design.

2605.08480 2026-05-12 cs.AI

AI-Care: A Conversational Agentic System for Task Coordination in Alzheimer's Disease Care

Preyash Yadav, Michelle Cohn, Priyanka Koppolu, Hritvik Agarwal, Amey Gohil, Tejas Patil, Sasha Pimento, Alyssa Weakley

AI总结 AI-Care 是一个基于对话代理的智能系统,旨在帮助阿尔茨海默病及相关痴呆症患者更便捷地管理日常生活任务,如设置日程提醒和整理待办事项。该系统通过语音优先的聊天机器人进行自然语言交互,降低患者的认知负担,并采用状态化流程控制确保操作安全可靠。研究显示,该系统在初步试点中获得了用户的信任与认可,有效支持了任务协调的完成。

Comments 9 pages, 3 figures

详情
英文摘要

Individuals with Alzheimer's disease (AD) and Alzheimer's disease-related dementia (ADRD) experience memory and thinking changes that impact their ability to use digital daily management tools. For example, adding an event to a digital calendar requires multiple steps that may act as barriers to independent use for individuals with AD/ADRD. This paper presents AI-Care, a conversational agentic artificial intelligence (AI) layer built on top of a remote caregiving platform co-designed with people with AD/ADRD. AI-Care is designed to reduce the cognitive load on individuals with AD/ADRD when managing everyday tasks such as setting calendar reminders and organizing to-do lists through natural-language interaction with a voice-first chatbot. The system uses a LangGraph-based stateful orchestration approach in which each request passes through sanitization, intent classification, context loading, safety checks, deterministic slot collection, tool execution, and response composition. Safety-critical responses, particularly around medications and allergies, are grounded in caregiver-verified records rather than free-form model generation. The system does not make autonomous medical or treatment decisions. Incomplete or ambiguous requests are handled through controlled multi-turn clarification rather than silent failure or guessing. The system supports both typed and spoken input, with voice output through ElevenLabs text-to-speech. Longer responses are chunked before synthesis to avoid rushed playback. A preliminary pilot with four individuals with mild-to-moderate AD/ADRD showed that users found the system trustworthy, competent, and likable, and were able to complete the evaluated coordination tasks through conversation. We describe the design goals, system architecture, safety controls, and findings from this formative evaluation.

2605.08478 2026-05-12 cs.LG

When Independent Sampling Outperforms Agentic Reasoning

Yihe Dong, Boris Shigida

AI总结 本文研究了在固定预算下如何分配推理计算资源以解决编程竞赛问题。通过对比基于代理的推理与重复独立采样(k-shot)方法,研究发现后者在准确率与成本、查询次数之间的权衡上表现更优,且这一优势在不同模型和难度级别下均保持一致。研究还表明,在资源受限的条件下,对于自包含的算法任务,独立采样可能优于更深层次的代理式推理,并提出了预算分配的优化分析及成本最优解的理论证明。

详情
英文摘要

We study how to allocate inference-time compute for competitive programming under fixed budgets. Evaluating 216 Codeforces problems across Divisions 1-3, we compare agent-based reasoning with repeated independent sampling (k-shot) as a function of both cost and number of model calls. Across models and difficulty levels, k-shot consistently achieves a better accuracy-cost and accuracy-query tradeoff. This gap persists despite prompt caching in agent frameworks, indicating lower per-call effectiveness. Our results show that, for self-contained algorithmic tasks, independent exploration can outperform deeper agentic reasoning under realistic resource constraints. We also provide a budget-allocation analysis when the inference budget is fixed, and prove that a cost-optimal solver minimizes the principled metric log failure likelihood per dollar.

2605.08477 2026-05-12 cs.CL

Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

Naoki Otani, Nikita Bhutani, Hannah Kim, Dan Zhang, Estevam Hruschka

AI总结 本文探讨了基于数据密集型任务的大型语言模型代理是否需要逐步规划的问题。研究对比了两种规划方式:全周期规划(FH)和单步规划(SH),发现对于结构明确的任务,FH结合按需重规划可以在保持精度的同时减少计算量。实验表明,FH在不同任务深度、广度和工具鲁棒性下表现优异,且比SH更高效。这一发现挑战了传统上认为逐步执行更必要的假设。

Comments CAIS 2026

详情
英文摘要

Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.

2605.08476 2026-05-12 cs.CL

A Computational Operationalisation of Competing Maturational Theories of Syntactic Development via Statistical Grammar Induction

Mila Marcheva, Suchir Salhan, Weiwei Sun

AI总结 本文研究儿童在第一语言发展过程中习得的中间句法范畴及其顺序问题,针对不同成熟理论(如自底向上和自内向外)提出的不同预测,利用统计句法归纳方法对这些假设进行计算操作化验证。研究通过固定输入和学习算法,比较不同句法发展顺序对可学习结构的影响,结果表明自底向上的理论在三个评估指标上显著优于自内向外的理论。

Comments In Proceedings of the Annual Meeting of the Cognitive Science Society (CogSci) 2026. Presentation in Rio de Janeiro, Brazil

详情
英文摘要

This paper is concerned with what intermediate syntactic categories children acquire during first language development, and in what order. Maturational theories make different predictions. Bottom-up accounts (GROWING) propose that lexical and inflectional structure emerges first, while inward accounts (INWARD) predict early access to discourse-related categories. We computationally operationalise these hypotheses of staged syntactic emergence using statistical grammar induction, asking what each proposed ordering makes learnable when input and learning algorithm are held constant. Our framework makes category acquisition explicit and allows us to explore how different maturational orderings shape the structure that can be learned under identical conditions. Based on this operationalisation, the GROWING account significantly outperforms the INWARD account across three evaluation metrics.

2605.08472 2026-05-12 cs.AI

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Aswin RRV, Jacob Dineen, Divij Handa, Mihir Parmar, Ben Zhou, Swaroop Mishra, Chitta Baral

AI总结 本研究探讨了在大型语言模型中使用强化学习(RL)时,如何通过中期训练阶段引入自生成数据来提升学习效果。研究提出了一种基于乔治·波利亚解题方法的引导式数据生成框架,用于生成训练问题的多种正确解答变体,并在强化学习前进行微调。实验表明,采用该方法初始化的模型在数学推理、代码生成和叙事推理等任务上均取得了显著提升,证明了多解法学习对后续强化学习的积极影响。

详情
英文摘要

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

2605.08468 2026-05-12 cs.CL cs.AI cs.LG

PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

Mehmet Iscan

AI总结 本文提出了一种名为 PYTHALAB-MERA 的轻量外部控制器,用于增强冻结语言模型在代码生成任务中的验证能力。该方法通过引入基于验证的 episodic 记忆、自适应检索与动作选择、延迟奖励分配以及结构化技能复用,提升了代码生成的准确性和鲁棒性。实验表明,在严格的验证环境下,该控制器显著提高了代码通过验证的成功率,优于现有的自优化和 GRACE 扩展方法。

Comments 28 pages, 4 figures, 7 tables; local CLI artifact evaluation

详情
英文摘要

Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.

2605.08467 2026-05-12 cs.LG

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

Shiyang Li, Zijian Zhang, Guangyan Sun, Yuebo Luo, Winson Chen, Yanzhi Wang, Mingyi Hong, Caiwen Ding

AI总结 本文介绍了CUDAHercules,一个用于评估生成CUDA代码是否达到专家级硬件优化水平的基准测试。该基准覆盖了从单个内核到完整应用的多种场景,并在多代GPU架构上进行测试,通过领域语义验证器确保结果的准确性。实验表明,当前最先进的代码模型在生成可运行的CUDA代码方面表现尚可,但在实现专家级优化策略方面仍有较大差距,说明自动CUDA编程仍面临诸多挑战,需进一步提升硬件推理能力和工具使用水平。

详情
英文摘要

Large language models show promise for automated CUDA programming, however even the strongest coding models (e.g., Claude-Opus-4.6) may still fall short of expert-level, architecture-aware optimization. We introduce CUDAHercules, a benchmark that evaluates generated CUDA against end-to-end human-expert SOTA systems. It spans single kernels, module-level operators, full applications, and unsolved challenge tasks across Ampere, Hopper, and Blackwell GPUs, with end-to-end tasks gated by domain-specific semantic validators. Evaluating models such as Claude-Opus-4.6 and GPT-5.4 shows a large gap between runnable CUDA and expert CUDA engineering: models often compile and pass tests, but rarely recover the optimization strategies needed to match expert performance. Application semantics further reduce success, and iterative or tool-augmented feedback can improve correctness while drifting toward slow fallback implementations. These results show that automated CUDA programming remains far from fully solved and requires stronger hardware reasoning, better tool use, and training objectives that connect code understanding to hardware architecture-grounded intelligence.

2605.08462 2026-05-12 cs.CL cs.AI

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

I. F. Atasoy, B. Mutlu, E. A. Sezer, A. Wahdan

AI总结 该研究探讨了大型语言模型(LLM)在上下文相关任务中的幻觉检测性能,并质疑现有基准是否低估了LLM的表现。通过对比人类标注与Gemini 2.5 Flash和GPT-5 Mini的预测结果,并引入跨文化人类仲裁机制,研究发现模型在提供明确推理时更易获得人类认可,且仲裁后基准数据的准确性显著提升。研究结果表明,在存在歧义的任务中,借助模型辅助的重新评估能够生成更可靠的基准。

Comments Presented at the ROMCIR Workshop at ECIR 2026

详情
英文摘要

Hallucination remains a persistent challenge in Large Language Models (LLMs), particularly in context-grounded settings such as RAG and agentic AI systems. This study focuses on contextual hallucination detection in summarization tasks. We analyze the QAGS-C and SummEval datasets by comparing original benchmark annotations with reason and span-based predictions from Gemini 2.5 Flash and GPT-5 Mini. To address systematic divergences between human labels and LLM judgments, we re-evaluated all conflicted samples through a human adjudication process involving 2 cross-cultural adjudicators. Following this re-evaluation, triple agreement (between human, GPT, and Gemini) increased by 6.38% for QAGS-C and 7.62% for SummEval. Similarly, model accuracy improved, with GPT increasing by 4.25% on QAGS-C and 2.34% on SummEval, while Gemini showed gains of 8.51% and 3.80%, respectively. Notably, adjudicators frequently sided with the models' judgments over original human annotations when LLMs provided explicit reasoning. Overall human adjudicator agreement ranged between 83% and 87%. These findings suggest that for ambiguity-prone tasks, single-pass annotations may be insufficient, and model-assisted re-evaluation yields more reliable benchmarks.

2605.08458 2026-05-12 cs.LG q-bio.NC

Neurally-plausible radial basis kernels using distributed Fourier embeddings

Jakeb Chouinard

AI总结 本文研究了如何在神经可解释的框架下构建连续的空间表征,重点分析了可用于实现径向基核函数的常见核函数。作者基于空间语义指针框架,探讨了类似网格细胞的表征在生成径向基核函数中的能力与优化性,为物理与感知现象的统一表征提供了新的理论支持。

详情
英文摘要

Coherent, continuous spatial representations are critical for synthesizing physical and perceptual phenomena into a single representational space. Radial basis kernels provide a path forward for this type of distributed representation. In this work, we aim to characterize and analyze common radial basis kernels realizable in the neurally-plausible framework of spatial semantic pointers. Further, we analyze previous radial basis kernel work based on grid cell-like representations and demonstrate that such representations are both capable of and optimal for realizing radial basis kernels.

2605.08454 2026-05-12 cs.LG cs.AI

Recovering Physical Dynamics from Discrete Observations via Intrinsic Differential Consistency

Yuxiang Luo, Andrew Perrault

AI总结 本文研究如何从离散观测中恢复连续时间物理动力学,提出了一种基于内在微分一致性的新方法。通过引入半群性质作为全局结构约束,替代传统的局部监督方式,训练一个时间条件化的割线速度场,并利用对称性破裂作为正则化项和推理指导,使模型在不同时间尺度上保持动态一致性。实验表明,该方法在多个微分方程基准测试中显著提升了预测精度并减少了计算量。

详情
英文摘要

Recovering continuous-time dynamics from discrete observations is difficult because local supervision (e.g., pointwise regression targets, derivative approximations, or equation residuals) loses fidelity as the observation interval grows. We replace local supervision with a global structural constraint: any flow representing autonomous dynamics must satisfy the semi-group property under time translation. We train a time-conditioned secant velocity field whose deviation from this property, which we call Symmetry Rupture, serves two purposes. As a training regularizer, it confines the hypothesis space to flows that compose consistently across temporal scales. As an inference oracle, it lets the solver select the largest step size that preserves internal consistency, replacing the local truncation error that conventional adaptive solvers depend on. On the diffusion-reaction benchmark under time-informed inference, our method reduces rollout RMSE by 87\% while using 5x fewer function evaluations than a Neural ODE baseline. In the more demanding direct auto-regressive setting, where the model must predict distant future frames without intermediate temporal cues, our adaptive solver allocates compute based on local geometric complexity -- maintaining the lowest rollout RMSE on two of three PDE benchmarks while baselines either diverge or require up to an order of magnitude more function evaluations to remain stable.

2605.08453 2026-05-12 cs.LG cs.AI stat.ML

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

Peter Súkeník, Cristina López Amado, Christoph H. Lampert, Marco Mondelli

AI总结 本文研究了sink(汇点)和对角模式在注意力切换和防止过度平滑中的作用。通过分析几何条件,揭示了sink表示所需的嵌入对齐特性,并进一步明确了sink在防止过度平滑中的作用机制,证明了密集注意力在某些条件下比稀疏注意力更易导致平滑,并通过实验验证了这一条件在实际中常被满足。文章还建立了sink与硬注意力切换之间的等价关系,并通过引入自通信机制对硬注意力切换进行了放松,分析了sink与对角模式在表示成本上的差异,解释了为何预训练Transformer更倾向于使用sink结构。这些研究填补了防止过度平滑需求与sink功能之间的差距,并阐明了注意力层在无需token通信时为何可能表现出类似MLP的行为。

详情
英文摘要

This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.

2605.08452 2026-05-12 cs.CV

NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

Jian Lan, Zhicheng Liu, Xinpeng Wang, Yuhao Zhou, Haokun Chen, Jiancheng Lv, Barbara Plank, Thomas Seidl

AI总结 该研究旨在深入理解视觉语言模型(VLMs)在运动学物理定量推理任务中的表现,揭示其是否真正理解物理规律或仅凭猜测得出答案。为此,作者提出了NICE和FACT双诊断框架,其中FACT用于诊断模型在视觉保真度、物理定律理解和时间定位方面的能力,NICE则通过邻域感知校准方法和新指标评估并提升模型置信度的可靠性。实验表明,当前最先进的VLMs在识别视觉前提和应用物理定律方面存在明显不足,该工作为构建更可靠、物理基础更扎实的VLMs提供了标准化的诊断范式。

详情
英文摘要

The ability to derive precise spatial and physical insights is a cornerstone of vision-language models (VLMs), yet their poor performances in related spatial intelligence tasks such as physical reasoning remain a fundamental barrier. The community critically lacks a scientific analysis revealing whether VLMs faithfully reach answers or plausibly make guesses. This work aims to provide a fundamental understanding of how VLMs perceive the physical world, and utilize physical laws, while assessing the reliability of model confidence. We propose NICE and FACT, a dual-diagnostic paradigm that explicitly decomposes quantitative reasoning for kinematic physics: FACT diagnoses visual fidelity, physical law comprehension, and temporal grounding. NICE studies our novel neighborhood-informed calibration method and novel metrics to evaluate and calibrate confidence reliability. Evaluated across 6 latest state-of-the-art VLMs, we uncover that models fail to identify visual preconditions or utilize necessary physical laws to reach answers. This work highlights and establishes a standardized diagnostic paradigm to guide the development of faithful, physically-grounded VLMs.

2605.08451 2026-05-12 cs.LG

RubiConv -- Efficient Boundary-Respecting Convolutions

Linda Friso, Annie Marsden, Xinyi Chen, Arushi Gupta, Peter Bartlett, Mark Braverman, Elad Hazan

AI总结 本文提出了一种名为 RubiConv 的新型卷积算法,旨在解决在大规模数据打包场景下传统 FFT 卷积方法难以高效应用的问题。该方法通过实现边界尊重的卷积操作,显著提升了在实际训练中的效率,实验表明其在速度上优于注意力机制和基于标准 FFT 的基线方法。该研究填补了理论效率与实际应用之间的鸿沟,使长序列卷积模型在大规模真实数据处理中具备了可行性。

Comments 19 pages, 12 figures

详情
英文摘要

Convolutional architectures have emerged as powerful alternatives to Transformers for sequence modeling. The primary advantage is that they offer improved theoretical sequence length complexity by leveraging the Fast Fourier Transform (FFT). However, this theoretical improvement does not always meaningfully land in practice. One critical obstacle is that applying standard FFTs is not amenable to the large-scale training pipeline wherein data is packed from different sources into a single sequence for hardware efficiency. Indeed, standard FFT algorithms are not easily amenable to document packing. Existing workarounds suffer from severe inefficiencies, crippling the practical performance of convolutional architectures. We close this gap with RubiConv, a novel algorithm for performing hardware-efficient, boundary-respecting convolutions on packed sequences. Extensive experiments show that RubiConv achieves significant speedups over both attention and standard FFT-based baselines. This work makes the theoretical efficiency of long convolutional models a practical reality for large-scale, real-world data packing.

2605.08450 2026-05-12 cs.LG cs.AI

Zero-shot Imitation Learning by Latent Topology Mapping

Maxwell J. Jacobson, Yexiang Xue

AI总结 本文研究了在专家示范有限的情况下,如何实现对新任务的零样本模仿学习。提出了一种名为ZALT的方法,通过识别潜在的枢纽状态并构建其间的转移拓扑,实现了对长时序任务的高效策略规划与适应。该方法能够在无需额外示范的情况下,完成训练时未见过的起点-目标任务,在复杂3D迷宫环境中表现出显著优于现有方法的零样本任务成功率。

详情
英文摘要

Imitation learning is effective for training agents when expert demonstrations are available, but collecting demonstrations for every complex task in an environment is costly. We study the long-horizon, goal-conditioned setting where a fixed demonstration dataset contains useful behavior, but not complete examples for every task the agent must solve. Existing imitation learning methods can learn strong policies from demonstrations, but when solving long-horizon tasks, small errors accumulate over long primitive-action trajectories and make zero-shot adaptation to new tasks unreliable. We introduce Zero-shot Agents from Latent Topologies (ZALT), an imitation-learning method that solves unseen start-goal tasks beyond those demonstrated during training. ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions -- combined, these enable ZALT to perform zero-shot adaptation. In a complex 3D maze environment, ZALT achieves 55% zero-shot success on unseen tasks, compared to 6% for the strongest baseline.

2605.08448 2026-05-12 cs.AI cs.CL

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

Jacob Ativo, Bharaneeshwar Balasubramaniyam, Anh Tran, Khushboo Gupta, Hongmin Li, Doina Caragea, Cornelia Caragea

AI总结 本文研究了在社交媒体危机数据分类任务中,利用大语言模型(LLM)引导的半监督学习方法以提升分类性能的问题。作者对比了两种基于LLM的半监督方法——VerifyMatch和LLM引导的共训练(LG-CoTrain),并将其与传统半监督方法进行比较。实验表明,在标签数据有限的情况下,LG-CoTrain表现最优,而随着标签数量增加,自训练方法也展现出较强竞争力,研究还揭示了通过LLM引导的半监督学习,可以将大模型知识迁移至更小、更易部署的模型中,为实际灾害响应应用提供了可行路径。

详情
英文摘要

Semi-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts. In this work, we present the first empirical evaluation of large language model (LLM) guided semi-supervised learning for crisis related tweet classification. We compare two recent LLM assisted semi-supervised methods, VerifyMatch and LLM guided Co-Training ( LG-CoTrain), against established semi-supervised baselines. Our results show that LG-CoTrain significantly outperforms classical semi-supervised approaches in low resource settings with 5, 10 and 25 labeled examples per class, achieving the highest averaged Macro F1 across events. VerifyMatch achieves competitive performance while also demonstrating strong calibration properties. As the number of labeled examples increases, the performance gap narrows and Self Training emerges as a strong baseline. We further observe that compact semi-supervised models can, in some cases, outperform very large LLMs operating in zero-shot settings. This finding highlights the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi-supervised learning, offering a practical pathway for real world disaster response applications. Our project repository on Github is here.

2605.08447 2026-05-12 cs.CL

Revisiting the syntax of imperatives in Yemeni Arabic: An Agree across phases approach

Mohammed Q. Shormani

AI总结 本文重新探讨也门阿拉伯语祈使句的句法结构,提出一种“跨阶段一致”(Agree across phases, AAP)的分析方法。研究认为,该方法能够有效解释简单和复杂祈使句结构,包括A’链结构,强调句法与语篇之间的紧密互动。文章还指出祈使句的主题主语是一个二元代词(2-person pro),而祈使句前的显性代词或名词属于C域元素,充当话题,与代词形成共指关系,这一关系通过匹配机制实现,从而生成局部或非局部的A’链结构。

Comments 33 pages

详情
英文摘要

This article revisits the syntax of imperatives in Yemeni Arabic proposing an Agree acros phases (AAP) approach. I argue that the AAP approach successfully accounts for both simple and complex imperative constructions, including A'-chain structures, by establishing a close interactions between syntax and discourse. The study demonstrates that this interface is motivated by the interpretive and performative functions associated with imperatives, linking informational structure with propositional structure. It is also proposed that the thematic subject of imperatives is a 2-person pro, whereas any overt pronominal or nominal element occurring preverbally is not a subject, but rather a C-domain element, precisely aboutness topic. These topics serve as the logical subjects of imperatives and enter into a coreferentiality relationship with pro. This relation is analyzed as APP involving Match, yielding both local and non-local A'-chains. For core imperatives, viz., lacking an overt topic, I propose a null topic to (re)merge in Spec,TopP, whose interpretation depends on the discourse.

2605.08445 2026-05-12 cs.AI

Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare

Prasanna Desikan, Harshit Rajgarhia, Shivali Dalmia, Ananya Mantravadi

AI总结 该论文提出了一种用于评估生成式、多模态和自主型AI在医疗领域表现的基准框架,旨在解决当前医疗AI系统在真实临床任务中可靠性、安全性和临床相关性测量不足的问题。研究指出,现有基准多关注模型的知识水平,而忽视了其在复杂临床工作流中的实际表现,导致模型在实际部署中表现不佳。论文强调需要系统化的评估方法,以准确衡量AI在医疗场景中的实用价值,推动更可靠的临床应用。

详情
英文摘要

AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and validation datasets were never designed to capture. Evaluating these systems requires benchmarks: structured combinations of tasks, datasets, and metrics that enable reproducible, comparable measurement of what a model can do. The central challenge in healthcare AI is not performance alone, but the absence of systematic methods to measure reliability, safety, and clinical relevance under real-world conditions. Most existing benchmarks test what a model knows; too few test whether it can perform reliably and without failing across the full complexity of real clinical tasks. Current benchmarks have accumulated through ad hoc dataset construction optimized for narrow task performance: frontier models achieve near-perfect scores on medical licensing examinations, but when evaluated across real clinical tasks, performance degrades sharply, scoring 0.74--0.85 on documentation, 0.61--0.76 on clinical decision support, and only 0.53--0.63 on administrative and workflow tasks \cite{medhelm}. High benchmark scores give a false sense of deployment readiness, and the gap between performance and utility widens precisely as AI systems take on more consequential clinical roles. Without a principled framework for benchmark design, the field cannot determine whether poor clinical performance reflects model limitations or failures in how performance is being measured.

2605.06524 2026-05-12 cs.AI

Process Matters more than Output for Distinguishing Humans from Machines

Milena Rmus, Mathew D. Hardy, Thomas L. Griffiths, Mayank Agrawal

AI总结 本文探讨了在区分人类与机器时,行为过程比输出结果更具鉴别力。研究引入了包含30项认知任务的CogCAPTCHA30,通过分析任务执行过程中的特征,发现即使在输出匹配的情况下,过程特征仍能更可靠地区分人类与人工智能系统。实验还表明,针对过程进行的微调可以提升机器模仿人类行为的能力,但需要合适的任务特定过程表示作为前提。

详情
英文摘要

Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.

2605.06375 2026-05-12 cs.LG cs.AI math.ST stat.TH

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

Hao Yu

AI总结 该论文针对基于人类偏好强化学习(RLHF)中的大语言模型对齐问题,提出了一种统一的Pair-GRPO方法家族,旨在解决策略更新不稳定、梯度方向模糊、可解释性差和梯度方差高等问题。研究通过引入Soft-Pair-GRPO和Hard-Pair-GRPO两种变体,分别在保留GRPO结构的基础上引入二元偏好奖励和显式概率约束,理论证明了其梯度稳定性,并提供了单调策略改进、确定梯度方向等理论保证。实验表明,该方法在多个基准任务中优于现有先进方法,显著提升了对齐质量与训练稳定性。

详情
英文摘要

Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive scalar multiple of standard GRPO's gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.

2605.06356 2026-05-12 cs.CV

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu, Long Chen

AI总结 本文提出了一种高效高分辨率图像到视频生成方法SwiftI2V,旨在在保持输入图像细节的同时生成逼真的时间动态效果。为了解决现有方法在高分辨率下计算开销大、细节失真等问题,SwiftI2V采用两阶段设计,先生成低分辨率运动参考以降低计算成本,再通过强图像条件引导进行2K分辨率的视频合成,从而在保证生成质量的同时显著提升效率。该方法引入条件分段生成机制,实现可控的分段视频合成,并在2K分辨率的VBench-I2V数据集上取得了与端到端方法相当的性能,同时将GPU时间减少了202倍。

Comments 27 pages, 17 figures

详情
英文摘要

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

2605.06300 2026-05-12 cs.LG

Region Seeding via Pre-Activation Regularization: A Geometric View of Piecewise Affine Neural Networks

Yi Wei, Xuan Qi, Furao Shen

AI总结 该论文研究了深度神经网络中分段仿射激活函数所诱导的输入空间划分结构,提出了一种基于预激活正则化的方法,用于在优化过程中早期生成与数据相关的划分区域。通过理论分析,作者给出了确保神经元切换面接近数据点的充分条件,从而增加局部仿射区域的数量,并基于此设计了一种可插拔的正则化项,有效提升了模型的表达能力和训练性能。实验表明,该方法在多个数据集上均能提高模型的区域数量和整体表现。

详情
英文摘要

Deep networks with continuous piecewise affine activations induce polyhedral partitions of the input space, making the number of realized affine regions a natural measure of expressive capacity and a key determinant of how well the model can approximate nonlinear target functions. In practice, standard training realizes far fewer region refinements in data-visited neighborhoods than the architecture could in principle support, while existing region-count theory is primarily architectural and offers little guidance on how optimization shapes the realized partition near the data. Our theory provides a sufficient condition under which bringing neuron switching surfaces sufficiently close to data points ensures their intersection with local neighborhoods, which in turn implies a strict increase in the local affine-region count, yielding a principled training-time handle for seeding data-relevant partitions early in optimization. Guided by these results, we propose a plug-and-play region-seeding regularizer that encourages early partitioning while allowing task-driven refinement to dominate later in training. Experiments show that the regularizer increases the number of realized affine regions via exact enumeration and improves overall performance on toy datasets, while also improving early-stage accuracy and achieving comparable (or slightly improved) final accuracy on ImageNet-1k for classical models.

2605.06231 2026-05-12 cs.CL

YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

Fengze Guo, Yue Chang

AI总结 本文介绍了我们在SemEval-2026任务9中的系统,旨在检测多语言、多文化和多事件的在线舆论极化,通过三个子任务识别22种语言中的极化社交媒体内容。我们提出了一种异构集成方法,结合了多语言预训练模型XLM-RoBERTa-large和mDeBERTa-v3-base,并探索了多任务学习、基于翻译的数据增强和类别加权等技术以应对严重的类别不平衡问题。研究发现,独立任务建模结合类别加权能更有效地提升分类性能。

Comments Accepted to the SemEval-2026 workshop of the ACL 2026 conference

详情
英文摘要

This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM-RoBERTa-large and mDeBERTa-v3-base. We investigate techniques such as multi-task learning, translation-based data augmentation, and class weighting to improve classification performance under severe label imbalance. Our findings indicate that independent task modeling combined with class weighting is more effective.

2605.06226 2026-05-12 cs.AI q-bio.GN

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Tianyu Liu, Wangjie Zheng, Rui Yang, Benny Kai Guo Loo, Hui Zhang, Jeffries Lauran, Jianlei Gu, Botao Yu, Weihao Xuan, Kexin Huang, Nan Liu, James Zou, Yonghui Jiang, Hua Xu, Hongyu Zhao

AI总结 本文提出了一种多模态AI代理系统Hygieia,用于罕见病的精准诊断和风险基因优先排序。该系统整合了表型特征、基因组数据和临床记录,采用基于路由和知识增强的框架,有效减少错误并针对不同疾病类型定制诊断策略。实验表明,Hygieia在多个诊断基准上达到领先水平,并在实际临床应用中显著提升了诊断准确率和效率,减轻了医生的工作负担。

Comments 32 pages, 6 figures

详情
英文摘要

Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

2605.06225 2026-05-12 cs.LG cs.AI

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous

AI总结 本文提出了一种名为“Memory Inception(MI)”的训练-free 方法,通过在特定网络层插入文本衍生的键值(KV)缓存,实现对大语言模型(LLM)的潜空间引导。该方法在保持控制力的同时减少了缓存冗余,相比传统指令提示和激活引导,MI 在结构化引导任务中表现出更优的性能,尤其在持续性或昂贵的引导场景下具有显著优势。

详情
英文摘要

Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.

2605.06222 2026-05-12 cs.RO cs.AI

When to Trust Imagination: Adaptive Action Execution for World Action Models

Rui Wang, Yue Zhang, Jiehong Lin, Kuncheng Luo, Jianan Wang, Zhongrui Wang, Xiaojuan Qi

AI总结 本文研究了如何在世界动作模型(WAM)中实现自适应动作执行,以解决模型预测与实际物理过程不一致的问题。作者提出了一种名为未来前向动力学因果注意(FFDC)的轻量验证器,通过综合预测动作、视觉动态、真实观测和语言指令,判断模型预测的未来是否依然可信,从而动态调整动作执行的长度。此外,还引入了混合时间步训练方法,提升长时域轨迹的覆盖能力。实验表明,该方法在保持高效执行的同时显著提升了任务成功率和系统鲁棒性。

详情
英文摘要

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.